AIM BLOG

Latest Insights.

Read the latest insights on AI security technologies, industry trends, and prompt engineering from the AIM Intelligence research and engineering teams.

Introducing AI Safety Benchmark v0.5: MLCommons' Initiative

AI Safety Benchmark v0.5 is a proof-of-concept benchmark designed to evaluate the safety of text-based generative language models, providing a structured approach to assess potential risks.

As artificial intelligence continues to integrate into critical aspects of society, ensuring its safety and reliability has become a fundamental priority. AI systems, particularly language models, are now used in sensitive domains like healthcare, legal advising, and education, where their decisions and interactions can have far-reaching consequences. This makes systematic evaluation of their potential risks essential — not just for technical development but also for fostering public trust.

AI safety benchmarks provide the tools to assess and address these risks. They identify vulnerabilities, measure safety performance, and set standards that guide responsible AI innovation. By doing so, these benchmarks encourage transparency and accountability, ensuring that AI systems are not only functional but also safe for the environments they serve.

1. Introduction

1.1 The Role and Goals of MLCommons AI Safety Working Group (WG)

MLCommons is a nonprofit consortium that collaborates with researchers, engineers, and practitioners from academia and industry to enhance the reliability, safety, and efficiency of AI technologies. Known for developing AI performance benchmarks, MLCommons has significantly impacted the field, with its MLPerf benchmark increasing AI system processing speeds by over 50 times.

Established in 2023, the AI Safety Working Group (WG) aims to develop benchmarks to assess and improve the safety of AI systems. Its primary objectives are:

  1. Evaluating AI system safety: Establishing reliable and systematic evaluation standards.
  2. Tracking safety over time: Providing a foundation for continuous improvement.
  3. Incentivizing safer AI development: Encouraging responsible AI innovation across industries.

1.2 AI Safety Benchmark v0.5: Purpose and Significance

AI Safety Benchmark v0.5 is a proof-of-concept benchmark designed to evaluate the safety of text-based generative language models (LMs). It provides a structured approach to assess potential risks and sets the groundwork for future expansions.

Key Features:

  1. Seven Core Hazard Categories: The benchmark evaluates key risk areas using over 43,000 English-based test cases.
  2. Comprehensive Approach: Unlike prior performance-focused benchmarks, the v0.5 benchmark is the first to prioritize safety evaluation.
  3. Scalability: Designed to expand beyond text-based LMs to include text-to-image, speech-to-text, and multimodal models in future iterations.

2. Scope and Specification of the Benchmark

2.1 Systems Under Test (SUTs)

The benchmark tests general-purpose AI chat systems, which are language models designed for open-domain conversations in English. Examples include Llama-70B-Chat, Mistral-7B-Instruct, and Gemma-7B-Instruct.

2.2 Use Cases

The benchmark targets interactions between a general-purpose English assistant and adults, focusing on:

2.3 Personas

The v0.5 benchmark models interactions through three user personas:

  1. Typical Adult User: Not malicious and does not intentionally elicit unsafe responses.
  2. Malicious Adult User: Lacks advanced technical skills but attempts to generate harmful queries.
  3. Vulnerable Adult User: At risk of self-harm and poses queries based on limited domain knowledge.

3. Overview of AI Safety Taxonomy

3.1 Core Hazard Categories in v0.5

The seven categories evaluated in the v0.5 benchmark include:

  1. Violent crimes
  2. Non-violent crimes
  3. Sex-related crimes
  4. Child sexual exploitation
  5. Indiscriminate weapons (CBRNE)
  6. Suicide & self-harm
  7. Hate
These categories were prioritized based on potential for severe harm, societal risks, and risks to individuals.

3.2 Design and Flexibility

4. Test Items in the Benchmark

4.1 Why Create New Test Prompts?

New prompt datasets were developed to address:

  1. Incomplete Coverage: Existing datasets often lack comprehensive representation.
  2. Inconsistent Quality: Variations in dataset quality hinder consistent comparisons.
  3. Opportunity for Improvement: Structured approach based on linguistic and behavioral theories.
  4. Scalability: Future benchmarks will cover additional modalities and hazard categories.

4.2 Test Item Formats

The benchmark evaluates test items using chat response tests:

4.3 Dataset Overview

The total dataset consists of 43,090 test items, created by applying 32 templates to 725 sentence fragments.

5. Grading SUTs

5.1 Scoring on Test Items

LlamaGuard is used as an automated evaluation model to assess the safety of each model's response, classifying responses as either "Safe" or "Unsafe."

5.2 Scoring System

Grades are calculated using a five-point grading scale:

5.3 Grading Results

13 open-source models were tested:

The overall proportion of Unsafe responses was 1.2%, with the Sex-Related Crime category showing the highest proportion at 3%.

6. Limitations and Future Work

The v0.5 benchmark has several clear limitations:

Future Directions

In the upcoming VLM Safety Benchmark development, we aim to create a more practical benchmark by incorporating:

Conclusion

AI Safety Benchmark v0.5 represents an important first step toward systematic safety evaluation of language models. By establishing a structured framework for assessing potential risks, it provides the foundation for future iterations that will be more comprehensive and applicable to real-world scenarios.

As AI systems become more integrated into critical aspects of our lives, the importance of safety benchmarks like this one cannot be overstated. They are essential tools for ensuring that AI development proceeds responsibly and that the systems we deploy are trustworthy and aligned with human values.

← Back to List
aim

Ready to secure your AI?

Consult with AIM Intelligence's security experts and request a free red teaming demo optimized for your system.

EXPLORE PLATFORM