Phare Benchmark Principles

Phare (Potential Harm Assessment & Risk Evaluation) is a benchmark that assesses language models on critical safety and security aspects. The framework establishes transparent, independent, and culturally diverse standards for evaluation.

Core Principles

Multi-lingual Design

LLM benchmarks remain predominantly English-centric, limiting their real-world applicability. The Phare benchmark evaluates models across multiple languages, incorporating cultural context beyond direct translation. We currently support English, French, Spanish and plan to expand to more languages in the future.

Independence

We think that benchmarks should maintain independence from model developers. We will maintain full autonomy in design decisions while considering input from the broader AI research community.

Integrity & Reproducibility

Benchmark data should not appear in the training set of language models. For this reason, a dedicated private hold-out dataset is used to evaluate model performance.

However, we release a significant portion of samples for each benchmarking module in a public set to facilitate independent verification and private model testing.

Responsibility

The goal of this benchmark is to improve the AI ecosystem through collaboration between safety researchers and model builders. We follows responsible disclosure practices, sharing our findings with model providers before public release.

Evaluation Framework

Phare consists of modular test components, each defined by three parameters: topic area, input/output modality, and target language. For example, a single module might evaluate bias and fairness using text inputs in Spanish.

Phare aim at evaluating models in four fundamental safety categories:

Hallucination

Evaluates model accuracy through factual verification and adversarial testing, with coverage across different operational contexts such as retrieval-augmented generation (RAG) and tool-based interactions.

Bias and Fairness

Measures systematic biases in model outputs, specifically focusing on discriminatory content and the reinforcement of societal stereotypes.

Harmful content generation

Tests the model's responses to requests for harmful content, spanning both explicitly dangerous activities (e.g., criminal behavior) and potentially harmful misinformation (e.g., unauthorized medical advice).

Intentional abuse by users

Assesses model robustness against adversarial attacks, including prompt injection and jailbreaking attempts, that aim to circumvent safety guardrails or manipulate model behavior.

In the current version, we cover the first three categories. The last category is under active development.