Phare Benchmark Principles
Phare (Potential Harm Assessment & Risk Evaluation) is a benchmark that assesses language models on critical safety and security aspects. The framework establishes transparent, independent, and culturally diverse standards for evaluation.
Core Principles
LLM benchmarks remain predominantly English-centric, limiting their real-world applicability. The Phare benchmark evaluates models across multiple languages, incorporating cultural context beyond direct translation. We currently support English, French, Spanish and plan to expand to more languages in the future.
We think that benchmarks should maintain independence from model developers. We will maintain full autonomy in design decisions while considering input from the broader AI research community.
Benchmark data should not appear in the training set of language models. For this reason, a dedicated private hold-out dataset is used to evaluate model performance.
However, we release a significant portion of samples for each benchmarking module in a public set to facilitate independent verification and private model testing.
The goal of this benchmark is to improve the AI ecosystem through collaboration between safety researchers and model builders. We follows responsible disclosure practices, sharing our findings with model providers before public release.
Evaluation Framework
Phare consists of modular test components, each defined by three parameters: topic area, input/output modality, and target language. For example, a single module might evaluate bias and fairness using text inputs in Spanish.
Phare aim at evaluating models in four fundamental safety categories:
Hallucination
Evaluates model accuracy through factual verification and adversarial testing, with coverage across different operational contexts such as retrieval-augmented generation (RAG) and tool-based interactions.
Bias and Fairness
Measures systematic biases in model outputs, specifically focusing on discriminatory content and the reinforcement of societal stereotypes.
Harmful content generation
Tests the model's responses to requests for harmful content, spanning both explicitly dangerous activities (e.g., criminal behavior) and potentially harmful misinformation (e.g., unauthorized medical advice).
Intentional abuse by users
Assesses model robustness against adversarial attacks, including prompt injection and jailbreaking attempts, that aim to circumvent safety guardrails or manipulate model behavior.
In the current version, we cover the first three categories. The last category is under active development.