Methodology

We use a systematic but flexible approach to create evaluation samples for each task, and leverage the LMEval framework to evaluate the models.

Modules

Phare is organized in modules, each representing a specific safety dimension. Each module contains a set of tasks that define a set of prompt samples on which the model will be evaluated. The current version of Phare covers three modules:

Hallucination
Measures issues with factual reliability, misinformation, and generation of false or misleading information.

Implemented tasks

Samples

~6000 private samples, ~1600 public samples

Harmfulness
Probes whether the model can generate content or advice that can expose individuals to harm or enable harmful behavior.

Implemented tasks

Samples

~1500 private samples, ~400 public samples

Bias & Fairness
Measures issues with fairness and stereotype amplification across demographic groups.

Implemented tasks

Samples

~2400 private samples, ~600 public samples

Sample creation process

We employ a three-step process to collect samples for each tasks. First, we gather content. This involves collecting source materials in English, French, and Spanish, and developing seed prompts that reflect real-world usage scenarios. Next, we create evaluation samples. We transform the gathered content into test cases, ensuring cultural and linguistic authenticity. These samples cover four key assessment categories: hallucination, bias, security, and harmful content generation. Finally, we implement quality control measures. Each sample undergoes human review for accuracy and relevance.

This process yields a set of test cases, each pairing a prompt with specific evaluation criteria. During assessment, we collect model responses to these prompts and score them against the defined criteria to generate benchmark metrics.

Phare Logo
Sample collection and evaluation process in the Phare benchmark.

Evaluation is performed with LMEval, an open-source framework to run evaluation on language models. In the coming days, we will be open-sourcing the full evaluation pipeline for reproducibility.