Methodology
We use a systematic but flexible approach to create evaluation samples for each task, and leverage the LMEval framework to evaluate the models.
Modules
Phare is organized in modules, each representing a specific safety dimension. Each module contains a set of tasks that define a set of prompt samples on which the model will be evaluated. The current version of Phare covers three modules:
Implemented tasks
Samples
~6000 private samples, ~1600 public samples
Sample creation process
We employ a three-step process to collect samples for each tasks. First, we gather content. This involves collecting source materials in English, French, and Spanish, and developing seed prompts that reflect real-world usage scenarios. Next, we create evaluation samples. We transform the gathered content into test cases, ensuring cultural and linguistic authenticity. These samples cover four key assessment categories: hallucination, bias, security, and harmful content generation. Finally, we implement quality control measures. Each sample undergoes human review for accuracy and relevance.
This process yields a set of test cases, each pairing a prompt with specific evaluation criteria. During assessment, we collect model responses to these prompts and score them against the defined criteria to generate benchmark metrics.

Evaluation is performed with LMEval, an open-source framework to run evaluation on language models. In the coming days, we will be open-sourcing the full evaluation pipeline for reproducibility.