Phare LLM Benchmark

Modules

Phare is organized in modules, each representing a specific safety dimension. Each module contains a set of tasks that define a set of prompt samples on which the model will be evaluated. The current version of Phare covers three modules:

Sample creation process

We employ a three-step process to collect samples for each tasks. First, we gather content. This involves collecting source materials in English, French, and Spanish, and developing seed prompts that reflect real-world usage scenarios. Next, we create evaluation samples. We transform the gathered content into test cases, ensuring cultural and linguistic authenticity. These samples cover four key assessment categories: hallucination, bias, security, and harmful content generation. Finally, we implement quality control measures. Each sample undergoes human review for accuracy and relevance.

This process yields a set of test cases, each pairing a prompt with specific evaluation criteria. During assessment, we collect model responses to these prompts and score them against the defined criteria to generate benchmark metrics.

Sample collection and evaluation process in the Phare benchmark.

Evaluation is performed with LMEval, an open-source framework to run evaluation on language models. The full evaluation pipeline is available on GitHub.

Methodology

Modules

Implemented tasks

Samples

Implemented tasks

Samples

Implemented tasks

Samples

Implemented tasks

Samples

Sample creation process