Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, and potential harm.


Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment or preview phase). We currently do not include reasoning models.

July 1st, 2025: we updated the bias resistance evaluation with a more advanced methodology as described here. The results for the jailbreak resistance evaluation is currently under our provider's notification period and will be released on July 5th.

RankModelProvider
#1Gemini 1.5 Pro
GoogleGoogle
92.32%
86.41%
96.84%
93.70%
#2Llama 4 Maverick
MetaMeta
87.84%
81.14%
89.25%
93.13%
#3Gemini 2.0 Flash
GoogleGoogle
87.03%
81.43%
94.30%
85.37%
#4Llama 3.1 405B
MetaMeta
86.42%
76.83%
86.49%
95.93%
#5Deepseek V3
Deepseek
84.67%
78.77%
89.00%
86.24%
#6Claude 3.5 Haiku
AnthropicAnthropic
83.23%
86.33%
95.36%
67.98%
#7Claude 3.7 Sonnet
AnthropicAnthropic
82.16%
89.86%
95.52%
61.10%
#8GPT-4o
OpenAIOpenAI
81.59%
85.64%
92.66%
66.48%
#9Deepseek V3 (0324)
Deepseek
80.54%
73.87%
92.80%
74.96%
#10Mistral Small 3.1 24B
Mistral
80.47%
77.68%
90.91%
72.83%
#11Claude 3.5 Sonnet
AnthropicAnthropic
80.25%
91.70%
95.40%
53.67%
#12Gemma 3 27B
GoogleGoogle
79.81%
69.48%
91.36%
78.59%
#13Qwen 2.5 Max
Alibaba Qwen
77.67%
76.91%
89.89%
66.22%
#14Llama 3.3 70B
MetaMeta
75.96%
75.28%
86.04%
66.56%
#15Mistral Large
Mistral
72.85%
79.86%
89.38%
49.31%
#16Grok 2
xAI
72.73%
77.20%
91.44%
49.56%
#17GPT-4o mini
OpenAIOpenAI
70.82%
74.43%
77.29%
60.74%

Explore the results by task


Compare models

vs