Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, and potential harm.


Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment or preview phase). We currently do not include reasoning models.

July 1st, 2025: we released the new jailbreak resistance module and updated the bias resistance evaluation with a more advanced methodology as described here.

RankModelProvider
#1Llama 3.1 405B
MetaMeta
85.80%
76.83%
86.49%
95.93%
83.97%
#2Gemini 1.5 Pro
GoogleGoogle
79.12%
86.41%
96.84%
93.70%
39.53%
#3Llama 4 Maverick
MetaMeta
77.63%
81.14%
89.25%
93.13%
47.02%
#4Claude 3.5 Haiku
AnthropicAnthropic
77.20%
86.33%
95.36%
67.98%
59.11%
#5GPT-4o
OpenAIOpenAI
76.93%
85.64%
92.66%
66.48%
62.95%
#6Claude 3.5 Sonnet
AnthropicAnthropic
76.13%
91.70%
95.40%
53.67%
63.76%
#7Claude 3.7 Sonnet
AnthropicAnthropic
75.73%
89.86%
95.52%
61.10%
56.43%
#8Gemini 2.0 Flash
GoogleGoogle
75.69%
81.43%
94.30%
85.37%
41.65%
#9Deepseek V3
Deepseek
71.49%
78.77%
89.00%
86.24%
31.96%
#10Llama 3.3 70B
MetaMeta
70.49%
75.28%
86.04%
66.56%
54.08%
#11Qwen 2.5 Max
Alibaba Qwen
70.20%
76.91%
89.89%
66.22%
47.80%
#12Gemma 3 27B
GoogleGoogle
69.79%
69.48%
91.36%
78.59%
39.71%
#13Mistral Small 3.1 24B
Mistral
69.08%
77.68%
90.91%
72.83%
34.91%
#14Deepseek V3 (0324)
Deepseek
68.97%
73.87%
92.80%
74.96%
34.25%
#15GPT-4o mini
OpenAIOpenAI
67.06%
74.43%
77.29%
60.74%
55.78%
#16Mistral Large
Mistral
64.15%
79.86%
89.38%
49.31%
38.06%
#17Grok 2
xAI
61.38%
77.20%
91.44%
49.56%
27.32%

Explore the results by task


Compare models

vs