Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, potential harm, and jailbreak resistance.


Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment).

February 26th, 2026 — We added 5 new models to the benchmark: Claude 4.6 Opus, Claude 4.6 Sonnet, Gemini 3.1 Pro Preview, GPT 5.2, and Kimi K2.5.

RankModelProvider
#1Claude 4.5 Haiku
AnthropicAnthropic
83.16%
83.56%
99.93%
70.66%
78.51%
#2Claude 4.5 Opus
AnthropicAnthropic
82.38%
88.23%
98.25%
63.20%
79.83%
#3Claude 4.5 Sonnet
AnthropicAnthropic
77.60%
87.00%
99.05%
49.14%
75.23%
#4Claude 4.1 Opus
AnthropicAnthropic
76.87%
86.19%
96.31%
43.61%
81.35%
#5Llama 3.1 405B Instruct OR
MetaMeta
76.42%
67.40%
86.49%
75.23%
76.55%
#6Claude 4.6 Opus
AnthropicAnthropic
74.13%
87.76%
100.00%
33.51%
75.24%
#7Claude 4.6 Sonnet
AnthropicAnthropic
74.03%
85.51%
99.93%
41.28%
69.39%
#8Gemini 3.0 Pro Preview
GoogleGoogle
73.31%
81.02%
93.50%
53.65%
65.06%
#9GPT 5 mini
OpenAIOpenAI
73.22%
79.58%
98.29%
46.41%
68.60%
#10GPT 5.1
OpenAIOpenAI
72.82%
81.84%
96.92%
46.77%
65.76%
#11GPT 5.2
OpenAIOpenAI
71.03%
77.13%
96.88%
38.51%
71.61%
#12Llama 4 Maverick
MetaMeta
70.84%
71.47%
89.25%
73.65%
48.99%
#13GPT 4o
OpenAIOpenAI
70.65%
78.52%
92.66%
50.92%
60.48%
#14Gemini 3.1 Pro Preview
GoogleGoogle
70.63%
87.59%
95.19%
48.10%
51.64%
#15GPT 4.1 mini
OpenAIOpenAI
70.47%
67.81%
83.39%
88.12%
42.55%
#16GPT 5 nano
OpenAIOpenAI
69.53%
76.37%
97.41%
34.70%
69.64%
#17Claude 3.7 Sonnet
AnthropicAnthropic
69.48%
85.17%
95.52%
33.77%
63.45%
#18Claude 3.5 Haiku 20241022
AnthropicAnthropic
69.08%
78.04%
95.36%
38.08%
64.82%
#19Qwen Plus
Alibaba Qwen
68.85%
75.58%
94.14%
55.70%
49.98%
#20GPT 5
OpenAIOpenAI
67.20%
74.58%
96.97%
28.56%
68.68%
#21GPT 4.1
OpenAIOpenAI
67.11%
77.64%
92.30%
52.41%
46.10%
#22Kimi K2.5
MoonshotAIMoonshot AI
67.05%
80.69%
97.20%
29.00%
61.30%
#23Grok 4 Fast No Reasoning
xAI
66.55%
63.40%
81.34%
80.26%
41.20%
#24GPT OSS 120B
OpenAIOpenAI
66.14%
66.08%
93.75%
38.84%
65.90%
#25Qwen 3 Max
Alibaba Qwen
65.93%
75.20%
95.40%
44.77%
48.35%
#26Deepseek V3.1
Deepseek
64.84%
61.58%
94.43%
65.17%
38.17%
#27Mistral Small 3.2
Mistral
64.52%
63.40%
87.87%
73.90%
32.89%
#28Gemini 2.0 Flash
GoogleGoogle
64.19%
71.59%
94.30%
53.51%
37.35%
#29Qwen 3 8B
Alibaba Qwen
63.88%
60.38%
87.37%
58.64%
49.13%
#30Llama 4 Scout
MetaMeta
63.82%
58.20%
81.04%
67.10%
48.93%
#31Gemini 2.5 Flash
GoogleGoogle
63.64%
72.13%
93.66%
51.74%
37.05%
#32Llama 3.1 8B Instruct
MetaMeta
62.46%
63.81%
83.06%
44.15%
58.81%
#33Magistral Medium Latest
Mistral
62.26%
68.19%
84.52%
50.75%
45.61%
#34Deepseek V3
Deepseek
62.24%
70.91%
89.00%
59.75%
29.31%
#35Qwen 2.5 Max
Alibaba Qwen
62.22%
69.80%
89.89%
42.95%
46.26%
#36Grok 4
xAI
62.13%
80.73%
71.77%
31.16%
64.85%
#37Llama 3.3 70B Instruct OR
MetaMeta
61.98%
68.83%
86.04%
45.39%
47.68%
#38Mistral Large 3
Mistral
60.93%
68.03%
88.06%
62.72%
24.90%
#39Command A
CohereCohere
60.92%
71.89%
91.36%
45.59%
34.82%
#40Deepseek V3 0324
Deepseek
60.87%
62.59%
92.80%
55.26%
32.82%
#41Gemini 2.5 Pro
GoogleGoogle
60.55%
74.82%
92.18%
36.52%
38.67%
#42Grok 3 mini
xAI
59.96%
68.62%
90.47%
46.40%
34.36%
#43Qwen 3 30B VL Instruct
Alibaba Qwen
59.33%
59.88%
81.76%
53.57%
42.09%
#44Gemini 2.5 Flash Lite
GoogleGoogle
58.91%
68.09%
79.15%
45.53%
42.84%
#45Mistral Large 2
Mistral
58.82%
75.62%
89.38%
39.70%
30.58%
#46Deepseek R1 0528
Deepseek
58.55%
72.89%
95.15%
25.49%
40.67%
#47GPT 4o mini
OpenAIOpenAI
58.32%
65.97%
77.29%
37.90%
52.14%
#48Mistral Medium Latest
Mistral
58.11%
69.63%
92.32%
39.66%
30.83%
#49GPT 4.1 nano
OpenAIOpenAI
57.45%
67.74%
72.54%
36.22%
53.30%
#50Gemma 3 27B IT OR
GoogleGoogle
57.44%
60.30%
91.36%
38.01%
40.07%
#51Gemini 2.0 Flash Lite
GoogleGoogle
57.31%
63.21%
85.14%
41.65%
39.26%
#52Magistral Small Latest
Mistral
55.94%
62.44%
76.23%
48.02%
37.09%
#53Gemma 3 12B IT OR
GoogleGoogle
55.73%
53.86%
92.65%
32.14%
44.27%
#54Grok 2
xAI
54.58%
66.80%
91.44%
33.10%
26.97%
#55Grok 3
xAI
53.45%
75.12%
89.68%
23.24%
25.77%
Mistral Small 3.1*
Mistral
N/A
70.00%
90.91%
N/A
N/A
Claude 3.5 Sonnet*
AnthropicAnthropic
N/A
86.71%
95.40%
N/A
N/A
Gemini 1.5 Pro*
GoogleGoogle
N/A
78.07%
96.84%
N/A
N/A
* Models marked with an asterisk have partial scores.

Explore the results by task


Compare models

vs