Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, potential harm, and jailbreak resistance.


Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment).

December 15th, 2025: we released an updated version of the jailbreak resistance module and added 33 new models to the benchmark, including 20 state-of-the-art reasoning models.

RankModelProvider
#1Claude 4.5 Haiku
AnthropicAnthropic
83.16%
83.56%
99.93%
70.66%
78.51%
#2Claude 4.5 Opus
AnthropicAnthropic
82.38%
88.23%
98.25%
63.20%
79.83%
#3Claude 4.5 Sonnet
AnthropicAnthropic
77.60%
87.00%
99.05%
49.14%
75.23%
#4Claude 4.1 Opus
AnthropicAnthropic
76.87%
86.19%
96.31%
43.61%
81.35%
#5Llama 3.1 405B Instruct OR
MetaMeta
76.42%
67.40%
86.49%
75.23%
76.55%
#6Gemini 3.0 Pro Preview
GoogleGoogle
73.31%
81.02%
93.50%
53.65%
65.06%
#7GPT 5 mini
OpenAIOpenAI
73.22%
79.58%
98.29%
46.41%
68.60%
#8GPT 5.1
OpenAIOpenAI
72.82%
81.84%
96.92%
46.77%
65.76%
#9Llama 4 Maverick
MetaMeta
70.84%
71.47%
89.25%
73.65%
48.99%
#10GPT-4o
OpenAIOpenAI
70.65%
78.52%
92.66%
50.92%
60.48%
#11GPT 4.1 mini
OpenAIOpenAI
70.47%
67.81%
83.39%
88.12%
42.55%
#12GPT 5 nano
OpenAIOpenAI
69.53%
76.37%
97.41%
34.70%
69.64%
#13Claude 3.7 Sonnet
AnthropicAnthropic
69.48%
85.17%
95.52%
33.77%
63.45%
#14Claude 3.5 Haiku 20241022
AnthropicAnthropic
69.08%
78.04%
95.36%
38.08%
64.82%
#15Qwen Plus
Alibaba Qwen
68.85%
75.58%
94.14%
55.70%
49.98%
#16GPT 5
OpenAIOpenAI
67.20%
74.58%
96.97%
28.56%
68.68%
#17GPT 4.1
OpenAIOpenAI
67.11%
77.64%
92.30%
52.41%
46.10%
#18Grok 4 Fast No Reasoning
xAI
66.55%
63.40%
81.34%
80.26%
41.20%
#19GPT OSS 120B
OpenAIOpenAI
66.14%
66.08%
93.75%
38.84%
65.90%
#20Qwen 3 Max
Alibaba Qwen
65.93%
75.20%
95.40%
44.77%
48.35%
#21Deepseek V3.1
Deepseek
64.84%
61.58%
94.43%
65.17%
38.17%
#22Mistral Small 3.2
Mistral
64.52%
63.40%
87.87%
73.90%
32.89%
#23Gemini 2.0 Flash
GoogleGoogle
64.19%
71.59%
94.30%
53.51%
37.35%
#24Qwen 3 8B
Alibaba Qwen
63.88%
60.38%
87.37%
58.64%
49.13%
#25Llama 4 Scout
MetaMeta
63.82%
58.20%
81.04%
67.10%
48.93%
#26Gemini 2.5 Flash
GoogleGoogle
63.64%
72.13%
93.66%
51.74%
37.05%
#27Llama 3.1 8B Instruct
MetaMeta
62.46%
63.81%
83.06%
44.15%
58.81%
#28Magistral Medium Latest
Mistral
62.26%
68.19%
84.52%
50.75%
45.61%
#29Deepseek V3
Deepseek
62.24%
70.91%
89.00%
59.75%
29.31%
#30Qwen 2.5 Max
Alibaba Qwen
62.22%
69.80%
89.89%
42.95%
46.26%
#31Llama 3.3 70B Instruct OR
MetaMeta
61.98%
68.83%
86.04%
45.39%
47.68%
#32Deepseek V3 0324
Deepseek
60.87%
62.59%
92.80%
55.26%
32.82%
#33Gemini 2.5 Pro
GoogleGoogle
60.55%
74.82%
92.18%
36.52%
38.67%
#34Grok 3 mini
xAI
59.96%
68.62%
90.47%
46.40%
34.36%
#35Qwen 3 30B VL Instruct
Alibaba Qwen
59.33%
59.88%
81.76%
53.57%
42.09%
#36Gemini 2.5 Flash Lite
GoogleGoogle
58.91%
68.09%
79.15%
45.53%
42.84%
#37Mistral Large 2
Mistral
58.82%
75.62%
89.38%
39.70%
30.58%
#38Deepseek R1 0528
Deepseek
58.55%
72.89%
95.15%
25.49%
40.67%
#39GPT-4o mini
OpenAIOpenAI
58.32%
65.97%
77.29%
37.90%
52.14%
#40Mistral Medium Latest
Mistral
58.11%
69.63%
92.32%
39.66%
30.83%
#41GPT 4.1 nano
OpenAIOpenAI
57.45%
67.74%
72.54%
36.22%
53.30%
#42Gemma 3 27B IT OR
GoogleGoogle
57.44%
60.30%
91.36%
38.01%
40.07%
#43Gemini 2.0 Flash Lite
GoogleGoogle
57.31%
63.21%
85.14%
41.65%
39.26%
#44Magistral Small Latest
Mistral
55.94%
62.44%
76.23%
48.02%
37.09%
#45Gemma 3 12B IT OR
GoogleGoogle
55.73%
53.86%
92.65%
32.14%
44.27%
#46Grok 2
xAI
54.58%
66.80%
91.44%
33.10%
26.97%
#47Grok 3
xAI
53.45%
75.12%
89.68%
23.24%
25.77%
Mistral Small 3.1*
Mistral
N/A
70.00%
90.91%
N/A
N/A
Claude 3.5 Sonnet*
AnthropicAnthropic
N/A
86.71%
95.40%
N/A
N/A
Gemini 1.5 Pro*
GoogleGoogle
N/A
78.07%
96.84%
N/A
N/A
* Models marked with an asterisk have partial scores.

Explore the results by task


Compare models

vs