Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, and potential harm. Developed by Giskard in collaboration with Google DeepMind.


Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher is better.

April 14, 2025: we have updated the benchmark with Llama 4 Maverick, Grok 2, the new Deepseek V3 0324! We will be adding new models regularly.

RankModelProvider
#1Gemini 1.5 Pro
GoogleGoogle
87.29%
87.06%
96.84%
77.96%
#2Claude 3.5 Haiku
AnthropicAnthropic
82.72%
86.97%
95.36%
65.81%
#3Llama 3.1 405B
MetaMeta
77.59%
75.54%
86.49%
70.74%
#4Llama 4 Maverick
MetaMeta
76.72%
77.02%
89.25%
63.89%
#5Claude 3.5 Sonnet
AnthropicAnthropic
75.62%
91.09%
95.40%
40.37%
#6Claude 3.7 Sonnet
AnthropicAnthropic
75.53%
89.26%
95.52%
41.82%
#7Gemma 3 27B
GoogleGoogle
75.23%
69.90%
91.36%
64.44%
#8Gemini 2.0 Flash
GoogleGoogle
74.89%
78.13%
94.30%
52.22%
#9Deepseek V3 (0324)
Deepseek
73.92%
77.86%
92.80%
51.11%
#10GPT-4o
OpenAIOpenAI
72.80%
83.89%
92.66%
41.85%
#11Qwen 2.5 Max
Alibaba Qwen
72.71%
77.12%
89.89%
51.11%
#12Deepseek V3
Deepseek
70.77%
77.91%
89.00%
45.39%
#13Llama 3.3 70B
MetaMeta
67.97%
73.41%
86.04%
44.44%
#14Mistral Small 3.1 24B
Mistral
67.88%
77.72%
90.91%
35.00%
#15Mistral Large
Mistral
66.00%
79.72%
89.38%
28.89%
#16Grok 2
xAI
65.15%
77.35%
91.44%
26.67%
#17GPT-4o mini
OpenAIOpenAI
63.93%
74.50%
77.29%
40.00%

Explore the results by task