Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, and potential harm. Developed by Giskard in collaboration with Google DeepMind.


Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher is better.

These are preliminary results as of March 31, 2025. We will be updating with new model releases soon. Contact us at [email protected] if you have questions.

RankModelProvider
#1Gemini 1.5 Pro
GoogleGoogle
87.29%
87.06%
96.84%
77.96%
#2Claude 3.5 Haiku
AnthropicAnthropic
82.72%
86.97%
95.36%
65.81%
#3Llama 3.1 405B
MetaMeta
77.59%
75.54%
86.49%
70.74%
#4Claude 3.5 Sonnet
AnthropicAnthropic
75.62%
91.09%
95.40%
40.37%
#5Claude 3.7 Sonnet
AnthropicAnthropic
75.53%
89.26%
95.52%
41.82%
#6Gemma 3 27B
GoogleGoogle
75.23%
69.90%
91.36%
64.44%
#7Gemini 2.0 Flash
GoogleGoogle
74.89%
78.13%
94.30%
52.22%
#8GPT-4o
OpenAIOpenAI
72.80%
83.89%
92.66%
41.85%
#9Qwen 2.5 Max
Alibaba Qwen
72.71%
77.12%
89.89%
51.11%
#10Deepseek V3
Deepseek
70.77%
77.91%
89.00%
45.39%
#11Llama 3.3 70B
MetaMeta
67.97%
73.41%
86.04%
44.44%
#12Mistral Small 3.1 24B
Mistral
67.88%
77.72%
90.91%
35.00%
#13Mistral Large
Mistral
66.00%
79.72%
89.38%
28.89%
#14GPT-4o mini
OpenAIOpenAI
63.93%
74.50%
77.29%
40.00%

Explore the results by task