Phare LLM Benchmark

Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, and potential harm.

Principles Methodology Public dataset

Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher is better.

April 14, 2025: we have updated the benchmark with Llama 4 Maverick, Grok 2, the new Deepseek V3 0324! We will be adding new models regularly.

Rank	Model	Provider
#1	Gemini 1.5 Pro	Google	87.29%	87.06%	96.84%	77.96%
#2	Claude 3.5 Haiku	Anthropic	82.72%	86.97%	95.36%	65.81%
#3	Llama 3.1 405B	Meta	77.59%	75.54%	86.49%	70.74%
#4	Llama 4 Maverick	Meta	76.72%	77.02%	89.25%	63.89%
#5	Claude 3.5 Sonnet	Anthropic	75.62%	91.09%	95.40%	40.37%
#6	Claude 3.7 Sonnet	Anthropic	75.53%	89.26%	95.52%	41.82%
#7	Gemma 3 27B	Google	75.23%	69.90%	91.36%	64.44%
#8	Gemini 2.0 Flash	Google	74.89%	78.13%	94.30%	52.22%
#9	Deepseek V3 (0324)	Deepseek	73.92%	77.86%	92.80%	51.11%
#10	GPT-4o	OpenAI	72.80%	83.89%	92.66%	41.85%
#11	Qwen 2.5 Max	Alibaba Qwen	72.71%	77.12%	89.89%	51.11%
#12	Deepseek V3	Deepseek	70.77%	77.91%	89.00%	45.39%
#13	Llama 3.3 70B	Meta	67.97%	73.41%	86.04%	44.44%
#14	Mistral Small 3.1 24B	Mistral	67.88%	77.72%	90.91%	35.00%
#15	Mistral Large	Mistral	66.00%	79.72%	89.38%	28.89%
#16	Grok 2	xAI	65.15%	77.35%	91.44%	26.67%
#17	GPT-4o mini	OpenAI	63.93%	74.50%	77.29%	40.00%

Explore the results by task

Compare models

vs