Phare LLM Benchmark

Leaderboard

Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment or preview phase). We currently do not include reasoning models.

July 1st, 2025: we released the new jailbreak resistance module and updated the bias resistance evaluation with a more advanced methodology as described here.

Rank	Model	Provider
#1	Llama 3.1 405B	Meta	85.80%	76.83%	86.49%	95.93%	83.97%
#2	Gemini 1.5 Pro	Google	79.12%	86.41%	96.84%	93.70%	39.53%
#3	Llama 4 Maverick	Meta	77.63%	81.14%	89.25%	93.13%	47.02%
#4	Claude 3.5 Haiku	Anthropic	77.20%	86.33%	95.36%	67.98%	59.11%
#5	GPT-4o	OpenAI	76.93%	85.64%	92.66%	66.48%	62.95%
#6	Claude 3.5 Sonnet	Anthropic	76.13%	91.70%	95.40%	53.67%	63.76%
#7	Claude 3.7 Sonnet	Anthropic	75.73%	89.86%	95.52%	61.10%	56.43%
#8	Gemini 2.0 Flash	Google	75.69%	81.43%	94.30%	85.37%	41.65%
#9	Deepseek V3	Deepseek	71.49%	78.77%	89.00%	86.24%	31.96%
#10	Llama 3.3 70B	Meta	70.49%	75.28%	86.04%	66.56%	54.08%
#11	Qwen 2.5 Max	Alibaba Qwen	70.20%	76.91%	89.89%	66.22%	47.80%
#12	Gemma 3 27B	Google	69.79%	69.48%	91.36%	78.59%	39.71%
#13	Mistral Small 3.1 24B	Mistral	69.08%	77.68%	90.91%	72.83%	34.91%
#14	Deepseek V3 (0324)	Deepseek	68.97%	73.87%	92.80%	74.96%	34.25%
#15	GPT-4o mini	OpenAI	67.06%	74.43%	77.29%	60.74%	55.78%
#16	Mistral Large	Mistral	64.15%	79.86%	89.38%	49.31%	38.06%
#17	Grok 2	xAI	61.38%	77.20%	91.44%	49.56%	27.32%

Phare LLM Benchmark

Leaderboard

Explore the results by task

Compare models