Factuality

The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)

RankModelProvider
#1Claude 3.5 Sonnet
AnthropicAnthropic
73.49%
83.11%
63.54%
73.81%
#2Claude 3.7 Sonnet
AnthropicAnthropic
72.79%
84.96%
62.66%
70.75%
#3GPT-4o
OpenAIOpenAI
70.95%
82.41%
59.70%
70.75%
#4Gemini 2.0 Flash
GoogleGoogle
67.98%
77.56%
59.38%
67.01%
#5Deepseek V3 (0324)
Deepseek
67.42%
77.49%
56.74%
68.03%
#6Deepseek V3
Deepseek
66.77%
77.59%
56.74%
65.99%
#7Gemini 1.5 Pro
GoogleGoogle
66.36%
79.08%
52.74%
67.25%
#8Mistral Large
Mistral
64.59%
78.99%
50.49%
64.29%
#9Qwen 2.5 Max
Alibaba Qwen
62.66%
77.32%
50.45%
60.20%
#10Llama 4 Maverick
MetaMeta
61.17%
70.69%
54.33%
58.50%
#11Llama 3.3 70B
MetaMeta
59.94%
73.41%
48.57%
57.82%
#12Grok 2
xAI
59.44%
78.06%
42.43%
57.82%
#13Llama 3.1 405B
MetaMeta
58.79%
71.95%
44.90%
59.52%
#14Claude 3.5 Haiku
AnthropicAnthropic
56.74%
70.79%
43.64%
55.78%
#15Mistral Small 3.1 24B
Mistral
55.56%
68.08%
43.15%
55.44%
#16GPT-4o mini
OpenAIOpenAI
54.93%
70.25%
39.09%
55.44%
#17Gemma 3 27B
GoogleGoogle
50.74%
65.59%
40.02%
46.60%