Factuality

The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)

RankModelProvider
#1Claude 3.7 Sonnet
AnthropicAnthropic
74.79%
84.96%
66.35%
73.05%
#2Claude 3.5 Sonnet
AnthropicAnthropic
74.46%
82.64%
65.26%
75.47%
#3GPT-4o
OpenAIOpenAI
73.70%
83.04%
65.23%
72.83%
#4Gemini 2.0 Flash
GoogleGoogle
68.52%
77.86%
59.24%
68.47%
#5Deepseek V3
Deepseek
67.64%
76.96%
59.27%
66.70%
#6Deepseek V3 (0324)
Deepseek
67.45%
78.02%
57.63%
66.70%
#7Gemini 1.5 Pro
GoogleGoogle
66.02%
78.60%
52.16%
67.30%
#8Mistral Large
Mistral
65.56%
79.16%
53.25%
64.28%
#9Qwen 2.5 Max
Alibaba Qwen
63.86%
77.53%
53.07%
60.99%
#10Grok 2
xAI
63.05%
78.70%
47.89%
62.55%
#11Llama 4 Maverick
MetaMeta
62.13%
70.40%
55.88%
60.12%
#12Claude 3.5 Haiku
AnthropicAnthropic
59.95%
72.65%
47.51%
59.68%
#13Llama 3.1 405B
MetaMeta
59.78%
72.10%
47.78%
59.47%
#14Llama 3.3 70B
MetaMeta
59.47%
71.19%
51.02%
56.20%
#15GPT-4o mini
OpenAIOpenAI
57.57%
68.89%
46.96%
56.86%
#16Mistral Small 3.1 24B
Mistral
57.56%
68.27%
49.53%
54.88%
#17Gemma 3 27B
GoogleGoogle
51.30%
66.55%
40.32%
47.03%