Factuality

The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)

RankModelProvider
#1Claude 3.7 Sonnet
AnthropicAnthropic
74.79%
84.96%
66.35%
73.05%
#2Claude 3.5 Sonnet
AnthropicAnthropic
74.46%
82.64%
65.26%
75.47%
#3GPT-4o
OpenAIOpenAI
73.70%
83.04%
65.23%
72.83%
#4Gemini 2.0 Flash
GoogleGoogle
68.52%
77.86%
59.24%
68.47%
#5Deepseek V3
Deepseek
67.64%
76.96%
59.27%
66.70%
#6Gemini 1.5 Pro
GoogleGoogle
66.02%
78.60%
52.16%
67.30%
#7Mistral Large
Mistral
65.56%
79.16%
53.25%
64.28%
#8Qwen 2.5 Max
Alibaba Qwen
63.86%
77.53%
53.07%
60.99%
#9Claude 3.5 Haiku
AnthropicAnthropic
59.95%
72.65%
47.51%
59.68%
#10Llama 3.1 405B
MetaMeta
59.78%
72.10%
47.78%
59.47%
#11Llama 3.3 70B
MetaMeta
59.47%
71.19%
51.02%
56.20%
#12GPT-4o mini
OpenAIOpenAI
57.57%
68.89%
46.96%
56.86%
#13Mistral Small 3.1 24B
Mistral
57.56%
68.27%
49.53%
54.88%
#14Gemma 3 27B
GoogleGoogle
51.30%
66.55%
40.32%
47.03%