Factuality

The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)

RankModelProvider
#1Gemini 3.1 Pro Preview
GoogleGoogle
85.90%
89.68%
85.71%
82.31%
#2Gemini 3.0 Pro Preview
GoogleGoogle
83.30%
88.61%
84.76%
76.53%
#3GPT 5.1
OpenAIOpenAI
78.20%
85.77%
78.10%
70.75%
#4GPT 5
OpenAIOpenAI
78.16%
85.77%
75.24%
73.47%
#5Claude 4.6 Opus
AnthropicAnthropic
77.90%
88.61%
73.33%
71.77%
#6Gemini 2.5 Pro
GoogleGoogle
77.63%
83.99%
77.14%
71.77%
#7GPT 5.2
OpenAIOpenAI
77.59%
85.41%
75.24%
72.11%
#8Grok 4
xAI
76.85%
84.70%
72.38%
73.47%
#9GPT 4.1
OpenAIOpenAI
74.75%
83.63%
69.52%
71.09%
#10Claude 4.5 Opus
AnthropicAnthropic
74.70%
86.48%
65.71%
71.92%
#11Claude 4.6 Sonnet
AnthropicAnthropic
74.20%
85.77%
68.57%
68.26%
#12Grok 3
xAI
74.19%
81.49%
67.62%
73.47%
#13Kimi K2.5
MoonshotAIMoonshot AI
73.72%
87.54%
62.86%
70.75%
#14Claude 3.5 Sonnet
AnthropicAnthropic
73.61%
83.21%
63.81%
73.81%
#15Claude 3.7 Sonnet
AnthropicAnthropic
72.89%
85.05%
62.86%
70.75%
#16Claude 4.1 Opus
AnthropicAnthropic
72.03%
83.99%
64.76%
67.35%
#17GPT 4o
OpenAIOpenAI
71.10%
82.56%
60.00%
70.75%
#18Claude 4.5 Sonnet
AnthropicAnthropic
70.04%
83.63%
60.95%
65.53%
#19Deepseek R1 0528
Deepseek
68.37%
80.00%
62.86%
62.24%
#20Gemini 2.0 Flash
GoogleGoogle
68.31%
77.94%
60.00%
67.01%
#21Deepseek V3 0324
Deepseek
67.70%
77.94%
57.14%
68.03%
#22Deepseek V3
Deepseek
67.02%
77.94%
57.14%
65.99%
#23Gemini 1.5 Pro
GoogleGoogle
66.64%
79.36%
53.33%
67.24%
#24Mistral Large 3
Mistral
66.40%
77.58%
59.05%
62.59%
#25Gemini 2.5 Flash
GoogleGoogle
66.34%
79.36%
58.10%
61.56%
#26Mistral Large 2
Mistral
65.02%
79.36%
51.43%
64.29%
#27Qwen 3 Max
Alibaba Qwen
64.69%
77.94%
56.19%
59.93%
#28Mistral Medium Latest
Mistral
63.91%
77.22%
52.38%
62.12%
#29Grok 3 mini
xAI
63.72%
76.87%
52.38%
61.90%
#30GPT 5 mini
OpenAIOpenAI
63.04%
78.29%
48.57%
62.24%
#31Qwen 2.5 Max
Alibaba Qwen
62.92%
77.58%
50.96%
60.20%
#32Deepseek V3.1
Deepseek
62.07%
77.22%
50.48%
58.50%
#33Llama 4 Maverick
MetaMeta
61.52%
70.82%
55.24%
58.50%
#34Command A
CohereCohere
60.88%
72.24%
49.52%
60.88%
#35Llama 3.3 70B Instruct OR
MetaMeta
60.34%
73.67%
49.52%
57.82%
#36Gemini 2.0 Flash Lite
GoogleGoogle
59.93%
71.68%
47.62%
60.48%
#37Grok 2
xAI
59.66%
78.29%
42.86%
57.82%
#38Llama 3.1 405B Instruct OR
MetaMeta
59.16%
72.24%
45.71%
59.52%
#39GPT 4.1 mini
OpenAIOpenAI
58.58%
70.11%
47.62%
58.02%
#40Qwen Plus
Alibaba Qwen
57.75%
73.31%
48.57%
51.36%
#41Claude 3.5 Haiku 20241022
AnthropicAnthropic
56.80%
70.82%
43.81%
55.78%
#42Magistral Medium Latest
Mistral
56.40%
71.17%
37.14%
60.88%
#43Mistral Small 3.1
Mistral
55.86%
68.33%
43.81%
55.44%
#44Grok 4 Fast No Reasoning
xAI
55.56%
70.36%
38.10%
58.22%
#45GPT 4o mini
OpenAIOpenAI
54.98%
70.46%
39.05%
55.44%
#46Gemini 2.5 Flash Lite
GoogleGoogle
54.67%
65.84%
44.76%
53.40%
#47Claude 4.5 Haiku
AnthropicAnthropic
54.49%
67.62%
43.81%
52.04%
#48GPT 5 nano
OpenAIOpenAI
53.15%
66.19%
37.14%
56.12%
#49Mistral Small 3.2
Mistral
52.58%
67.62%
38.10%
52.04%
#50Magistral Small Latest
Mistral
51.91%
63.70%
40.00%
52.04%
#51GPT OSS 120B
OpenAIOpenAI
51.61%
64.41%
43.81%
46.60%
#52Gemma 3 27B IT OR
GoogleGoogle
51.01%
65.48%
40.95%
46.60%
#53Llama 4 Scout
MetaMeta
46.22%
58.72%
35.24%
44.71%
#54GPT 4.1 nano
OpenAIOpenAI
45.46%
61.57%
35.24%
39.59%
#55Qwen 3 30B VL Instruct
Alibaba Qwen
44.86%
60.71%
32.38%
41.50%
#56Gemma 3 12B IT OR
GoogleGoogle
38.39%
52.67%
26.67%
35.84%
#57Llama 3.1 8B Instruct
MetaMeta
34.97%
48.75%
24.76%
31.40%
#58Qwen 3 8B
Alibaba Qwen
31.84%
43.06%
22.86%
29.59%