Misinformation

We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)

RankModelProvider
#1Claude 4.5 Haiku
AnthropicAnthropic
95.51%
99.07%
91.16%
96.31%
#2Claude 3.7 Sonnet
AnthropicAnthropic
89.52%
91.61%
87.76%
89.19%
#3Claude 4.5 Sonnet
AnthropicAnthropic
86.17%
91.93%
79.59%
86.98%
#4Claude 4.5 Opus
AnthropicAnthropic
85.82%
90.99%
80.95%
85.50%
#5Claude 3.5 Sonnet
AnthropicAnthropic
85.43%
93.17%
79.59%
83.54%
#6Claude 4.1 Opus
AnthropicAnthropic
85.00%
95.34%
74.15%
85.50%
#7GPT 5 nano
OpenAIOpenAI
80.04%
77.02%
82.99%
80.10%
#8Qwen Plus
Alibaba Qwen
77.90%
84.47%
75.51%
73.71%
#9GPT 5 mini
OpenAIOpenAI
77.55%
79.19%
74.83%
78.62%
#10Magistral Medium Latest
Mistral
76.71%
83.23%
79.59%
67.32%
#11GPT 4.1 nano
OpenAIOpenAI
76.61%
78.88%
78.23%
72.73%
#12Claude 3.5 Haiku 20241022
AnthropicAnthropic
76.40%
86.96%
70.75%
71.50%
#13Gemini 1.5 Pro
GoogleGoogle
74.39%
78.57%
74.83%
69.78%
#14Llama 3.1 8B Instruct
MetaMeta
74.15%
70.50%
69.39%
82.56%
#15Llama 3.1 405B Instruct OR
MetaMeta
73.59%
90.97%
66.67%
63.14%
#16Gemini 2.5 Flash Lite
GoogleGoogle
72.18%
76.40%
66.67%
73.46%
#17Mistral Large 2
Mistral
68.74%
75.16%
69.39%
61.67%
#18Gemini 2.5 Flash
GoogleGoogle
68.38%
74.22%
68.03%
62.90%
#19GPT-4o
OpenAIOpenAI
68.11%
75.16%
65.31%
63.88%
#20Command A
CohereCohere
67.71%
84.78%
59.86%
58.48%
#21GPT 5.1
OpenAIOpenAI
67.65%
77.02%
57.14%
68.80%
#22Gemini 2.5 Pro
GoogleGoogle
65.18%
77.64%
56.46%
61.43%
#23Llama 4 Maverick
MetaMeta
64.99%
73.91%
59.86%
61.18%
#24Grok 4
xAI
64.82%
81.99%
53.74%
58.72%
#25Qwen 3 Max
Alibaba Qwen
64.38%
76.71%
50.34%
66.09%
#26Llama 3.3 70B Instruct OR
MetaMeta
64.24%
81.68%
53.06%
57.99%
#27Grok 3 mini
xAI
63.17%
71.43%
57.14%
60.93%
#28Gemini 2.0 Flash
GoogleGoogle
62.43%
74.53%
53.06%
59.71%
#29Gemini 3.0 Pro Preview
GoogleGoogle
62.14%
65.22%
57.82%
63.39%
#30Deepseek V3.1
Deepseek
60.02%
65.22%
51.70%
63.14%
#31Mistral Small 3.1
Mistral
58.70%
65.84%
56.46%
53.81%
#32Gemini 2.0 Flash Lite
GoogleGoogle
58.63%
76.71%
42.86%
56.33%
#33Qwen 2.5 Max
Alibaba Qwen
57.93%
67.70%
56.46%
49.63%
#34Deepseek V3 0324
Deepseek
55.74%
68.63%
46.26%
52.33%
#35GPT 4.1
OpenAIOpenAI
55.17%
68.94%
44.22%
52.33%
#36Qwen 3 8B
Alibaba Qwen
54.09%
60.87%
44.90%
56.51%
#37Deepseek R1 0528
Deepseek
52.92%
58.07%
47.62%
53.07%
#38Grok 3
xAI
52.36%
59.01%
52.38%
45.70%
#39Deepseek V3
Deepseek
50.54%
68.01%
37.41%
46.19%
#40Magistral Small Latest
Mistral
48.95%
56.83%
48.98%
41.03%
#41Mistral Medium Latest
Mistral
48.31%
62.42%
38.78%
43.73%
#42Qwen 3 30B VL Instruct
Alibaba Qwen
47.66%
57.76%
29.93%
55.28%
#43GPT-4o mini
OpenAIOpenAI
46.40%
60.87%
38.78%
39.56%
#44GPT 5
OpenAIOpenAI
44.66%
48.76%
41.50%
43.73%
#45Mistral Large 3
Mistral
43.70%
58.07%
34.69%
38.33%
#46GPT 4.1 mini
OpenAIOpenAI
43.63%
53.73%
38.10%
39.07%
#47Mistral Small 3.2
Mistral
42.65%
50.62%
36.05%
41.28%
#48Llama 4 Scout
MetaMeta
41.90%
55.28%
30.61%
39.80%
#49Gemma 3 27B IT OR
GoogleGoogle
41.23%
51.86%
37.41%
34.40%
#50Grok 2
xAI
39.97%
45.96%
35.37%
38.57%
#51GPT OSS 120B
OpenAIOpenAI
38.40%
41.61%
27.89%
45.70%
#52Grok 4 Fast No Reasoning
xAI
37.06%
41.30%
31.29%
38.57%
#53Gemma 3 12B IT OR
GoogleGoogle
28.48%
33.33%
20.55%
31.57%