Misinformation

We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)

RankModelProvider
#1Claude 3.7 Sonnet
AnthropicAnthropic
89.52%
91.61%
87.76%
89.19%
#2Claude 3.5 Sonnet
AnthropicAnthropic
85.43%
93.17%
79.59%
83.54%
#3Claude 3.5 Haiku
AnthropicAnthropic
76.40%
86.96%
70.75%
71.50%
#4Gemini 1.5 Pro
GoogleGoogle
74.39%
78.57%
74.83%
69.78%
#5Llama 3.1 405B
MetaMeta
73.59%
90.97%
66.67%
63.14%
#6Mistral Large
Mistral
68.74%
75.16%
69.39%
61.67%
#7GPT-4o
OpenAIOpenAI
68.11%
75.16%
65.31%
63.88%
#8Llama 4 Maverick
MetaMeta
64.99%
73.91%
59.86%
61.18%
#9Llama 3.3 70B
MetaMeta
64.24%
81.68%
53.06%
57.99%
#10Gemini 2.0 Flash
GoogleGoogle
62.43%
74.53%
53.06%
59.71%
#11Mistral Small 3.1 24B
Mistral
58.70%
65.84%
56.46%
53.81%
#12Qwen 2.5 Max
Alibaba Qwen
57.93%
67.70%
56.46%
49.63%
#13Deepseek V3 (0324)
Deepseek
55.74%
68.63%
46.26%
52.33%
#14Deepseek V3
Deepseek
50.54%
68.01%
37.41%
46.19%
#15GPT-4o mini
OpenAIOpenAI
46.40%
60.87%
38.78%
39.56%
#16Gemma 3 27B
GoogleGoogle
41.23%
51.86%
37.41%
34.40%
#17Grok 2
xAI
39.97%
45.96%
35.37%
38.57%