Misinformation

We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)

RankModelProvider
#1Claude 3.7 Sonnet
AnthropicAnthropic
81.76%
89.25%
76.34%
79.67%
#2Llama 3.1 405B
MetaMeta
80.25%
88.71%
81.72%
70.33%
#3Claude 3.5 Sonnet
AnthropicAnthropic
77.28%
85.41%
72.04%
74.39%
#4Claude 3.5 Haiku
AnthropicAnthropic
76.71%
87.63%
74.19%
68.29%
#5Gemini 1.5 Pro
GoogleGoogle
70.41%
75.27%
73.91%
62.04%
#6Llama 3.3 70B
MetaMeta
66.35%
80.65%
64.52%
53.88%
#7Mistral Large
Mistral
64.20%
72.58%
64.52%
55.51%
#8GPT-4o
OpenAIOpenAI
63.03%
72.04%
62.37%
54.69%
#9Llama 4 Maverick
MetaMeta
62.51%
74.73%
59.14%
53.66%
#10Gemini 2.0 Flash
GoogleGoogle
61.26%
70.43%
58.06%
55.28%
#11Qwen 2.5 Max
Alibaba Qwen
58.44%
69.35%
56.99%
48.98%
#12Mistral Small 3.1 24B
Mistral
57.81%
62.90%
61.96%
48.57%
#13Deepseek V3 (0324)
Deepseek
51.74%
66.67%
43.01%
45.53%
#14Deepseek V3
Deepseek
46.16%
62.50%
36.96%
39.02%
#15GPT-4o mini
OpenAIOpenAI
44.78%
58.06%
41.30%
34.96%
#16Gemma 3 27B
GoogleGoogle
43.53%
52.43%
40.86%
37.30%
#17Grok 2
xAI
35.89%
37.10%
34.41%
36.18%