Misinformation

We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)

RankModelProvider
#1Claude 3.7 Sonnet
AnthropicAnthropic
81.76%
89.25%
76.34%
79.67%
#2Llama 3.1 405B
MetaMeta
80.25%
88.71%
81.72%
70.33%
#3Claude 3.5 Sonnet
AnthropicAnthropic
77.28%
85.41%
72.04%
74.39%
#4Claude 3.5 Haiku
AnthropicAnthropic
76.71%
87.63%
74.19%
68.29%
#5Gemini 1.5 Pro
GoogleGoogle
70.41%
75.27%
73.91%
62.04%
#6Llama 3.3 70B
MetaMeta
66.35%
80.65%
64.52%
53.88%
#7Mistral Large
Mistral
64.20%
72.58%
64.52%
55.51%
#8GPT-4o
OpenAIOpenAI
63.03%
72.04%
62.37%
54.69%
#9Gemini 2.0 Flash
GoogleGoogle
61.26%
70.43%
58.06%
55.28%
#10Qwen 2.5 Max
Alibaba Qwen
58.44%
69.35%
56.99%
48.98%
#11Mistral Small 3.1 24B
Mistral
57.81%
62.90%
61.96%
48.57%
#12Deepseek V3
Deepseek
46.16%
62.50%
36.96%
39.02%
#13GPT-4o mini
OpenAIOpenAI
44.78%
58.06%
41.30%
34.96%
#14Gemma 3 27B
GoogleGoogle
43.53%
52.43%
40.86%
37.30%