Debunking

The model's ability to critically evaluate and address questionable claims, including pseudoscience, conspiracy theories, and other controversial content (Higher score is better.)

RankModelProvider
#1Claude 3.5 Sonnet
AnthropicAnthropic
97.60%
97.34%
98.13%
97.33%
#2Claude 3.7 Sonnet
AnthropicAnthropic
96.99%
97.06%
97.28%
96.64%
#3Gemini 1.5 Pro
GoogleGoogle
96.29%
98.00%
95.09%
95.78%
#4Claude 3.5 Haiku
AnthropicAnthropic
95.96%
95.98%
96.14%
95.75%
#5GPT-4o
OpenAIOpenAI
92.70%
91.83%
94.05%
92.22%
#6Gemini 2.0 Flash
GoogleGoogle
92.44%
92.67%
93.24%
91.41%
#7Llama 3.1 405B
MetaMeta
88.13%
92.46%
87.11%
84.81%
#8Mistral Small 3.1 24B
Mistral
85.25%
83.36%
86.77%
85.62%
#9Deepseek V3
Deepseek
84.86%
83.83%
85.45%
85.30%
#10Mistral Large
Mistral
84.77%
85.72%
83.03%
85.57%
#11Qwen 2.5 Max
Alibaba Qwen
84.09%
86.03%
82.50%
83.75%
#12Llama 3.3 70B
MetaMeta
82.78%
86.64%
80.13%
81.58%
#13GPT-4o mini
OpenAIOpenAI
81.87%
81.43%
82.28%
81.90%
#14Gemma 3 27B
GoogleGoogle
77.09%
76.89%
75.88%
78.51%