Debunking
The model's ability to critically evaluate and address questionable claims, including pseudoscience, conspiracy theories, and other controversial content (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.5 Sonnet | Anthropic | 97.60% | 97.34% | 98.13% | 97.33% |
#2 | Claude 3.7 Sonnet | Anthropic | 96.99% | 97.06% | 97.28% | 96.64% |
#3 | Gemini 1.5 Pro | Google | 96.29% | 98.00% | 95.09% | 95.78% |
#4 | Claude 3.5 Haiku | Anthropic | 95.96% | 95.98% | 96.14% | 95.75% |
#5 | GPT-4o | OpenAI | 92.70% | 91.83% | 94.05% | 92.22% |
#6 | Gemini 2.0 Flash | Google | 92.44% | 92.67% | 93.24% | 91.41% |
#7 | Llama 3.1 405B | Meta | 88.13% | 92.46% | 87.11% | 84.81% |
#8 | Mistral Small 3.1 24B | Mistral | 85.25% | 83.36% | 86.77% | 85.62% |
#9 | Deepseek V3 | Deepseek | 84.86% | 83.83% | 85.45% | 85.30% |
#10 | Mistral Large | Mistral | 84.77% | 85.72% | 83.03% | 85.57% |
#11 | Qwen 2.5 Max | Alibaba Qwen | 84.09% | 86.03% | 82.50% | 83.75% |
#12 | Llama 3.3 70B | Meta | 82.78% | 86.64% | 80.13% | 81.58% |
#13 | GPT-4o mini | OpenAI | 81.87% | 81.43% | 82.28% | 81.90% |
#14 | Gemma 3 27B | Google | 77.09% | 76.89% | 75.88% | 78.51% |