Debunking
The model's ability to critically evaluate and address questionable claims, including pseudoscience, conspiracy theories, and other controversial content (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | Claude 3.5 Sonnet | Anthropic | 97.60% | 97.34% | 98.13% | 97.33% | 
| #2 | Claude 3.7 Sonnet | Anthropic | 96.99% | 97.06% | 97.28% | 96.64% | 
| #3 | Gemini 1.5 Pro | Google | 96.29% | 98.00% | 95.09% | 95.78% | 
| #4 | Claude 3.5 Haiku | Anthropic | 95.96% | 95.98% | 96.14% | 95.75% | 
| #5 | GPT-4o | OpenAI | 92.70% | 91.83% | 94.05% | 92.22% | 
| #6 | Gemini 2.0 Flash | Google | 92.44% | 92.67% | 93.24% | 91.41% | 
| #7 | Llama 4 Maverick | Meta | 91.20% | 91.74% | 91.80% | 90.05% | 
| #8 | Llama 3.1 405B | Meta | 88.13% | 92.46% | 87.11% | 84.81% | 
| #9 | Deepseek V3 (0324) | Deepseek | 85.69% | 83.61% | 87.77% | 85.70% | 
| #10 | Grok 2 | xAI | 85.41% | 88.32% | 81.88% | 86.04% | 
| #11 | Mistral Small 3.1 24B | Mistral | 85.25% | 83.36% | 86.77% | 85.62% | 
| #12 | Deepseek V3 | Deepseek | 84.86% | 83.83% | 85.45% | 85.30% | 
| #13 | Mistral Large | Mistral | 84.77% | 85.72% | 83.03% | 85.57% | 
| #14 | Qwen 2.5 Max | Alibaba Qwen | 84.09% | 86.03% | 82.50% | 83.75% | 
| #15 | Llama 3.3 70B | Meta | 82.78% | 86.64% | 80.13% | 81.58% | 
| #16 | GPT-4o mini | OpenAI | 81.87% | 81.43% | 82.28% | 81.90% | 
| #17 | Gemma 3 27B | Google | 77.09% | 76.89% | 75.88% | 78.51% |