Misinformation
We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.7 Sonnet | Anthropic | 81.76% | 89.25% | 76.34% | 79.67% |
#2 | Llama 3.1 405B | Meta | 80.25% | 88.71% | 81.72% | 70.33% |
#3 | Claude 3.5 Sonnet | Anthropic | 77.28% | 85.41% | 72.04% | 74.39% |
#4 | Claude 3.5 Haiku | Anthropic | 76.71% | 87.63% | 74.19% | 68.29% |
#5 | Gemini 1.5 Pro | Google | 70.41% | 75.27% | 73.91% | 62.04% |
#6 | Llama 3.3 70B | Meta | 66.35% | 80.65% | 64.52% | 53.88% |
#7 | Mistral Large | Mistral | 64.20% | 72.58% | 64.52% | 55.51% |
#8 | GPT-4o | OpenAI | 63.03% | 72.04% | 62.37% | 54.69% |
#9 | Gemini 2.0 Flash | Google | 61.26% | 70.43% | 58.06% | 55.28% |
#10 | Qwen 2.5 Max | Alibaba Qwen | 58.44% | 69.35% | 56.99% | 48.98% |
#11 | Mistral Small 3.1 24B | Mistral | 57.81% | 62.90% | 61.96% | 48.57% |
#12 | Deepseek V3 | Deepseek | 46.16% | 62.50% | 36.96% | 39.02% |
#13 | GPT-4o mini | OpenAI | 44.78% | 58.06% | 41.30% | 34.96% |
#14 | Gemma 3 27B | Google | 43.53% | 52.43% | 40.86% | 37.30% |