Misinformation
We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.7 Sonnet | Anthropic | 89.52% | 91.61% | 87.76% | 89.19% |
#2 | Claude 3.5 Sonnet | Anthropic | 85.43% | 93.17% | 79.59% | 83.54% |
#3 | Claude 3.5 Haiku | Anthropic | 76.40% | 86.96% | 70.75% | 71.50% |
#4 | Gemini 1.5 Pro | Google | 74.39% | 78.57% | 74.83% | 69.78% |
#5 | Llama 3.1 405B | Meta | 73.59% | 90.97% | 66.67% | 63.14% |
#6 | Mistral Large | Mistral | 68.74% | 75.16% | 69.39% | 61.67% |
#7 | GPT-4o | OpenAI | 68.11% | 75.16% | 65.31% | 63.88% |
#8 | Llama 4 Maverick | Meta | 64.99% | 73.91% | 59.86% | 61.18% |
#9 | Llama 3.3 70B | Meta | 64.24% | 81.68% | 53.06% | 57.99% |
#10 | Gemini 2.0 Flash | Google | 62.43% | 74.53% | 53.06% | 59.71% |
#11 | Mistral Small 3.1 24B | Mistral | 58.70% | 65.84% | 56.46% | 53.81% |
#12 | Qwen 2.5 Max | Alibaba Qwen | 57.93% | 67.70% | 56.46% | 49.63% |
#13 | Deepseek V3 (0324) | Deepseek | 55.74% | 68.63% | 46.26% | 52.33% |
#14 | Deepseek V3 | Deepseek | 50.54% | 68.01% | 37.41% | 46.19% |
#15 | GPT-4o mini | OpenAI | 46.40% | 60.87% | 38.78% | 39.56% |
#16 | Gemma 3 27B | Google | 41.23% | 51.86% | 37.41% | 34.40% |
#17 | Grok 2 | xAI | 39.97% | 45.96% | 35.37% | 38.57% |