Misinformation
We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | Claude 4.5 Haiku | Anthropic | 95.51% | 99.07% | 91.16% | 96.31% |
| #2 | Claude 3.7 Sonnet | Anthropic | 89.52% | 91.61% | 87.76% | 89.19% |
| #3 | Claude 4.5 Sonnet | Anthropic | 86.17% | 91.93% | 79.59% | 86.98% |
| #4 | Claude 4.5 Opus | Anthropic | 85.82% | 90.99% | 80.95% | 85.50% |
| #5 | Claude 3.5 Sonnet | Anthropic | 85.43% | 93.17% | 79.59% | 83.54% |
| #6 | Claude 4.1 Opus | Anthropic | 85.00% | 95.34% | 74.15% | 85.50% |
| #7 | GPT 5 nano | OpenAI | 80.04% | 77.02% | 82.99% | 80.10% |
| #8 | Qwen Plus | Alibaba Qwen | 77.90% | 84.47% | 75.51% | 73.71% |
| #9 | GPT 5 mini | OpenAI | 77.55% | 79.19% | 74.83% | 78.62% |
| #10 | Magistral Medium Latest | Mistral | 76.71% | 83.23% | 79.59% | 67.32% |
| #11 | GPT 4.1 nano | OpenAI | 76.61% | 78.88% | 78.23% | 72.73% |
| #12 | Claude 3.5 Haiku 20241022 | Anthropic | 76.40% | 86.96% | 70.75% | 71.50% |
| #13 | Gemini 1.5 Pro | Google | 74.39% | 78.57% | 74.83% | 69.78% |
| #14 | Llama 3.1 8B Instruct | Meta | 74.15% | 70.50% | 69.39% | 82.56% |
| #15 | Llama 3.1 405B Instruct OR | Meta | 73.59% | 90.97% | 66.67% | 63.14% |
| #16 | Gemini 2.5 Flash Lite | Google | 72.18% | 76.40% | 66.67% | 73.46% |
| #17 | Mistral Large 2 | Mistral | 68.74% | 75.16% | 69.39% | 61.67% |
| #18 | Gemini 2.5 Flash | Google | 68.38% | 74.22% | 68.03% | 62.90% |
| #19 | GPT-4o | OpenAI | 68.11% | 75.16% | 65.31% | 63.88% |
| #20 | Command A | Cohere | 67.71% | 84.78% | 59.86% | 58.48% |
| #21 | GPT 5.1 | OpenAI | 67.65% | 77.02% | 57.14% | 68.80% |
| #22 | Gemini 2.5 Pro | Google | 65.18% | 77.64% | 56.46% | 61.43% |
| #23 | Llama 4 Maverick | Meta | 64.99% | 73.91% | 59.86% | 61.18% |
| #24 | Grok 4 | xAI | 64.82% | 81.99% | 53.74% | 58.72% |
| #25 | Qwen 3 Max | Alibaba Qwen | 64.38% | 76.71% | 50.34% | 66.09% |
| #26 | Llama 3.3 70B Instruct OR | Meta | 64.24% | 81.68% | 53.06% | 57.99% |
| #27 | Grok 3 mini | xAI | 63.17% | 71.43% | 57.14% | 60.93% |
| #28 | Gemini 2.0 Flash | Google | 62.43% | 74.53% | 53.06% | 59.71% |
| #29 | Gemini 3.0 Pro Preview | Google | 62.14% | 65.22% | 57.82% | 63.39% |
| #30 | Deepseek V3.1 | Deepseek | 60.02% | 65.22% | 51.70% | 63.14% |
| #31 | Mistral Small 3.1 | Mistral | 58.70% | 65.84% | 56.46% | 53.81% |
| #32 | Gemini 2.0 Flash Lite | Google | 58.63% | 76.71% | 42.86% | 56.33% |
| #33 | Qwen 2.5 Max | Alibaba Qwen | 57.93% | 67.70% | 56.46% | 49.63% |
| #34 | Deepseek V3 0324 | Deepseek | 55.74% | 68.63% | 46.26% | 52.33% |
| #35 | GPT 4.1 | OpenAI | 55.17% | 68.94% | 44.22% | 52.33% |
| #36 | Qwen 3 8B | Alibaba Qwen | 54.09% | 60.87% | 44.90% | 56.51% |
| #37 | Deepseek R1 0528 | Deepseek | 52.92% | 58.07% | 47.62% | 53.07% |
| #38 | Grok 3 | xAI | 52.36% | 59.01% | 52.38% | 45.70% |
| #39 | Deepseek V3 | Deepseek | 50.54% | 68.01% | 37.41% | 46.19% |
| #40 | Magistral Small Latest | Mistral | 48.95% | 56.83% | 48.98% | 41.03% |
| #41 | Mistral Medium Latest | Mistral | 48.31% | 62.42% | 38.78% | 43.73% |
| #42 | Qwen 3 30B VL Instruct | Alibaba Qwen | 47.66% | 57.76% | 29.93% | 55.28% |
| #43 | GPT-4o mini | OpenAI | 46.40% | 60.87% | 38.78% | 39.56% |
| #44 | GPT 5 | OpenAI | 44.66% | 48.76% | 41.50% | 43.73% |
| #45 | Mistral Large 3 | Mistral | 43.70% | 58.07% | 34.69% | 38.33% |
| #46 | GPT 4.1 mini | OpenAI | 43.63% | 53.73% | 38.10% | 39.07% |
| #47 | Mistral Small 3.2 | Mistral | 42.65% | 50.62% | 36.05% | 41.28% |
| #48 | Llama 4 Scout | Meta | 41.90% | 55.28% | 30.61% | 39.80% |
| #49 | Gemma 3 27B IT OR | Google | 41.23% | 51.86% | 37.41% | 34.40% |
| #50 | Grok 2 | xAI | 39.97% | 45.96% | 35.37% | 38.57% |
| #51 | GPT OSS 120B | OpenAI | 38.40% | 41.61% | 27.89% | 45.70% |
| #52 | Grok 4 Fast No Reasoning | xAI | 37.06% | 41.30% | 31.29% | 38.57% |
| #53 | Gemma 3 12B IT OR | Google | 28.48% | 33.33% | 20.55% | 31.57% |