Factuality
The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.5 Sonnet | Anthropic | 73.49% | 83.11% | 63.54% | 73.81% |
#2 | Claude 3.7 Sonnet | Anthropic | 72.79% | 84.96% | 62.66% | 70.75% |
#3 | GPT-4o | OpenAI | 70.95% | 82.41% | 59.70% | 70.75% |
#4 | Gemini 2.0 Flash | Google | 67.98% | 77.56% | 59.38% | 67.01% |
#5 | Deepseek V3 (0324) | Deepseek | 67.42% | 77.49% | 56.74% | 68.03% |
#6 | Deepseek V3 | Deepseek | 66.77% | 77.59% | 56.74% | 65.99% |
#7 | Gemini 1.5 Pro | Google | 66.36% | 79.08% | 52.74% | 67.25% |
#8 | Mistral Large | Mistral | 64.59% | 78.99% | 50.49% | 64.29% |
#9 | Qwen 2.5 Max | Alibaba Qwen | 62.66% | 77.32% | 50.45% | 60.20% |
#10 | Llama 4 Maverick | Meta | 61.17% | 70.69% | 54.33% | 58.50% |
#11 | Llama 3.3 70B | Meta | 59.94% | 73.41% | 48.57% | 57.82% |
#12 | Grok 2 | xAI | 59.44% | 78.06% | 42.43% | 57.82% |
#13 | Llama 3.1 405B | Meta | 58.79% | 71.95% | 44.90% | 59.52% |
#14 | Claude 3.5 Haiku | Anthropic | 56.74% | 70.79% | 43.64% | 55.78% |
#15 | Mistral Small 3.1 24B | Mistral | 55.56% | 68.08% | 43.15% | 55.44% |
#16 | GPT-4o mini | OpenAI | 54.93% | 70.25% | 39.09% | 55.44% |
#17 | Gemma 3 27B | Google | 50.74% | 65.59% | 40.02% | 46.60% |