Factuality
The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.7 Sonnet | Anthropic | 74.79% | 84.96% | 66.35% | 73.05% |
#2 | Claude 3.5 Sonnet | Anthropic | 74.46% | 82.64% | 65.26% | 75.47% |
#3 | GPT-4o | OpenAI | 73.70% | 83.04% | 65.23% | 72.83% |
#4 | Gemini 2.0 Flash | Google | 68.52% | 77.86% | 59.24% | 68.47% |
#5 | Deepseek V3 | Deepseek | 67.64% | 76.96% | 59.27% | 66.70% |
#6 | Gemini 1.5 Pro | Google | 66.02% | 78.60% | 52.16% | 67.30% |
#7 | Mistral Large | Mistral | 65.56% | 79.16% | 53.25% | 64.28% |
#8 | Qwen 2.5 Max | Alibaba Qwen | 63.86% | 77.53% | 53.07% | 60.99% |
#9 | Claude 3.5 Haiku | Anthropic | 59.95% | 72.65% | 47.51% | 59.68% |
#10 | Llama 3.1 405B | Meta | 59.78% | 72.10% | 47.78% | 59.47% |
#11 | Llama 3.3 70B | Meta | 59.47% | 71.19% | 51.02% | 56.20% |
#12 | GPT-4o mini | OpenAI | 57.57% | 68.89% | 46.96% | 56.86% |
#13 | Mistral Small 3.1 24B | Mistral | 57.56% | 68.27% | 49.53% | 54.88% |
#14 | Gemma 3 27B | Google | 51.30% | 66.55% | 40.32% | 47.03% |