Leaderboard
Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher is better.
April 14, 2025: we have updated the benchmark with Llama 4 Maverick, Grok 2, the new Deepseek V3 0324! We will be adding new models regularly.
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Gemini 1.5 Pro | Google | 87.29% | 87.06% | 96.84% | 77.96% |
#2 | Claude 3.5 Haiku | Anthropic | 82.72% | 86.97% | 95.36% | 65.81% |
#3 | Llama 3.1 405B | Meta | 77.59% | 75.54% | 86.49% | 70.74% |
#4 | Llama 4 Maverick | Meta | 76.72% | 77.02% | 89.25% | 63.89% |
#5 | Claude 3.5 Sonnet | Anthropic | 75.62% | 91.09% | 95.40% | 40.37% |
#6 | Claude 3.7 Sonnet | Anthropic | 75.53% | 89.26% | 95.52% | 41.82% |
#7 | Gemma 3 27B | Google | 75.23% | 69.90% | 91.36% | 64.44% |
#8 | Gemini 2.0 Flash | Google | 74.89% | 78.13% | 94.30% | 52.22% |
#9 | Deepseek V3 (0324) | Deepseek | 73.92% | 77.86% | 92.80% | 51.11% |
#10 | GPT-4o | OpenAI | 72.80% | 83.89% | 92.66% | 41.85% |
#11 | Qwen 2.5 Max | Alibaba Qwen | 72.71% | 77.12% | 89.89% | 51.11% |
#12 | Deepseek V3 | Deepseek | 70.77% | 77.91% | 89.00% | 45.39% |
#13 | Llama 3.3 70B | Meta | 67.97% | 73.41% | 86.04% | 44.44% |
#14 | Mistral Small 3.1 24B | Mistral | 67.88% | 77.72% | 90.91% | 35.00% |
#15 | Mistral Large | Mistral | 66.00% | 79.72% | 89.38% | 28.89% |
#16 | Grok 2 | xAI | 65.15% | 77.35% | 91.44% | 26.67% |
#17 | GPT-4o mini | OpenAI | 63.93% | 74.50% | 77.29% | 40.00% |