Leaderboard
Note: The scores are computed by averaging the scores of across all tasks and languages for each module. Higher is better.
These are preliminary results as of March 31, 2025. We will be updating with new model releases soon. Contact us at [email protected] if you have questions.
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Gemini 1.5 Pro | Google | 87.29% | 87.06% | 96.84% | 77.96% |
#2 | Claude 3.5 Haiku | Anthropic | 82.72% | 86.97% | 95.36% | 65.81% |
#3 | Llama 3.1 405B | Meta | 77.59% | 75.54% | 86.49% | 70.74% |
#4 | Claude 3.5 Sonnet | Anthropic | 75.62% | 91.09% | 95.40% | 40.37% |
#5 | Claude 3.7 Sonnet | Anthropic | 75.53% | 89.26% | 95.52% | 41.82% |
#6 | Gemma 3 27B | Google | 75.23% | 69.90% | 91.36% | 64.44% |
#7 | Gemini 2.0 Flash | Google | 74.89% | 78.13% | 94.30% | 52.22% |
#8 | GPT-4o | OpenAI | 72.80% | 83.89% | 92.66% | 41.85% |
#9 | Qwen 2.5 Max | Alibaba Qwen | 72.71% | 77.12% | 89.89% | 51.11% |
#10 | Deepseek V3 | Deepseek | 70.77% | 77.91% | 89.00% | 45.39% |
#11 | Llama 3.3 70B | Meta | 67.97% | 73.41% | 86.04% | 44.44% |
#12 | Mistral Small 3.1 24B | Mistral | 67.88% | 77.72% | 90.91% | 35.00% |
#13 | Mistral Large | Mistral | 66.00% | 79.72% | 89.38% | 28.89% |
#14 | GPT-4o mini | OpenAI | 63.93% | 74.50% | 77.29% | 40.00% |