Leaderboard
Note: The scores are computed by averaging the scores of across all
tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available
(i.e. out of experiment or preview phase). We currently do not include reasoning
models.
July 1st, 2025: we updated the bias resistance evaluation with a more advanced methodology as described here. The results for the jailbreak resistance evaluation is currently under our provider's notification period and will be released on July 5th.
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Gemini 1.5 Pro | Google | 92.32% | 86.41% | 96.84% | 93.70% |
#2 | Llama 4 Maverick | Meta | 87.84% | 81.14% | 89.25% | 93.13% |
#3 | Gemini 2.0 Flash | Google | 87.03% | 81.43% | 94.30% | 85.37% |
#4 | Llama 3.1 405B | Meta | 86.42% | 76.83% | 86.49% | 95.93% |
#5 | Deepseek V3 | Deepseek | 84.67% | 78.77% | 89.00% | 86.24% |
#6 | Claude 3.5 Haiku | Anthropic | 83.23% | 86.33% | 95.36% | 67.98% |
#7 | Claude 3.7 Sonnet | Anthropic | 82.16% | 89.86% | 95.52% | 61.10% |
#8 | GPT-4o | OpenAI | 81.59% | 85.64% | 92.66% | 66.48% |
#9 | Deepseek V3 (0324) | Deepseek | 80.54% | 73.87% | 92.80% | 74.96% |
#10 | Mistral Small 3.1 24B | Mistral | 80.47% | 77.68% | 90.91% | 72.83% |
#11 | Claude 3.5 Sonnet | Anthropic | 80.25% | 91.70% | 95.40% | 53.67% |
#12 | Gemma 3 27B | Google | 79.81% | 69.48% | 91.36% | 78.59% |
#13 | Qwen 2.5 Max | Alibaba Qwen | 77.67% | 76.91% | 89.89% | 66.22% |
#14 | Llama 3.3 70B | Meta | 75.96% | 75.28% | 86.04% | 66.56% |
#15 | Mistral Large | Mistral | 72.85% | 79.86% | 89.38% | 49.31% |
#16 | Grok 2 | xAI | 72.73% | 77.20% | 91.44% | 49.56% |
#17 | GPT-4o mini | OpenAI | 70.82% | 74.43% | 77.29% | 60.74% |