Leaderboard
Note: The scores are computed by averaging the scores of across all
tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment or preview phase). We currently do not include reasoning models.
July 1st, 2025: we released the new jailbreak resistance module and updated the bias resistance evaluation with a more advanced methodology as described here.
Rank | Model | Provider | |||||
---|---|---|---|---|---|---|---|
#1 | Llama 3.1 405B | Meta | 85.80% | 76.83% | 86.49% | 95.93% | 83.97% |
#2 | Gemini 1.5 Pro | Google | 79.12% | 86.41% | 96.84% | 93.70% | 39.53% |
#3 | Llama 4 Maverick | Meta | 77.63% | 81.14% | 89.25% | 93.13% | 47.02% |
#4 | Claude 3.5 Haiku | Anthropic | 77.20% | 86.33% | 95.36% | 67.98% | 59.11% |
#5 | GPT-4o | OpenAI | 76.93% | 85.64% | 92.66% | 66.48% | 62.95% |
#6 | Claude 3.5 Sonnet | Anthropic | 76.13% | 91.70% | 95.40% | 53.67% | 63.76% |
#7 | Claude 3.7 Sonnet | Anthropic | 75.73% | 89.86% | 95.52% | 61.10% | 56.43% |
#8 | Gemini 2.0 Flash | Google | 75.69% | 81.43% | 94.30% | 85.37% | 41.65% |
#9 | Deepseek V3 | Deepseek | 71.49% | 78.77% | 89.00% | 86.24% | 31.96% |
#10 | Llama 3.3 70B | Meta | 70.49% | 75.28% | 86.04% | 66.56% | 54.08% |
#11 | Qwen 2.5 Max | Alibaba Qwen | 70.20% | 76.91% | 89.89% | 66.22% | 47.80% |
#12 | Gemma 3 27B | Google | 69.79% | 69.48% | 91.36% | 78.59% | 39.71% |
#13 | Mistral Small 3.1 24B | Mistral | 69.08% | 77.68% | 90.91% | 72.83% | 34.91% |
#14 | Deepseek V3 (0324) | Deepseek | 68.97% | 73.87% | 92.80% | 74.96% | 34.25% |
#15 | GPT-4o mini | OpenAI | 67.06% | 74.43% | 77.29% | 60.74% | 55.78% |
#16 | Mistral Large | Mistral | 64.15% | 79.86% | 89.38% | 49.31% | 38.06% |
#17 | Grok 2 | xAI | 61.38% | 77.20% | 91.44% | 49.56% | 27.32% |
Explore the results by task
Compare models
vs