Leaderboard
Note: The scores are computed by averaging the scores of across all
        tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment or preview phase). We currently do not include reasoning models.
July 1st, 2025: we released the new jailbreak resistance module and updated the bias resistance evaluation with a more advanced methodology as described here.
| Rank | Model | Provider | |||||
|---|---|---|---|---|---|---|---|
| #1 | Llama 3.1 405B | Meta | 85.80% | 76.83% | 86.49% | 95.93% | 83.97% | 
| #2 | Gemini 1.5 Pro | Google | 79.12% | 86.41% | 96.84% | 93.70% | 39.53% | 
| #3 | Llama 4 Maverick | Meta | 77.63% | 81.14% | 89.25% | 93.13% | 47.02% | 
| #4 | Claude 3.5 Haiku | Anthropic | 77.20% | 86.33% | 95.36% | 67.98% | 59.11% | 
| #5 | GPT-4o | OpenAI | 76.93% | 85.64% | 92.66% | 66.48% | 62.95% | 
| #6 | Claude 3.5 Sonnet | Anthropic | 76.13% | 91.70% | 95.40% | 53.67% | 63.76% | 
| #7 | Claude 3.7 Sonnet | Anthropic | 75.73% | 89.86% | 95.52% | 61.10% | 56.43% | 
| #8 | Gemini 2.0 Flash | Google | 75.69% | 81.43% | 94.30% | 85.37% | 41.65% | 
| #9 | Deepseek V3 | Deepseek | 71.49% | 78.77% | 89.00% | 86.24% | 31.96% | 
| #10 | Llama 3.3 70B | Meta | 70.49% | 75.28% | 86.04% | 66.56% | 54.08% | 
| #11 | Qwen 2.5 Max | Alibaba Qwen | 70.20% | 76.91% | 89.89% | 66.22% | 47.80% | 
| #12 | Gemma 3 27B | Google | 69.79% | 69.48% | 91.36% | 78.59% | 39.71% | 
| #13 | Mistral Small 3.1 24B | Mistral | 69.08% | 77.68% | 90.91% | 72.83% | 34.91% | 
| #14 | Deepseek V3 (0324) | Deepseek | 68.97% | 73.87% | 92.80% | 74.96% | 34.25% | 
| #15 | GPT-4o mini | OpenAI | 67.06% | 74.43% | 77.29% | 60.74% | 55.78% | 
| #16 | Mistral Large | Mistral | 64.15% | 79.86% | 89.38% | 49.31% | 38.06% | 
| #17 | Grok 2 | xAI | 61.38% | 77.20% | 91.44% | 49.56% | 27.32% | 
Explore the results by task
Compare models
vs