Leaderboard
Note: The scores are computed by averaging the scores of across all
tasks and languages for each module. Higher score is better.
The models benchmarked are instruction-tuned models that are generally available (i.e. out of experiment).
December 15th, 2025: we released an updated version of the jailbreak resistance module and added 33 new models to the benchmark, including 20 state-of-the-art reasoning models.
| Rank | Model | Provider | |||||
|---|---|---|---|---|---|---|---|
| #1 | Claude 4.5 Haiku | Anthropic | 83.16% | 83.56% | 99.93% | 70.66% | 78.51% |
| #2 | Claude 4.5 Opus | Anthropic | 82.38% | 88.23% | 98.25% | 63.20% | 79.83% |
| #3 | Claude 4.5 Sonnet | Anthropic | 77.60% | 87.00% | 99.05% | 49.14% | 75.23% |
| #4 | Claude 4.1 Opus | Anthropic | 76.87% | 86.19% | 96.31% | 43.61% | 81.35% |
| #5 | Llama 3.1 405B Instruct OR | Meta | 76.42% | 67.40% | 86.49% | 75.23% | 76.55% |
| #6 | Gemini 3.0 Pro Preview | Google | 73.31% | 81.02% | 93.50% | 53.65% | 65.06% |
| #7 | GPT 5 mini | OpenAI | 73.22% | 79.58% | 98.29% | 46.41% | 68.60% |
| #8 | GPT 5.1 | OpenAI | 72.82% | 81.84% | 96.92% | 46.77% | 65.76% |
| #9 | Llama 4 Maverick | Meta | 70.84% | 71.47% | 89.25% | 73.65% | 48.99% |
| #10 | GPT-4o | OpenAI | 70.65% | 78.52% | 92.66% | 50.92% | 60.48% |
| #11 | GPT 4.1 mini | OpenAI | 70.47% | 67.81% | 83.39% | 88.12% | 42.55% |
| #12 | GPT 5 nano | OpenAI | 69.53% | 76.37% | 97.41% | 34.70% | 69.64% |
| #13 | Claude 3.7 Sonnet | Anthropic | 69.48% | 85.17% | 95.52% | 33.77% | 63.45% |
| #14 | Claude 3.5 Haiku 20241022 | Anthropic | 69.08% | 78.04% | 95.36% | 38.08% | 64.82% |
| #15 | Qwen Plus | Alibaba Qwen | 68.85% | 75.58% | 94.14% | 55.70% | 49.98% |
| #16 | GPT 5 | OpenAI | 67.20% | 74.58% | 96.97% | 28.56% | 68.68% |
| #17 | GPT 4.1 | OpenAI | 67.11% | 77.64% | 92.30% | 52.41% | 46.10% |
| #18 | Grok 4 Fast No Reasoning | xAI | 66.55% | 63.40% | 81.34% | 80.26% | 41.20% |
| #19 | GPT OSS 120B | OpenAI | 66.14% | 66.08% | 93.75% | 38.84% | 65.90% |
| #20 | Qwen 3 Max | Alibaba Qwen | 65.93% | 75.20% | 95.40% | 44.77% | 48.35% |
| #21 | Deepseek V3.1 | Deepseek | 64.84% | 61.58% | 94.43% | 65.17% | 38.17% |
| #22 | Mistral Small 3.2 | Mistral | 64.52% | 63.40% | 87.87% | 73.90% | 32.89% |
| #23 | Gemini 2.0 Flash | Google | 64.19% | 71.59% | 94.30% | 53.51% | 37.35% |
| #24 | Qwen 3 8B | Alibaba Qwen | 63.88% | 60.38% | 87.37% | 58.64% | 49.13% |
| #25 | Llama 4 Scout | Meta | 63.82% | 58.20% | 81.04% | 67.10% | 48.93% |
| #26 | Gemini 2.5 Flash | Google | 63.64% | 72.13% | 93.66% | 51.74% | 37.05% |
| #27 | Llama 3.1 8B Instruct | Meta | 62.46% | 63.81% | 83.06% | 44.15% | 58.81% |
| #28 | Magistral Medium Latest | Mistral | 62.26% | 68.19% | 84.52% | 50.75% | 45.61% |
| #29 | Deepseek V3 | Deepseek | 62.24% | 70.91% | 89.00% | 59.75% | 29.31% |
| #30 | Qwen 2.5 Max | Alibaba Qwen | 62.22% | 69.80% | 89.89% | 42.95% | 46.26% |
| #31 | Llama 3.3 70B Instruct OR | Meta | 61.98% | 68.83% | 86.04% | 45.39% | 47.68% |
| #32 | Deepseek V3 0324 | Deepseek | 60.87% | 62.59% | 92.80% | 55.26% | 32.82% |
| #33 | Gemini 2.5 Pro | Google | 60.55% | 74.82% | 92.18% | 36.52% | 38.67% |
| #34 | Grok 3 mini | xAI | 59.96% | 68.62% | 90.47% | 46.40% | 34.36% |
| #35 | Qwen 3 30B VL Instruct | Alibaba Qwen | 59.33% | 59.88% | 81.76% | 53.57% | 42.09% |
| #36 | Gemini 2.5 Flash Lite | Google | 58.91% | 68.09% | 79.15% | 45.53% | 42.84% |
| #37 | Mistral Large 2 | Mistral | 58.82% | 75.62% | 89.38% | 39.70% | 30.58% |
| #38 | Deepseek R1 0528 | Deepseek | 58.55% | 72.89% | 95.15% | 25.49% | 40.67% |
| #39 | GPT-4o mini | OpenAI | 58.32% | 65.97% | 77.29% | 37.90% | 52.14% |
| #40 | Mistral Medium Latest | Mistral | 58.11% | 69.63% | 92.32% | 39.66% | 30.83% |
| #41 | GPT 4.1 nano | OpenAI | 57.45% | 67.74% | 72.54% | 36.22% | 53.30% |
| #42 | Gemma 3 27B IT OR | Google | 57.44% | 60.30% | 91.36% | 38.01% | 40.07% |
| #43 | Gemini 2.0 Flash Lite | Google | 57.31% | 63.21% | 85.14% | 41.65% | 39.26% |
| #44 | Magistral Small Latest | Mistral | 55.94% | 62.44% | 76.23% | 48.02% | 37.09% |
| #45 | Gemma 3 12B IT OR | Google | 55.73% | 53.86% | 92.65% | 32.14% | 44.27% |
| #46 | Grok 2 | xAI | 54.58% | 66.80% | 91.44% | 33.10% | 26.97% |
| #47 | Grok 3 | xAI | 53.45% | 75.12% | 89.68% | 23.24% | 25.77% |
| Mistral Small 3.1* | Mistral | N/A | 70.00% | 90.91% | N/A | N/A | |
| Claude 3.5 Sonnet* | Anthropic | N/A | 86.71% | 95.40% | N/A | N/A | |
| Gemini 1.5 Pro* | Google | N/A | 78.07% | 96.84% | N/A | N/A |
* Models marked with an asterisk have partial scores.
Explore the results by task
Compare models
vs