Prompt Injection
Measures the model's performance against known injection attacks. (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | Claude 4.5 Haiku | Anthropic | 98.07% | 97.86% | 97.59% | 98.75% |
| #2 | Claude 4.1 Opus | Anthropic | 97.75% | 97.86% | 97.89% | 97.49% |
| #3 | Claude 4.5 Sonnet | Anthropic | 97.42% | 97.33% | 98.19% | 96.74% |
| #4 | Claude 4.5 Opus | Anthropic | 97.05% | 97.33% | 97.59% | 96.24% |
| #5 | Claude 3.5 Haiku 20241022 | Anthropic | 93.40% | 93.58% | 91.87% | 94.74% |
| #6 | GPT 5 mini | OpenAI | 86.55% | 90.22% | 87.16% | 82.28% |
| #7 | Claude 3.7 Sonnet | Anthropic | 86.35% | 87.70% | 86.14% | 85.21% |
| #8 | GPT 5 nano | OpenAI | 86.17% | 90.37% | 84.94% | 83.21% |
| #9 | GPT OSS 120B | OpenAI | 86.00% | 92.51% | 84.04% | 81.45% |
| #10 | GPT 5.1 | OpenAI | 85.19% | 89.84% | 84.04% | 81.70% |
| #11 | GPT 5 | OpenAI | 81.30% | 85.95% | 79.82% | 78.14% |
| #12 | Llama 3.1 405B Instruct OR | Meta | 79.06% | 78.61% | 84.64% | 73.93% |
| #13 | GPT 4.1 nano | OpenAI | 75.85% | 82.89% | 70.48% | 74.19% |
| #14 | GPT-4o | OpenAI | 71.17% | 75.40% | 68.67% | 69.42% |
| #15 | GPT 4.1 | OpenAI | 69.98% | 79.68% | 63.86% | 66.42% |
| #16 | Qwen Plus | Alibaba Qwen | 68.88% | 79.21% | 66.27% | 61.15% |
| #17 | Llama 4 Maverick | Meta | 68.33% | 61.50% | 79.82% | 63.66% |
| #18 | Gemini 3.0 Pro Preview | Google | 67.99% | 77.01% | 66.57% | 60.40% |
| #19 | GPT-4o mini | OpenAI | 67.28% | 73.26% | 63.86% | 64.74% |
| #20 | Qwen 2.5 Max | Alibaba Qwen | 66.73% | 77.53% | 59.52% | 63.16% |
| #21 | Qwen 3 Max | Alibaba Qwen | 64.44% | 72.47% | 61.45% | 59.40% |
| #22 | GPT 4.1 mini | OpenAI | 61.33% | 69.35% | 60.24% | 54.39% |
| #23 | Grok 4 Fast No Reasoning | xAI | 60.31% | 63.10% | 59.94% | 57.89% |
| #24 | Llama 4 Scout | Meta | 55.76% | 52.94% | 64.46% | 49.87% |
| #25 | Deepseek R1 0528 | Deepseek | 54.41% | 47.59% | 60.24% | 55.39% |
| #26 | Qwen 3 30B VL Instruct | Alibaba Qwen | 53.93% | 63.64% | 50.90% | 47.24% |
| #27 | Gemini 2.5 Flash Lite | Google | 53.18% | 57.22% | 52.71% | 49.62% |
| #28 | Deepseek V3.1 | Deepseek | 52.52% | 48.66% | 54.52% | 54.39% |
| #29 | Gemini 2.5 Flash | Google | 50.14% | 54.55% | 51.51% | 44.36% |
| #30 | Llama 3.1 8B Instruct | Meta | 49.57% | 51.34% | 53.78% | 43.61% |
| #31 | Gemini 2.5 Pro | Google | 48.59% | 54.01% | 46.39% | 45.36% |
| #32 | Llama 3.3 70B Instruct OR | Meta | 48.10% | 40.64% | 58.43% | 45.23% |
| #33 | Gemini 2.0 Flash | Google | 46.15% | 47.59% | 45.76% | 45.09% |
| #34 | Gemini 2.0 Flash Lite | Google | 46.13% | 52.41% | 39.88% | 46.10% |
| #35 | Gemma 3 12B IT OR | Google | 45.86% | 47.06% | 42.77% | 47.74% |
| #36 | Deepseek V3 0324 | Deepseek | 44.98% | 47.06% | 45.78% | 42.11% |
| #37 | Qwen 3 8B | Alibaba Qwen | 44.33% | 45.45% | 43.94% | 43.61% |
| #38 | Gemma 3 27B IT OR | Google | 41.97% | 36.36% | 48.19% | 41.35% |
| #39 | Magistral Medium Latest | Mistral | 41.33% | 37.43% | 52.71% | 33.83% |
| #40 | Deepseek V3 | Deepseek | 38.19% | 39.57% | 39.16% | 35.84% |
| #41 | Grok 2 | xAI | 35.57% | 35.14% | 36.97% | 34.60% |
| #42 | Mistral Small 3.2 | Mistral | 34.20% | 33.69% | 34.34% | 34.59% |
| #43 | Grok 3 | xAI | 32.99% | 33.16% | 35.24% | 30.58% |
| #44 | Mistral Medium Latest | Mistral | 31.00% | 28.34% | 36.45% | 28.21% |
| #45 | Mistral Large 2 | Mistral | 30.58% | 27.27% | 36.14% | 28.32% |
| #46 | Grok 3 mini | xAI | 28.72% | 24.60% | 33.73% | 27.82% |
| #47 | Magistral Small Latest | Mistral | 22.09% | 17.11% | 27.11% | 22.06% |
| Mistral Small 3.1* | Mistral | N/A | N/A | N/A | N/A | |
| Claude 3.5 Sonnet* | Anthropic | N/A | N/A | N/A | N/A | |
| Gemini 1.5 Pro* | Google | N/A | N/A | N/A | N/A |
* Models marked with an asterisk have partial scores.