Self-assessed Stereotypes
We evaluate the model's ability to recognize its own stereotypical associations by having it generate stories about characters with specific attributes (e.g., gender, nationality), then asking it to analyze whether its narrative choices reflect societal stereotypes. (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | GPT 4.1 mini | OpenAI | 88.12% | 89.05% | 88.27% | 87.03% |
| #2 | Grok 4 Fast No Reasoning | xAI | 80.26% | 81.61% | 79.58% | 79.61% |
| #3 | Llama 3.1 405B Instruct OR | Meta | 75.23% | 66.73% | 81.82% | 77.14% |
| #4 | Mistral Small 3.2 | Mistral | 73.90% | 73.25% | 75.18% | 73.28% |
| #5 | Llama 4 Maverick | Meta | 73.65% | 68.84% | 74.61% | 77.49% |
| #6 | Claude 4.5 Haiku | Anthropic | 70.66% | 62.75% | 74.22% | 75.00% |
| #7 | Llama 4 Scout | Meta | 67.10% | 69.04% | 63.18% | 69.09% |
| #8 | Deepseek V3.1 | Deepseek | 65.17% | 60.38% | 75.09% | 60.03% |
| #9 | Claude 4.5 Opus | Anthropic | 63.20% | 61.43% | 58.70% | 69.46% |
| #10 | Deepseek V3 | Deepseek | 59.75% | 60.81% | 54.43% | 64.01% |
| #11 | Qwen 3 8B | Alibaba Qwen | 58.64% | 60.90% | 53.21% | 61.80% |
| #12 | Qwen Plus | Alibaba Qwen | 55.70% | 48.68% | 58.14% | 60.28% |
| #13 | Deepseek V3 0324 | Deepseek | 55.26% | 55.68% | 68.70% | 41.41% |
| #14 | Gemini 3.0 Pro Preview | Google | 53.65% | 61.49% | 50.18% | 49.29% |
| #15 | Qwen 3 30B VL Instruct | Alibaba Qwen | 53.57% | 50.42% | 60.71% | 49.56% |
| #16 | Gemini 2.0 Flash | Google | 53.51% | 56.92% | 56.42% | 47.20% |
| #17 | GPT 4.1 | OpenAI | 52.41% | 52.00% | 49.74% | 55.50% |
| #18 | Gemini 2.5 Flash | Google | 51.74% | 47.82% | 50.07% | 57.31% |
| #19 | GPT-4o | OpenAI | 50.92% | 55.17% | 51.56% | 46.04% |
| #20 | Magistral Medium Latest | Mistral | 50.75% | 34.52% | 60.98% | 56.75% |
| #21 | Claude 4.5 Sonnet | Anthropic | 49.14% | 57.63% | 45.66% | 44.12% |
| #22 | Magistral Small Latest | Mistral | 48.02% | 42.13% | 51.84% | 50.09% |
| #23 | GPT 5.1 | OpenAI | 46.77% | 50.09% | 50.73% | 39.49% |
| #24 | GPT 5 mini | OpenAI | 46.41% | 44.05% | 54.20% | 40.98% |
| #25 | Grok 3 mini | xAI | 46.40% | 46.14% | 42.04% | 51.01% |
| #26 | Gemini 2.5 Flash Lite | Google | 45.53% | 36.02% | 51.07% | 49.51% |
| #27 | Llama 3.3 70B Instruct OR | Meta | 45.39% | 42.90% | 48.06% | 45.21% |
| #28 | Qwen 3 Max | Alibaba Qwen | 44.77% | 45.88% | 49.08% | 39.37% |
| #29 | Llama 3.1 8B Instruct | Meta | 44.15% | 30.35% | 50.44% | 51.67% |
| #30 | Claude 4.1 Opus | Anthropic | 43.61% | 45.36% | 41.47% | 44.00% |
| #31 | Qwen 2.5 Max | Alibaba Qwen | 42.95% | 30.40% | 48.04% | 50.42% |
| #32 | Gemini 2.0 Flash Lite | Google | 41.65% | 52.34% | 33.04% | 39.57% |
| #33 | Mistral Large 2 | Mistral | 39.70% | 39.19% | 34.47% | 45.45% |
| #34 | Mistral Medium Latest | Mistral | 39.66% | 37.96% | 39.98% | 41.03% |
| #35 | GPT OSS 120B | OpenAI | 38.84% | 32.83% | 42.77% | 40.90% |
| #36 | Claude 3.5 Haiku 20241022 | Anthropic | 38.08% | 38.21% | 37.26% | 38.78% |
| #37 | Gemma 3 27B IT OR | Google | 38.01% | 31.52% | 43.06% | 39.45% |
| #38 | GPT-4o mini | OpenAI | 37.90% | 39.05% | 37.02% | 37.63% |
| #39 | Gemini 2.5 Pro | Google | 36.52% | 35.48% | 34.26% | 39.83% |
| #40 | GPT 4.1 nano | OpenAI | 36.22% | 33.65% | 37.37% | 37.63% |
| #41 | GPT 5 nano | OpenAI | 34.70% | 40.36% | 35.17% | 28.56% |
| #42 | Claude 3.7 Sonnet | Anthropic | 33.77% | 35.99% | 28.18% | 37.16% |
| #43 | Grok 2 | xAI | 33.10% | 40.21% | 30.45% | 28.63% |
| #44 | Gemma 3 12B IT OR | Google | 32.14% | 28.44% | 35.32% | 32.66% |
| #45 | GPT 5 | OpenAI | 28.56% | 27.69% | 34.15% | 23.86% |
| #46 | Deepseek R1 0528 | Deepseek | 25.49% | 25.96% | 24.03% | 26.49% |
| #47 | Grok 3 | xAI | 23.24% | 28.10% | 24.46% | 17.14% |
| Mistral Small 3.1* | Mistral | N/A | N/A | N/A | N/A | |
| Claude 3.5 Sonnet* | Anthropic | N/A | N/A | N/A | N/A | |
| Gemini 1.5 Pro* | Google | N/A | N/A | N/A | N/A |
* Models marked with an asterisk have partial scores.