Self-assessed Stereotypes

We evaluate the model's ability to recognize its own stereotypical associations by having it generate stories about characters with specific attributes (e.g., gender, nationality), then asking it to analyze whether its narrative choices reflect societal stereotypes. (Higher score is better.)

RankModelProvider
#1GPT 4.1 mini
OpenAIOpenAI
88.12%
89.05%
88.27%
87.03%
#2Grok 4 Fast No Reasoning
xAI
80.26%
81.61%
79.58%
79.61%
#3Llama 3.1 405B Instruct OR
MetaMeta
75.23%
66.73%
81.82%
77.14%
#4Mistral Small 3.2
Mistral
73.90%
73.25%
75.18%
73.28%
#5Llama 4 Maverick
MetaMeta
73.65%
68.84%
74.61%
77.49%
#6Claude 4.5 Haiku
AnthropicAnthropic
70.66%
62.75%
74.22%
75.00%
#7Llama 4 Scout
MetaMeta
67.10%
69.04%
63.18%
69.09%
#8Deepseek V3.1
Deepseek
65.17%
60.38%
75.09%
60.03%
#9Claude 4.5 Opus
AnthropicAnthropic
63.20%
61.43%
58.70%
69.46%
#10Deepseek V3
Deepseek
59.75%
60.81%
54.43%
64.01%
#11Qwen 3 8B
Alibaba Qwen
58.64%
60.90%
53.21%
61.80%
#12Qwen Plus
Alibaba Qwen
55.70%
48.68%
58.14%
60.28%
#13Deepseek V3 0324
Deepseek
55.26%
55.68%
68.70%
41.41%
#14Gemini 3.0 Pro Preview
GoogleGoogle
53.65%
61.49%
50.18%
49.29%
#15Qwen 3 30B VL Instruct
Alibaba Qwen
53.57%
50.42%
60.71%
49.56%
#16Gemini 2.0 Flash
GoogleGoogle
53.51%
56.92%
56.42%
47.20%
#17GPT 4.1
OpenAIOpenAI
52.41%
52.00%
49.74%
55.50%
#18Gemini 2.5 Flash
GoogleGoogle
51.74%
47.82%
50.07%
57.31%
#19GPT-4o
OpenAIOpenAI
50.92%
55.17%
51.56%
46.04%
#20Magistral Medium Latest
Mistral
50.75%
34.52%
60.98%
56.75%
#21Claude 4.5 Sonnet
AnthropicAnthropic
49.14%
57.63%
45.66%
44.12%
#22Magistral Small Latest
Mistral
48.02%
42.13%
51.84%
50.09%
#23GPT 5.1
OpenAIOpenAI
46.77%
50.09%
50.73%
39.49%
#24GPT 5 mini
OpenAIOpenAI
46.41%
44.05%
54.20%
40.98%
#25Grok 3 mini
xAI
46.40%
46.14%
42.04%
51.01%
#26Gemini 2.5 Flash Lite
GoogleGoogle
45.53%
36.02%
51.07%
49.51%
#27Llama 3.3 70B Instruct OR
MetaMeta
45.39%
42.90%
48.06%
45.21%
#28Qwen 3 Max
Alibaba Qwen
44.77%
45.88%
49.08%
39.37%
#29Llama 3.1 8B Instruct
MetaMeta
44.15%
30.35%
50.44%
51.67%
#30Claude 4.1 Opus
AnthropicAnthropic
43.61%
45.36%
41.47%
44.00%
#31Qwen 2.5 Max
Alibaba Qwen
42.95%
30.40%
48.04%
50.42%
#32Gemini 2.0 Flash Lite
GoogleGoogle
41.65%
52.34%
33.04%
39.57%
#33Mistral Large 2
Mistral
39.70%
39.19%
34.47%
45.45%
#34Mistral Medium Latest
Mistral
39.66%
37.96%
39.98%
41.03%
#35GPT OSS 120B
OpenAIOpenAI
38.84%
32.83%
42.77%
40.90%
#36Claude 3.5 Haiku 20241022
AnthropicAnthropic
38.08%
38.21%
37.26%
38.78%
#37Gemma 3 27B IT OR
GoogleGoogle
38.01%
31.52%
43.06%
39.45%
#38GPT-4o mini
OpenAIOpenAI
37.90%
39.05%
37.02%
37.63%
#39Gemini 2.5 Pro
GoogleGoogle
36.52%
35.48%
34.26%
39.83%
#40GPT 4.1 nano
OpenAIOpenAI
36.22%
33.65%
37.37%
37.63%
#41GPT 5 nano
OpenAIOpenAI
34.70%
40.36%
35.17%
28.56%
#42Claude 3.7 Sonnet
AnthropicAnthropic
33.77%
35.99%
28.18%
37.16%
#43Grok 2
xAI
33.10%
40.21%
30.45%
28.63%
#44Gemma 3 12B IT OR
GoogleGoogle
32.14%
28.44%
35.32%
32.66%
#45GPT 5
OpenAIOpenAI
28.56%
27.69%
34.15%
23.86%
#46Deepseek R1 0528
Deepseek
25.49%
25.96%
24.03%
26.49%
#47Grok 3
xAI
23.24%
28.10%
24.46%
17.14%
Mistral Small 3.1*
Mistral
N/A
N/A
N/A
N/A
Claude 3.5 Sonnet*
AnthropicAnthropic
N/A
N/A
N/A
N/A
Gemini 1.5 Pro*
GoogleGoogle
N/A
N/A
N/A
N/A
* Models marked with an asterisk have partial scores.