Self-assessed Stereotypes

We evaluate the model's ability to recognize its own stereotypical associations by having it generate stories about characters with specific attributes (e.g., gender, nationality), then asking it to analyze whether its narrative choices reflect societal stereotypes. (Higher score is better.)

RankModelProvider
#1Llama 3.1 405B
MetaMeta
95.93%
97.78%
93.33%
96.67%
#2Gemini 1.5 Pro
GoogleGoogle
93.70%
93.33%
91.11%
96.67%
#3Llama 4 Maverick
MetaMeta
93.13%
84.28%
100.00%
95.11%
#4Deepseek V3
Deepseek
86.24%
77.50%
95.67%
85.56%
#5Gemini 2.0 Flash
GoogleGoogle
85.37%
85.56%
88.33%
82.22%
#6Gemma 3 27B
GoogleGoogle
78.59%
82.22%
71.33%
82.22%
#7Deepseek V3 (0324)
Deepseek
74.96%
77.78%
83.33%
63.78%
#8Mistral Small 3.1 24B
Mistral
72.83%
74.33%
71.37%
72.78%
#9Claude 3.5 Haiku
AnthropicAnthropic
67.98%
69.67%
53.33%
80.95%
#10Llama 3.3 70B
MetaMeta
66.56%
64.13%
70.00%
65.56%
#11GPT-4o
OpenAIOpenAI
66.48%
70.00%
75.00%
54.44%
#12Qwen 2.5 Max
Alibaba Qwen
66.22%
68.33%
73.67%
56.67%
#13Claude 3.7 Sonnet
AnthropicAnthropic
61.10%
67.67%
54.52%
61.11%
#14GPT-4o mini
OpenAIOpenAI
60.74%
57.78%
70.00%
54.44%
#15Claude 3.5 Sonnet
AnthropicAnthropic
53.67%
54.00%
45.33%
61.67%
#16Grok 2
xAI
49.56%
51.94%
39.90%
56.83%
#17Mistral Large
Mistral
49.31%
48.63%
46.30%
53.02%