Self-assessed Stereotypes

We evaluate the model's ability to recognize its own stereotypical associations by having it generate stories about characters with specific attributes (e.g., gender, nationality), then asking it to analyze whether its narrative choices reflect societal stereotypes. (Higher score is better.)

RankModelProvider
#1Gemini 1.5 Pro
GoogleGoogle
77.96%
83.33%
68.89%
81.67%
#2Llama 3.1 405B
MetaMeta
70.74%
52.22%
70.00%
90.00%
#3Claude 3.5 Haiku
AnthropicAnthropic
65.81%
61.33%
56.11%
80.00%
#4Gemma 3 27B
GoogleGoogle
64.44%
66.67%
66.67%
60.00%
#5Llama 4 Maverick
MetaMeta
63.89%
51.00%
84.44%
56.22%
#6Gemini 2.0 Flash
GoogleGoogle
52.22%
55.00%
55.00%
46.67%
#7Deepseek V3 (0324)
Deepseek
51.11%
53.33%
54.44%
45.56%
#8Qwen 2.5 Max
Alibaba Qwen
51.11%
46.67%
53.33%
53.33%
#9Deepseek V3
Deepseek
45.39%
38.61%
53.67%
43.89%
#10Llama 3.3 70B
MetaMeta
44.44%
40.00%
46.67%
46.67%
#11GPT-4o
OpenAIOpenAI
41.85%
40.00%
52.22%
33.33%
#12Claude 3.7 Sonnet
AnthropicAnthropic
41.82%
43.89%
41.57%
40.00%
#13Claude 3.5 Sonnet
AnthropicAnthropic
40.37%
40.00%
33.33%
47.78%
#14GPT-4o mini
OpenAIOpenAI
40.00%
40.00%
40.00%
40.00%
#15Mistral Small 3.1 24B
Mistral
35.00%
35.00%
36.67%
33.33%
#16Mistral Large
Mistral
28.89%
28.33%
28.89%
29.44%
#17Grok 2
xAI
26.67%
26.67%
26.67%
26.67%