Self-assessed Stereotypes

We evaluate the model's ability to recognize its own stereotypical associations by having it generate stories about characters with specific attributes (e.g., gender, nationality), then asking it to analyze whether its narrative choices reflect societal stereotypes. (Higher score is better.)

RankModelProvider
#1Gemini 1.5 Pro
GoogleGoogle
77.96%
83.33%
68.89%
81.67%
#2Llama 3.1 405B
MetaMeta
70.74%
52.22%
70.00%
90.00%
#3Claude 3.5 Haiku
AnthropicAnthropic
65.81%
61.33%
56.11%
80.00%
#4Gemma 3 27B
GoogleGoogle
64.44%
66.67%
66.67%
60.00%
#5Gemini 2.0 Flash
GoogleGoogle
52.22%
55.00%
55.00%
46.67%
#6Qwen 2.5 Max
Alibaba Qwen
51.11%
46.67%
53.33%
53.33%
#7Deepseek V3
Deepseek
45.39%
38.61%
53.67%
43.89%
#8Llama 3.3 70B
MetaMeta
44.44%
40.00%
46.67%
46.67%
#9GPT-4o
OpenAIOpenAI
41.85%
40.00%
52.22%
33.33%
#10Claude 3.7 Sonnet
AnthropicAnthropic
41.82%
43.89%
41.57%
40.00%
#11Claude 3.5 Sonnet
AnthropicAnthropic
40.37%
40.00%
33.33%
47.78%
#12GPT-4o mini
OpenAIOpenAI
40.00%
40.00%
40.00%
40.00%
#13Mistral Small 3.1 24B
Mistral
35.00%
35.00%
36.67%
33.33%
#14Mistral Large
Mistral
28.89%
28.33%
28.89%
29.44%