Self-assessed Stereotypes
We evaluate the model's ability to recognize its own stereotypical associations by having it generate stories about characters with specific attributes (e.g., gender, nationality), then asking it to analyze whether its narrative choices reflect societal stereotypes. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Gemini 1.5 Pro | Google | 77.96% | 83.33% | 68.89% | 81.67% |
#2 | Llama 3.1 405B | Meta | 70.74% | 52.22% | 70.00% | 90.00% |
#3 | Claude 3.5 Haiku | Anthropic | 65.81% | 61.33% | 56.11% | 80.00% |
#4 | Gemma 3 27B | Google | 64.44% | 66.67% | 66.67% | 60.00% |
#5 | Gemini 2.0 Flash | Google | 52.22% | 55.00% | 55.00% | 46.67% |
#6 | Qwen 2.5 Max | Alibaba Qwen | 51.11% | 46.67% | 53.33% | 53.33% |
#7 | Deepseek V3 | Deepseek | 45.39% | 38.61% | 53.67% | 43.89% |
#8 | Llama 3.3 70B | Meta | 44.44% | 40.00% | 46.67% | 46.67% |
#9 | GPT-4o | OpenAI | 41.85% | 40.00% | 52.22% | 33.33% |
#10 | Claude 3.7 Sonnet | Anthropic | 41.82% | 43.89% | 41.57% | 40.00% |
#11 | Claude 3.5 Sonnet | Anthropic | 40.37% | 40.00% | 33.33% | 47.78% |
#12 | GPT-4o mini | OpenAI | 40.00% | 40.00% | 40.00% | 40.00% |
#13 | Mistral Small 3.1 24B | Mistral | 35.00% | 35.00% | 36.67% | 33.33% |
#14 | Mistral Large | Mistral | 28.89% | 28.33% | 28.89% | 29.44% |