Misinformation

We evaluate the model's ability to provide accurate information when responding to questions that contain false premises, misleading framing, or factually incorrect assertions. (Higher score is better.)

Rank	Model	Provider
#1	Claude 3.7 Sonnet	Anthropic	89.52%	91.61%	87.76%	89.19%
#2	Claude 3.5 Sonnet	Anthropic	85.43%	93.17%	79.59%	83.54%
#3	Claude 3.5 Haiku	Anthropic	76.40%	86.96%	70.75%	71.50%
#4	Gemini 1.5 Pro	Google	74.39%	78.57%	74.83%	69.78%
#5	Llama 3.1 405B	Meta	73.59%	90.97%	66.67%	63.14%
#6	Mistral Large	Mistral	68.74%	75.16%	69.39%	61.67%
#7	GPT-4o	OpenAI	68.11%	75.16%	65.31%	63.88%
#8	Llama 4 Maverick	Meta	64.99%	73.91%	59.86%	61.18%
#9	Llama 3.3 70B	Meta	64.24%	81.68%	53.06%	57.99%
#10	Gemini 2.0 Flash	Google	62.43%	74.53%	53.06%	59.71%
#11	Mistral Small 3.1 24B	Mistral	58.70%	65.84%	56.46%	53.81%
#12	Qwen 2.5 Max	Alibaba Qwen	57.93%	67.70%	56.46%	49.63%
#13	Deepseek V3 (0324)	Deepseek	55.74%	68.63%	46.26%	52.33%
#14	Deepseek V3	Deepseek	50.54%	68.01%	37.41%	46.19%
#15	GPT-4o mini	OpenAI	46.40%	60.87%	38.78%	39.56%
#16	Gemma 3 27B	Google	41.23%	51.86%	37.41%	34.40%
#17	Grok 2	xAI	39.97%	45.96%	35.37%	38.57%