Tools Reliability

The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)

RankModelProvider
#1Claude 3.5 Sonnet
AnthropicAnthropic
86.50%
85.94%
86.08%
87.46%
#2Mistral Large
Mistral
78.57%
74.69%
77.91%
83.11%
#3Claude 3.5 Haiku
AnthropicAnthropic
77.58%
73.90%
77.70%
81.13%
#4Deepseek V3
Deepseek
77.44%
76.84%
76.75%
78.74%
#5GPT-4o
OpenAIOpenAI
77.31%
77.66%
76.35%
77.94%
#6Grok 2
xAI
76.58%
77.16%
79.81%
72.77%
#7Mistral Small 3.1 24B
Mistral
75.25%
71.28%
72.33%
82.13%
#8Claude 3.7 Sonnet
AnthropicAnthropic
74.99%
76.60%
73.49%
74.88%
#9GPT-4o mini
OpenAIOpenAI
74.46%
73.54%
75.98%
73.87%
#10Qwen 2.5 Max
Alibaba Qwen
68.33%
66.66%
68.94%
69.39%
#11Gemma 3 27B
GoogleGoogle
68.12%
60.57%
65.13%
78.64%
#12Gemini 1.5 Pro
GoogleGoogle
68.00%
79.76%
60.29%
63.96%
#13Llama 4 Maverick
MetaMeta
63.94%
59.91%
68.85%
63.07%
#14Llama 3.3 70B
MetaMeta
62.38%
63.35%
58.33%
65.44%
#15Gemini 2.0 Flash
GoogleGoogle
54.85%
58.93%
52.19%
53.42%
#16Llama 3.1 405B
MetaMeta
45.66%
51.84%
43.79%
41.36%
#17Deepseek V3 (0324)
Deepseek
36.19%
32.55%
36.88%
39.14%