Tools Reliability

The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)

RankModelProvider
#1Claude 3.5 Sonnet
AnthropicAnthropic
85.34%
86.11%
84.74%
85.17%
#2Mistral Large
Mistral
78.89%
75.16%
78.02%
83.50%
#3Claude 3.5 Haiku
AnthropicAnthropic
78.68%
74.68%
78.62%
82.72%
#4Gemini 1.5 Pro
GoogleGoogle
74.90%
79.84%
73.37%
71.48%
#5Mistral Small 3.1 24B
Mistral
73.98%
69.84%
76.88%
75.22%
#6GPT-4o mini
OpenAIOpenAI
73.12%
71.07%
74.63%
73.67%
#7Deepseek V3
Deepseek
72.74%
70.24%
72.79%
75.18%
#8Claude 3.7 Sonnet
AnthropicAnthropic
72.68%
74.34%
69.36%
74.33%
#9Gemma 3 27B
GoogleGoogle
69.30%
60.77%
67.25%
79.87%
#10Qwen 2.5 Max
Alibaba Qwen
68.33%
66.10%
69.07%
69.81%
#11GPT-4o
OpenAIOpenAI
64.84%
65.69%
63.04%
65.78%
#12Llama 3.3 70B
MetaMeta
48.71%
60.21%
40.32%
45.59%
#13Llama 3.1 405B
MetaMeta
32.32%
40.85%
27.49%
28.62%
#14Gemini 2.0 Flash
GoogleGoogle
31.80%
35.39%
29.14%
30.88%