Tools Reliability

The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)

RankModelProvider
#1Claude 4.5 Opus
AnthropicAnthropic
93.07%
90.86%
95.07%
93.29%
#2Claude 4.5 Sonnet
AnthropicAnthropic
92.21%
88.12%
94.39%
94.11%
#3Claude 4.1 Opus
AnthropicAnthropic
91.18%
88.12%
93.37%
92.05%
#4Claude 3.5 Sonnet
AnthropicAnthropic
89.92%
88.24%
88.78%
92.74%
#5Gemini 3.0 Pro Preview
GoogleGoogle
89.11%
86.86%
89.62%
90.86%
#6Claude 4.5 Haiku
AnthropicAnthropic
84.76%
79.33%
86.05%
88.90%
#7GPT 4.1
OpenAIOpenAI
84.47%
84.80%
82.31%
86.30%
#8Grok 3
xAI
83.15%
79.10%
81.46%
88.90%
#9GPT 5.1
OpenAIOpenAI
82.76%
79.93%
80.27%
88.08%
#10Claude 3.5 Haiku 20241022
AnthropicAnthropic
82.59%
78.03%
81.80%
87.95%
#11Mistral Large 2
Mistral
82.50%
77.08%
81.80%
88.63%
#12Mistral Medium Latest
Mistral
82.39%
78.95%
78.23%
90.00%
#13GPT 4.1 mini
OpenAIOpenAI
82.13%
82.30%
77.38%
86.71%
#14GPT-4o
OpenAIOpenAI
81.68%
80.05%
79.93%
85.07%
#15GPT 5 mini
OpenAIOpenAI
81.58%
79.45%
79.93%
85.34%
#16Claude 3.7 Sonnet
AnthropicAnthropic
81.15%
79.81%
79.93%
83.70%
#17Grok 2
xAI
80.95%
79.69%
83.16%
80.00%
#18Deepseek V3
Deepseek
80.17%
79.10%
79.08%
82.33%
#19GPT-4o mini
OpenAIOpenAI
79.77%
77.67%
80.27%
81.37%
#20GPT OSS 120B
OpenAIOpenAI
79.34%
76.01%
78.02%
83.97%
#21Mistral Small 3.1
Mistral
78.84%
74.11%
75.85%
86.58%
#22Grok 4 Fast No Reasoning
xAI
77.87%
73.04%
77.55%
83.01%
#23Mistral Small 3.2
Mistral
77.59%
73.16%
78.23%
81.37%
#24GPT 5
OpenAIOpenAI
77.09%
72.57%
77.21%
81.51%
#25Deepseek R1 0528
Deepseek
76.45%
72.49%
75.17%
81.71%
#26GPT 5 nano
OpenAIOpenAI
75.77%
71.62%
70.75%
84.93%
#27Qwen 3 Max
Alibaba Qwen
74.84%
69.60%
76.70%
78.22%
#28Gemini 1.5 Pro
GoogleGoogle
74.69%
83.08%
67.20%
73.77%
#29Qwen 2.5 Max
Alibaba Qwen
72.98%
69.12%
74.49%
75.34%
#30Gemma 3 27B IT OR
GoogleGoogle
71.43%
63.18%
70.41%
80.68%
#31Magistral Small Latest
Mistral
70.68%
63.57%
67.69%
80.80%
#32Qwen Plus
Alibaba Qwen
70.35%
65.32%
70.41%
75.31%
#33Gemini 2.5 Pro
GoogleGoogle
69.17%
69.08%
70.36%
68.08%
#34Llama 4 Maverick
MetaMeta
67.31%
61.77%
71.77%
68.39%
#35Gemini 2.5 Flash
GoogleGoogle
67.14%
65.32%
66.50%
69.59%
#36Gemma 3 12B IT OR
GoogleGoogle
66.60%
59.98%
63.95%
75.89%
#37Llama 3.3 70B Instruct OR
MetaMeta
66.35%
65.40%
63.46%
70.19%
#38GPT 4.1 nano
OpenAIOpenAI
65.88%
61.76%
61.22%
74.66%
#39Magistral Medium Latest
Mistral
64.13%
56.41%
66.67%
69.32%
#40Gemini 2.0 Flash
GoogleGoogle
62.97%
65.48%
60.27%
63.15%
#41Gemini 2.5 Flash Lite
GoogleGoogle
61.88%
59.86%
61.39%
64.38%
#42Qwen 3 8B
Alibaba Qwen
61.26%
54.51%
63.10%
66.16%
#43Qwen 3 30B VL Instruct
Alibaba Qwen
57.88%
52.26%
59.18%
62.19%
#44Grok 3 mini
xAI
56.60%
56.18%
58.84%
54.79%
#45Gemini 2.0 Flash Lite
GoogleGoogle
51.52%
51.31%
49.83%
53.42%
#46Llama 3.1 405B Instruct OR
MetaMeta
47.75%
51.87%
45.01%
46.36%
#47Deepseek V3 0324
Deepseek
40.64%
35.83%
40.51%
45.58%
#48Deepseek V3.1
Deepseek
35.08%
43.59%
31.80%
29.86%

Note: Llama 4 Scout and Llama 3.1 8B Instruct are excluded due to unsupported tool calling in the Azure AI API for these models.