Tools Reliability
The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | Claude 4.5 Opus | Anthropic | 93.07% | 90.86% | 95.07% | 93.29% |
| #2 | Claude 4.5 Sonnet | Anthropic | 92.21% | 88.12% | 94.39% | 94.11% |
| #3 | Claude 4.1 Opus | Anthropic | 91.18% | 88.12% | 93.37% | 92.05% |
| #4 | Claude 3.5 Sonnet | Anthropic | 89.92% | 88.24% | 88.78% | 92.74% |
| #5 | Gemini 3.0 Pro Preview | Google | 89.11% | 86.86% | 89.62% | 90.86% |
| #6 | Claude 4.5 Haiku | Anthropic | 84.76% | 79.33% | 86.05% | 88.90% |
| #7 | GPT 4.1 | OpenAI | 84.47% | 84.80% | 82.31% | 86.30% |
| #8 | Grok 3 | xAI | 83.15% | 79.10% | 81.46% | 88.90% |
| #9 | GPT 5.1 | OpenAI | 82.76% | 79.93% | 80.27% | 88.08% |
| #10 | Claude 3.5 Haiku 20241022 | Anthropic | 82.59% | 78.03% | 81.80% | 87.95% |
| #11 | Mistral Large 2 | Mistral | 82.50% | 77.08% | 81.80% | 88.63% |
| #12 | Mistral Medium Latest | Mistral | 82.39% | 78.95% | 78.23% | 90.00% |
| #13 | GPT 4.1 mini | OpenAI | 82.13% | 82.30% | 77.38% | 86.71% |
| #14 | GPT-4o | OpenAI | 81.68% | 80.05% | 79.93% | 85.07% |
| #15 | GPT 5 mini | OpenAI | 81.58% | 79.45% | 79.93% | 85.34% |
| #16 | Claude 3.7 Sonnet | Anthropic | 81.15% | 79.81% | 79.93% | 83.70% |
| #17 | Grok 2 | xAI | 80.95% | 79.69% | 83.16% | 80.00% |
| #18 | Deepseek V3 | Deepseek | 80.17% | 79.10% | 79.08% | 82.33% |
| #19 | GPT-4o mini | OpenAI | 79.77% | 77.67% | 80.27% | 81.37% |
| #20 | GPT OSS 120B | OpenAI | 79.34% | 76.01% | 78.02% | 83.97% |
| #21 | Mistral Small 3.1 | Mistral | 78.84% | 74.11% | 75.85% | 86.58% |
| #22 | Grok 4 Fast No Reasoning | xAI | 77.87% | 73.04% | 77.55% | 83.01% |
| #23 | Mistral Small 3.2 | Mistral | 77.59% | 73.16% | 78.23% | 81.37% |
| #24 | GPT 5 | OpenAI | 77.09% | 72.57% | 77.21% | 81.51% |
| #25 | Deepseek R1 0528 | Deepseek | 76.45% | 72.49% | 75.17% | 81.71% |
| #26 | GPT 5 nano | OpenAI | 75.77% | 71.62% | 70.75% | 84.93% |
| #27 | Qwen 3 Max | Alibaba Qwen | 74.84% | 69.60% | 76.70% | 78.22% |
| #28 | Gemini 1.5 Pro | Google | 74.69% | 83.08% | 67.20% | 73.77% |
| #29 | Qwen 2.5 Max | Alibaba Qwen | 72.98% | 69.12% | 74.49% | 75.34% |
| #30 | Gemma 3 27B IT OR | Google | 71.43% | 63.18% | 70.41% | 80.68% |
| #31 | Magistral Small Latest | Mistral | 70.68% | 63.57% | 67.69% | 80.80% |
| #32 | Qwen Plus | Alibaba Qwen | 70.35% | 65.32% | 70.41% | 75.31% |
| #33 | Gemini 2.5 Pro | Google | 69.17% | 69.08% | 70.36% | 68.08% |
| #34 | Llama 4 Maverick | Meta | 67.31% | 61.77% | 71.77% | 68.39% |
| #35 | Gemini 2.5 Flash | Google | 67.14% | 65.32% | 66.50% | 69.59% |
| #36 | Gemma 3 12B IT OR | Google | 66.60% | 59.98% | 63.95% | 75.89% |
| #37 | Llama 3.3 70B Instruct OR | Meta | 66.35% | 65.40% | 63.46% | 70.19% |
| #38 | GPT 4.1 nano | OpenAI | 65.88% | 61.76% | 61.22% | 74.66% |
| #39 | Magistral Medium Latest | Mistral | 64.13% | 56.41% | 66.67% | 69.32% |
| #40 | Gemini 2.0 Flash | Google | 62.97% | 65.48% | 60.27% | 63.15% |
| #41 | Gemini 2.5 Flash Lite | Google | 61.88% | 59.86% | 61.39% | 64.38% |
| #42 | Qwen 3 8B | Alibaba Qwen | 61.26% | 54.51% | 63.10% | 66.16% |
| #43 | Qwen 3 30B VL Instruct | Alibaba Qwen | 57.88% | 52.26% | 59.18% | 62.19% |
| #44 | Grok 3 mini | xAI | 56.60% | 56.18% | 58.84% | 54.79% |
| #45 | Gemini 2.0 Flash Lite | Google | 51.52% | 51.31% | 49.83% | 53.42% |
| #46 | Llama 3.1 405B Instruct OR | Meta | 47.75% | 51.87% | 45.01% | 46.36% |
| #47 | Deepseek V3 0324 | Deepseek | 40.64% | 35.83% | 40.51% | 45.58% |
| #48 | Deepseek V3.1 | Deepseek | 35.08% | 43.59% | 31.80% | 29.86% |
Note: Llama 4 Scout and Llama 3.1 8B Instruct are excluded due to unsupported tool calling in the Azure AI API for these models.