Tools Reliability
The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.5 Sonnet | Anthropic | 85.34% | 86.11% | 84.74% | 85.17% |
#2 | Mistral Large | Mistral | 78.89% | 75.16% | 78.02% | 83.50% |
#3 | Claude 3.5 Haiku | Anthropic | 78.68% | 74.68% | 78.62% | 82.72% |
#4 | Gemini 1.5 Pro | Google | 74.90% | 79.84% | 73.37% | 71.48% |
#5 | Mistral Small 3.1 24B | Mistral | 73.98% | 69.84% | 76.88% | 75.22% |
#6 | GPT-4o mini | OpenAI | 73.12% | 71.07% | 74.63% | 73.67% |
#7 | Deepseek V3 | Deepseek | 72.74% | 70.24% | 72.79% | 75.18% |
#8 | Claude 3.7 Sonnet | Anthropic | 72.68% | 74.34% | 69.36% | 74.33% |
#9 | Gemma 3 27B | Google | 69.30% | 60.77% | 67.25% | 79.87% |
#10 | Qwen 2.5 Max | Alibaba Qwen | 68.33% | 66.10% | 69.07% | 69.81% |
#11 | GPT-4o | OpenAI | 64.84% | 65.69% | 63.04% | 65.78% |
#12 | Llama 3.3 70B | Meta | 48.71% | 60.21% | 40.32% | 45.59% |
#13 | Llama 3.1 405B | Meta | 32.32% | 40.85% | 27.49% | 28.62% |
#14 | Gemini 2.0 Flash | Google | 31.80% | 35.39% | 29.14% | 30.88% |