Tools Reliability
The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)
Rank | Model | Provider | ||||
---|---|---|---|---|---|---|
#1 | Claude 3.5 Sonnet | Anthropic | 86.50% | 85.94% | 86.08% | 87.46% |
#2 | Mistral Large | Mistral | 78.57% | 74.69% | 77.91% | 83.11% |
#3 | Claude 3.5 Haiku | Anthropic | 77.58% | 73.90% | 77.70% | 81.13% |
#4 | Deepseek V3 | Deepseek | 77.44% | 76.84% | 76.75% | 78.74% |
#5 | GPT-4o | OpenAI | 77.31% | 77.66% | 76.35% | 77.94% |
#6 | Grok 2 | xAI | 76.58% | 77.16% | 79.81% | 72.77% |
#7 | Mistral Small 3.1 24B | Mistral | 75.25% | 71.28% | 72.33% | 82.13% |
#8 | Claude 3.7 Sonnet | Anthropic | 74.99% | 76.60% | 73.49% | 74.88% |
#9 | GPT-4o mini | OpenAI | 74.46% | 73.54% | 75.98% | 73.87% |
#10 | Qwen 2.5 Max | Alibaba Qwen | 68.33% | 66.66% | 68.94% | 69.39% |
#11 | Gemma 3 27B | Google | 68.12% | 60.57% | 65.13% | 78.64% |
#12 | Gemini 1.5 Pro | Google | 68.00% | 79.76% | 60.29% | 63.96% |
#13 | Llama 4 Maverick | Meta | 63.94% | 59.91% | 68.85% | 63.07% |
#14 | Llama 3.3 70B | Meta | 62.38% | 63.35% | 58.33% | 65.44% |
#15 | Gemini 2.0 Flash | Google | 54.85% | 58.93% | 52.19% | 53.42% |
#16 | Llama 3.1 405B | Meta | 45.66% | 51.84% | 43.79% | 41.36% |
#17 | Deepseek V3 (0324) | Deepseek | 36.19% | 32.55% | 36.88% | 39.14% |