WildToolBench shows no LLM exceeds 15 percent accuracy on tool-use tasks that reflect real user behaviors like compositional orchestration, implicit intents across turns, and mixed instructions.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.HC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Benchmarking LLM Tool-Use in the Wild
WildToolBench shows no LLM exceeds 15 percent accuracy on tool-use tasks that reflect real user behaviors like compositional orchestration, implicit intents across turns, and mixed instructions.