Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.
Acebench: A comprehensive evaluation of llm tool usage.Findings of the Association for Computational Linguistics: EMNLP, 2025: 12970–12998
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
dataset 1
citation-polarity summary
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1roles
dataset 1polarities
use dataset 1representative citing papers
citing papers explorer
-
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.