ToolEmu uses LM-based tool emulation to test LM agents on 36 high-stakes tools and 144 cases, revealing that even the safest agent fails 23.9% of the time.
The failure of an [Agent] to deal with underspecified instructions can often result in incorrect tool calls, which requires your careful attention
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2023 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
ToolEmu uses LM-based tool emulation to test LM agents on 36 high-stakes tools and 144 cases, revealing that even the safest agent fails 23.9% of the time.