When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

· 2025 · cs.CL · arXiv 2510.19186

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.

representative citing papers

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

cs.CL · 2026-06-30 · unverdicted · novelty 5.0

Guided-Retry prompting cuts hallucination from 30.5% to 15.3% on MultiWOZ and 20.9% to 12.2% on SGD in LLM dialogue agents facing database failures.

citing papers explorer

Showing 1 of 1 citing paper.

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue cs.CL · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
Guided-Retry prompting cuts hallucination from 30.5% to 15.3% on MultiWOZ and 20.9% to 12.2% on SGD in LLM dialogue agents facing database failures.

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

fields

years

verdicts

representative citing papers

citing papers explorer