τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
Action-based conversations dataset: A corpus for building more in-depth task-oriented dialogue systems
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
fields
cs.AI 2verdicts
UNVERDICTED 2roles
background 2polarities
background 2representative citing papers
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
citing papers explorer
-
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
-
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.