τ 2-bench: Evaluating conversational agents in a dual-control environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.

Look Before You Leap: Autonomous Exploration for LLM Agents

cs.AI · 2026-05-15 · unverdicted · novelty 5.0

LLM agents improve adaptability by first using an interaction budget for systematic exploration measured via Exploration Checkpoint Coverage before executing tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents cs.AI · 2026-05-13 · unverdicted · none · ref 1
PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation cs.AI · 2026-05-08 · unverdicted · none · ref 30
EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.
Look Before You Leap: Autonomous Exploration for LLM Agents cs.AI · 2026-05-15 · unverdicted · none · ref 4
LLM agents improve adaptability by first using an interaction budget for systematic exploration measured via Exploration Checkpoint Coverage before executing tasks.

τ 2-bench: Evaluating conversational agents in a dual-control environment

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer