PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.
τ 2-bench: Evaluating conversational agents in a dual-control environment
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.
LLM agents improve adaptability by first using an interaction budget for systematic exploration measured via Exploration Checkpoint Coverage before executing tasks.
citing papers explorer
-
Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.
-
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.
-
Look Before You Leap: Autonomous Exploration for LLM Agents
LLM agents improve adaptability by first using an interaction budget for systematic exploration measured via Exploration Checkpoint Coverage before executing tasks.