RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

· 2026 · cs.HC · arXiv 2605.20204

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

representative citing papers

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.

citing papers explorer

Showing 1 of 1 citing paper.

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces cs.AI · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

fields

years

verdicts

representative citing papers

citing papers explorer