User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Mind the sim2real gap in user simulation for agentic tasks.arXiv preprint arXiv:2603.11245
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 2polarities
background 2representative citing papers
Real developer IDE traces differ substantially from LLM simulations in behavior and structure; current proactive assistants are unreliable on real traces, and simulated data cannot substitute for real data in training.
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
citing papers explorer
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
-
An Empirical Study of Proactive Coding Assistants in Real-World Software Development
Real developer IDE traces differ substantially from LLM simulations in behavior and structure; current proactive assistants are unreliable on real traces, and simulated data cannot substitute for real data in training.
-
Reinforcing Human Behavior Simulation via Verbal Feedback
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
-
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
- EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents