MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Anil Babu Ankisettipalli; Ashutosh Hathidara; Julien Yu; Sebastian Schreiber; Vaishali Senthil

arxiv: 2601.08118 · v3 · pith:DHFWS5CXnew · submitted 2026-01-13 · 💻 cs.AI · cs.LG

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Ashutosh Hathidara , Julien Yu , Vaishali Senthil , Sebastian Schreiber , Anil Babu Ankisettipalli This is my paper

classification 💻 cs.AI cs.LG

keywords mirrorbenchuserconversationalacrossagentsbenchmarkingframeworkhuman

0 comments

read the original abstract

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
cs.AI 2026-05 unverdicted novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...
Reinforcing Human Behavior Simulation via Verbal Feedback
cs.LG 2026-05 unverdicted novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.