pith. sign in

arxiv: 2601.08118 · v3 · pith:DHFWS5CXnew · submitted 2026-01-13 · 💻 cs.AI · cs.LG

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Pith reviewed 2026-05-21 15:09 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords MirrorBenchuser proxy agentshuman-likenessconversational agentsLLM evaluationlexical diversityAI simulatorsbenchmarking framework
0
0 comments X

The pith

MirrorBench shows that current user-proxy agents produce systematically less human-like utterances than real users when measured with combined lexical and LLM metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MirrorBench to test whether AI agents that simulate users in conversations generate responses that match real human patterns. It deliberately separates this test from any measure of task success so the focus stays on utterance realism alone. The framework applies three statistical measures of word variety alongside three AI judging methods and anchors the scores with real human conversations and proxy-to-proxy exchanges as references. Results from four public datasets indicate consistent differences between the proxies and actual people. This separation matters because many AI systems now rely on simulated users both to test other models and to create training data.

Core claim

MirrorBench is a reproducible benchmarking framework that evaluates user proxies solely on their ability to produce human-like utterances. It integrates the lexical-diversity metrics MATTR, Yule's K, and HD-D with the LLM-judge metrics GTEval, Pairwise Indistinguishability, and Rubric-and-Reason, then contextualizes results using Human-Human and Proxy-Proxy calibration controls to yield variance-aware comparisons that reveal systematic gaps between proxies and real human users across four public datasets.

What carries the argument

MirrorBench, a framework that combines three lexical-diversity metrics with three LLM-judge metrics and calibrates scores against human-human and proxy-proxy controls to isolate human-likeness.

If this is right

  • User proxies can be compared in a variance-aware manner across multiple datasets.
  • Systematic gaps between proxies and humans become visible for guiding improvements.
  • Evaluation of utterance realism is decoupled from task-specific performance.
  • The open-source framework with its command-line interface supports reproducible experiments.
  • The approach is extensible to additional datasets and metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Proxies that score higher on MirrorBench could produce more natural synthetic data for training dialogue models.
  • The calibration controls might be adapted to test realism in other generative tasks such as story or code generation.
  • High MirrorBench scores could be checked for correlation with better results in live human-AI interaction studies.

Load-bearing premise

The selected lexical-diversity metrics and LLM-judge rubrics together with human-human and proxy-proxy controls are sufficient to measure human-likeness independently of any downstream task.

What would settle it

A user proxy that matches human scores on every MirrorBench metric yet produces conversations that human raters still identify as artificial in blind side-by-side comparisons.

Figures

Figures reproduced from arXiv: 2601.08118 by Anil Babu Ankisettipalli, Ashutosh Hathidara, Julien Yu, Sebastian Schreiber, Vaishali Senthil.

Figure 1
Figure 1. Figure 1: MirrorBench benchmark pipeline, spanning dataset preparation, synthetic rollouts, the evaluation suite, and an [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Human-likeness of five user-proxy LLMs across four datasets. Higher is better for judge-based metrics (GTEval, PI, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Judge–human correlation on ChatbotArena: Corre [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Judge sensitivity of judge-realism metrics on Chatb [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Assistant sensitivity on ChatbotArena (scores). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Avg per-sample telemetry for GPT-4o as user-proxy [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cost-quality trade-off for PI: cost per evaluation (USD) vs. PI Δ𝑤 (↑ better). Judge = Claude-4-Sonnet; assistant = GPT-4o; temperature = 0; cache off. Markers denote user￾proxies; labels denote datasets. Dashed line: Pareto frontier. See [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MirrorBench Architecture: Six-layer stack from low-level execution backends & persistence up through the core [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: MirrorBench execution flow. The framework de [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Judge prompt for Pairwise Indistinguishability [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Judge prompt for RNR Metric User-Proxy System Prompt You are simulating a real human user for the Mirror￾Bench evaluation harness. Respond with the next USER turn only. Do not write assistant messages, notes, or any other analysis. Your utterance should be like a real user and the context should be based on the following information provided [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: System prompt for User-Proxy Assistant System Prompt You are the assistant in a MirrorBench replay. The user-proxy agent is attempting to reproduce the USER side of the real conversation provided below. But the user-proxy does not have access to the real conversation history. Instead, it only has access to the conversation summary. You need to respond as the assistant. But we are providing you with the re… view at source ↗
Figure 17
Figure 17. Figure 17: Plan manifest (JSON, abridged). Fully resolved, [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Final run report (JSON, abridged). Aggregate metric [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Job configuration (abridged). YAML that fully specifies a [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces MirrorBench, a reproducible and extensible benchmarking framework for evaluating conversational user-proxy agents solely on their ability to generate human-like utterances, decoupled from any downstream task success. It combines three lexical-diversity metrics (MATTR, Yule’s K, HD-D) with three LLM-judge metrics (GTEval, Pairwise Indistinguishability, Rubric-and-Reason), contextualized via Human-Human and Proxy-Proxy calibration controls. Evaluations across four public datasets are claimed to yield variance-aware comparisons that reveal systematic gaps between proxies and real human users; the framework and CLI are open-sourced.

Significance. If the metrics are shown to validly track human-likeness, MirrorBench would provide a useful, task-independent standard for developing and comparing user simulators, which is relevant for both evaluation of conversational systems and synthetic data generation. The open-source release, CLI, and explicit calibration controls are concrete strengths that support reproducibility. However, the significance is limited by the absence of direct validation that the LLM-judge components correlate with human judgments rather than model-specific artifacts.

major comments (2)
  1. [Abstract and Evaluation Methodology] Abstract / Evaluation section: The claim that MirrorBench isolates and quantifies human-likeness rests primarily on the three LLM-judge rubrics. No inter-rater agreement, correlation with human ratings, or ablation swapping the underlying judge model is reported on the same utterance sets. Lexical metrics (MATTR, Yule’s K, HD-D) address only surface diversity, so the judge components are load-bearing for the 'systematic gaps' conclusion; without this validation the gaps may reflect judge preferences (e.g., verbosity or formality) rather than genuine human-likeness.
  2. [Experiments and Calibration Controls] §4 (Experiments) and controls description: While Human-Human and Proxy-Proxy controls are a positive design choice, they calibrate relative to the same LLM judges. This does not close the loop on whether observed differences survive replacement of the judge model or addition of human correlation data.
minor comments (3)
  1. [Methodology] Clarify the exact prompting templates and temperature settings used for the LLM judges to improve reproducibility.
  2. [Results] The abstract mentions 'variance-aware comparisons' but does not specify the statistical tests or confidence intervals used; add these details in the results section.
  3. Consider adding a limitations section discussing potential biases in the chosen public datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on MirrorBench. The comments correctly identify the need for additional validation of the LLM-judge components to support claims of human-likeness. We address each point below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Methodology] Abstract / Evaluation section: The claim that MirrorBench isolates and quantifies human-likeness rests primarily on the three LLM-judge rubrics. No inter-rater agreement, correlation with human ratings, or ablation swapping the underlying judge model is reported on the same utterance sets. Lexical metrics (MATTR, Yule’s K, HD-D) address only surface diversity, so the judge components are load-bearing for the 'systematic gaps' conclusion; without this validation the gaps may reflect judge preferences (e.g., verbosity or formality) rather than genuine human-likeness.

    Authors: We agree that the manuscript does not currently report inter-rater agreement, human correlation coefficients, or judge-model ablations, and that this leaves the LLM-judge results open to the possibility of model-specific biases. The lexical metrics supply an independent, non-LLM signal, and the Human-Human / Proxy-Proxy controls provide relative calibration, yet these do not fully substitute for direct human validation. In the revised manuscript we will add (i) an ablation that re-runs the three LLM-judge metrics with a second judge model on the same utterance sets and reports consistency of the observed gaps, and (ii) a small-scale human rating study on a stratified sample of utterances to compute Spearman correlations between human scores and each LLM-judge metric. These additions will appear in a new subsection of the evaluation methodology. revision: yes

  2. Referee: [Experiments and Calibration Controls] §4 (Experiments) and controls description: While Human-Human and Proxy-Proxy controls are a positive design choice, they calibrate relative to the same LLM judges. This does not close the loop on whether observed differences survive replacement of the judge model or addition of human correlation data.

    Authors: The calibration controls establish baselines within a single evaluation framework, allowing us to quantify how far proxies depart from human distributions under the same judging criteria. We acknowledge that this design does not yet demonstrate invariance to judge-model choice or alignment with human perception. The revisions outlined in our response to the first comment—specifically the judge-model ablation and the human correlation pilot—directly address this concern and will be integrated into the experimental results and discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: MirrorBench defines new evaluation procedures without reducing claims to self-defined inputs or self-citations

full rationale

The paper presents MirrorBench as a benchmarking framework that combines known lexical-diversity metrics (MATTR, Yule’s K, HD-D) with LLM-judge rubrics and Human-Human/Proxy-Proxy controls. No derivation chain, equations, or predictions are shown that reduce a result to fitted parameters or prior self-citations by construction. The central claim of revealing systematic gaps rests on the explicit definition of these procedures rather than any self-referential reduction. The framework is self-contained as a measurement setup against external datasets and standard metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that the selected diversity and judge metrics are valid proxies for human-likeness; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Lexical diversity statistics (MATTR, Yule's K, HD-D) plus LLM judges can isolate human-likeness when calibrated against human-human and proxy-proxy dialogues.
    Invoked in the abstract when the authors state that the metrics evaluate human-likeness across conversational regimes.

pith-pipeline@v0.9.0 · 5742 in / 1205 out tokens · 50121 ms · 2026-05-21T15:09:01.756536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

    cs.AI 2026-04 unverdicted novelty 8.0

    User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

  2. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

    cs.AI 2026-05 unverdicted novelty 7.0

    SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...

  3. Reinforcing Human Behavior Simulation via Verbal Feedback

    cs.LG 2026-05 unverdicted novelty 6.0

    DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 3 Pith papers · 6 internal anchors

  1. [1]

    Adnan Ahmad, Stefan Hillmann, and Sebastian Möller. 2025. Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models. arXiv:2502.12813 [cs.CL] https://arxiv.org/abs/2502.12813

  2. [2]

    Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2021. Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. InEMNLP

  3. [3]

    Bruce Croft

    Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft

  4. [4]

    InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19)

    Asking Clarifying Questions in Open-Domain Information-Seeking Conversations. InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19). Association for Computing Machinery, New York, NY, USA, 475–484. doi:10.1145/3331184.3331265

  5. [5]

    2025.System Card: Claude Opus 4 & Claude Sonnet

    Anthropic. 2025.System Card: Claude Opus 4 & Claude Sonnet

  6. [6]

    Anthropic

    Technical Report. Anthropic. https://www-cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285.pdf Accessed: 2025-10-22

  7. [7]

    Krisztian Balog and ChengXiang Zhai. 2023. User Simulation for Evaluating Information Access Systems. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(Beijing, China)(SIGIR-AP ’23). Association for Computing Machinery, New York, NY, USA, 302–305. doi:10.1145/3624...

  8. [8]

    Carmona, Robert Szava-Kovats, and Meelis Pärtel

    Carlos P. Carmona, Robert Szava-Kovats, and Meelis Pärtel. 2019. Estimating probabilistic dark diversity based on the hypergeometric distribution.bioRxiv (2019). arXiv:https://www.biorxiv.org/content/early/2019/05/15/636753.full.pdf doi:10.1101/636753

  9. [9]

    John W Chotlos. 1944. IV. A statistical and comparative analysis of individual written language samples.Psychological Monographs56, 2 (1944), 75

  10. [10]

    Covington and Joe D

    Michael A. Covington and Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR).Journal of Quantitative Linguistics 17 (2010), 100 – 94. https://api.semanticscholar.org/CorpusID:18924254

  11. [11]

    2025.System Card: Gemini 2.5 Pro

    Google. 2025.System Card: Gemini 2.5 Pro. Technical Report. Google. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini- 2-5-Pro-Model-Card.pdf Accessed: 2025-10-22

  12. [12]

    Ashutosh Hathidara, Julien Yu, and Sebastian Schreiber. 2025. Disambiguation- Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky. arXiv:2507.03336 [cs.AI] https://arxiv.org/abs/2507.03336

  13. [13]

    H. S. Heaps. 1978.Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., USA

  14. [14]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  15. [15]

    M. G. Kendall. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30 (1938), 81–93. https://api.semanticscholar.org/CorpusID:120478295

  16. [16]

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo

  17. [17]

    InThe Twelfth International Conference on Learning Representations

    Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=8euJaTveKw

  18. [18]

    2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition

    Stephen Kokoska and Dan Zwillinger. 2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition. doi:10.1201/b16923 Section 14.7

  19. [19]

    Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Ar- nav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. 2023. OpenAssistant Conversations...

  20. [20]

    Kowalski

    Charles J. Kowalski. 2018. On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coeffi- cient.Journal of the Royal Statistical Society Series C: Applied Statis- tics21, 1 (12 2018), 1–12. arXiv:https://academic.oup.com/jrsssc/article- pdf/21/1/1/48613051/jrsssc_21_1_1.pdf doi:10.2307/2346598

  21. [22]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI] https://arxiv.org/abs...

  22. [23]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

  23. [24]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...

  24. [25]

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. 2025. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan ...

  25. [26]

    Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy

    Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy. 2025. Agentic Mis- alignment: How LLMs Could Be Insider Threats. arXiv:2510.05179 [cs.CR] https://arxiv.org/abs/2510.05179

  26. [27]

    Philip Mccarthy and Scott Jarvis. 2007. Vocd: A theoretical and empirical evalua- tion. Language Testing, 24, 459-488.Language Testing - LANG TEST24 (10 2007), 459–488. doi:10.1177/0265532207080767

  27. [28]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: a distributed framework for emerging AI applica- tions. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(Carlsbad, CA, USA)(OSDI’18). ...

  28. [29]

    Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. 2025. Flipping the Di- alogue: Training and Evaluating User Language Models. arXiv:2510.06552 [cs.CL] https://arxiv.org/abs/2510.06552

  29. [30]

    2025.GPT-5 System Card

    OpenAI. 2025.GPT-5 System Card. Technical Report. OpenAI. https://cdn.openai. com/gpt-5-system-card.pdf Accessed: 2025-10-22

  30. [31]

    OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

  31. [32]

    Xuchen Pan, Dawei Gao, Yuexiang Xie, Yushuo Chen, Zhewei Wei, Yaliang Li, Bolin Ding, Ji-Rong Wen, and Jingren Zhou. 2024. Very Large-Scale Multi-Agent Simulation in AgentScope. arXiv:2407.17789 [cs.MA] https://arxiv.org/abs/2407. 17789

  32. [33]

    Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, and Branislav Kveton. 2025. Quantitative LLM Judges. arXiv:2506.02945 [cs.CL] https://arxiv. org/abs/2506.02945

  33. [34]

    Schatzmann and S

    J. Schatzmann and S. Young. 2009. The Hidden Agenda User Simulation Model. Trans. Audio, Speech and Lang. Proc.17, 4 (May 2009), 733–747. doi:10.1109/TASL. 2008.2012071

  34. [35]

    Kumiko Tanaka-Ishii and Shunsuke Aihara. 2015. Computational constancy measures of texts-yule’s k and rényi’s entropy.Comput. Linguist.41, 3 (Sept. 2015), 481–502. doi:10.1162/COLI_a_00228

  35. [36]

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the Judges: Evaluating Align- ment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaus- tubh Dhole, Rotem Dror, Sebastian Gehrmann, E...

  36. [37]

    Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, and Haizhou Li

  37. [38]

    Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 21082...

  38. [39]

    Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, and Jun Xu

  39. [40]

    arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705

    Configurable multi-agent framework for scalable and realistic testing of llm-based agents. arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705

  40. [41]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

  41. [42]

    Weinan Zhang, Muning Wen, Jun Wang, Haoyu Zhang, Qiuying Peng, Cheng Jin, Xihuai Wang, Qiqiang Lin, Xiaoyun Mo, and Jiamu Zhou. 2025. Hammer- Bench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenar- ios. arXiv:2412.16516 [cs.CL] https://arxiv.org/abs/2412.16516 Preprint Under Review, Feb 2025, Ashutosh Hathidara, Julien Yu, Vaishali S...

  42. [43]

    Yanzhe Zhang and Diyi Yang. 2025. Searching for Privacy Risks in LLM Agents via Simulation. arXiv:2508.10880 [cs.CR] https://arxiv.org/abs/2508.10880

  43. [44]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview...

  44. [45]

    Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, and Maarten Sap. 2024. Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computa...

  45. [46]

    retry with exponential backoff

    Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, and Elliott Ash. 2025. DialogueForge: LLM Simulation of Human-Chatbot Dia- logue. arXiv:2507.15752 [cs.CL] https://arxiv.org/abs/2507.15752 A MirrorBench System In this section, we explain the system aspects of the benchmarking framework. Although the metric definitions and execut...

  46. [47]

    **Style Similarity**: Do the proxy user responses match the conversational style of real user responses (formality, tone, verbosity)?

  47. [48]

    **Realism**: Do the proxy user responses sound natural and human-like?

  48. [49]

    reasoning

    **Contextual Appropriateness**: Are the proxy user responses appropriate given the conversation context? Note: You should not evaluate based on the content of the responses, only their style, realism, contextual appropriateness, and tone. ## Instructions: - Focus exclusively on comparing USER responses (ignore assistant responses) - Consider the overall c...

  49. [50]

    Concise and real-user like language

  50. [51]

    Does not sound scripted or artificial

  51. [52]

    reasoning

    Real-user like tone and style Return JSON: {“reasoning": “<1-2 sentences>", “verdict": <“NO" or “YES">}. [Conversation] {conversation} Output ONLY valid JSON, no additional text. Figure 14: Judge prompt for RNR Metric User-Proxy System Prompt You are simulating a real human user for theMirror- Benchevaluation harness. Respond with the next USER turn only....