MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Anil Babu Ankisettipalli; Ashutosh Hathidara; Julien Yu; Sebastian Schreiber; Vaishali Senthil

arxiv: 2601.08118 · v3 · pith:DHFWS5CXnew · submitted 2026-01-13 · 💻 cs.AI · cs.LG

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Ashutosh Hathidara , Julien Yu , Vaishali Senthil , Sebastian Schreiber , Anil Babu Ankisettipalli This is my paper

Pith reviewed 2026-05-21 15:09 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords MirrorBenchuser proxy agentshuman-likenessconversational agentsLLM evaluationlexical diversityAI simulatorsbenchmarking framework

0 comments

The pith

MirrorBench shows that current user-proxy agents produce systematically less human-like utterances than real users when measured with combined lexical and LLM metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MirrorBench to test whether AI agents that simulate users in conversations generate responses that match real human patterns. It deliberately separates this test from any measure of task success so the focus stays on utterance realism alone. The framework applies three statistical measures of word variety alongside three AI judging methods and anchors the scores with real human conversations and proxy-to-proxy exchanges as references. Results from four public datasets indicate consistent differences between the proxies and actual people. This separation matters because many AI systems now rely on simulated users both to test other models and to create training data.

Core claim

MirrorBench is a reproducible benchmarking framework that evaluates user proxies solely on their ability to produce human-like utterances. It integrates the lexical-diversity metrics MATTR, Yule's K, and HD-D with the LLM-judge metrics GTEval, Pairwise Indistinguishability, and Rubric-and-Reason, then contextualizes results using Human-Human and Proxy-Proxy calibration controls to yield variance-aware comparisons that reveal systematic gaps between proxies and real human users across four public datasets.

What carries the argument

MirrorBench, a framework that combines three lexical-diversity metrics with three LLM-judge metrics and calibrates scores against human-human and proxy-proxy controls to isolate human-likeness.

If this is right

User proxies can be compared in a variance-aware manner across multiple datasets.
Systematic gaps between proxies and humans become visible for guiding improvements.
Evaluation of utterance realism is decoupled from task-specific performance.
The open-source framework with its command-line interface supports reproducible experiments.
The approach is extensible to additional datasets and metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Proxies that score higher on MirrorBench could produce more natural synthetic data for training dialogue models.
The calibration controls might be adapted to test realism in other generative tasks such as story or code generation.
High MirrorBench scores could be checked for correlation with better results in live human-AI interaction studies.

Load-bearing premise

The selected lexical-diversity metrics and LLM-judge rubrics together with human-human and proxy-proxy controls are sufficient to measure human-likeness independently of any downstream task.

What would settle it

A user proxy that matches human scores on every MirrorBench metric yet produces conversations that human raters still identify as artificial in blind side-by-side comparisons.

Figures

Figures reproduced from arXiv: 2601.08118 by Anil Babu Ankisettipalli, Ashutosh Hathidara, Julien Yu, Sebastian Schreiber, Vaishali Senthil.

**Figure 1.** Figure 1: MirrorBench benchmark pipeline, spanning dataset preparation, synthetic rollouts, the evaluation suite, and an [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Human-likeness of five user-proxy LLMs across four datasets. Higher is better for judge-based metrics (GTEval, PI, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Judge–human correlation on ChatbotArena: Corre [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: Judge sensitivity of judge-realism metrics on Chatb [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Assistant sensitivity on ChatbotArena (scores). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Avg per-sample telemetry for GPT-4o as user-proxy [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Cost-quality trade-off for PI: cost per evaluation (USD) vs. PI Δ𝑤 (↑ better). Judge = Claude-4-Sonnet; assistant = GPT-4o; temperature = 0; cache off. Markers denote userproxies; labels denote datasets. Dashed line: Pareto frontier. See [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: MirrorBench Architecture: Six-layer stack from low-level execution backends & persistence up through the core [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: MirrorBench execution flow. The framework de [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Judge prompt for Pairwise Indistinguishability [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 14.** Figure 14: Judge prompt for RNR Metric User-Proxy System Prompt You are simulating a real human user for the MirrorBench evaluation harness. Respond with the next USER turn only. Do not write assistant messages, notes, or any other analysis. Your utterance should be like a real user and the context should be based on the following information provided [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: System prompt for User-Proxy Assistant System Prompt You are the assistant in a MirrorBench replay. The user-proxy agent is attempting to reproduce the USER side of the real conversation provided below. But the user-proxy does not have access to the real conversation history. Instead, it only has access to the conversation summary. You need to respond as the assistant. But we are providing you with the re… view at source ↗

**Figure 17.** Figure 17: Plan manifest (JSON, abridged). Fully resolved, [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Final run report (JSON, abridged). Aggregate metric [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Job configuration (abridged). YAML that fully specifies a [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MirrorBench packages lexical diversity metrics with LLM judges and calibration controls to score user-proxy human-likeness independent of task success, but the judges lack any reported human correlation.

read the letter

MirrorBench is a benchmark that scores how human-like LLM user proxies sound by combining three lexical diversity measures with three LLM-judge rubrics and adding Human-Human plus Proxy-Proxy controls for calibration. It runs on four public datasets and ships open-source code with a CLI. That setup is the main new piece: a ready-to-run framework that keeps the evaluation separate from downstream task performance, which matters for people generating data or testing systems with simulators. The paper does a clean job laying out the metric suite and showing variance-aware comparisons that point to gaps between proxies and real users. The open release makes it easy for others to try or extend. The soft spot is exactly what the stress test flags. The lexical metrics only capture surface diversity, so the LLM judges carry the load for the human-likeness claim. No human-rater correlation numbers or judge-model ablations appear in the abstract, and the controls use the same judges so they do not break the potential artifact loop. If the judges reward verbosity or formality from their training data instead of genuine human patterns, the reported gaps may not hold. This is aimed at researchers and engineers building conversational agents who need practical ways to vet simulators. A reader focused on evaluation methods would get concrete value from the code and the metric combination, even if they add their own human checks later. It deserves peer review because the framework is timely and reproducible, though referees will likely press on the judge validation.

Referee Report

2 major / 3 minor

Summary. The paper introduces MirrorBench, a reproducible and extensible benchmarking framework for evaluating conversational user-proxy agents solely on their ability to generate human-like utterances, decoupled from any downstream task success. It combines three lexical-diversity metrics (MATTR, Yule’s K, HD-D) with three LLM-judge metrics (GTEval, Pairwise Indistinguishability, Rubric-and-Reason), contextualized via Human-Human and Proxy-Proxy calibration controls. Evaluations across four public datasets are claimed to yield variance-aware comparisons that reveal systematic gaps between proxies and real human users; the framework and CLI are open-sourced.

Significance. If the metrics are shown to validly track human-likeness, MirrorBench would provide a useful, task-independent standard for developing and comparing user simulators, which is relevant for both evaluation of conversational systems and synthetic data generation. The open-source release, CLI, and explicit calibration controls are concrete strengths that support reproducibility. However, the significance is limited by the absence of direct validation that the LLM-judge components correlate with human judgments rather than model-specific artifacts.

major comments (2)

[Abstract and Evaluation Methodology] Abstract / Evaluation section: The claim that MirrorBench isolates and quantifies human-likeness rests primarily on the three LLM-judge rubrics. No inter-rater agreement, correlation with human ratings, or ablation swapping the underlying judge model is reported on the same utterance sets. Lexical metrics (MATTR, Yule’s K, HD-D) address only surface diversity, so the judge components are load-bearing for the 'systematic gaps' conclusion; without this validation the gaps may reflect judge preferences (e.g., verbosity or formality) rather than genuine human-likeness.
[Experiments and Calibration Controls] §4 (Experiments) and controls description: While Human-Human and Proxy-Proxy controls are a positive design choice, they calibrate relative to the same LLM judges. This does not close the loop on whether observed differences survive replacement of the judge model or addition of human correlation data.

minor comments (3)

[Methodology] Clarify the exact prompting templates and temperature settings used for the LLM judges to improve reproducibility.
[Results] The abstract mentions 'variance-aware comparisons' but does not specify the statistical tests or confidence intervals used; add these details in the results section.
Consider adding a limitations section discussing potential biases in the chosen public datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on MirrorBench. The comments correctly identify the need for additional validation of the LLM-judge components to support claims of human-likeness. We address each point below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract and Evaluation Methodology] Abstract / Evaluation section: The claim that MirrorBench isolates and quantifies human-likeness rests primarily on the three LLM-judge rubrics. No inter-rater agreement, correlation with human ratings, or ablation swapping the underlying judge model is reported on the same utterance sets. Lexical metrics (MATTR, Yule’s K, HD-D) address only surface diversity, so the judge components are load-bearing for the 'systematic gaps' conclusion; without this validation the gaps may reflect judge preferences (e.g., verbosity or formality) rather than genuine human-likeness.

Authors: We agree that the manuscript does not currently report inter-rater agreement, human correlation coefficients, or judge-model ablations, and that this leaves the LLM-judge results open to the possibility of model-specific biases. The lexical metrics supply an independent, non-LLM signal, and the Human-Human / Proxy-Proxy controls provide relative calibration, yet these do not fully substitute for direct human validation. In the revised manuscript we will add (i) an ablation that re-runs the three LLM-judge metrics with a second judge model on the same utterance sets and reports consistency of the observed gaps, and (ii) a small-scale human rating study on a stratified sample of utterances to compute Spearman correlations between human scores and each LLM-judge metric. These additions will appear in a new subsection of the evaluation methodology. revision: yes
Referee: [Experiments and Calibration Controls] §4 (Experiments) and controls description: While Human-Human and Proxy-Proxy controls are a positive design choice, they calibrate relative to the same LLM judges. This does not close the loop on whether observed differences survive replacement of the judge model or addition of human correlation data.

Authors: The calibration controls establish baselines within a single evaluation framework, allowing us to quantify how far proxies depart from human distributions under the same judging criteria. We acknowledge that this design does not yet demonstrate invariance to judge-model choice or alignment with human perception. The revisions outlined in our response to the first comment—specifically the judge-model ablation and the human correlation pilot—directly address this concern and will be integrated into the experimental results and discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: MirrorBench defines new evaluation procedures without reducing claims to self-defined inputs or self-citations

full rationale

The paper presents MirrorBench as a benchmarking framework that combines known lexical-diversity metrics (MATTR, Yule’s K, HD-D) with LLM-judge rubrics and Human-Human/Proxy-Proxy controls. No derivation chain, equations, or predictions are shown that reduce a result to fitted parameters or prior self-citations by construction. The central claim of revealing systematic gaps rests on the explicit definition of these procedures rather than any self-referential reduction. The framework is self-contained as a measurement setup against external datasets and standard metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that the selected diversity and judge metrics are valid proxies for human-likeness; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Lexical diversity statistics (MATTR, Yule's K, HD-D) plus LLM judges can isolate human-likeness when calibrated against human-human and proxy-proxy dialogues.
Invoked in the abstract when the authors state that the metrics evaluate human-likeness across conversational regimes.

pith-pipeline@v0.9.0 · 5742 in / 1205 out tokens · 50121 ms · 2026-05-21T15:09:01.756536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MirrorBench combines three lexical-diversity metrics (MATTR, Yule’s K, and HD-D) with three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across four public datasets, MirrorBench yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
cs.AI 2026-05 unverdicted novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...
Reinforcing Human Behavior Simulation via Verbal Feedback
cs.LG 2026-05 unverdicted novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 3 Pith papers · 6 internal anchors

[1]

Adnan Ahmad, Stefan Hillmann, and Sebastian Möller. 2025. Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models. arXiv:2502.12813 [cs.CL] https://arxiv.org/abs/2502.12813

work page arXiv 2025
[2]

Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2021. Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. InEMNLP

work page 2021
[3]

Bruce Croft

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft

work page
[4]

InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19)

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations. InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19). Association for Computing Machinery, New York, NY, USA, 475–484. doi:10.1145/3331184.3331265

work page doi:10.1145/3331184.3331265
[5]

2025.System Card: Claude Opus 4 & Claude Sonnet

Anthropic. 2025.System Card: Claude Opus 4 & Claude Sonnet

work page 2025
[6]

Anthropic

Technical Report. Anthropic. https://www-cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285.pdf Accessed: 2025-10-22

work page 2025
[7]

Krisztian Balog and ChengXiang Zhai. 2023. User Simulation for Evaluating Information Access Systems. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(Beijing, China)(SIGIR-AP ’23). Association for Computing Machinery, New York, NY, USA, 302–305. doi:10.1145/3624...

work page doi:10.1145/3624918.3629549 2023
[8]

Carmona, Robert Szava-Kovats, and Meelis Pärtel

Carlos P. Carmona, Robert Szava-Kovats, and Meelis Pärtel. 2019. Estimating probabilistic dark diversity based on the hypergeometric distribution.bioRxiv (2019). arXiv:https://www.biorxiv.org/content/early/2019/05/15/636753.full.pdf doi:10.1101/636753

work page doi:10.1101/636753 2019
[9]

John W Chotlos. 1944. IV. A statistical and comparative analysis of individual written language samples.Psychological Monographs56, 2 (1944), 75

work page 1944
[10]

Covington and Joe D

Michael A. Covington and Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR).Journal of Quantitative Linguistics 17 (2010), 100 – 94. https://api.semanticscholar.org/CorpusID:18924254

work page 2010
[11]

2025.System Card: Gemini 2.5 Pro

Google. 2025.System Card: Gemini 2.5 Pro. Technical Report. Google. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini- 2-5-Pro-Model-Card.pdf Accessed: 2025-10-22

work page 2025
[12]

Ashutosh Hathidara, Julien Yu, and Sebastian Schreiber. 2025. Disambiguation- Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky. arXiv:2507.03336 [cs.AI] https://arxiv.org/abs/2507.03336

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

H. S. Heaps. 1978.Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., USA

work page 1978
[14]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

M. G. Kendall. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30 (1938), 81–93. https://api.semanticscholar.org/CorpusID:120478295

work page 1938
[16]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo

work page
[17]

InThe Twelfth International Conference on Learning Representations

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=8euJaTveKw

work page
[18]

2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition

Stephen Kokoska and Dan Zwillinger. 2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition. doi:10.1201/b16923 Section 14.7

work page doi:10.1201/b16923 2000
[19]

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Ar- nav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. 2023. OpenAssistant Conversations...

work page 2023
[20]

Kowalski

Charles J. Kowalski. 2018. On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coeffi- cient.Journal of the Royal Statistical Society Series C: Applied Statis- tics21, 1 (12 2018), 1–12. arXiv:https://academic.oup.com/jrsssc/article- pdf/21/1/1/48613051/jrsssc_21_1_1.pdf doi:10.2307/2346598

work page doi:10.2307/2346598 2018
[22]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI] https://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...

work page doi:10.18653/v1/2023.emnlp- 2023
[25]

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. 2025. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan ...

work page doi:10.18653/v1/2025.findings-naacl.65 2025
[26]

Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy. 2025. Agentic Mis- alignment: How LLMs Could Be Insider Threats. arXiv:2510.05179 [cs.CR] https://arxiv.org/abs/2510.05179

work page arXiv 2025
[27]

Philip Mccarthy and Scott Jarvis. 2007. Vocd: A theoretical and empirical evalua- tion. Language Testing, 24, 459-488.Language Testing - LANG TEST24 (10 2007), 459–488. doi:10.1177/0265532207080767

work page doi:10.1177/0265532207080767 2007
[28]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: a distributed framework for emerging AI applica- tions. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(Carlsbad, CA, USA)(OSDI’18). ...

work page 2018
[29]

Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. 2025. Flipping the Di- alogue: Training and Evaluating User Language Models. arXiv:2510.06552 [cs.CL] https://arxiv.org/abs/2510.06552

work page arXiv 2025
[30]

2025.GPT-5 System Card

OpenAI. 2025.GPT-5 System Card. Technical Report. OpenAI. https://cdn.openai. com/gpt-5-system-card.pdf Accessed: 2025-10-22

work page 2025
[31]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Xuchen Pan, Dawei Gao, Yuexiang Xie, Yushuo Chen, Zhewei Wei, Yaliang Li, Bolin Ding, Ji-Rong Wen, and Jingren Zhou. 2024. Very Large-Scale Multi-Agent Simulation in AgentScope. arXiv:2407.17789 [cs.MA] https://arxiv.org/abs/2407. 17789

work page arXiv 2024
[33]

Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, and Branislav Kveton. 2025. Quantitative LLM Judges. arXiv:2506.02945 [cs.CL] https://arxiv. org/abs/2506.02945

work page arXiv 2025
[34]

Schatzmann and S

J. Schatzmann and S. Young. 2009. The Hidden Agenda User Simulation Model. Trans. Audio, Speech and Lang. Proc.17, 4 (May 2009), 733–747. doi:10.1109/TASL. 2008.2012071

work page doi:10.1109/tasl 2009
[35]

Kumiko Tanaka-Ishii and Shunsuke Aihara. 2015. Computational constancy measures of texts-yule’s k and rényi’s entropy.Comput. Linguist.41, 3 (Sept. 2015), 481–502. doi:10.1162/COLI_a_00228

work page doi:10.1162/coli_a_00228 2015
[36]

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the Judges: Evaluating Align- ment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaus- tubh Dhole, Rotem Dror, Sebastian Gehrmann, E...

work page 2025
[37]

Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, and Haizhou Li

work page
[38]

Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 21082...

work page doi:10.18653/v1/ 2025
[39]

Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, and Jun Xu

work page
[40]

arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705

Configurable multi-agent framework for scalable and realistic testing of llm-based agents. arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705

work page arXiv
[41]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

work page 2022
[42]

Weinan Zhang, Muning Wen, Jun Wang, Haoyu Zhang, Qiuying Peng, Cheng Jin, Xihuai Wang, Qiqiang Lin, Xiaoyun Mo, and Jiamu Zhou. 2025. Hammer- Bench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenar- ios. arXiv:2412.16516 [cs.CL] https://arxiv.org/abs/2412.16516 Preprint Under Review, Feb 2025, Ashutosh Hathidara, Julien Yu, Vaishali S...

work page arXiv 2025
[43]

Yanzhe Zhang and Diyi Yang. 2025. Searching for Privacy Risks in LLM Agents via Simulation. arXiv:2508.10880 [cs.CR] https://arxiv.org/abs/2508.10880

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview...

work page 2023
[45]

Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, and Maarten Sap. 2024. Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computa...

work page doi:10.18653/v1/2024.emnlp-main.1208 2024
[46]

retry with exponential backoff

Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, and Elliott Ash. 2025. DialogueForge: LLM Simulation of Human-Chatbot Dia- logue. arXiv:2507.15752 [cs.CL] https://arxiv.org/abs/2507.15752 A MirrorBench System In this section, we explain the system aspects of the benchmarking framework. Although the metric definitions and execut...

work page arXiv 2025
[47]

**Style Similarity**: Do the proxy user responses match the conversational style of real user responses (formality, tone, verbosity)?

work page
[48]

**Realism**: Do the proxy user responses sound natural and human-like?

work page
[49]

reasoning

**Contextual Appropriateness**: Are the proxy user responses appropriate given the conversation context? Note: You should not evaluate based on the content of the responses, only their style, realism, contextual appropriateness, and tone. ## Instructions: - Focus exclusively on comparing USER responses (ignore assistant responses) - Consider the overall c...

work page
[50]

Concise and real-user like language

work page
[51]

Does not sound scripted or artificial

work page
[52]

reasoning

Real-user like tone and style Return JSON: {“reasoning": “<1-2 sentences>", “verdict": <“NO" or “YES">}. [Conversation] {conversation} Output ONLY valid JSON, no additional text. Figure 14: Judge prompt for RNR Metric User-Proxy System Prompt You are simulating a real human user for theMirror- Benchevaluation harness. Respond with the next USER turn only....

work page 2025

[1] [1]

Adnan Ahmad, Stefan Hillmann, and Sebastian Möller. 2025. Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models. arXiv:2502.12813 [cs.CL] https://arxiv.org/abs/2502.12813

work page arXiv 2025

[2] [2]

Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2021. Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. InEMNLP

work page 2021

[3] [3]

Bruce Croft

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft

work page

[4] [4]

InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19)

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations. InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19). Association for Computing Machinery, New York, NY, USA, 475–484. doi:10.1145/3331184.3331265

work page doi:10.1145/3331184.3331265

[5] [5]

2025.System Card: Claude Opus 4 & Claude Sonnet

Anthropic. 2025.System Card: Claude Opus 4 & Claude Sonnet

work page 2025

[6] [6]

Anthropic

Technical Report. Anthropic. https://www-cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285.pdf Accessed: 2025-10-22

work page 2025

[7] [7]

Krisztian Balog and ChengXiang Zhai. 2023. User Simulation for Evaluating Information Access Systems. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(Beijing, China)(SIGIR-AP ’23). Association for Computing Machinery, New York, NY, USA, 302–305. doi:10.1145/3624...

work page doi:10.1145/3624918.3629549 2023

[8] [8]

Carmona, Robert Szava-Kovats, and Meelis Pärtel

Carlos P. Carmona, Robert Szava-Kovats, and Meelis Pärtel. 2019. Estimating probabilistic dark diversity based on the hypergeometric distribution.bioRxiv (2019). arXiv:https://www.biorxiv.org/content/early/2019/05/15/636753.full.pdf doi:10.1101/636753

work page doi:10.1101/636753 2019

[9] [9]

John W Chotlos. 1944. IV. A statistical and comparative analysis of individual written language samples.Psychological Monographs56, 2 (1944), 75

work page 1944

[10] [10]

Covington and Joe D

Michael A. Covington and Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR).Journal of Quantitative Linguistics 17 (2010), 100 – 94. https://api.semanticscholar.org/CorpusID:18924254

work page 2010

[11] [11]

2025.System Card: Gemini 2.5 Pro

Google. 2025.System Card: Gemini 2.5 Pro. Technical Report. Google. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini- 2-5-Pro-Model-Card.pdf Accessed: 2025-10-22

work page 2025

[12] [12]

Ashutosh Hathidara, Julien Yu, and Sebastian Schreiber. 2025. Disambiguation- Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky. arXiv:2507.03336 [cs.AI] https://arxiv.org/abs/2507.03336

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

H. S. Heaps. 1978.Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., USA

work page 1978

[14] [14]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

M. G. Kendall. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30 (1938), 81–93. https://api.semanticscholar.org/CorpusID:120478295

work page 1938

[16] [16]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo

work page

[17] [17]

InThe Twelfth International Conference on Learning Representations

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=8euJaTveKw

work page

[18] [18]

2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition

Stephen Kokoska and Dan Zwillinger. 2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition. doi:10.1201/b16923 Section 14.7

work page doi:10.1201/b16923 2000

[19] [19]

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Ar- nav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. 2023. OpenAssistant Conversations...

work page 2023

[20] [20]

Kowalski

Charles J. Kowalski. 2018. On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coeffi- cient.Journal of the Royal Statistical Society Series C: Applied Statis- tics21, 1 (12 2018), 1–12. arXiv:https://academic.oup.com/jrsssc/article- pdf/21/1/1/48613051/jrsssc_21_1_1.pdf doi:10.2307/2346598

work page doi:10.2307/2346598 2018

[21] [22]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI] https://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...

work page doi:10.18653/v1/2023.emnlp- 2023

[24] [25]

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. 2025. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan ...

work page doi:10.18653/v1/2025.findings-naacl.65 2025

[25] [26]

Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy. 2025. Agentic Mis- alignment: How LLMs Could Be Insider Threats. arXiv:2510.05179 [cs.CR] https://arxiv.org/abs/2510.05179

work page arXiv 2025

[26] [27]

Philip Mccarthy and Scott Jarvis. 2007. Vocd: A theoretical and empirical evalua- tion. Language Testing, 24, 459-488.Language Testing - LANG TEST24 (10 2007), 459–488. doi:10.1177/0265532207080767

work page doi:10.1177/0265532207080767 2007

[27] [28]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: a distributed framework for emerging AI applica- tions. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(Carlsbad, CA, USA)(OSDI’18). ...

work page 2018

[28] [29]

Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. 2025. Flipping the Di- alogue: Training and Evaluating User Language Models. arXiv:2510.06552 [cs.CL] https://arxiv.org/abs/2510.06552

work page arXiv 2025

[29] [30]

2025.GPT-5 System Card

OpenAI. 2025.GPT-5 System Card. Technical Report. OpenAI. https://cdn.openai. com/gpt-5-system-card.pdf Accessed: 2025-10-22

work page 2025

[30] [31]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

Xuchen Pan, Dawei Gao, Yuexiang Xie, Yushuo Chen, Zhewei Wei, Yaliang Li, Bolin Ding, Ji-Rong Wen, and Jingren Zhou. 2024. Very Large-Scale Multi-Agent Simulation in AgentScope. arXiv:2407.17789 [cs.MA] https://arxiv.org/abs/2407. 17789

work page arXiv 2024

[32] [33]

Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, and Branislav Kveton. 2025. Quantitative LLM Judges. arXiv:2506.02945 [cs.CL] https://arxiv. org/abs/2506.02945

work page arXiv 2025

[33] [34]

Schatzmann and S

J. Schatzmann and S. Young. 2009. The Hidden Agenda User Simulation Model. Trans. Audio, Speech and Lang. Proc.17, 4 (May 2009), 733–747. doi:10.1109/TASL. 2008.2012071

work page doi:10.1109/tasl 2009

[34] [35]

Kumiko Tanaka-Ishii and Shunsuke Aihara. 2015. Computational constancy measures of texts-yule’s k and rényi’s entropy.Comput. Linguist.41, 3 (Sept. 2015), 481–502. doi:10.1162/COLI_a_00228

work page doi:10.1162/coli_a_00228 2015

[35] [36]

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the Judges: Evaluating Align- ment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaus- tubh Dhole, Rotem Dror, Sebastian Gehrmann, E...

work page 2025

[36] [37]

Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, and Haizhou Li

work page

[37] [38]

Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 21082...

work page doi:10.18653/v1/ 2025

[38] [39]

Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, and Jun Xu

work page

[39] [40]

arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705

Configurable multi-agent framework for scalable and realistic testing of llm-based agents. arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705

work page arXiv

[40] [41]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

work page 2022

[41] [42]

Weinan Zhang, Muning Wen, Jun Wang, Haoyu Zhang, Qiuying Peng, Cheng Jin, Xihuai Wang, Qiqiang Lin, Xiaoyun Mo, and Jiamu Zhou. 2025. Hammer- Bench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenar- ios. arXiv:2412.16516 [cs.CL] https://arxiv.org/abs/2412.16516 Preprint Under Review, Feb 2025, Ashutosh Hathidara, Julien Yu, Vaishali S...

work page arXiv 2025

[42] [43]

Yanzhe Zhang and Diyi Yang. 2025. Searching for Privacy Risks in LLM Agents via Simulation. arXiv:2508.10880 [cs.CR] https://arxiv.org/abs/2508.10880

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [44]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview...

work page 2023

[44] [45]

Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, and Maarten Sap. 2024. Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computa...

work page doi:10.18653/v1/2024.emnlp-main.1208 2024

[45] [46]

retry with exponential backoff

Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, and Elliott Ash. 2025. DialogueForge: LLM Simulation of Human-Chatbot Dia- logue. arXiv:2507.15752 [cs.CL] https://arxiv.org/abs/2507.15752 A MirrorBench System In this section, we explain the system aspects of the benchmarking framework. Although the metric definitions and execut...

work page arXiv 2025

[46] [47]

**Style Similarity**: Do the proxy user responses match the conversational style of real user responses (formality, tone, verbosity)?

work page

[47] [48]

**Realism**: Do the proxy user responses sound natural and human-like?

work page

[48] [49]

reasoning

**Contextual Appropriateness**: Are the proxy user responses appropriate given the conversation context? Note: You should not evaluate based on the content of the responses, only their style, realism, contextual appropriateness, and tone. ## Instructions: - Focus exclusively on comparing USER responses (ignore assistant responses) - Consider the overall c...

work page

[49] [50]

Concise and real-user like language

work page

[50] [51]

Does not sound scripted or artificial

work page

[51] [52]

reasoning

Real-user like tone and style Return JSON: {“reasoning": “<1-2 sentences>", “verdict": <“NO" or “YES">}. [Conversation] {conversation} Output ONLY valid JSON, no additional text. Figure 14: Judge prompt for RNR Metric User-Proxy System Prompt You are simulating a real human user for theMirror- Benchevaluation harness. Respond with the next USER turn only....

work page 2025