MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
Pith reviewed 2026-05-21 15:09 UTC · model grok-4.3
The pith
MirrorBench shows that current user-proxy agents produce systematically less human-like utterances than real users when measured with combined lexical and LLM metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MirrorBench is a reproducible benchmarking framework that evaluates user proxies solely on their ability to produce human-like utterances. It integrates the lexical-diversity metrics MATTR, Yule's K, and HD-D with the LLM-judge metrics GTEval, Pairwise Indistinguishability, and Rubric-and-Reason, then contextualizes results using Human-Human and Proxy-Proxy calibration controls to yield variance-aware comparisons that reveal systematic gaps between proxies and real human users across four public datasets.
What carries the argument
MirrorBench, a framework that combines three lexical-diversity metrics with three LLM-judge metrics and calibrates scores against human-human and proxy-proxy controls to isolate human-likeness.
If this is right
- User proxies can be compared in a variance-aware manner across multiple datasets.
- Systematic gaps between proxies and humans become visible for guiding improvements.
- Evaluation of utterance realism is decoupled from task-specific performance.
- The open-source framework with its command-line interface supports reproducible experiments.
- The approach is extensible to additional datasets and metrics.
Where Pith is reading between the lines
- Proxies that score higher on MirrorBench could produce more natural synthetic data for training dialogue models.
- The calibration controls might be adapted to test realism in other generative tasks such as story or code generation.
- High MirrorBench scores could be checked for correlation with better results in live human-AI interaction studies.
Load-bearing premise
The selected lexical-diversity metrics and LLM-judge rubrics together with human-human and proxy-proxy controls are sufficient to measure human-likeness independently of any downstream task.
What would settle it
A user proxy that matches human scores on every MirrorBench metric yet produces conversations that human raters still identify as artificial in blind side-by-side comparisons.
Figures
read the original abstract
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MirrorBench, a reproducible and extensible benchmarking framework for evaluating conversational user-proxy agents solely on their ability to generate human-like utterances, decoupled from any downstream task success. It combines three lexical-diversity metrics (MATTR, Yule’s K, HD-D) with three LLM-judge metrics (GTEval, Pairwise Indistinguishability, Rubric-and-Reason), contextualized via Human-Human and Proxy-Proxy calibration controls. Evaluations across four public datasets are claimed to yield variance-aware comparisons that reveal systematic gaps between proxies and real human users; the framework and CLI are open-sourced.
Significance. If the metrics are shown to validly track human-likeness, MirrorBench would provide a useful, task-independent standard for developing and comparing user simulators, which is relevant for both evaluation of conversational systems and synthetic data generation. The open-source release, CLI, and explicit calibration controls are concrete strengths that support reproducibility. However, the significance is limited by the absence of direct validation that the LLM-judge components correlate with human judgments rather than model-specific artifacts.
major comments (2)
- [Abstract and Evaluation Methodology] Abstract / Evaluation section: The claim that MirrorBench isolates and quantifies human-likeness rests primarily on the three LLM-judge rubrics. No inter-rater agreement, correlation with human ratings, or ablation swapping the underlying judge model is reported on the same utterance sets. Lexical metrics (MATTR, Yule’s K, HD-D) address only surface diversity, so the judge components are load-bearing for the 'systematic gaps' conclusion; without this validation the gaps may reflect judge preferences (e.g., verbosity or formality) rather than genuine human-likeness.
- [Experiments and Calibration Controls] §4 (Experiments) and controls description: While Human-Human and Proxy-Proxy controls are a positive design choice, they calibrate relative to the same LLM judges. This does not close the loop on whether observed differences survive replacement of the judge model or addition of human correlation data.
minor comments (3)
- [Methodology] Clarify the exact prompting templates and temperature settings used for the LLM judges to improve reproducibility.
- [Results] The abstract mentions 'variance-aware comparisons' but does not specify the statistical tests or confidence intervals used; add these details in the results section.
- Consider adding a limitations section discussing potential biases in the chosen public datasets.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on MirrorBench. The comments correctly identify the need for additional validation of the LLM-judge components to support claims of human-likeness. We address each point below and describe the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and Evaluation Methodology] Abstract / Evaluation section: The claim that MirrorBench isolates and quantifies human-likeness rests primarily on the three LLM-judge rubrics. No inter-rater agreement, correlation with human ratings, or ablation swapping the underlying judge model is reported on the same utterance sets. Lexical metrics (MATTR, Yule’s K, HD-D) address only surface diversity, so the judge components are load-bearing for the 'systematic gaps' conclusion; without this validation the gaps may reflect judge preferences (e.g., verbosity or formality) rather than genuine human-likeness.
Authors: We agree that the manuscript does not currently report inter-rater agreement, human correlation coefficients, or judge-model ablations, and that this leaves the LLM-judge results open to the possibility of model-specific biases. The lexical metrics supply an independent, non-LLM signal, and the Human-Human / Proxy-Proxy controls provide relative calibration, yet these do not fully substitute for direct human validation. In the revised manuscript we will add (i) an ablation that re-runs the three LLM-judge metrics with a second judge model on the same utterance sets and reports consistency of the observed gaps, and (ii) a small-scale human rating study on a stratified sample of utterances to compute Spearman correlations between human scores and each LLM-judge metric. These additions will appear in a new subsection of the evaluation methodology. revision: yes
-
Referee: [Experiments and Calibration Controls] §4 (Experiments) and controls description: While Human-Human and Proxy-Proxy controls are a positive design choice, they calibrate relative to the same LLM judges. This does not close the loop on whether observed differences survive replacement of the judge model or addition of human correlation data.
Authors: The calibration controls establish baselines within a single evaluation framework, allowing us to quantify how far proxies depart from human distributions under the same judging criteria. We acknowledge that this design does not yet demonstrate invariance to judge-model choice or alignment with human perception. The revisions outlined in our response to the first comment—specifically the judge-model ablation and the human correlation pilot—directly address this concern and will be integrated into the experimental results and discussion sections. revision: yes
Circularity Check
No circularity: MirrorBench defines new evaluation procedures without reducing claims to self-defined inputs or self-citations
full rationale
The paper presents MirrorBench as a benchmarking framework that combines known lexical-diversity metrics (MATTR, Yule’s K, HD-D) with LLM-judge rubrics and Human-Human/Proxy-Proxy controls. No derivation chain, equations, or predictions are shown that reduce a result to fitted parameters or prior self-citations by construction. The central claim of revealing systematic gaps rests on the explicit definition of these procedures rather than any self-referential reduction. The framework is self-contained as a measurement setup against external datasets and standard metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lexical diversity statistics (MATTR, Yule's K, HD-D) plus LLM judges can isolate human-likeness when calibrated against human-human and proxy-proxy dialogues.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MirrorBench combines three lexical-diversity metrics (MATTR, Yule’s K, and HD-D) with three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across four public datasets, MirrorBench yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
-
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...
-
Reinforcing Human Behavior Simulation via Verbal Feedback
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
Reference graph
Works this paper leans on
- [1]
-
[2]
Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2021. Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. InEMNLP
work page 2021
- [3]
-
[4]
Asking Clarifying Questions in Open-Domain Information-Seeking Conversations. InProceedings of the 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Paris, France)(SI- GIR’19). Association for Computing Machinery, New York, NY, USA, 475–484. doi:10.1145/3331184.3331265
-
[5]
2025.System Card: Claude Opus 4 & Claude Sonnet
Anthropic. 2025.System Card: Claude Opus 4 & Claude Sonnet
work page 2025
- [6]
-
[7]
Krisztian Balog and ChengXiang Zhai. 2023. User Simulation for Evaluating Information Access Systems. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(Beijing, China)(SIGIR-AP ’23). Association for Computing Machinery, New York, NY, USA, 302–305. doi:10.1145/3624...
-
[8]
Carmona, Robert Szava-Kovats, and Meelis Pärtel
Carlos P. Carmona, Robert Szava-Kovats, and Meelis Pärtel. 2019. Estimating probabilistic dark diversity based on the hypergeometric distribution.bioRxiv (2019). arXiv:https://www.biorxiv.org/content/early/2019/05/15/636753.full.pdf doi:10.1101/636753
-
[9]
John W Chotlos. 1944. IV. A statistical and comparative analysis of individual written language samples.Psychological Monographs56, 2 (1944), 75
work page 1944
-
[10]
Michael A. Covington and Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR).Journal of Quantitative Linguistics 17 (2010), 100 – 94. https://api.semanticscholar.org/CorpusID:18924254
work page 2010
-
[11]
2025.System Card: Gemini 2.5 Pro
Google. 2025.System Card: Gemini 2.5 Pro. Technical Report. Google. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini- 2-5-Pro-Model-Card.pdf Accessed: 2025-10-22
work page 2025
-
[12]
Ashutosh Hathidara, Julien Yu, and Sebastian Schreiber. 2025. Disambiguation- Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky. arXiv:2507.03336 [cs.AI] https://arxiv.org/abs/2507.03336
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
H. S. Heaps. 1978.Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., USA
work page 1978
-
[14]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
M. G. Kendall. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30 (1938), 81–93. https://api.semanticscholar.org/CorpusID:120478295
work page 1938
-
[16]
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo
-
[17]
InThe Twelfth International Conference on Learning Representations
Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=8euJaTveKw
-
[18]
2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition
Stephen Kokoska and Dan Zwillinger. 2000.CRC Standard Probability and Statis- tics Tables and Formulae, Student Edition. doi:10.1201/b16923 Section 14.7
-
[19]
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Ar- nav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. 2023. OpenAssistant Conversations...
work page 2023
-
[20]
Charles J. Kowalski. 2018. On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coeffi- cient.Journal of the Royal Statistical Society Series C: Applied Statis- tics21, 1 (12 2018), 1–12. arXiv:https://academic.oup.com/jrsssc/article- pdf/21/1/1/48613051/jrsssc_21_1_1.pdf doi:10.2307/2346598
-
[22]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI] https://arxiv.org/abs...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...
-
[25]
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. 2025. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan ...
-
[26]
Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy
Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Min- dermann, Evan Hubinger, Ethan Perez, and Kevin Troy. 2025. Agentic Mis- alignment: How LLMs Could Be Insider Threats. arXiv:2510.05179 [cs.CR] https://arxiv.org/abs/2510.05179
-
[27]
Philip Mccarthy and Scott Jarvis. 2007. Vocd: A theoretical and empirical evalua- tion. Language Testing, 24, 459-488.Language Testing - LANG TEST24 (10 2007), 459–488. doi:10.1177/0265532207080767
-
[28]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: a distributed framework for emerging AI applica- tions. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(Carlsbad, CA, USA)(OSDI’18). ...
work page 2018
- [29]
-
[30]
OpenAI. 2025.GPT-5 System Card. Technical Report. OpenAI. https://cdn.openai. com/gpt-5-system-card.pdf Accessed: 2025-10-22
work page 2025
-
[31]
OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [32]
-
[33]
Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, and Branislav Kveton. 2025. Quantitative LLM Judges. arXiv:2506.02945 [cs.CL] https://arxiv. org/abs/2506.02945
-
[34]
J. Schatzmann and S. Young. 2009. The Hidden Agenda User Simulation Model. Trans. Audio, Speech and Lang. Proc.17, 4 (May 2009), 733–747. doi:10.1109/TASL. 2008.2012071
-
[35]
Kumiko Tanaka-Ishii and Shunsuke Aihara. 2015. Computational constancy measures of texts-yule’s k and rényi’s entropy.Comput. Linguist.41, 3 (Sept. 2015), 481–502. doi:10.1162/COLI_a_00228
-
[36]
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the Judges: Evaluating Align- ment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaus- tubh Dhole, Rotem Dror, Sebastian Gehrmann, E...
work page 2025
-
[37]
Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, and Haizhou Li
-
[38]
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 21082...
-
[39]
Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, and Jun Xu
-
[40]
arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705
Configurable multi-agent framework for scalable and realistic testing of llm-based agents. arXiv:2507.14705 [cs.AI] https://arxiv.org/abs/2507.14705
-
[41]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...
work page 2022
-
[42]
Weinan Zhang, Muning Wen, Jun Wang, Haoyu Zhang, Qiuying Peng, Cheng Jin, Xihuai Wang, Qiqiang Lin, Xiaoyun Mo, and Jiamu Zhou. 2025. Hammer- Bench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenar- ios. arXiv:2412.16516 [cs.CL] https://arxiv.org/abs/2412.16516 Preprint Under Review, Feb 2025, Ashutosh Hathidara, Julien Yu, Vaishali S...
-
[43]
Yanzhe Zhang and Diyi Yang. 2025. Searching for Privacy Risks in LLM Agents via Simulation. arXiv:2508.10880 [cs.CR] https://arxiv.org/abs/2508.10880
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview...
work page 2023
-
[45]
Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, and Maarten Sap. 2024. Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computa...
-
[46]
retry with exponential backoff
Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, and Elliott Ash. 2025. DialogueForge: LLM Simulation of Human-Chatbot Dia- logue. arXiv:2507.15752 [cs.CL] https://arxiv.org/abs/2507.15752 A MirrorBench System In this section, we explain the system aspects of the benchmarking framework. Although the metric definitions and execut...
-
[47]
**Style Similarity**: Do the proxy user responses match the conversational style of real user responses (formality, tone, verbosity)?
-
[48]
**Realism**: Do the proxy user responses sound natural and human-like?
-
[49]
**Contextual Appropriateness**: Are the proxy user responses appropriate given the conversation context? Note: You should not evaluate based on the content of the responses, only their style, realism, contextual appropriateness, and tone. ## Instructions: - Focus exclusively on comparing USER responses (ignore assistant responses) - Consider the overall c...
-
[50]
Concise and real-user like language
-
[51]
Does not sound scripted or artificial
-
[52]
Real-user like tone and style Return JSON: {“reasoning": “<1-2 sentences>", “verdict": <“NO" or “YES">}. [Conversation] {conversation} Output ONLY valid JSON, no additional text. Figure 14: Judge prompt for RNR Metric User-Proxy System Prompt You are simulating a real human user for theMirror- Benchevaluation harness. Respond with the next USER turn only....
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.