arxiv: 2604.16310 · v1 · submitted 2026-01-30 · 💻 cs.IR · cs.AI· cs.CL

Recognition: no theorem link

RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

Lorenz Brehme , Benedikt Dornauer , Jan-Henrik B\"ottcher , Klaus Schmid , Mircea-Cristian Racasan , Ruth Breu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:11 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords RAG evaluationmulti-turn dialoguedynamic evaluationLLM simulationretrieval-augmented generationconversation generationsystem assessment

0 comments

The pith

RAG-DIVE uses an LLM to generate and validate multi-turn conversations for dynamically evaluating RAG system performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static evaluation datasets for retrieval-augmented generation systems miss the back-and-forth of real user interactions. RAG-DIVE addresses this by having one LLM component create ongoing user queries while another checks their quality and coherence. A third component then scores the RAG system on both individual turns and the full dialogue. Experiments show the method detects changes when the RAG system is modified and produces trends similar to those from fixed datasets.

Core claim

RAG-DIVE introduces a three-stage process where an LLM simulates user conversations dynamically, validates them for quality, and then evaluates the RAG system's responses across the entire interaction to produce both turn-level and aggregated multi-turn performance metrics.

What carries the argument

The RAG-DIVE framework consisting of the Conversation Generator that creates multi-turn queries, the Conversation Validator that filters invalid outputs, and the Conversation Evaluator that computes per-turn and multi-turn metrics.

If this is right

RAG systems can be tested in adaptive settings that better mirror real usage.
Performance differences from system changes become detectable without new static datasets.
Evaluations can run repeatedly for consistency checks.
Trends from dynamic generation align with those from traditional static evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other interactive AI systems beyond RAG.
Longer conversations might reveal cumulative effects not visible in short static tests.
Cost of repeated LLM calls for generation and evaluation may limit scale in practice.

Load-bearing premise

That conversations created by the LLM and filtered by the validator closely enough match how actual users would interact with the RAG system.

What would settle it

Run RAG-DIVE on a modified RAG system and check whether the generated metrics fail to reflect known performance degradations observed in independent human assessments of the same system.

Figures

Figures reproduced from arXiv: 2604.16310 by Benedikt Dornauer, Jan-Henrik B\"ottcher, Klaus Schmid, Lorenz Brehme, Mircea-Cristian Racasan, Ruth Breu.

**Figure 1.** Figure 1: Illustration of RAG-DIVE Third, a conversation critic analyzes the interaction and reports which policies were successfully fulfilled and which were not. This dynamic, task-oriented evaluation provides a more realistic assessment of system performance, as it adapts to the system’s responses and tests its ability to handle interactive, goal-driven conversations. While significant progress has been made in … view at source ↗

read the original abstract

Evaluating Retrieval-Augmented Generation (RAG) systems using static multi-turn datasets fails to capture the dynamic nature of real-world dialogues. Existing evaluation methods rely on predefined datasets, which restrict them to static, one-directional queries and limit their ability to capture the adaptive, context-dependent performance of RAG systems in interactive, multi-turn settings. Thus, we introduce the RAG-DIVE, a Dynamic Interactive Validation and Evaluation approach, that simulates user interactions with RAG systems. RAG-DIVE leverages an LLM to generate multi-turn conversations dynamically and is organized into three components. The dialogue generation stage consists of the (1) Conversation Generator, which simulates a user by creating multi-turn queries, and the (2) Conversation Validator, which filters and corrects invalid or low-quality outputs to ensure coherent conversations. The evaluation stage is handled by the (3) Conversation Evaluator, which assesses the RAG system's performance across the entire dialogue and generates both per-turn and multi-turn metrics that provide an aggregated view of system behavior. We validated RAG-DIVE through two experimental setups. First, we tested a sample RAG system, including human evaluation of dialogue quality, repeated trials to assess consistency, and an ablation study showing that RAG-DIVE detects performance changes caused by system modifications. Second, we compared RAG-DIVE with a traditional static dataset evaluation on an industrial RAG system under different configurations to verify whether both approaches reveal similar performance trends. Our findings demonstrate that RAG-DIVE facilitates dynamic, interaction-driven evaluation for multi-turn conversations, thereby advancing the assessment of RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAG-DIVE gives a clean three-stage LLM pipeline for generating and scoring multi-turn RAG dialogues, with some practical checks, but the central claim rests on untested distributional match to real users.

read the letter

The paper's main contribution is a straightforward pipeline: an LLM Conversation Generator that builds multi-turn queries on the fly, a Validator that filters and fixes low-quality turns, and an Evaluator that produces both per-turn scores and an aggregated dialogue metric. This setup is meant to replace static datasets with something closer to live interaction. They back it with two experiments. One runs a sample RAG system through human review of dialogue quality, repeated runs for consistency, and an ablation that shows the metrics shift when the underlying RAG components change. The second compares trends on an industrial RAG system against a traditional static dataset and finds rough agreement. Those steps are useful and show the method is at least internally coherent and sensitive to obvious system differences. The writing is clear about the three components and the evaluation goals. The soft spot is exactly the one the stress-test note flags. The human evaluation confirms the generated dialogues are coherent and the metrics respond to artificial changes, but nothing in the described work checks whether the statistical properties of those dialogues—intent shifts, follow-up patterns, lexical variety, or topic drift—line up with real user sessions. Without that distributional test, it is hard to know whether performance deltas come from the RAG system or from LLM artifacts in the simulation. The second experiment treats the dynamic method as the reference when checking trend agreement, which does not close the gap. This is the kind of paper that belongs in a reading group focused on applied RAG evaluation. It is not a foundational result, but the pipeline is concrete enough that people tuning production systems might want to try it. The work shows honest engagement with the problem and reproducible experimental steps, so it deserves peer review. I would send it out and ask reviewers to press on the simulation fidelity question.

Referee Report

3 major / 2 minor

Summary. The paper introduces RAG-DIVE, a dynamic framework for evaluating RAG systems in multi-turn dialogues. It uses an LLM-based Conversation Generator to simulate user queries, a Validator to filter and correct low-quality outputs for coherence, and an Evaluator to produce per-turn and aggregated multi-turn metrics. Validation occurs in two setups: (1) human evaluation of dialogue quality, consistency checks via repeated trials, and ablation on a sample RAG system to detect performance changes from modifications; (2) trend comparison against static dataset evaluation on an industrial RAG system under varying configurations. The central claim is that this interaction-driven approach advances assessment by capturing adaptive, context-dependent behavior beyond static datasets.

Significance. If the generated dialogues prove representative of real user distributions, RAG-DIVE could meaningfully advance RAG evaluation by enabling adaptive testing of context-dependent retrieval and generation. The human validation and ablation components provide some internal checks, but the absence of distributional matching to human data and lack of quantitative results reduce the immediate significance. The work addresses a real gap in static multi-turn evaluation but requires stronger evidence to shift practice.

major comments (3)

[Experiments (first setup)] The human evaluation of dialogue quality and ablation study (first experimental setup) confirm internal coherence and sensitivity to artificial system changes, but provide no test of whether LLM-generated conversations match real human multi-turn statistical properties (e.g., intent shifts, follow-up coherence, topic drift). This distributional match is load-bearing for the claim that RAG-DIVE enables faithful dynamic evaluation.
[Experiments (comparison setup)] The second experiment reports trend agreement with static dataset evaluation on an industrial RAG system, but this comparison treats the LLM simulation as the reference without independent human dialogue ground truth; any detected deltas could reflect LLM artifacts (e.g., reduced lexical diversity or overly logical follow-ups) rather than genuine RAG performance differences.
[Abstract and Experiments] No quantitative results, metric values, error bars, or details on how per-turn and aggregated metrics are computed appear in the description of either experiment, preventing assessment of effect sizes, consistency, or practical utility of the reported trends.

minor comments (2)

[Method] The flow among the three components (Generator, Validator, Evaluator) would be clearer with a diagram or pseudocode listing the exact LLM prompts and decision rules for validation.
[Related Work] Ensure the related-work section cites recent benchmarks on multi-turn RAG evaluation and human-LLM dialogue alignment studies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our paper. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments (first setup)] The human evaluation of dialogue quality and ablation study (first experimental setup) confirm internal coherence and sensitivity to artificial system changes, but provide no test of whether LLM-generated conversations match real human multi-turn statistical properties (e.g., intent shifts, follow-up coherence, topic drift). This distributional match is load-bearing for the claim that RAG-DIVE enables faithful dynamic evaluation.

Authors: We acknowledge that demonstrating distributional similarity to real human multi-turn dialogues (e.g., via statistical properties like intent shifts or topic drift) would provide stronger evidence for the representativeness of the generated conversations. Our current validation relies on human assessment of dialogue quality and coherence together with ablation results showing sensitivity to RAG modifications. We will revise the manuscript to explicitly note this as a limitation and outline plans for future distributional comparisons against human dialogue corpora. revision: partial
Referee: [Experiments (comparison setup)] The second experiment reports trend agreement with static dataset evaluation on an industrial RAG system, but this comparison treats the LLM simulation as the reference without independent human dialogue ground truth; any detected deltas could reflect LLM artifacts (e.g., reduced lexical diversity or overly logical follow-ups) rather than genuine RAG performance differences.

Authors: The second experiment compares performance trends between RAG-DIVE and static evaluation on the same industrial system to illustrate that the dynamic approach yields consistent signals under varying configurations. We do not treat the LLM-generated dialogues as ground truth; rather, we present them as a complementary method. We will clarify this intent and add discussion of possible LLM artifacts as a limitation in the revised text. revision: partial
Referee: [Abstract and Experiments] No quantitative results, metric values, error bars, or details on how per-turn and aggregated metrics are computed appear in the description of either experiment, preventing assessment of effect sizes, consistency, or practical utility of the reported trends.

Authors: We agree that the current manuscript omits specific quantitative values, error bars, and metric computation details. In the revised version we will add the per-turn and aggregated metric values, standard deviations from repeated trials, and explicit formulas or procedures used to compute them, enabling readers to assess effect sizes and consistency. revision: yes

Circularity Check

0 steps flagged

No circularity detected in framework definition or validation

full rationale

The paper introduces RAG-DIVE as a three-component framework (Conversation Generator, Validator, Evaluator) that uses an external LLM to simulate dialogues, followed by human evaluation of quality, consistency trials, ablation on system modifications, and trend comparison against static datasets. No equations, parameters, or derivations are present. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core choices. The central claim that the method enables dynamic evaluation rests on empirical checks against external benchmarks (human raters and static baselines) rather than reducing to any quantity defined or fitted within the paper itself. This is a standard methodological proposal with independent validation steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated premise that LLM-simulated dialogues are sufficiently representative of real users and that the validator produces high-quality data without introducing bias. No free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5618 in / 1159 out tokens · 20534 ms · 2026-05-16T09:11:30.967978+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 6 internal anchors

[1]

Sara Aboukadri, Aafaf Ouaddah, Abdellatif Mezrioui, and Ikram El Asri. 2025. Leveraging RAG and LLMs for Access Control Policy Extraction From User Stories in Agile Software Development. 116462–116472 pages. doi:10.1109/ACCESS. 2025.3586203

work page doi:10.1109/access 2025
[2]

Emre Can Acikgoz, Carl Guo, Suvodip Dey, Akul Datta, Takyoung Kim, Gokhan Tur, and Dilek Hakkani-Tür. 2025. TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Compar- isons. arXiv:2504.19982 [cs] doi:10.48550/arXiv.2504.19982

work page doi:10.48550/arxiv.2504.19982 2025
[3]

Deterministic

Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. 2025. Non-Determinism of "Deterministic" LLM Settings. arXiv:2408.04667 [cs] doi:10.48550/arXiv.2408.04667

work page doi:10.48550/arxiv.2408.04667 2025
[4]

Lorenz Brehme, Benedikt Dornauer, Thomas Ströhle, Maximilian Ehrhart, and Ruth Breu. 2025. Retrieval-Augmented Generation in Industry: An Interview Study on Use Cases, Requirements, Challenges, and Evaluation. arXiv:2508.14066 [cs] doi:10.48550/arXiv.2508.14066

work page doi:10.48550/arxiv.2508.14066 2025
[5]

Lorenz Brehme, Thomas Ströhle, and Ruth Breu. 2025. Can LLMs be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets. 16-23 pages. doi:10.1109/SDS66131.2025.00010

work page doi:10.1109/sds66131.2025.00010 2025
[6]

Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, and Maarten Sap. 025. Out of Style: RAG’s Fragility to Linguistic Variation. arXiv:2504.08231 [cs] doi:10.48550/arXiv.2504.08231

work page doi:10.48550/arxiv.2504.08231
[7]

Yiruo Cheng, Kelong Mao, Ziliang Zhao, Guanting Dong, Hongjin Qian, Yongkang Wu, Tetsuya Sakai, Ji-Rong Wen, and Zhicheng Dou. 2024. CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation. arXiv:2410.23090 [cs] doi:10.48550/arXiv.2410.23090

work page doi:10.48550/arxiv.2410.23090 2024
[8]

Google DeepMind. 2025. Gemini models | Gemini API. https://ai.google.dev/ gemini-api/docs/models

work page 2025
[9]

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-Verification Reduces Hallucina- tion in Large Language Models. arXiv:2309.11495 [cs] doi:10.48550/arXiv.2309. 11495

work page doi:10.48550/arxiv.2309 2023
[10]

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. RA- GAS: Automated Evaluation of Retrieval Augmented Generation. doi:10.48550/ arXiv.2309.15217

work page arXiv 2023
[11]

2025.Embeddings Gemini API

Google. 2025.Embeddings Gemini API. https://ai.google.dev/gemini-api/docs/ embeddings

work page 2025
[12]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 [cs] doi:10.48550/arXiv.2411.15594

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594 2025
[13]

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young- Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, and Ma- rina Danilevsky. 2025. MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems. arXiv:2501.03468 [cs] doi:10.48550/arXiv.2501.03468

work page doi:10.48550/arxiv.2501.03468 2025
[14]

Tzu-Lin Kuo, Feng-Ting Liao, Mu-Wei Hsieh, Fu-Chieh Chang, Po-Chun Hsu, and Da-Shan Shiu. 2024. RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues. arXiv:2409.12558 [cs] doi:10.48550/arXiv.2409. 12558 version: 1

work page doi:10.48550/arxiv.2409 2024
[16]

2024.langchain-ai/langchain-postgres

LagChain. 2024.langchain-ai/langchain-postgres. https://github.com/langchain- ai/langchain-postgres original-date: 2024-04-08T13:38:40Z

work page 2024
[17]

2025.LangChain

LangChain. 2025.LangChain. https://www.langchain.com/

work page 2025
[18]

Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez As- tudillo, and Radu Florian. 2024. Multi-Document Grounded Multi-Turn Synthetic Dialog Generation. arXiv:2409.11500 [cs] doi:10.48550/arXiv.2409.11500

work page doi:10.48550/arxiv.2409.11500 2024
[20]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Genera- tion for Knowledge-Intensive NLP Tasks. http://arxiv.org/abs/2005.11401 arXiv:2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised Evaluation of Interactive Dialog with DialoGPT. arXiv:2006.12719 [cs.CL] https://arxiv.org/abs/2006.12719

work page arXiv 2020
[23]

2025.GPT-5 System Card

openAI. 2025.GPT-5 System Card. https://openai.com/index/gpt-5-system-card/

work page 2025
[24]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076 [cs] doi:10.48550/ arXiv.2404.13076

work page arXiv 2024
[25]

Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, and Kay-Ulrich Scholl. 2025. Investigating the Robustness of Retrieval- Augmented Generation at the Query Level. 439–457 pages. https://aclanthology. org/2025.gem-1.38/

work page 2025
[26]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2024. Know What You Don’t Know: Unanswerable Questions for SQuAD. doi:10.48550/arXiv.1806.03822

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.03822 2024
[27]

Sara Rosenthal, Avirup Sil, Radu Florian, and Salim Roukos. 2024. CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems. arXiv:2404.02103 [cs.CL]

work page arXiv 2024
[28]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An Automated Evaluation Framework for Retrieval-Augmented Genera- tion Systems. doi:10.48550/arXiv.2311.09476

work page doi:10.48550/arxiv.2311.09476 2024
[29]

Alireza Salemi and Hamed Zamani. 2024. Evaluating Retrieval Quality in Retrieval-Augmented Generation. doi:10.48550/arXiv.2404.13781

work page doi:10.48550/arxiv.2404.13781 2024
[30]

Tobias Schimanski, Jingwei Ni, Roberto Spacey, Nicola Ranger, and Markus Leippold. 2024. ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures. arXiv:2406.09818 [cs] doi:10.48550/arXiv. 2406.09818

work page internal anchor Pith review doi:10.48550/arxiv 2024
[31]

Ensheng Shi, Yanlin Wang, Wei Tao, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2022. RACE: Retrieval-Augmented Commit Message Generation. arXiv:2203.02700 [cs] doi:10.48550/arXiv.2203.02700

work page doi:10.48550/arxiv.2203.02700 2022
[32]

Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xing. 2025. MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs. arXiv:2501.17399 [cs] doi:10.48550/ arXiv.2501.17399 version: 2

work page arXiv 2025
[33]

Yixuan Tang and Yi Yang. 2024. MultiHop-RAG: Benchmarking Retrieval- Augmented Generation for Multi-Hop Queries. doi:10.48550/arXiv.2401.15391

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.15391 2024
[34]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

work page
[35]

MuSiQue: Multihop Questions via Single-hop Question Composition

work page
[36]

Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. 2024. Searching for best practices in retrieval-augmented generation. arXiv:2407.01219 [cs.CL]

work page arXiv 2024
[37]

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering

work page 2012
[38]

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, and Xin Luna Dong. 2024...

work page doi:10.48550/arxiv.2406.04744 2024
[39]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. doi:10.48550/arXiv.1809. 09600

work page doi:10.48550/arxiv.1809 2018
[40]

Yi-Ting Yeh, Maxine Eskénazi, and Shikib Mehri. 2021. A Comprehensive As- sessment of Dialog Evaluation Metrics. arXiv:2106.03706 https://arxiv.org/abs/ 2106.03706

work page arXiv 2021
[41]

Shuo Yu, Mingyue Cheng, Jiqian Yang, and Jie Ouyang. 2024. A Knowledge- Centric Benchmarking Framework and Empirical Study for Retrieval-Augmented Generation. doi:10.48550/arXiv.2409.13694

work page doi:10.48550/arxiv.2409.13694 2024
[42]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. arXiv:2306.05685 [cs] doi:10.48550/arXiv.2306.05685 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023