{"work":{"id":"d0c30cd7-81e1-4159-a87f-f6adca77ff08","openalex_id":null,"doi":null,"arxiv_id":"2306.05685","raw_key":null,"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","authors":null,"authors_text":"Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang","year":2023,"venue":"cs.CL","abstract":"Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.","external_url":"https://arxiv.org/abs/2306.05685","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-30T01:34:09.589492+00:00","pith_arxiv_id":"2306.05685","created_at":"2026-05-08T21:39:24.676569+00:00","updated_at":"2026-06-30T01:34:09.589492+00:00","title_quality_ok":true,"display_title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","render_title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"},"hub":{"state":{"work_id":"d0c30cd7-81e1-4159-a87f-f6adca77ff08","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":215,"external_cited_by_count":null,"distinct_field_count":17,"first_pith_cited_at":"2023-04-24T16:31:06+00:00","last_pith_cited_at":"2026-06-26T18:21:58+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-30T01:59:22.416724+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":15},{"context_role":"method","n":10},{"context_role":"dataset","n":4},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":14},{"context_polarity":"use_method","n":9},{"context_polarity":"use_dataset","n":4},{"context_polarity":"unclear","n":2},{"context_polarity":"baseline","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","claims":[{"claim_text":"Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"6B achieves a very significant improvement over previous state-of-the-art methods [46, 67, 111] on linear probing. To our knowledge, this represents the currently best linear eval- uation results without the JFT dataset [173]. Transfer to Semantic Segmentation. To investigate the pixel-level perceptual capacity of InternViT-6B, we con- duct extensive experiments of semantic segmentation on the ADE20K [185] dataset. Following ViT-22B [37], we be- 6 method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"issues, not just LLM limitations or simple prompt following, and require more than superficial fixes, thereby highlighting the need for structural MAS redesigns. 3 2 Related Work 2.1 Challenges in Agentic Systems The promising capabilities of agentic systems have inspired research into solving specific challenges. For instance, Agent Workflow Memory [22] addresses long-horizon web navigation by introducing workflow memory. DSPy [23] tackles issues in programming agentic flows, while StateFlow [2","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"MotionBench [48] 68.4 68.4 62.8 GLM-4V MVBench [73] 74.4 74.3 76.4 InternVL-2.5 TOMATO [117] 44.7 44.2 46.9∗ Gemini 2.5 Pro TVBench [19] 63.6 61.5 62.6∗ Gemini 2.5 Pro Dream-1K [139] 43.9 42.6 42.0 Tarsier2 TempCompass [82] 83.7 83.1 75.8∗ Gemini 2.5 Pro Long video LongVideoBench [147] 74.0 74.4 66.7 GPT-4o LVBench [142] 64.6 64.0 69.2∗ Gemini 2.5 Pro MLVU [178] 82.1 81.8 81.2∗ Gemini 2.5 Pro VideoMME(w/o sub)[32] 77.9 77.6 87.0∗ Gemini 2.5 Pro TemporalBench [12] 79.8 78.9 73.3 GPT-4o Streaming ","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (2 contexts).","role_counts":[{"n":2,"context_role":"dataset"},{"n":1,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T15:28:05.564307+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"5a7fcdc8-d7d4-4530-8705-9dea86ab6528","orcid":null,"display_name":"Lianmin Zheng"},{"id":"4ef8e6aa-db29-4ec4-8285-a4cf7b409b4e","orcid":null,"display_name":"Wei-Lin Chiang"},{"id":"5a46be7c-004b-4f61-b7c8-295640d0378a","orcid":null,"display_name":"Ying Sheng"},{"id":"e54cb289-9eb8-4dad-93b5-185f17758a9b","orcid":null,"display_name":"Siyuan Zhuang"},{"id":"a9e5264f-b9cf-40d1-9426-9312a33621b9","orcid":null,"display_name":"Zhanghao Wu"},{"id":"36f8753e-7cbb-4fd8-8ecd-43113e7c6619","orcid":null,"display_name":"Yonghao Zhuang"}]},"error":null,"updated_at":"2026-05-15T15:28:07.107557+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:27:12.452905+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks","work_id":"27eaec54-c105-4969-8188-da5f0fca3688","shared_citers":10},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":10},{"title":"A Survey on LLM-as-a-Judge","work_id":"2676656a-67bd-4ad5-bad6-cb6f5fcdbfbe","shared_citers":9},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":9},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":9},{"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","shared_citers":9},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":8},{"title":"Holistic Evaluation of Language Models","work_id":"cc02a01e-7218-47dc-8e66-3333e7e4adec","shared_citers":8},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":8},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":7},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":6},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":6},{"title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","work_id":"891b9780-a800-4e3c-bba0-53597ab8dc98","shared_citers":6},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":6},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":6},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":6},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":6}],"time_series":[{"n":6,"year":2023},{"n":2,"year":2024},{"n":3,"year":2025},{"n":77,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:37:44.684648+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:27:04.120326+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","claims":[{"claim_text":"Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"6B achieves a very significant improvement over previous state-of-the-art methods [46, 67, 111] on linear probing. To our knowledge, this represents the currently best linear eval- uation results without the JFT dataset [173]. Transfer to Semantic Segmentation. To investigate the pixel-level perceptual capacity of InternViT-6B, we con- duct extensive experiments of semantic segmentation on the ADE20K [185] dataset. Following ViT-22B [37], we be- 6 method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"issues, not just LLM limitations or simple prompt following, and require more than superficial fixes, thereby highlighting the need for structural MAS redesigns. 3 2 Related Work 2.1 Challenges in Agentic Systems The promising capabilities of agentic systems have inspired research into solving specific challenges. For instance, Agent Workflow Memory [22] addresses long-horizon web navigation by introducing workflow memory. DSPy [23] tackles issues in programming agentic flows, while StateFlow [2","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"MotionBench [48] 68.4 68.4 62.8 GLM-4V MVBench [73] 74.4 74.3 76.4 InternVL-2.5 TOMATO [117] 44.7 44.2 46.9∗ Gemini 2.5 Pro TVBench [19] 63.6 61.5 62.6∗ Gemini 2.5 Pro Dream-1K [139] 43.9 42.6 42.0 Tarsier2 TempCompass [82] 83.7 83.1 75.8∗ Gemini 2.5 Pro Long video LongVideoBench [147] 74.0 74.4 66.7 GPT-4o LVBench [142] 64.6 64.0 69.2∗ Gemini 2.5 Pro MLVU [178] 82.1 81.8 81.2∗ Gemini 2.5 Pro VideoMME(w/o sub)[32] 77.9 77.6 87.0∗ Gemini 2.5 Pro TemporalBench [12] 79.8 78.9 73.3 GPT-4o Streaming ","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (2 contexts).","role_counts":[{"n":2,"context_role":"dataset"},{"n":1,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T15:28:07.112619+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","claims":[{"claim_text":"Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:37:36.284010+00:00"}},"summary":{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","claims":[{"claim_text":"Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks","work_id":"27eaec54-c105-4969-8188-da5f0fca3688","shared_citers":10},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":10},{"title":"A Survey on LLM-as-a-Judge","work_id":"2676656a-67bd-4ad5-bad6-cb6f5fcdbfbe","shared_citers":9},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":9},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":9},{"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","shared_citers":9},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":8},{"title":"Holistic Evaluation of Language Models","work_id":"cc02a01e-7218-47dc-8e66-3333e7e4adec","shared_citers":8},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":8},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":7},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":6},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":6},{"title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","work_id":"891b9780-a800-4e3c-bba0-53597ab8dc98","shared_citers":6},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":6},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":6},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":6},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":6}],"time_series":[{"n":6,"year":2023},{"n":2,"year":2024},{"n":3,"year":2025},{"n":77,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"5a7fcdc8-d7d4-4530-8705-9dea86ab6528","orcid":null,"display_name":"Lianmin Zheng","source":"manual","import_confidence":0.72},{"id":"e54cb289-9eb8-4dad-93b5-185f17758a9b","orcid":null,"display_name":"Siyuan Zhuang","source":"manual","import_confidence":0.72},{"id":"4ef8e6aa-db29-4ec4-8285-a4cf7b409b4e","orcid":null,"display_name":"Wei-Lin Chiang","source":"manual","import_confidence":0.72},{"id":"5a46be7c-004b-4f61-b7c8-295640d0378a","orcid":null,"display_name":"Ying Sheng","source":"manual","import_confidence":0.72},{"id":"36f8753e-7cbb-4fd8-8ecd-43113e7c6619","orcid":null,"display_name":"Yonghao Zhuang","source":"manual","import_confidence":0.72},{"id":"a9e5264f-b9cf-40d1-9426-9312a33621b9","orcid":null,"display_name":"Zhanghao Wu","source":"manual","import_confidence":0.72}]}}