super hub Mixed citations

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Siyuan Zhuang, Wei-Lin Chiang, Ying Sheng, Yonghao Zhuang, Zhanghao Wu · 2023 · cs.CL · arXiv 2306.05685

Mixed citation behavior. Most common role is background (47%).

204 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 204 citing papers more from Lianmin Zheng arXiv PDF

abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 10 dataset 4 baseline 1

citation-polarity summary

background 14 use method 9 use dataset 4 unclear 2 baseline 1

claims ledger

abstract Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be

authors

Lianmin Zheng Siyuan Zhuang Wei-Lin Chiang Ying Sheng Yonghao Zhuang Zhanghao Wu

co-cited works

representative citing papers

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

Universal and Transferable Adversarial Attacks on Aligned Language Models

cs.CL · 2023-07-27 · accept · novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.

AI Fiction in the Wild

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Analysis of 500k ChatGPT logs shows over one-third of conversations generate fiction, dominated by power users with repetitive and niche patterns.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

LLMs show severe staleness after training cutoffs and recency bias on historical German statutes; RAG with version filtering mitigates both better than web search.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.

Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

Belief Engine is a configurable belief-update mechanism for multi-agent LLM systems that uses structured argument extraction and log-odds stance updates to make evidence-grounded deliberation inspectable and controllable.

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.

citing papers explorer

Showing 29 of 29 citing papers after filters.

Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 22 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models cs.CL · 2025-12-16 · conditional · none · ref 27 · internal anchor
VLegal-Bench supplies 10,450 expert-validated samples for evaluating LLMs on Vietnamese legal questions, retrieval, multi-step reasoning, and scenario solving.
Evalet: Evaluating Large Language Models through Functional Fragmentation cs.HC · 2025-09-14 · conditional · none · ref 100 · internal anchor
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
LLMs Judging LLMs: A Simplex Perspective cs.LG · 2025-05-28 · unverdicted · none · ref 1 · internal anchor
A simplex geometry framework shows when LLM-as-judge rankings are identifiable, provides a basis for preferring binary over multi-level scoring, and uses Bayesian priors to model epistemic uncertainty about judges.
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models cs.CL · 2025-04-29 · unverdicted · none · ref 42 · internal anchor
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 50 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety cs.CL · 2025-12-08 · unverdicted · none · ref 52 · internal anchor
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
Reading Between the Lines: The One-Sided Conversation Problem cs.CL · 2025-11-04 · unverdicted · none · ref 5 · internal anchor
The one-sided conversation problem is introduced with empirical results showing that future-turn access and utterance length improve missing-turn reconstruction while high-quality summaries are possible without reconstruction on standard dialogue datasets.
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization cs.CL · 2025-10-14 · unverdicted · none · ref 5 · internal anchor
Prompt Duel Optimizer uses dueling bandits and LLM-as-judge pairwise feedback with Double Thompson Sampling and top-performer mutation to find stronger prompts than label-free baselines on BBH and MS MARCO under limited comparison budgets.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 33 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild cs.CR · 2025-10-05 · unverdicted · none · ref 49 · internal anchor
QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning cs.CL · 2025-09-30 · unverdicted · none · ref 45 · internal anchor
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 30 · internal anchor
Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
Listener-Rewarded Thinking in VLMs for Image Preferences cs.CV · 2025-06-28 · unverdicted · none · ref 34 · internal anchor
Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 36 · internal anchor
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
Latent Trajectory Dynamics in Large Language Models: A Manifold Evolution Framework with Empirical Validation cs.CL · 2025-05-24 · unverdicted · none · ref 19 · internal anchor
DMET models LLM generation as controlled dynamical trajectories on a semantic manifold, with three proxy metrics that predict output quality and support adaptive decoding to lower perplexity.
Tuning Language Models for Robust Prediction of Diverse User Behaviors cs.CL · 2025-05-23 · unverdicted · none · ref 48 · internal anchor
BehaviorLM applies progressive fine-tuning in two stages to let LLMs predict both frequent anchor and rare tail user behaviors more robustly on real-world datasets.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 85 · internal anchor
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models cs.AI · 2025-11-11 · unverdicted · none · ref 3 · 2 links · internal anchor
Prepending a compact learnable prefix to LLMs produces safety gains comparable to next-generation aligned models while preserving fluency and adding negligible parameters.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 60 · internal anchor
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL cs.CL · 2025-11-02 · unverdicted · none · ref 42 · internal anchor
MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling cs.DC · 2025-08-21 · unverdicted · none · ref 55 · internal anchor
HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration cs.CV · 2025-05-27 · unverdicted · none · ref 24 · internal anchor
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 11 · internal anchor
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots cs.AI · 2025-05-25 · unverdicted · none · ref 39 · internal anchor
Multimodal LLMs in robots develop self-identification and predictive awareness through sensorimotor loops, with structural equation modeling linking sensory integration to dimensions of the minimal self.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 178 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector cs.CL · 2025-09-08 · unverdicted · none · ref 61 · internal anchor
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
Aligning Deep Implicit Preferences by Learning to Reason Defensively cs.AI · 2025-10-13 · unreviewed · ref 1 · internal anchor
Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors? cs.CV · 2025-08-14 · unreviewed · ref 31 · internal anchor

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer