Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

cs.CL · 2024-06-26 · conditional · novelty 6.0

WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong

cs.CL · 2025-01-16 · unverdicted · novelty 5.0

Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.

citing papers explorer

Showing 4 of 4 citing papers.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery cs.AI · 2024-08-12 · unverdicted · none · ref 113
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 70
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs cs.CL · 2024-06-26 · conditional · none · ref 43
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong cs.CL · 2025-01-16 · unverdicted · none · ref 3
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.

Judging llm-as-a-judge with mt-bench and chatbot arena

fields

years

verdicts

representative citing papers

citing papers explorer