Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

· 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.

Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport

cs.CL · 2026-05-09 · unverdicted · novelty 5.0

With 100 anchors the Bayesian linear corrector matches or beats the Neural-ODE flow on distribution recovery while both fix mean offset; with 1500 anchors the flow wins on MAE, Pearson correlation, and KL divergence.

citing papers explorer

Showing 2 of 2 citing papers.

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning cs.CL · 2026-05-13 · unverdicted · none · ref 3
RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.
Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport cs.CL · 2026-05-09 · unverdicted · none · ref 1
With 100 anchors the Bayesian linear corrector matches or beats the Neural-ODE flow on distribution recovery while both fix mean offset; with 1500 anchors the flow wins on MAE, Pearson correlation, and KL divergence.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

fields

years

verdicts

representative citing papers

citing papers explorer