LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge , author= · 2025 · arXiv 2506.09443

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

cs.AI · 2026-06-08 · unverdicted · novelty 4.0

Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

citing papers explorer

Showing 7 of 7 citing papers after filters.

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator cs.CL · 2026-05-20 · unverdicted · none · ref 40
RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following cs.CL · 2026-05-05 · unverdicted · none · ref 1
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 25
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 14
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge cs.AI · 2026-04-07 · unverdicted · none · ref 27
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 42
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 138
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer