RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 2polarities
background 2representative citing papers
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
citing papers explorer
-
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
-
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
-
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.