Breaking the mirror: Activation-based mitigation of self-preference in llm evaluators.arXiv preprint arXiv:2509.03647

Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Simon Fu, Narmeen Oozeer · arXiv 2509.03647

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

citing papers explorer

Showing 1 of 1 citing paper.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 13
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Breaking the mirror: Activation-based mitigation of self-preference in llm evaluators.arXiv preprint arXiv:2509.03647

fields

years

verdicts

representative citing papers

citing papers explorer