An empirical study of RLHF pipelines classifies failure modes such as reward hacking by analyzing directions of change in learned reward and judge scores across training checkpoints and shows they can be localized and partially predicted.
Feedback loops with language models drive in-context reward hacking,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming
An empirical study of RLHF pipelines classifies failure modes such as reward hacking by analyzing directions of change in learned reward and judge scores across training checkpoints and shows they can be localized and partially predicted.