The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
URL https://proceedings.neurips.cc/paper files/paper/2024/ file/02fd91a387a6a5a5751e81b58a75af90-Paper-Datasets and Benchmarks Track.pdf
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.
citing papers explorer
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.