Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
Correlated proxies: A new definition and improved mitigation for reward hacking.arXiv preprint arXiv:2403.03185
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
ConsistRM improves generative reward models via consistency-aware self-training, outperforming vanilla RFT by 1.5% on average across five benchmarks and four base models.
citing papers explorer
-
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
-
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
ConsistRM improves generative reward models via consistency-aware self-training, outperforming vanilla RFT by 1.5% on average across five benchmarks and four base models.