Mona: Myopic optimization with non-myopic approval can mitigate multi-step reward hacking

Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah · 2025 · arXiv 2501.13011

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Evaluating Plan Compliance in Autonomous Programming Agents

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

citing papers explorer

Showing 2 of 2 citing papers.

Evaluating Plan Compliance in Autonomous Programming Agents cs.SE · 2026-04-13 · unverdicted · none · ref 8
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 181
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Mona: Myopic optimization with non-myopic approval can mitigate multi-step reward hacking

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer