Feedback loops with language models drive in-context reward hacking,

· 2023 · arXiv 2309.04509

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

An empirical study of RLHF pipelines classifies failure modes such as reward hacking by analyzing directions of change in learned reward and judge scores across training checkpoints and shows they can be localized and partially predicted.

citing papers explorer

Showing 1 of 1 citing paper.

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming cs.LG · 2026-06-02 · unverdicted · none · ref 19
An empirical study of RLHF pipelines classifies failure modes such as reward hacking by analyzing directions of change in learned reward and judge scores across training checkpoints and shows they can be localized and partially predicted.

Feedback loops with language models drive in-context reward hacking,

fields

years

verdicts

representative citing papers

citing papers explorer