From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3
The pith
Penalizing prediction error from an error model lowers rewards for atypical responses and mitigates reward hacking in Best-of-N sampling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Caution mitigates reward hacking in BoN by training an error model on typical responses and using its prediction error to lower reward estimates for atypical ones; the approach is computationally efficient in practice and provably superior to standard BoN in a simplified linear setting.
What carries the argument
Caution, which treats prediction error from an error model trained on typical responses as a signal of uncertainty and lowers the corresponding reward estimates.
If this is right
- Larger N can be used in Best-of-N without the usual performance degradation from reward hacking.
- In linear settings the method selects higher-quality responses than standard Best-of-N with high probability.
- Caution requires only an additional error-model training step and remains simple to run at scale.
- Prediction error serves as a practical out-of-distribution signal that improves reward-based selection.
Where Pith is reading between the lines
- The same error-model penalty could be applied inside the reward-model training loop rather than only at inference time.
- Curiosity-style prediction-error signals might be repurposed for other detection tasks in language-model outputs.
- Combining caution with stronger base reward models could produce further gains on tasks that benefit from heavy inference compute.
Load-bearing premise
That prediction error from an error model trained on typical responses reliably identifies responses that exploit imperfections in the reward model and that lowering rewards for these responses improves overall generation quality without introducing new failure modes.
What would settle it
An experiment showing that responses flagged with high prediction error are actually preferred by human judges or that caution produces worse final outputs than standard BoN on a task with a trustworthy reward model.
Figures
read the original abstract
Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes 'caution' as a method to mitigate reward hacking in Best-of-N (BoN) sampling: an auxiliary error model is trained on typical responses, and its prediction error is subtracted from the reward model's scores to penalize atypical (potentially hacking) candidates. It claims this is simple and efficient, with extensive empirical results showing substantial mitigation across tasks, plus a theoretical analysis in a simplified linear setting proving improvement over standard BoN.
Significance. If the central claims hold, this offers a lightweight, inference-time approach to scaling compute in LLMs without the conservatism of heavy regularization or the cost of stronger reward models. The provision of both broad empirical evaluation and a clean linear theory (with explicit assumptions) is a positive strength that could make the method easy to adopt and extend.
major comments (2)
- [§3 (Linear Setting)] §3 (Linear Setting): The proof shows improvement under the assumption that the error model exactly recovers uncertainty with a shared feature map; this does not automatically establish robustness when both the reward and error models are neural nets trained on overlapping but non-identical distributions, leaving the transfer to the empirical regime unaddressed.
- [§5 (Empirical Evaluation)] §5 (Empirical Evaluation): The claim of 'substantially mitigates reward hacking' requires evidence that the error model systematically assigns higher error to actual reward-hacking responses (rather than merely in-distribution atypical ones); without ablations on the error-model training distribution or checks against the skeptic's concern that subtle exploits remain unpenalized, the load-bearing assumption remains unverified.
minor comments (2)
- [Abstract and §1] Abstract and §1: Clarify that the theoretical guarantee is restricted to the linear case and does not claim parameter-free behavior in the neural setting.
- [Figure captions and §5] Figure captions and §5: Include error bars, number of runs, and statistical tests to support the 'substantial' improvement claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the paper's potential impact. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: §3 (Linear Setting): The proof shows improvement under the assumption that the error model exactly recovers uncertainty with a shared feature map; this does not automatically establish robustness when both the reward and error models are neural nets trained on overlapping but non-identical distributions, leaving the transfer to the empirical regime unaddressed.
Authors: We agree that the theoretical result in §3 is derived under a simplified linear setting with explicit assumptions, including exact recovery of uncertainty by the error model via a shared feature map. This provides a clean, provable improvement over standard BoN but does not formally establish the result for neural networks with non-identical training distributions. The linear case is intended to illustrate the core pessimism principle and offer intuition. In the revision we will add a dedicated paragraph in the discussion section clarifying these assumptions, their role as a stepping stone, and the empirical results as the primary evidence for practical applicability, without claiming a direct theoretical transfer. revision: yes
-
Referee: §5 (Empirical Evaluation): The claim of 'substantially mitigates reward hacking' requires evidence that the error model systematically assigns higher error to actual reward-hacking responses (rather than merely in-distribution atypical ones); without ablations on the error-model training distribution or checks against the skeptic's concern that subtle exploits remain unpenalized, the load-bearing assumption remains unverified.
Authors: Our experiments demonstrate that caution reduces performance degradation as N grows and yields higher final performance than standard BoN across tasks, consistent with mitigation of reward hacking. However, we acknowledge that the current manuscript does not include explicit ablations verifying that the error model assigns systematically higher errors specifically to reward-hacking responses (as opposed to other atypical but non-hacking ones) or varying the error-model training distribution. We will add such analyses in the revised version, including error-score distributions on identified hacking examples and sensitivity checks on the error-model training data, to more directly substantiate the mechanism. revision: yes
Circularity Check
No significant circularity; derivation relies on independent error model and linear theory
full rationale
The paper's central approach trains a separate error model on typical responses to penalize prediction error when adjusting rewards for BoN candidates. This is not defined in terms of the target reward or selection outcome. The theoretical analysis is performed in a simplified linear setting that derives improvement over standard BoN under explicit assumptions about shared features and uncertainty recovery; it does not reduce to a fit on the same data or a self-citation chain. No load-bearing step equates a prediction or uniqueness claim to its own inputs by construction, and the empirical claims rest on external evaluation rather than internal redefinition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prediction error from an error model trained on typical responses serves as a valid proxy for distributional uncertainty that causes reward hacking
invented entities (1)
-
Caution mechanism (error-model pessimism)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniquely forced by Aczél-class functional equation) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones... α(x,y)=∥Pθ(x,y)−T(x,y)∥² ... ˆr_LCB=ˆr−λα
-
IndisputableMonolith/Cost.leanJcost_pos_of_ne_one (positive cost exactly off identity) refines?
refinesRelation between the paper passage and the cited Recognition theorem.
linear setting... α(y) approximates projection onto V^⊥... E[r⋆(yipess)−r⋆(yˆi)] ≳ √log N
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
(QGT+ o/߸ ;fQ Zt鐒gvZxG*J Y ȮY! dZs (HE E 2 n=#R
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.