pith. sign in

arxiv: 2604.04648 · v1 · submitted 2026-04-06 · 💻 cs.LG

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords reward hackingbest-of-n samplinginference-time scalingpessimismreward modelslanguage modelsout-of-distribution detection
0
0 comments X

The pith

Penalizing prediction error from an error model lowers rewards for atypical responses and mitigates reward hacking in Best-of-N sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to fix a failure mode in Best-of-N sampling, a common way to use extra compute at inference time. In Best-of-N, a language model generates many candidate answers, a reward model scores them, and the highest-scoring one is kept. As the number of candidates grows, the selected answers increasingly exploit flaws in the reward model instead of improving in genuine quality. The proposed fix, called caution, trains a separate error model on typical responses and then reduces the reward score of any candidate whose prediction error is high. Empirical results indicate this simple change largely prevents the performance drop, and a linear-model analysis proves the method selects better responses than plain Best-of-N.

Core claim

Caution mitigates reward hacking in BoN by training an error model on typical responses and using its prediction error to lower reward estimates for atypical ones; the approach is computationally efficient in practice and provably superior to standard BoN in a simplified linear setting.

What carries the argument

Caution, which treats prediction error from an error model trained on typical responses as a signal of uncertainty and lowers the corresponding reward estimates.

If this is right

  • Larger N can be used in Best-of-N without the usual performance degradation from reward hacking.
  • In linear settings the method selects higher-quality responses than standard Best-of-N with high probability.
  • Caution requires only an additional error-model training step and remains simple to run at scale.
  • Prediction error serves as a practical out-of-distribution signal that improves reward-based selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same error-model penalty could be applied inside the reward-model training loop rather than only at inference time.
  • Curiosity-style prediction-error signals might be repurposed for other detection tasks in language-model outputs.
  • Combining caution with stronger base reward models could produce further gains on tasks that benefit from heavy inference compute.

Load-bearing premise

That prediction error from an error model trained on typical responses reliably identifies responses that exploit imperfections in the reward model and that lowering rewards for these responses improves overall generation quality without introducing new failure modes.

What would settle it

An experiment showing that responses flagged with high prediction error are actually preferred by human judges or that caution produces worse final outputs than standard BoN on a task with a trustworthy reward model.

Figures

Figures reproduced from arXiv: 2604.04648 by Adam Block, Zhiwei Steven Wu, Zhuohao Yu.

Figure 1
Figure 1. Figure 1: Average Accuracy with different sampling budgets for Best-of-N on the GSM8k dataset. We see that standard Best-of-N sampling (blue, red, and gold) suffers from reward hacking, exhibiting the characteristic rise-and-fall pattern as N increases. In contrast, caution (our approach, green) consistently improves with larger N, effectively mitigating reward hacking. This counterintuitive phenomenon, whereby incr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview. Predictor is trained to match RM features on typical responses; at inference, we select the candidate with the highest pessimistic reward, down-weighting OOD ones. reward level: we penalize OOD responses by subtracting per-response uncertainty estimates from the scores assigned by rˆ, and then select the response with the highest pessimistic score. Conceptually, caution is the dual of curiosity, … view at source ↗
Figure 3
Figure 3. Figure 3: Scaling over N across distributions and domains. Best-of-N sampling on GSM8K, MATH-500, and BigBench-Hard. Curves compare selection by Reward Model, Pessimism, and Reward Model + Pessimism. Note that the pessimism function is trained only on GSM8K; thus, MATH-500 represents an out-of-distribution setting, while BigBench-Hard represents a fully out￾of-domain setting. λ > 0 to be a pessimistic variant of the… view at source ↗
Figure 4
Figure 4. Figure 4: Contrasting Selection Behaviors: Reward Hacking vs. Format Compliance. Two rep￾resentative examples showing how reward models favor verbose responses regardless of correctness, while our curiosity-driven pessimism prioritizes format compliance and distributional familiarity. RM assigns high scores to detailed responses regardless of correctness, while pessimism detects distributional deviation from trainin… view at source ↗
Figure 5
Figure 5. Figure 5: Pessimism–Reward visualization on GSM8K. Each row shows one problem: a scatter plot of z-normalized pessimism (x-axis) and z-normalized reward (y-axis), with green points for correct responses and red for incorrect. Upper-left points (high reward, low pessimism) illustrate re￾ward hacking—responses that score well despite low distributional support. Lower-right points (low reward, high pessimism) are well-… view at source ↗
Figure 6
Figure 6. Figure 6: Caution (RND-on-RM-features) scaling with λ. Best-of-N accuracy on GSM8K versus samples per problem (x-axis) for Pessimism strengths λ ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. The predic￾tor is trained against a frozen target network built from reward-model features. Larger λ increases pessimism strength; λ = 0 reduces to Reward-Model-only selection, and λ = 1 to pessimism-only. Moderate–high weights (roughly 0.6… view at source ↗
Figure 7
Figure 7. Figure 7: Traditional RND (random targets) baseline. Best-of-N GSM8K accuracy when us￾ing classical RND with a randomly initialized target network (no reward-model features), sweeping λ ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. Unlike our caution variant, this baseline shows little to no scal￾ing benefit and generally does not surpass the Reward-Model-only curve, indicating that semantic grounding from reward-model features… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of pessimism-based sampling with alternative baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'caution' as a method to mitigate reward hacking in Best-of-N (BoN) sampling: an auxiliary error model is trained on typical responses, and its prediction error is subtracted from the reward model's scores to penalize atypical (potentially hacking) candidates. It claims this is simple and efficient, with extensive empirical results showing substantial mitigation across tasks, plus a theoretical analysis in a simplified linear setting proving improvement over standard BoN.

Significance. If the central claims hold, this offers a lightweight, inference-time approach to scaling compute in LLMs without the conservatism of heavy regularization or the cost of stronger reward models. The provision of both broad empirical evaluation and a clean linear theory (with explicit assumptions) is a positive strength that could make the method easy to adopt and extend.

major comments (2)
  1. [§3 (Linear Setting)] §3 (Linear Setting): The proof shows improvement under the assumption that the error model exactly recovers uncertainty with a shared feature map; this does not automatically establish robustness when both the reward and error models are neural nets trained on overlapping but non-identical distributions, leaving the transfer to the empirical regime unaddressed.
  2. [§5 (Empirical Evaluation)] §5 (Empirical Evaluation): The claim of 'substantially mitigates reward hacking' requires evidence that the error model systematically assigns higher error to actual reward-hacking responses (rather than merely in-distribution atypical ones); without ablations on the error-model training distribution or checks against the skeptic's concern that subtle exploits remain unpenalized, the load-bearing assumption remains unverified.
minor comments (2)
  1. [Abstract and §1] Abstract and §1: Clarify that the theoretical guarantee is restricted to the linear case and does not claim parameter-free behavior in the neural setting.
  2. [Figure captions and §5] Figure captions and §5: Include error bars, number of runs, and statistical tests to support the 'substantial' improvement claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the paper's potential impact. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (Linear Setting): The proof shows improvement under the assumption that the error model exactly recovers uncertainty with a shared feature map; this does not automatically establish robustness when both the reward and error models are neural nets trained on overlapping but non-identical distributions, leaving the transfer to the empirical regime unaddressed.

    Authors: We agree that the theoretical result in §3 is derived under a simplified linear setting with explicit assumptions, including exact recovery of uncertainty by the error model via a shared feature map. This provides a clean, provable improvement over standard BoN but does not formally establish the result for neural networks with non-identical training distributions. The linear case is intended to illustrate the core pessimism principle and offer intuition. In the revision we will add a dedicated paragraph in the discussion section clarifying these assumptions, their role as a stepping stone, and the empirical results as the primary evidence for practical applicability, without claiming a direct theoretical transfer. revision: yes

  2. Referee: §5 (Empirical Evaluation): The claim of 'substantially mitigates reward hacking' requires evidence that the error model systematically assigns higher error to actual reward-hacking responses (rather than merely in-distribution atypical ones); without ablations on the error-model training distribution or checks against the skeptic's concern that subtle exploits remain unpenalized, the load-bearing assumption remains unverified.

    Authors: Our experiments demonstrate that caution reduces performance degradation as N grows and yields higher final performance than standard BoN across tasks, consistent with mitigation of reward hacking. However, we acknowledge that the current manuscript does not include explicit ablations verifying that the error model assigns systematically higher errors specifically to reward-hacking responses (as opposed to other atypical but non-hacking ones) or varying the error-model training distribution. We will add such analyses in the revised version, including error-score distributions on identified hacking examples and sensitivity checks on the error-model training data, to more directly substantiate the mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent error model and linear theory

full rationale

The paper's central approach trains a separate error model on typical responses to penalize prediction error when adjusting rewards for BoN candidates. This is not defined in terms of the target reward or selection outcome. The theoretical analysis is performed in a simplified linear setting that derives improvement over standard BoN under explicit assumptions about shared features and uncertainty recovery; it does not reduce to a fit on the same data or a self-citation chain. No load-bearing step equates a prediction or uniqueness claim to its own inputs by construction, and the empirical claims rest on external evaluation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that prediction error correlates with reward uncertainty and on the practical choice of training an error model, but no explicit free parameters or invented entities beyond the caution mechanism itself are detailed in the abstract.

axioms (1)
  • domain assumption Prediction error from an error model trained on typical responses serves as a valid proxy for distributional uncertainty that causes reward hacking
    Invoked in the description of how caution lowers reward estimates for atypical responses.
invented entities (1)
  • Caution mechanism (error-model pessimism) no independent evidence
    purpose: To penalize OOD responses in BoN sampling
    Introduced as the reverse of curiosity for OOD detection in LLM reward settings

pith-pipeline@v0.9.0 · 5602 in / 1476 out tokens · 75458 ms · 2026-05-10T19:35:28.562857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    (QGT+  o/߸ ;fQ Zt鐒gvZxG*J Y ȮY! dZs (HE E 2 n=#R

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...