Cheap Reward Hacking Detection

Iv\'an Belenky; Joaqu\'in Itria; Steven Johns

arxiv: 2606.08893 · v1 · pith:M4ZCIMTOnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI· cs.CR

Cheap Reward Hacking Detection

Iv\'an Belenky , Joaqu\'in Itria , Steven Johns This is my paper

Pith reviewed 2026-06-27 17:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords reward hacking detectiontransformer encodertrajectory embeddinglinear probeLLM judgecost efficiencyAI alignment monitoring

0 comments

The pith

A small transformer encoder detects reward hacking nearly as accurately as an LLM judge but at four orders of magnitude lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors train a small transformer encoder to embed Terminal-Wrench trajectories onto a unit sphere such that distances approximate the L1 distance between reward and metadata signals. A linear probe on these embeddings detects reward hacking on the cleaned test split with AUC 0.9467 and TPR at 5 percent FPR of 0.8296. This matches the performance of a sanitized LLM-as-judge on the same split while exceeding its TPR at low FPR and operating at roughly four orders of magnitude lower per-trajectory cost. Stripping natural-language reasoning from the encoder input at probe time drops AUC to 0.6213, showing the model integrates both behavior and reasoning. The result offers a concrete, low-cost alternative for identifying reward hacking in trajectories.

Core claim

The authors show that a small transformer encoder trained to map trajectories to a unit sphere, where embedding distance approximates L1 distance between reward and metadata signals, enables a linear probe to detect reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296. This matches the TW sanitized LLM-as-judge AUC of 0.9510 and exceeds its TPR@5%FPR of 0.7130 on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader, as removing natural-language reasoning drops AUC to 0.6213.

What carries the argument

The small transformer encoder that maps Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the L1 distance between reward and metadata signals.

If this is right

Reward hacking detection becomes feasible at much lower per-trajectory cost than LLM-based judging.
The linear probe achieves comparable AUC and higher TPR at 5 percent FPR than the LLM judge under the same information condition.
Detection performance requires the full input that includes natural-language reasoning, as performance falls sharply without it.
The embedding approach provides a scalable method for monitoring trajectories without repeated large-model calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routine, low-cost scanning of large numbers of trajectories could become practical in ongoing training or evaluation loops.
Similar embedding-plus-probe pipelines might be tested on other misalignment signals beyond reward hacking.
The cost reduction could allow more trajectories to be audited in settings where LLM calls are currently prohibitive.

Load-bearing premise

The cleaning of the test split does not introduce selection bias that favors the embedding method over the LLM baseline.

What would settle it

Testing the linear probe on a freshly collected set of trajectories without the reported cleaning step and observing a substantial drop in AUC or TPR would show whether the reported detection performance holds.

Figures

Figures reproduced from arXiv: 2606.08893 by Iv\'an Belenky, Joaqu\'in Itria, Steven Johns.

**Figure 2.** Figure 2: ROC of the logistic-regression probe on the 690-trajectory test split (15h run checkpoint). [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: UMAP [7] projection of the 690 test-split embeddings. Left: colored by is_hack. Right: colored by probe confidence P(hack). In deployment, the harness that runs the agent produces the reasoning steps, and the monitor reads them downstream. The sanitized row is the realistic operating condition. 4.2.3 Stripped-trained encoder The probe-time ablation above strips reasoning from a model that was trained on re… view at source ↗

**Figure 4.** Figure 4: Per-trajectory scores of the full encoder ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: |∆|-weighted density of occlusion saliency over hack trajectories by absolute action index (log scale). The signal concentrates in the first ≈ 20 actions. 15h checkpoint, n = 200 trajectories.5 First-order only Occlusion removes one action at a time. Interaction effects between actions are invisible to this measurement. 4.4 Token-space probing A complementary question: given the probe, what would a maximal… view at source ↗

read the original abstract

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small transformer embedding matches LLM judge on reward hacking detection at 10kx lower cost, but unspecified test-split cleaning is the load-bearing assumption.

read the letter

The core result here is a small transformer encoder that embeds Terminal-Wrench trajectories onto the unit sphere so that Euclidean distance approximates L1 distance between reward and metadata signals. A linear probe on those embeddings then flags reward hacking with AUC 0.9467 and TPR@5%FPR 0.8296 on the cleaned test split, roughly matching the sanitized LLM judge while costing four orders of magnitude less per trajectory. The ablation that drops AUC to 0.6213 when natural-language reasoning is stripped shows the model is using more than raw behavior.

That is the actual advance: a cheap, fixed-size embedding that can stand in for expensive LLM judging on this specific task. The numbers are concrete and the cost comparison is straightforward.

The soft spot is exactly the one the stress-test flags. The abstract (and the provided summary) gives no description of how the test split was cleaned, what fraction of trajectories were removed, or whether the removal rule was independent of the embedding and probe. If the cleaning step preferentially drops cases where the unit-sphere approximation is weak, the reported parity with the LLM judge becomes hard to interpret. The paper would be stronger with an explicit statement of the cleaning criteria and a check that performance holds on the uncleaned split or on a held-out set cleaned by a different rule.

Model size, training data sources, and exact training procedure are also missing from the abstract-level description, though the full manuscript presumably supplies them. Those details matter for reproducibility but are secondary to the cleaning question.

This is the kind of practical, engineering-oriented paper that belongs in a reading group focused on scalable oversight or cheap monitoring. A serious referee should see it, mainly to press on the cleaning procedure and to ask for the missing implementation details. I would send it to review rather than desk-reject.

Referee Report

1 major / 0 minor

Summary. The paper trains a small transformer encoder to map Terminal-Wrench trajectories to a unit sphere such that embedding distance approximates L1 distance between reward and metadata signals. A linear probe on this embedding detects reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296, matching the TW-sanitized LLM-as-judge AUC (0.9510) and exceeding its TPR@5%FPR (0.7130 vs 0.8296) at four orders of magnitude lower cost. Stripping natural-language reasoning from the input drops AUC to 0.6213.

Significance. If the performance numbers hold without selection bias from the cleaning step, the work demonstrates a scalable, low-cost embedding-based detector for reward hacking that could complement or replace expensive LLM judges in RLHF pipelines. The ablation showing that the encoder relies on natural-language reasoning (rather than raw behavior alone) strengthens the claim that the representation captures semantically relevant features.

major comments (1)

[Abstract] Abstract: The central performance claims (AUC 0.9467, TPR@5%FPR 0.8296 on the cleaned test split) rest on an unspecified cleaning procedure for the test split. No criteria, fraction of trajectories removed, or verification that removal was independent of the embedding/probe outputs are provided. This is load-bearing because preferential removal of cases where the unit-sphere embedding fails to separate hacked trajectories could artifactually produce parity or superiority over the LLM judge baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting an important omission in our description of the experimental pipeline. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (AUC 0.9467, TPR@5%FPR 0.8296 on the cleaned test split) rest on an unspecified cleaning procedure for the test split. No criteria, fraction of trajectories removed, or verification that removal was independent of the embedding/probe outputs are provided. This is load-bearing because preferential removal of cases where the unit-sphere embedding fails to separate hacked trajectories could artifactually produce parity or superiority over the LLM judge baseline.

Authors: We agree that the cleaning procedure must be fully specified for the results to be interpretable. The current manuscript does not provide the criteria, the fraction removed, or an explicit statement of independence from the embedding/probe. In the revised manuscript we will add a dedicated subsection describing the cleaning rules (removal of trajectories with missing or uncomputable reward/metadata fields), the exact fraction removed from the test split, and confirmation that cleaning occurred prior to any embedding training or probe fitting and was performed solely on metadata completeness. We will also add a short discussion of why the cleaning criteria are unlikely to introduce the selection bias the referee describes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance reported on held-out test split after training.

full rationale

The paper describes training a transformer encoder to produce embeddings and then evaluating a linear probe's detection performance (AUC, TPR) on a cleaned test split. No equations, derivations, or self-citations are presented that reduce the reported metrics to fitted quantities by construction, nor is any uniqueness theorem or ansatz imported from prior author work. The central result is an empirical measurement rather than an algebraic identity or renamed input. The unspecified cleaning procedure raises validity concerns but does not constitute definitional or self-referential circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no full methods section available to enumerate fitted parameters, background axioms, or new entities.

axioms (1)

domain assumption Embedding distance on the unit sphere approximates the L1 distance between reward and metadata signals
Stated directly in the abstract as the training target for the encoder.

pith-pipeline@v0.9.1-grok · 5662 in / 1263 out tokens · 26898 ms · 2026-06-27T17:19:28.148094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models.arXiv preprint arXiv:2201.03544, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021

2021
[4]

Bisimulation metrics for continuous Markov decision processes.SIAM Journal on Computing, 40(6):1662–1714, 2011

Norm Ferns, Prakash Panangaden, and Doina Precup. Bisimulation metrics for continuous Markov decision processes.SIAM Journal on Computing, 40(6):1662–1714, 2011

2011
[5]

Density-based clustering based on hierarchical density estimates

Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 160–172. Springer, 2013

2013
[6]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

2017
[7]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, 2020

2020
[9]

FreeLB: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764, 2019

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. FreeLB: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764, 2019

work page arXiv 1909
[10]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020

2020
[11]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35: 9460–9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35: 9460–9471, 2022

2022
[12]

Specification gaming: The flip side of AI ingenuity.DeepMind Blog, 2020

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ralph Bolber, Marcus Hutter, and Shane Legg. Specification gaming: The flip side of AI ingenuity.DeepMind Blog, 2020. URL https://deepmindsafetyresearch.medium.com/ specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4

2020
[13]

Programming as theory building.Microprocessing and Microprogramming, 15(5): 253–261, 1985

Peter Naur. Programming as theory building.Microprocessing and Microprogramming, 15(5): 253–261, 1985

1985
[14]

Categorizing Variants of Goodhart's Law

David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law.arXiv preprint arXiv:1803.04585, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bow- man, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, et al. Towards understanding sycophancy in language models. InInternational Conference on Learning Representations, volume 2024, pages 110–144, 2024. 19

2024
[16]

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, and Ziqian Zhong. Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories.arXiv preprint arXiv:2604.17596, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

MICo: Improved representations via sampling-based state similarity for Markov decision processes

Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, and Mark Rowland. MICo: Improved representations via sampling-based state similarity for Markov decision processes. Advances in Neural Information Processing Systems, 34:30113–30126, 2021

2021
[18]

Sliced and Radon Wasserstein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1): 22–45, 2015

Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasserstein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1): 22–45, 2015. 20

2015

[1] [1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models.arXiv preprint arXiv:2201.03544, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021

2021

[4] [4]

Bisimulation metrics for continuous Markov decision processes.SIAM Journal on Computing, 40(6):1662–1714, 2011

Norm Ferns, Prakash Panangaden, and Doina Precup. Bisimulation metrics for continuous Markov decision processes.SIAM Journal on Computing, 40(6):1662–1714, 2011

2011

[5] [5]

Density-based clustering based on hierarchical density estimates

Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 160–172. Springer, 2013

2013

[6] [6]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

2017

[7] [7]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, 2020

2020

[9] [9]

FreeLB: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764, 2019

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. FreeLB: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764, 2019

work page arXiv 1909

[10] [10]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020

2020

[11] [11]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35: 9460–9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35: 9460–9471, 2022

2022

[12] [12]

Specification gaming: The flip side of AI ingenuity.DeepMind Blog, 2020

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ralph Bolber, Marcus Hutter, and Shane Legg. Specification gaming: The flip side of AI ingenuity.DeepMind Blog, 2020. URL https://deepmindsafetyresearch.medium.com/ specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4

2020

[13] [13]

Programming as theory building.Microprocessing and Microprogramming, 15(5): 253–261, 1985

Peter Naur. Programming as theory building.Microprocessing and Microprogramming, 15(5): 253–261, 1985

1985

[14] [14]

Categorizing Variants of Goodhart's Law

David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law.arXiv preprint arXiv:1803.04585, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bow- man, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, et al. Towards understanding sycophancy in language models. InInternational Conference on Learning Representations, volume 2024, pages 110–144, 2024. 19

2024

[16] [16]

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, and Ziqian Zhong. Terminal wrench: A dataset of 331 reward-hackable environments and 3,632 exploit trajectories.arXiv preprint arXiv:2604.17596, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

MICo: Improved representations via sampling-based state similarity for Markov decision processes

Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, and Mark Rowland. MICo: Improved representations via sampling-based state similarity for Markov decision processes. Advances in Neural Information Processing Systems, 34:30113–30126, 2021

2021

[18] [18]

Sliced and Radon Wasserstein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1): 22–45, 2015

Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasserstein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1): 22–45, 2015. 20

2015