Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

Ayush Mittal; Dhruv Gupta

arxiv: 2606.11417 · v1 · pith:BZ2N6ZMLnew · submitted 2026-06-09 · 💻 cs.LG · cs.AI· stat.ML

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

Ayush Mittal , Dhruv Gupta This is my paper

Pith reviewed 2026-06-27 13:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords compression progressintrinsic motivationsealed audittelescoping rewarduniform deviation boundGoodhart resistanceARC-TGIreinforcement learning

0 comments

The pith

Signed decrease of a fixed sealed audit loss makes cumulative intrinsic reward telescope exactly to endpoint audit improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that defining intrinsic reward as the signed reduction in a fixed sealed audit loss causes total reward to equal the net change in that loss value. This prevents any training policy from collecting high rewards while the audited performance stays flat or declines. With a finite audit panel the same equality holds up to an additive uniform deviation term that depends only on panel size and model class. The result requires no horizon bound and survives adaptive behavior over time.

Core claim

Defining intrinsic reward as r_t = E(theta_{t-1}) - E(theta_t) where E is a fixed sealed audit loss causes the cumulative reward sum to telescope directly to E(theta_0) - E(theta_T). For finite audits the empirical version of this sum is bounded by the true improvement plus 2 Delta_n(F, delta), the uniform deviation of the model class. The guarantee is therefore horizon-free once the sealed panel controls the class uniformly.

What carries the argument

The signed decrease r_t = E(theta_{t-1}) - E(theta_t) of the fixed sealed audit loss E, which produces exact telescoping of cumulative reward.

If this is right

Cumulative reward cannot grow indefinitely without corresponding audit improvement.
The bound holds regardless of training horizon or policy adaptivity.
Clipping the progress signal, reusing the audit panel, or applying it to a neural class with vacuous Delta_n removes the guarantee.
Experiments on ARC-TGI confirm that deviation scales roughly as n to the minus 0.5 and that the signed signal resists clip-farming and stream-leakage attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same telescoping construction could be tested on other intrinsic signals such as prediction error or empowerment.
Designing new sealed-audit protocols for larger model classes would make the deviation bound practical for modern reinforcement learning.
The failure-mode list suggests concrete defenses such as one-time panels and capacity-limited auditors that could be validated on additional benchmarks.

Load-bearing premise

The audit loss E must stay fixed and sealed while the model class admits a non-vacuous uniform deviation bound over the audit panel.

What would settle it

An agent that collects cumulative signed reward exceeding the observed audit improvement by more than twice the measured uniform deviation on the given panel size.

Figures

Figures reproduced from arXiv: 2606.11417 by Ayush Mittal, Dhruv Gupta.

**Figure 1.** Figure 1: The measurement frame. Training data may be selected adaptively, but reward is computed only from the signed change in a fixed audit loss. This makes intrinsic reward an endpoint accounting identity over the sealed audit. 2 Related work Goodhart’s law and reward hacking. Optimizing a proxy until it diverges from the target it stands for is Goodhart’s law [14], and its learned-agent form is reward hacking o… view at source ↗

**Figure 2.** Figure 2: Boundary tests from the experiment artifact. Stream scoring: stream-scored progress can exceed sealed-audit progress by roughly 40×. Clipping: clipped reward farms a positive cumulative signal while signed reward equals endpoint change. Panel memorization: direct reusable-panel training yields positive apparent progress and negative true progress. Calibration: temperature scaling changes log-loss/ECE at fi… view at source ↗

**Figure 3.** Figure 3: Finite-audit concentration: empirical uniform deviation of floored cross-entropy over a finite family decays as ∆n ∝ n −0.527, close to the n −1/2 prediction. The observed slope is −0.527 over 20 seeds, matching the predicted n −1/2 rate; the Hoeffding constant itself is loose. The false-positive budget is therefore an empirically measurable quantity. 7.4 Reward-signal ablation under noise distractors The … view at source ↗

**Figure 4.** Figure 4: Reward-signal ablation with the same scheduler. Final active-cell reconstruction accuracy (left axis) and distractor allocation fraction (right axis) per reward signal, 20 seeds. Audit-CP is the strongest non-oracle reward signal and spends less than half the distractor budget of prediction-error curiosity. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Sealed-panel learning curves for an audit-CP run. Audit NLL (left axis) falls, equivalently cumulative audit compression progress rises, while cell and active-cell reconstruction accuracy (right axis) improve over training. The signed audit-CP signal tracks downstream reconstruction. One-step log-loss changes differ from one-step hard-accuracy changes; the reward-signal ablation reports both metrics. 7.5 A… view at source ↗

**Figure 6.** Figure 6: Adaptive scalar-feedback overfitting (single-cell calibration, 20 seeds; error bars are one standard deviation). Left: the attacker has power against naive reusable release, with gap 3.24±0.07 > 2∆n = 1.64 and a win in all 20 seeds. Fresh subsampling, laddering, rounding, and one-shot release keep the gap below threshold. Right: the gap falls below threshold once panel capacity exceeds the query budget, cr… view at source ↗

**Figure 7.** Figure 7: Proof architecture. Signed audit progress is an endpoint potential drop, which keeps the mechanized theorems short. The finite-class and ARC experiments instantiate the structural result; the boundary experiments mark assumptions that cannot be dropped. Paper result Lean declaration Meaning T1 exact telescoping cumCP_telescope P t (E(gt) − E(gt+1)) = E(g0) − E(gT ) T1 finite budget cumCP_le_of_lb If the te… view at source ↗

read the original abstract

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down the telescoping property of signed compression progress on a fixed sealed audit and backs it with Lean proofs plus experiments that check the stated failure modes.

read the letter

The core contribution is a clean accounting identity: if reward is defined as the signed drop in a fixed sealed loss E, then cumulative reward equals endpoint improvement plus at most 2 Delta_n for finite panels. They mechanize the telescoping sum, the finite-audit bound, finite Gibbs, and entropy floor in Lean 4, and they run experiments on ARC-TGI generators that directly test clipping, stream leakage, reusable panels, and noisy-TV attacks.

What stands out is that the paper states the conditions under which the guarantee fails and then checks those conditions empirically. The deviation scales roughly as n to the -0.5, reusable audits are shown to be exploitable, and the defended versions stay under the bound. That combination of formal core and targeted stress tests is useful.

The main limitation is practical reach. The uniform deviation bound Delta_n is only useful when the model class is simple enough for it to be non-vacuous; for typical neural nets it may not give tight control. The experiments stay on grid-transformation tasks, so it is not yet clear how the same construction behaves in richer environments or with larger models. The sealed-audit assumption is also strong and may be hard to maintain in open settings.

This is worth reading for anyone working on intrinsic motivation or Goodhart issues in RL. The formalization and the explicit failure analysis make it a solid candidate for peer review even if the practical scope is still narrow.

Referee Report

0 major / 3 minor

Summary. The paper claims that if intrinsic reward is defined as the signed decrease of a fixed sealed-audit loss E, r_t = E(θ_{t-1}) - E(θ_t), then cumulative reward telescopes exactly to endpoint audit improvement, preventing indefinite reward inflation without genuine progress. For finite audit panels the same identity holds up to a uniform deviation bound of 2Δ_n(F, δ). The work supplies a Lean 4 mechanization of the telescoping identity, finite-audit bound, finite Gibbs, and entropy floor, together with ARC-TGI experiments that test clipping, stream leakage, reusable panels, and noisy-TV attacks, explicitly listing the conditions (clipping, reusable panel, vacuous Δ_n) under which the guarantee fails.

Significance. If the central telescoping identity and uniform bound hold, the result supplies a parameter-free, horizon-free accounting identity that converts a sealed audit into a Goodhart-resistant intrinsic reward, with machine-checked proofs and targeted experiments that directly probe the listed failure modes. This strengthens the theoretical basis for compression-progress rewards in RL by making the resistance claim falsifiable and by quantifying the false-positive budget for finite panels.

minor comments (3)

[§3] §3 (uniform deviation): the statement of Δ_n(F, δ) should explicitly reference the covering-number or Rademacher complexity argument used to obtain the n^{-0.527} empirical rate, so readers can verify the constant factors in the 2Δ_n budget.
[Table 1] Table 1 (ARC-TGI results): the reusable-panel attack row reports deviation above 2Δ_n; confirm that the black-box scalar feedback is exactly the setting excluded by the sealed-audit assumption, or add a column showing the defended release case.
The Lean mechanization is cited but the repository or theorem names are not listed in the main text; add a short appendix table mapping each mechanized lemma to the corresponding paper equation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly captures the telescoping identity, the 2 Delta_n bound, the Lean mechanization, and the enumerated failure modes.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines intrinsic reward explicitly as the signed decrease r_t = E(theta_{t-1}) - E(theta_t) on a fixed sealed audit loss and shows that the cumulative sum therefore equals the total endpoint improvement by the telescoping property of summation. This is presented as a direct algebraic identity (with an added uniform deviation bound for finite panels), not as a derived prediction or first-principles result obtained from other inputs. The paper states the conditions under which the guarantee holds or fails, supplies a Lean 4 mechanization of the identity and bounds, and reports experiments testing the listed failure modes. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or smuggled ansatz; the derivation is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is abstract-only; the ledger is therefore minimal and provisional.

axioms (2)

domain assumption The audit loss E is a fixed, sealed function independent of the agent's policy and data stream.
Required for the telescoping identity to hold exactly.
domain assumption The model class F admits a uniform deviation bound Delta_n(F, delta) that holds simultaneously for all models in the class.
Needed for the finite-audit false-positive budget.

pith-pipeline@v0.9.1-grok · 5849 in / 1419 out tokens · 20826 ms · 2026-06-27T13:50:31.186321+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 7 linked inside Pith

[1]

Schapire

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

2002
[2]

The Ladder: A reliable leaderboard for machine learning competitions

Avrim Blum and Moritz Hardt. The Ladder: A reliable leaderboard for machine learning competitions. arXiv:1502.04585, 2015

Pith/arXiv arXiv 2015
[3]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv:1810.12894, 2018

Pith/arXiv arXiv 2018
[4]

ARC Prize 2025: Technical report

François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2025: Technical report. arXiv:2601.10904, 2026

arXiv 2025
[5]

Generalization in adaptive data analysis and holdout reuse

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. arXiv:1506.02629, 2015

Pith/arXiv arXiv 2015
[6]

Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, and Sahar Vahdati

Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A. Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, and Sahar Vahdati. ARC-TGI: Human-validated task generators with reasoning chain templates for ARC-AGI. arXiv:2603.05099, 2026

arXiv 2026
[7]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. arXiv:1705.05363, 2017

Pith/arXiv arXiv 2017
[8]

A possibility for implementing curiosity and boredom in model-building neural controllers

Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InFrom Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior, MIT Press/Bradford Books, 1991

1991
[9]

Jürgen Schmidhuber. Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. arXiv:0812.4360, 2008

Pith/arXiv arXiv 2008
[10]

Formal theory of creativity, fun, and intrinsic motivation (1990–2010).IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010).IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010

1990
[11]

Concrete problems in AI safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv:1606.06565, 2016

Pith/arXiv arXiv 2016
[12]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007
[13]

Joel Lehman, Jeff Clune, Dusan Misevic, and others. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities.Artificial Life, 26(2):274– 306, 2020

2020
[14]

Categorizing variants of Goodhart’s law

David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law. arXiv:1803.04585, 2018

Pith/arXiv arXiv 2018
[15]

The effects of reward misspecification: Mapping and mitigating misaligned models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. InInternational Conference on Learning Representations (ICLR), 2022

2022
[16]

Do ImageNet classifiers generalize to ImageNet? InInternational Conference on Machine Learning (ICML), 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InInternational Conference on Machine Learning (ICML), 2019

2019
[17]

A meta-analysis of overfitting in machine learning

Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019
[18]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 16

2022

[1] [1]

Schapire

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

2002

[2] [2]

The Ladder: A reliable leaderboard for machine learning competitions

Avrim Blum and Moritz Hardt. The Ladder: A reliable leaderboard for machine learning competitions. arXiv:1502.04585, 2015

Pith/arXiv arXiv 2015

[3] [3]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv:1810.12894, 2018

Pith/arXiv arXiv 2018

[4] [4]

ARC Prize 2025: Technical report

François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2025: Technical report. arXiv:2601.10904, 2026

arXiv 2025

[5] [5]

Generalization in adaptive data analysis and holdout reuse

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. arXiv:1506.02629, 2015

Pith/arXiv arXiv 2015

[6] [6]

Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, and Sahar Vahdati

Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A. Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, and Sahar Vahdati. ARC-TGI: Human-validated task generators with reasoning chain templates for ARC-AGI. arXiv:2603.05099, 2026

arXiv 2026

[7] [7]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. arXiv:1705.05363, 2017

Pith/arXiv arXiv 2017

[8] [8]

A possibility for implementing curiosity and boredom in model-building neural controllers

Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InFrom Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior, MIT Press/Bradford Books, 1991

1991

[9] [9]

Jürgen Schmidhuber. Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. arXiv:0812.4360, 2008

Pith/arXiv arXiv 2008

[10] [10]

Formal theory of creativity, fun, and intrinsic motivation (1990–2010).IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010).IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010

1990

[11] [11]

Concrete problems in AI safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv:1606.06565, 2016

Pith/arXiv arXiv 2016

[12] [12]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007

[13] [13]

Joel Lehman, Jeff Clune, Dusan Misevic, and others. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities.Artificial Life, 26(2):274– 306, 2020

2020

[14] [14]

Categorizing variants of Goodhart’s law

David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law. arXiv:1803.04585, 2018

Pith/arXiv arXiv 2018

[15] [15]

The effects of reward misspecification: Mapping and mitigating misaligned models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. InInternational Conference on Learning Representations (ICLR), 2022

2022

[16] [16]

Do ImageNet classifiers generalize to ImageNet? InInternational Conference on Machine Learning (ICML), 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InInternational Conference on Machine Learning (ICML), 2019

2019

[17] [17]

A meta-analysis of overfitting in machine learning

Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019

[18] [18]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 16

2022