pith. sign in

arxiv: 2605.15960 · v2 · pith:OZPPKI72new · submitted 2026-05-15 · 💻 cs.AI · cs.LG

Imperfect World Models are Exploitable

Pith reviewed 2026-05-20 18:47 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords model exploitationreward hackingreinforcement learningworld modelssafe planningpolicy sets
0
0 comments X

The pith

Imperfect world models in reinforcement learning are exploitable by policies that reverse their ranking under the true environment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a formal definition of model exploitation where a world model prefers one policy over another but the true transition model does not. It develops a general theory showing that such exploitation is essentially unavoidable when the set of policies is large, and this theory also covers reward hacking as a special case. The authors find that finite policy set conditions preventing reward hacking do not apply to exploitation. They propose a relaxed definition of exploitation and identify a safe planning horizon where it can be avoided, bridging reward hacking and model exploitation while highlighting limits in safe planning with imperfect models.

Core claim

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find thatthe

What carries the argument

The general theory of reward hacking and model exploitation that proves unavoidability via combinatorial arguments on sufficiently large policy sets.

If this is right

  • Exploitation cannot be ruled out in large policy sets by the same finite-set conditions that prevent reward hacking.
  • A relaxed version of exploitation can be avoided only inside a derived safe planning horizon.
  • Reward hacking is recovered as a special case inside the same general theory.
  • Standard planning with imperfect world models therefore has inherent limits once policy spaces grow large.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model-based agents may require explicit mechanisms to detect or bound model-reversal pairs rather than relying on policy-set size alone.
  • The result suggests that scaling policy spaces in real-world applications increases vulnerability even when model error is small.
  • Testable extensions could include measuring the shortest safe horizon in grid-world or continuous-control environments with controlled model mismatch.

Load-bearing premise

The policy set must be large enough for the combinatorial argument to establish that exploitation is unavoidable.

What would settle it

Constructing a large discrete policy space together with a mildly inaccurate world model and checking whether there always exists at least one pair of policies whose preference order reverses between the model and the true transitions.

Figures

Figures reproduced from arXiv: 2605.15960 by David Abel, Esmeralda S. Whitammer, Logan Mondal Bhamidipaty, Mykel J. Kochenderfer, Subramanian Ramamoorthy.

Figure 1
Figure 1. Figure 1: Taxonomy of transition model relationships in a 3-state MDP ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gradients for the value curves in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of ε-exploitability and a contour plot for the safe horizon. (a, b) The exploitable transition model pairs from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a definition of model exploitation in RL (a world model is exploitable if it strictly prefers one policy over another while the true transitions reverse the preference), shows that the reward-hacking inevitability proof does not transfer, develops a general theory proving exploitation is essentially unavoidable on large policy sets (with hacking as special case), observes that finite-set unhackability conditions have no counterpart for exploitation, and introduces a relaxed exploitation notion together with a safe horizon within which it can be avoided.

Significance. If the combinatorial argument is made rigorous with explicit conditions, the work would establish a useful formal bridge between reward hacking and model exploitation and clarify fundamental limits on safe planning with imperfect world models in RL. The generalization of unavoidability results and the relaxed safe-horizon construction are potentially valuable for robustness research.

major comments (1)
  1. [Development of the general theory (after the observation that the reward-hacking proof does not transfer)] The central unavoidability claim for exploitation rests on a combinatorial counting argument that applies only once the policy set is 'large enough,' yet no explicit minimal cardinality bound is supplied and no verification is given that typical structured or parameterized policy classes (e.g., neural-network policies) satisfy the required conditions. This is load-bearing for the claim that exploitation is 'essentially unavoidable' and that finite unhackability conditions have 'no counterpart.'
minor comments (1)
  1. [Section introducing the relaxed notion] Notation for the relaxed exploitation notion and the safe horizon could be introduced with an explicit equation or definition to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment point by point below, providing the strongest honest response consistent with the manuscript.

read point-by-point responses
  1. Referee: [Development of the general theory (after the observation that the reward-hacking proof does not transfer)] The central unavoidability claim for exploitation rests on a combinatorial counting argument that applies only once the policy set is 'large enough,' yet no explicit minimal cardinality bound is supplied and no verification is given that typical structured or parameterized policy classes (e.g., neural-network policies) satisfy the required conditions. This is load-bearing for the claim that exploitation is 'essentially unavoidable' and that finite unhackability conditions have 'no counterpart.'

    Authors: We thank the referee for highlighting the need for greater precision here. The combinatorial argument proceeds by showing that the number of possible preference orderings inducible by transition models is finite and bounded (at most exponential in the size of the state-action space), so that once the policy set exceeds this number, a pigeonhole argument forces the existence of at least one exploitable pair. While the main text and abstract emphasize the qualitative conclusion for 'large' sets, the appendix proof already contains the dependence on cardinality; we will make the explicit threshold (in terms of the number of distinct transition functions) explicit in the main body of the revised manuscript. Regarding structured or parameterized classes such as neural-network policies, the result is stated for arbitrary policy sets and therefore applies whenever a given parameterization induces a sufficiently large effective set of distinct policies. We will add a clarifying paragraph noting that overparameterized networks typically realize large policy sets in practice, while acknowledging that a fully rigorous embedding of specific architectures into the counting argument is left for future work. These changes will strengthen the presentation of the unavoidability claim and the contrast with finite-set unhackability without altering the core theorems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is conditional on explicit large-set assumption without self-referential reduction

full rationale

The paper first defines model exploitation by direct analogy to a prior reward-hacking characterization, explicitly notes that the existing inevitability proof does not transfer, and then constructs a separate combinatorial argument that holds only when the policy set is large enough for a counting argument to produce reversing preference pairs. This cardinality condition is introduced as an assumption rather than derived from the definition itself. No equation equates a derived quantity to a fitted parameter or prior result by construction, no self-citation supplies the uniqueness or ansatz for the central claim, and the subsequent relaxed exploitation notion plus safe-horizon bound are obtained by standard relaxation of the same combinatorial setup. The overall chain therefore remains self-contained against external benchmarks and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard RL transition models and policy sets plus the new exploitation definition; no free parameters are fitted to data, but the size of the policy set functions as a key domain assumption for the unavoidability result.

axioms (1)
  • domain assumption The policy set is large enough for the combinatorial argument establishing unavoidability to apply
    Invoked when proving that exploitation is essentially unavoidable on large policy sets after noting that the reward-hacking proof does not transfer.
invented entities (1)
  • Model exploitation no independent evidence
    purpose: Formal characterization of when a world model implies the wrong policy preference relative to the true transition model
    New definition introduced to analogize with reward hacking; no independent evidence provided beyond the definition itself.

pith-pipeline@v0.9.0 · 5701 in / 1380 out tokens · 49687 ms · 2026-05-20T18:47:30.307209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

  1. [1]

    Defining and Characterizing Reward Hacking , year =

    Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , journal =. Defining and Characterizing Reward Hacking , year =

  2. [2]

    International Conference on Learning Representations (ICLR) , year=

    Correlated proxies: A new definition and improved mitigation for reward hacking , author=. International Conference on Learning Representations (ICLR) , year=

  3. [3]

    Deisenroth, Marc and Rasmussen, Carl E , journal=

  4. [4]

    Journal of statistical mechanics: theory and experiment , volume=

    Path integrals and symmetry breaking for optimal control theory , author=. Journal of statistical mechanics: theory and experiment , volume=

  5. [5]

    Neural Information Processing Systems (NIPS) , year=

    Exploiting model uncertainty estimates for safe dynamic control learning , author=. Neural Information Processing Systems (NIPS) , year=

  6. [6]

    Artificial Intelligence and Statistics (AISTATS) , year=

    A reduction of imitation learning and structured prediction to no-regret online learning , author=. Artificial Intelligence and Statistics (AISTATS) , year=

  7. [7]

    Journal of Mathematical Analysis and Applications , volume=

    Optimal control of. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=

  8. [8]

    Biometrika , volume=

    A new measure of rank correlation , author=. Biometrika , volume=. 1938 , publisher=

  9. [9]

    Guaranteed margins for

    Doyle, John , journal=. Guaranteed margins for. 1978 , publisher=

  10. [10]

    Cosmos World Foundation Model Platform for Physical AI

    Cosmos world foundation model platform for physical ai , author=. arXiv preprint arXiv:2501.03575 , year=

  11. [11]

    Balestriero, Randall and LeCun, Yann , journal=. Le

  12. [12]

    Neural Information Processing Systems (NIPS) , year=

    Inverse reward design , author=. Neural Information Processing Systems (NIPS) , year=

  13. [13]

    Real and complex analysis , author=

  14. [14]

    2014 , publisher=

    Hilbert's fifth problem and related topics , author=. 2014 , publisher=

  15. [15]

    International Conference on Robotics and Automation (ICRA) , year=

    Simulation-based reinforcement learning for real-world autonomous driving , author=. International Conference on Robotics and Automation (ICRA) , year=

  16. [16]

    2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=

    Sim-to-real transfer in deep reinforcement learning for robotics: a survey , author=. 2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=. 2020 , organization=

  17. [17]

    Yu, Tianhe and Thomas, Garrett and Yu, Lantao and Ermon, Stefano and Zou, James Y and Levine, Sergey and Finn, Chelsea and Ma, Tengyu , journal=

  18. [18]

    European Conference on Artificial Life , pages=

    Noise and the reality gap: The use of simulation in evolutionary robotics , author=. European Conference on Artificial Life , pages=. 1995 , organization=

  19. [19]

    Mathematics of Operations Research , volume=

    Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=

  20. [20]

    Robust control of

    Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of. 2005 , publisher=

  21. [21]

    ACM Sigart Bulletin , volume=

    Dyna, an integrated architecture for learning, planning, and reacting , author=. ACM Sigart Bulletin , volume=. 1991 , publisher=

  22. [22]

    Neural Information Processing Systems (NeurIPS) , year=

    Sample-efficient reinforcement learning with stochastic ensemble value expansion , author=. Neural Information Processing Systems (NeurIPS) , year=

  23. [23]

    International Conference on Learning Representations (ICLR) , year=

    Model-Ensemble Trust-Region Policy Optimization , author=. International Conference on Learning Representations (ICLR) , year=

  24. [24]

    Neural Information Processing Systems (NeurIPS) , year=

    Proper value equivalence , author=. Neural Information Processing Systems (NeurIPS) , year=

  25. [25]

    Mastering

    Schrittwieser, Julian and Antonoglou, Ioannis and Hubert, Thomas and Simonyan, Karen and Sifre, Laurent and Schmitt, Simon and Guez, Arthur and Lockhart, Edward and Hassabis, Demis and Graepel, Thore and others , journal=. Mastering. 2020 , publisher=

  26. [26]

    International Conference on Learning Representations (ICLR) , year=

    Dream to control: Learning behaviors by latent imagination , author=. International Conference on Learning Representations (ICLR) , year=

  27. [27]

    UAI , pages=

    Model Regularization for Stable Sample Rollouts , author=. UAI , pages=

  28. [28]

    International Conference on Machine Learning (ICML) , year=

    Goal Misgeneralization in Deep Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , year=

  29. [29]

    Journal of Machine Learning Research , volume=

    R-max -- a general polynomial time algorithm for near-optimal reinforcement learning , author=. Journal of Machine Learning Research , volume=

  30. [30]

    2, 2022-06-27 , author=

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

  31. [31]

    Mastering

    Danijar Hafner and Timothy Lillicrap and Mohammad Norouzi and Jimmy Ba , year=. Mastering

  32. [32]

    Mastering diverse control tasks through world models , volume =

    Hafner, Danijar and Pasukonis, Jurgis and Ba, Jimmy and Lillicrap, Timothy , journal =. Mastering diverse control tasks through world models , volume =

  33. [33]

    2025 , journal=

    Training Agents Inside of Scalable World Models , author=. 2025 , journal=

  34. [34]

    Sargent , publisher =

    Lars Peter Hansen and Thomas J. Sargent , publisher =. Robustness , year =

  35. [35]

    Carnegie-Rochester conference series on public policy , volume=

    Econometric policy evaluation: A critique , author=. Carnegie-Rochester conference series on public policy , volume=. 1976 , organization=

  36. [36]

    Kidambi, Rahul and Rajeswaran, Aravind and Netrapalli, Praneeth and Joachims, Thorsten , journal=

  37. [37]

    Neural Information Processing Systems (NeurIPS) , year=

    When to trust your model: Model-based policy optimization , author=. Neural Information Processing Systems (NeurIPS) , year=

  38. [38]

    Neural Information Processing Systems (NeurIPS) , year=

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models , author=. Neural Information Processing Systems (NeurIPS) , year=

  39. [39]

    International Conference on Intelligent Robots and Systems (IROS) , year=

    Domain randomization for transferring deep neural networks from simulation to the real world , author=. International Conference on Intelligent Robots and Systems (IROS) , year=

  40. [40]

    International Conference on Machine Learning (ICML) , year=

    Learning latent dynamics for planning from pixels , author=. International Conference on Machine Learning (ICML) , year=

  41. [41]

    Neural Information Processing Systems (NeurIPS) , year =

    Recurrent World Models Facilitate Policy Evolution , author =. Neural Information Processing Systems (NeurIPS) , year =

  42. [42]

    Journal of Economic Theory , volume=

    The arbitrage theory of capital asset pricing , author=. Journal of Economic Theory , volume=

  43. [43]

    Game Studies , volume=

    A practiced practice: Speedrunning through space with de Certeau and Virilio , author=. Game Studies , volume=

  44. [44]

    2005 , publisher=

    Approximation of large-scale dynamical systems , author=. 2005 , publisher=

  45. [45]

    IEEE Transactions on Systems Science and Cybernetics , volume=

    A formal basis for the heuristic determination of minimum cost paths , author=. IEEE Transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=

  46. [46]

    The Quarterly Journal of Economics , pages=

    A behavioral model of rational choice , author=. The Quarterly Journal of Economics , pages=. 1955 , publisher=

  47. [47]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  48. [48]

    1949 , note =

    Jones, Chuck and Maltese, Michael , title =. 1949 , note =

  49. [49]

    arXiv preprint arXiv:1812.01129 , year=

    Mitigating planner overfitting in model-based reinforcement learning , author=. arXiv preprint arXiv:1812.01129 , year=

  50. [50]

    Neural Information Processing Systems (NIPS) , year=

    Autonomous helicopter flight via reinforcement learning , author=. Neural Information Processing Systems (NIPS) , year=

  51. [51]

    Autonomous Agents and Multiagent Systems (AAMAS) , year=

    The dependence of effective planning horizon on model accuracy , author=. Autonomous Agents and Multiagent Systems (AAMAS) , year=

  52. [52]

    Nature , volume=

    The quiet revolution of numerical weather prediction , author=. Nature , volume=. 2015 , publisher=

  53. [53]

    International Conference on Machine Learning (ICML) , year=

    Policy invariance under reward transformations: Theory and application to reward shaping , author=. International Conference on Machine Learning (ICML) , year=

  54. [54]

    On the expressivity of

    Abel, David and Dabney, Will and Harutyunyan, Anna and Ho, Mark K and Littman, Michael and Precup, Doina and Singh, Satinder , journal=. On the expressivity of

  55. [55]

    Finding the Frame: An

    The big world hypothesis and its ramifications for artificial intelligence , author=. Finding the Frame: An

  56. [56]

    Vehicle System Dynamics , year=

    THE MAGIC FORMULA TYRE MODEL , author=. Vehicle System Dynamics , year=

  57. [57]

    Reinforcement Learning Conference (RLC) , year=

    An Optimal Tightness Bound for the Simulation Lemma , author=. Reinforcement Learning Conference (RLC) , year=

  58. [58]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  59. [59]

    , title =

    Lee, John M. , title =. 2013 , publisher =

  60. [60]

    Machine learning , volume=

    Near-optimal reinforcement learning in polynomial time , author=. Machine learning , volume=. 2002 , publisher=

  61. [61]

    Neural Information Processing Systems (NeurIPS) , year=

    The value equivalence principle for model-based reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=

  62. [62]

    Mathematische Annalen , volume=

    Beweis der Invarianz des n -dimensionalen Gebiets , author=. Mathematische Annalen , volume=. 1911 , publisher=

  63. [63]

    Transactions of the Linnean Society of London , volume =

    Bates, Henry Walter , title =. Transactions of the Linnean Society of London , volume =

  64. [64]

    2012 , publisher=

    Dynamic Programming and Optimal Control: Volume I , author=. 2012 , publisher=