Imperfect World Models are Exploitable
Pith reviewed 2026-05-20 18:47 UTC · model grok-4.3
The pith
Imperfect world models in reinforcement learning are exploitable by policies that reverse their ranking under the true environment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find thatthe
What carries the argument
The general theory of reward hacking and model exploitation that proves unavoidability via combinatorial arguments on sufficiently large policy sets.
If this is right
- Exploitation cannot be ruled out in large policy sets by the same finite-set conditions that prevent reward hacking.
- A relaxed version of exploitation can be avoided only inside a derived safe planning horizon.
- Reward hacking is recovered as a special case inside the same general theory.
- Standard planning with imperfect world models therefore has inherent limits once policy spaces grow large.
Where Pith is reading between the lines
- Model-based agents may require explicit mechanisms to detect or bound model-reversal pairs rather than relying on policy-set size alone.
- The result suggests that scaling policy spaces in real-world applications increases vulnerability even when model error is small.
- Testable extensions could include measuring the shortest safe horizon in grid-world or continuous-control environments with controlled model mismatch.
Load-bearing premise
The policy set must be large enough for the combinatorial argument to establish that exploitation is unavoidable.
What would settle it
Constructing a large discrete policy space together with a mildly inaccurate world model and checking whether there always exists at least one pair of policies whose preference order reverses between the model and the true transitions.
Figures
read the original abstract
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a definition of model exploitation in RL (a world model is exploitable if it strictly prefers one policy over another while the true transitions reverse the preference), shows that the reward-hacking inevitability proof does not transfer, develops a general theory proving exploitation is essentially unavoidable on large policy sets (with hacking as special case), observes that finite-set unhackability conditions have no counterpart for exploitation, and introduces a relaxed exploitation notion together with a safe horizon within which it can be avoided.
Significance. If the combinatorial argument is made rigorous with explicit conditions, the work would establish a useful formal bridge between reward hacking and model exploitation and clarify fundamental limits on safe planning with imperfect world models in RL. The generalization of unavoidability results and the relaxed safe-horizon construction are potentially valuable for robustness research.
major comments (1)
- [Development of the general theory (after the observation that the reward-hacking proof does not transfer)] The central unavoidability claim for exploitation rests on a combinatorial counting argument that applies only once the policy set is 'large enough,' yet no explicit minimal cardinality bound is supplied and no verification is given that typical structured or parameterized policy classes (e.g., neural-network policies) satisfy the required conditions. This is load-bearing for the claim that exploitation is 'essentially unavoidable' and that finite unhackability conditions have 'no counterpart.'
minor comments (1)
- [Section introducing the relaxed notion] Notation for the relaxed exploitation notion and the safe horizon could be introduced with an explicit equation or definition to improve readability.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We address the single major comment point by point below, providing the strongest honest response consistent with the manuscript.
read point-by-point responses
-
Referee: [Development of the general theory (after the observation that the reward-hacking proof does not transfer)] The central unavoidability claim for exploitation rests on a combinatorial counting argument that applies only once the policy set is 'large enough,' yet no explicit minimal cardinality bound is supplied and no verification is given that typical structured or parameterized policy classes (e.g., neural-network policies) satisfy the required conditions. This is load-bearing for the claim that exploitation is 'essentially unavoidable' and that finite unhackability conditions have 'no counterpart.'
Authors: We thank the referee for highlighting the need for greater precision here. The combinatorial argument proceeds by showing that the number of possible preference orderings inducible by transition models is finite and bounded (at most exponential in the size of the state-action space), so that once the policy set exceeds this number, a pigeonhole argument forces the existence of at least one exploitable pair. While the main text and abstract emphasize the qualitative conclusion for 'large' sets, the appendix proof already contains the dependence on cardinality; we will make the explicit threshold (in terms of the number of distinct transition functions) explicit in the main body of the revised manuscript. Regarding structured or parameterized classes such as neural-network policies, the result is stated for arbitrary policy sets and therefore applies whenever a given parameterization induces a sufficiently large effective set of distinct policies. We will add a clarifying paragraph noting that overparameterized networks typically realize large policy sets in practice, while acknowledging that a fully rigorous embedding of specific architectures into the counting argument is left for future work. These changes will strengthen the presentation of the unavoidability claim and the contrast with finite-set unhackability without altering the core theorems. revision: yes
Circularity Check
No significant circularity; derivation is conditional on explicit large-set assumption without self-referential reduction
full rationale
The paper first defines model exploitation by direct analogy to a prior reward-hacking characterization, explicitly notes that the existing inevitability proof does not transfer, and then constructs a separate combinatorial argument that holds only when the policy set is large enough for a counting argument to produce reversing preference pairs. This cardinality condition is introduced as an assumption rather than derived from the definition itself. No equation equates a derived quantity to a fitted parameter or prior result by construction, no self-citation supplies the uniqueness or ansatz for the central claim, and the subsequent relaxed exploitation notion plus safe-horizon bound are obtained by standard relaxation of the same combinatorial setup. The overall chain therefore remains self-contained against external benchmarks and does not reduce to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The policy set is large enough for the combinatorial argument establishing unavoidability to apply
invented entities (1)
-
Model exploitation
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Defining and Characterizing Reward Hacking , year =
Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , journal =. Defining and Characterizing Reward Hacking , year =
-
[2]
International Conference on Learning Representations (ICLR) , year=
Correlated proxies: A new definition and improved mitigation for reward hacking , author=. International Conference on Learning Representations (ICLR) , year=
-
[3]
Deisenroth, Marc and Rasmussen, Carl E , journal=
-
[4]
Journal of statistical mechanics: theory and experiment , volume=
Path integrals and symmetry breaking for optimal control theory , author=. Journal of statistical mechanics: theory and experiment , volume=
-
[5]
Neural Information Processing Systems (NIPS) , year=
Exploiting model uncertainty estimates for safe dynamic control learning , author=. Neural Information Processing Systems (NIPS) , year=
-
[6]
Artificial Intelligence and Statistics (AISTATS) , year=
A reduction of imitation learning and structured prediction to no-regret online learning , author=. Artificial Intelligence and Statistics (AISTATS) , year=
-
[7]
Journal of Mathematical Analysis and Applications , volume=
Optimal control of. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=
work page 1965
-
[8]
A new measure of rank correlation , author=. Biometrika , volume=. 1938 , publisher=
work page 1938
-
[9]
Doyle, John , journal=. Guaranteed margins for. 1978 , publisher=
work page 1978
-
[10]
Cosmos World Foundation Model Platform for Physical AI
Cosmos world foundation model platform for physical ai , author=. arXiv preprint arXiv:2501.03575 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Balestriero, Randall and LeCun, Yann , journal=. Le
-
[12]
Neural Information Processing Systems (NIPS) , year=
Inverse reward design , author=. Neural Information Processing Systems (NIPS) , year=
-
[13]
Real and complex analysis , author=
-
[14]
Hilbert's fifth problem and related topics , author=. 2014 , publisher=
work page 2014
-
[15]
International Conference on Robotics and Automation (ICRA) , year=
Simulation-based reinforcement learning for real-world autonomous driving , author=. International Conference on Robotics and Automation (ICRA) , year=
-
[16]
2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=
Sim-to-real transfer in deep reinforcement learning for robotics: a survey , author=. 2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=. 2020 , organization=
work page 2020
-
[17]
Yu, Tianhe and Thomas, Garrett and Yu, Lantao and Ermon, Stefano and Zou, James Y and Levine, Sergey and Finn, Chelsea and Ma, Tengyu , journal=
-
[18]
European Conference on Artificial Life , pages=
Noise and the reality gap: The use of simulation in evolutionary robotics , author=. European Conference on Artificial Life , pages=. 1995 , organization=
work page 1995
-
[19]
Mathematics of Operations Research , volume=
Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=
work page 2005
-
[20]
Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of. 2005 , publisher=
work page 2005
-
[21]
Dyna, an integrated architecture for learning, planning, and reacting , author=. ACM Sigart Bulletin , volume=. 1991 , publisher=
work page 1991
-
[22]
Neural Information Processing Systems (NeurIPS) , year=
Sample-efficient reinforcement learning with stochastic ensemble value expansion , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[23]
International Conference on Learning Representations (ICLR) , year=
Model-Ensemble Trust-Region Policy Optimization , author=. International Conference on Learning Representations (ICLR) , year=
-
[24]
Neural Information Processing Systems (NeurIPS) , year=
Proper value equivalence , author=. Neural Information Processing Systems (NeurIPS) , year=
- [25]
-
[26]
International Conference on Learning Representations (ICLR) , year=
Dream to control: Learning behaviors by latent imagination , author=. International Conference on Learning Representations (ICLR) , year=
- [27]
-
[28]
International Conference on Machine Learning (ICML) , year=
Goal Misgeneralization in Deep Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , year=
-
[29]
Journal of Machine Learning Research , volume=
R-max -- a general polynomial time algorithm for near-optimal reinforcement learning , author=. Journal of Machine Learning Research , volume=
-
[30]
A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=
work page 2022
- [31]
-
[32]
Mastering diverse control tasks through world models , volume =
Hafner, Danijar and Pasukonis, Jurgis and Ba, Jimmy and Lillicrap, Timothy , journal =. Mastering diverse control tasks through world models , volume =
-
[33]
Training Agents Inside of Scalable World Models , author=. 2025 , journal=
work page 2025
-
[34]
Lars Peter Hansen and Thomas J. Sargent , publisher =. Robustness , year =
-
[35]
Carnegie-Rochester conference series on public policy , volume=
Econometric policy evaluation: A critique , author=. Carnegie-Rochester conference series on public policy , volume=. 1976 , organization=
work page 1976
-
[36]
Kidambi, Rahul and Rajeswaran, Aravind and Netrapalli, Praneeth and Joachims, Thorsten , journal=
-
[37]
Neural Information Processing Systems (NeurIPS) , year=
When to trust your model: Model-based policy optimization , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[38]
Neural Information Processing Systems (NeurIPS) , year=
Deep reinforcement learning in a handful of trials using probabilistic dynamics models , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[39]
International Conference on Intelligent Robots and Systems (IROS) , year=
Domain randomization for transferring deep neural networks from simulation to the real world , author=. International Conference on Intelligent Robots and Systems (IROS) , year=
-
[40]
International Conference on Machine Learning (ICML) , year=
Learning latent dynamics for planning from pixels , author=. International Conference on Machine Learning (ICML) , year=
-
[41]
Neural Information Processing Systems (NeurIPS) , year =
Recurrent World Models Facilitate Policy Evolution , author =. Neural Information Processing Systems (NeurIPS) , year =
-
[42]
Journal of Economic Theory , volume=
The arbitrage theory of capital asset pricing , author=. Journal of Economic Theory , volume=
-
[43]
A practiced practice: Speedrunning through space with de Certeau and Virilio , author=. Game Studies , volume=
-
[44]
Approximation of large-scale dynamical systems , author=. 2005 , publisher=
work page 2005
-
[45]
IEEE Transactions on Systems Science and Cybernetics , volume=
A formal basis for the heuristic determination of minimum cost paths , author=. IEEE Transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=
work page 1968
-
[46]
The Quarterly Journal of Economics , pages=
A behavioral model of rational choice , author=. The Quarterly Journal of Economics , pages=. 1955 , publisher=
work page 1955
-
[47]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [48]
-
[49]
arXiv preprint arXiv:1812.01129 , year=
Mitigating planner overfitting in model-based reinforcement learning , author=. arXiv preprint arXiv:1812.01129 , year=
-
[50]
Neural Information Processing Systems (NIPS) , year=
Autonomous helicopter flight via reinforcement learning , author=. Neural Information Processing Systems (NIPS) , year=
-
[51]
Autonomous Agents and Multiagent Systems (AAMAS) , year=
The dependence of effective planning horizon on model accuracy , author=. Autonomous Agents and Multiagent Systems (AAMAS) , year=
-
[52]
The quiet revolution of numerical weather prediction , author=. Nature , volume=. 2015 , publisher=
work page 2015
-
[53]
International Conference on Machine Learning (ICML) , year=
Policy invariance under reward transformations: Theory and application to reward shaping , author=. International Conference on Machine Learning (ICML) , year=
-
[54]
Abel, David and Dabney, Will and Harutyunyan, Anna and Ho, Mark K and Littman, Michael and Precup, Doina and Singh, Satinder , journal=. On the expressivity of
-
[55]
The big world hypothesis and its ramifications for artificial intelligence , author=. Finding the Frame: An
-
[56]
Vehicle System Dynamics , year=
THE MAGIC FORMULA TYRE MODEL , author=. Vehicle System Dynamics , year=
-
[57]
Reinforcement Learning Conference (RLC) , year=
An Optimal Tightness Bound for the Simulation Lemma , author=. Reinforcement Learning Conference (RLC) , year=
-
[58]
Reinforcement learning: An introduction , author=. 1998 , publisher=
work page 1998
- [59]
-
[60]
Near-optimal reinforcement learning in polynomial time , author=. Machine learning , volume=. 2002 , publisher=
work page 2002
-
[61]
Neural Information Processing Systems (NeurIPS) , year=
The value equivalence principle for model-based reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[62]
Mathematische Annalen , volume=
Beweis der Invarianz des n -dimensionalen Gebiets , author=. Mathematische Annalen , volume=. 1911 , publisher=
work page 1911
-
[63]
Transactions of the Linnean Society of London , volume =
Bates, Henry Walter , title =. Transactions of the Linnean Society of London , volume =
-
[64]
Dynamic Programming and Optimal Control: Volume I , author=. 2012 , publisher=
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.