Imperfect World Models are Exploitable

David Abel; Esmeralda S. Whitammer; Logan Mondal Bhamidipaty; Mykel J. Kochenderfer; Subramanian Ramamoorthy

arxiv: 2605.15960 · v2 · pith:OZPPKI72new · submitted 2026-05-15 · 💻 cs.AI · cs.LG

Imperfect World Models are Exploitable

Logan Mondal Bhamidipaty , Esmeralda S. Whitammer , David Abel , Mykel J. Kochenderfer , Subramanian Ramamoorthy This is my paper

Pith reviewed 2026-05-20 18:47 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords model exploitationreward hackingreinforcement learningworld modelssafe planningpolicy sets

0 comments

The pith

Imperfect world models in reinforcement learning are exploitable by policies that reverse their ranking under the true environment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a formal definition of model exploitation where a world model prefers one policy over another but the true transition model does not. It develops a general theory showing that such exploitation is essentially unavoidable when the set of policies is large, and this theory also covers reward hacking as a special case. The authors find that finite policy set conditions preventing reward hacking do not apply to exploitation. They propose a relaxed definition of exploitation and identify a safe planning horizon where it can be avoided, bridging reward hacking and model exploitation while highlighting limits in safe planning with imperfect models.

Core claim

What carries the argument

The general theory of reward hacking and model exploitation that proves unavoidability via combinatorial arguments on sufficiently large policy sets.

If this is right

Exploitation cannot be ruled out in large policy sets by the same finite-set conditions that prevent reward hacking.
A relaxed version of exploitation can be avoided only inside a derived safe planning horizon.
Reward hacking is recovered as a special case inside the same general theory.
Standard planning with imperfect world models therefore has inherent limits once policy spaces grow large.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model-based agents may require explicit mechanisms to detect or bound model-reversal pairs rather than relying on policy-set size alone.
The result suggests that scaling policy spaces in real-world applications increases vulnerability even when model error is small.
Testable extensions could include measuring the shortest safe horizon in grid-world or continuous-control environments with controlled model mismatch.

Load-bearing premise

The policy set must be large enough for the combinatorial argument to establish that exploitation is unavoidable.

What would settle it

Constructing a large discrete policy space together with a mildly inaccurate world model and checking whether there always exists at least one pair of policies whose preference order reverses between the model and the true transitions.

Figures

Figures reproduced from arXiv: 2605.15960 by David Abel, Esmeralda S. Whitammer, Logan Mondal Bhamidipaty, Mykel J. Kochenderfer, Subramanian Ramamoorthy.

**Figure 2.** Figure 2: Gradients for the value curves in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of ε-exploitability and a contour plot for the safe horizon. (a, b) The exploitable transition model pairs from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines model exploitation cleanly and proves it is unavoidable for large policy sets, with reward hacking as a special case, but leaves the required set size unspecified.

read the letter

The main takeaway is that they give a precise definition of model exploitation—where a world model reverses the true preference ordering between two policies—and then prove this is essentially unavoidable once the policy set is large enough, while treating reward hacking as a special case inside the same framework. They also show that the finite-set conditions that block hacking have no direct counterpart here, so they add a relaxed version with a safe horizon to avoid it for a while.

Referee Report

1 major / 1 minor

Summary. The paper proposes a definition of model exploitation in RL (a world model is exploitable if it strictly prefers one policy over another while the true transitions reverse the preference), shows that the reward-hacking inevitability proof does not transfer, develops a general theory proving exploitation is essentially unavoidable on large policy sets (with hacking as special case), observes that finite-set unhackability conditions have no counterpart for exploitation, and introduces a relaxed exploitation notion together with a safe horizon within which it can be avoided.

Significance. If the combinatorial argument is made rigorous with explicit conditions, the work would establish a useful formal bridge between reward hacking and model exploitation and clarify fundamental limits on safe planning with imperfect world models in RL. The generalization of unavoidability results and the relaxed safe-horizon construction are potentially valuable for robustness research.

major comments (1)

[Development of the general theory (after the observation that the reward-hacking proof does not transfer)] The central unavoidability claim for exploitation rests on a combinatorial counting argument that applies only once the policy set is 'large enough,' yet no explicit minimal cardinality bound is supplied and no verification is given that typical structured or parameterized policy classes (e.g., neural-network policies) satisfy the required conditions. This is load-bearing for the claim that exploitation is 'essentially unavoidable' and that finite unhackability conditions have 'no counterpart.'

minor comments (1)

[Section introducing the relaxed notion] Notation for the relaxed exploitation notion and the safe horizon could be introduced with an explicit equation or definition to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment point by point below, providing the strongest honest response consistent with the manuscript.

read point-by-point responses

Referee: [Development of the general theory (after the observation that the reward-hacking proof does not transfer)] The central unavoidability claim for exploitation rests on a combinatorial counting argument that applies only once the policy set is 'large enough,' yet no explicit minimal cardinality bound is supplied and no verification is given that typical structured or parameterized policy classes (e.g., neural-network policies) satisfy the required conditions. This is load-bearing for the claim that exploitation is 'essentially unavoidable' and that finite unhackability conditions have 'no counterpart.'

Authors: We thank the referee for highlighting the need for greater precision here. The combinatorial argument proceeds by showing that the number of possible preference orderings inducible by transition models is finite and bounded (at most exponential in the size of the state-action space), so that once the policy set exceeds this number, a pigeonhole argument forces the existence of at least one exploitable pair. While the main text and abstract emphasize the qualitative conclusion for 'large' sets, the appendix proof already contains the dependence on cardinality; we will make the explicit threshold (in terms of the number of distinct transition functions) explicit in the main body of the revised manuscript. Regarding structured or parameterized classes such as neural-network policies, the result is stated for arbitrary policy sets and therefore applies whenever a given parameterization induces a sufficiently large effective set of distinct policies. We will add a clarifying paragraph noting that overparameterized networks typically realize large policy sets in practice, while acknowledging that a fully rigorous embedding of specific architectures into the counting argument is left for future work. These changes will strengthen the presentation of the unavoidability claim and the contrast with finite-set unhackability without altering the core theorems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is conditional on explicit large-set assumption without self-referential reduction

full rationale

The paper first defines model exploitation by direct analogy to a prior reward-hacking characterization, explicitly notes that the existing inevitability proof does not transfer, and then constructs a separate combinatorial argument that holds only when the policy set is large enough for a counting argument to produce reversing preference pairs. This cardinality condition is introduced as an assumption rather than derived from the definition itself. No equation equates a derived quantity to a fitted parameter or prior result by construction, no self-citation supplies the uniqueness or ansatz for the central claim, and the subsequent relaxed exploitation notion plus safe-horizon bound are obtained by standard relaxation of the same combinatorial setup. The overall chain therefore remains self-contained against external benchmarks and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard RL transition models and policy sets plus the new exploitation definition; no free parameters are fitted to data, but the size of the policy set functions as a key domain assumption for the unavoidability result.

axioms (1)

domain assumption The policy set is large enough for the combinatorial argument establishing unavoidability to apply
Invoked when proving that exploitation is essentially unavoidable on large policy sets after noting that the reward-hacking proof does not transfer.

invented entities (1)

Model exploitation no independent evidence
purpose: Formal characterization of when a world model implies the wrong policy preference relative to the true transition model
New definition introduced to analogize with reward hacking; no independent evidence provided beyond the definition itself.

pith-pipeline@v0.9.0 · 5701 in / 1380 out tokens · 49687 ms · 2026-05-20T18:47:30.307209+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

[1]

Defining and Characterizing Reward Hacking , year =

Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , journal =. Defining and Characterizing Reward Hacking , year =

work page
[2]

International Conference on Learning Representations (ICLR) , year=

Correlated proxies: A new definition and improved mitigation for reward hacking , author=. International Conference on Learning Representations (ICLR) , year=

work page
[3]

Deisenroth, Marc and Rasmussen, Carl E , journal=

work page
[4]

Journal of statistical mechanics: theory and experiment , volume=

Path integrals and symmetry breaking for optimal control theory , author=. Journal of statistical mechanics: theory and experiment , volume=

work page
[5]

Neural Information Processing Systems (NIPS) , year=

Exploiting model uncertainty estimates for safe dynamic control learning , author=. Neural Information Processing Systems (NIPS) , year=

work page
[6]

Artificial Intelligence and Statistics (AISTATS) , year=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Artificial Intelligence and Statistics (AISTATS) , year=

work page
[7]

Journal of Mathematical Analysis and Applications , volume=

Optimal control of. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=

work page 1965
[8]

Biometrika , volume=

A new measure of rank correlation , author=. Biometrika , volume=. 1938 , publisher=

work page 1938
[9]

Guaranteed margins for

Doyle, John , journal=. Guaranteed margins for. 1978 , publisher=

work page 1978
[10]

Cosmos World Foundation Model Platform for Physical AI

Cosmos world foundation model platform for physical ai , author=. arXiv preprint arXiv:2501.03575 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Balestriero, Randall and LeCun, Yann , journal=. Le

work page
[12]

Neural Information Processing Systems (NIPS) , year=

Inverse reward design , author=. Neural Information Processing Systems (NIPS) , year=

work page
[13]

Real and complex analysis , author=

work page
[14]

2014 , publisher=

Hilbert's fifth problem and related topics , author=. 2014 , publisher=

work page 2014
[15]

International Conference on Robotics and Automation (ICRA) , year=

Simulation-based reinforcement learning for real-world autonomous driving , author=. International Conference on Robotics and Automation (ICRA) , year=

work page
[16]

2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=

Sim-to-real transfer in deep reinforcement learning for robotics: a survey , author=. 2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=. 2020 , organization=

work page 2020
[17]

Yu, Tianhe and Thomas, Garrett and Yu, Lantao and Ermon, Stefano and Zou, James Y and Levine, Sergey and Finn, Chelsea and Ma, Tengyu , journal=

work page
[18]

European Conference on Artificial Life , pages=

Noise and the reality gap: The use of simulation in evolutionary robotics , author=. European Conference on Artificial Life , pages=. 1995 , organization=

work page 1995
[19]

Mathematics of Operations Research , volume=

Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=

work page 2005
[20]

Robust control of

Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of. 2005 , publisher=

work page 2005
[21]

ACM Sigart Bulletin , volume=

Dyna, an integrated architecture for learning, planning, and reacting , author=. ACM Sigart Bulletin , volume=. 1991 , publisher=

work page 1991
[22]

Neural Information Processing Systems (NeurIPS) , year=

Sample-efficient reinforcement learning with stochastic ensemble value expansion , author=. Neural Information Processing Systems (NeurIPS) , year=

work page
[23]

International Conference on Learning Representations (ICLR) , year=

Model-Ensemble Trust-Region Policy Optimization , author=. International Conference on Learning Representations (ICLR) , year=

work page
[24]

Neural Information Processing Systems (NeurIPS) , year=

Proper value equivalence , author=. Neural Information Processing Systems (NeurIPS) , year=

work page
[25]

Mastering

Schrittwieser, Julian and Antonoglou, Ioannis and Hubert, Thomas and Simonyan, Karen and Sifre, Laurent and Schmitt, Simon and Guez, Arthur and Lockhart, Edward and Hassabis, Demis and Graepel, Thore and others , journal=. Mastering. 2020 , publisher=

work page 2020
[26]

International Conference on Learning Representations (ICLR) , year=

Dream to control: Learning behaviors by latent imagination , author=. International Conference on Learning Representations (ICLR) , year=

work page
[27]

UAI , pages=

Model Regularization for Stable Sample Rollouts , author=. UAI , pages=

work page
[28]

International Conference on Machine Learning (ICML) , year=

Goal Misgeneralization in Deep Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , year=

work page
[29]

Journal of Machine Learning Research , volume=

R-max -- a general polynomial time algorithm for near-optimal reinforcement learning , author=. Journal of Machine Learning Research , volume=

work page
[30]

2, 2022-06-27 , author=

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

work page 2022
[31]

Mastering

Danijar Hafner and Timothy Lillicrap and Mohammad Norouzi and Jimmy Ba , year=. Mastering

work page
[32]

Mastering diverse control tasks through world models , volume =

Hafner, Danijar and Pasukonis, Jurgis and Ba, Jimmy and Lillicrap, Timothy , journal =. Mastering diverse control tasks through world models , volume =

work page
[33]

2025 , journal=

Training Agents Inside of Scalable World Models , author=. 2025 , journal=

work page 2025
[34]

Sargent , publisher =

Lars Peter Hansen and Thomas J. Sargent , publisher =. Robustness , year =

work page
[35]

Carnegie-Rochester conference series on public policy , volume=

Econometric policy evaluation: A critique , author=. Carnegie-Rochester conference series on public policy , volume=. 1976 , organization=

work page 1976
[36]

Kidambi, Rahul and Rajeswaran, Aravind and Netrapalli, Praneeth and Joachims, Thorsten , journal=

work page
[37]

Neural Information Processing Systems (NeurIPS) , year=

When to trust your model: Model-based policy optimization , author=. Neural Information Processing Systems (NeurIPS) , year=

work page
[38]

Neural Information Processing Systems (NeurIPS) , year=

Deep reinforcement learning in a handful of trials using probabilistic dynamics models , author=. Neural Information Processing Systems (NeurIPS) , year=

work page
[39]

International Conference on Intelligent Robots and Systems (IROS) , year=

Domain randomization for transferring deep neural networks from simulation to the real world , author=. International Conference on Intelligent Robots and Systems (IROS) , year=

work page
[40]

International Conference on Machine Learning (ICML) , year=

Learning latent dynamics for planning from pixels , author=. International Conference on Machine Learning (ICML) , year=

work page
[41]

Neural Information Processing Systems (NeurIPS) , year =

Recurrent World Models Facilitate Policy Evolution , author =. Neural Information Processing Systems (NeurIPS) , year =

work page
[42]

Journal of Economic Theory , volume=

The arbitrage theory of capital asset pricing , author=. Journal of Economic Theory , volume=

work page
[43]

Game Studies , volume=

A practiced practice: Speedrunning through space with de Certeau and Virilio , author=. Game Studies , volume=

work page
[44]

2005 , publisher=

Approximation of large-scale dynamical systems , author=. 2005 , publisher=

work page 2005
[45]

IEEE Transactions on Systems Science and Cybernetics , volume=

A formal basis for the heuristic determination of minimum cost paths , author=. IEEE Transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=

work page 1968
[46]

The Quarterly Journal of Economics , pages=

A behavioral model of rational choice , author=. The Quarterly Journal of Economics , pages=. 1955 , publisher=

work page 1955
[47]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

1949 , note =

Jones, Chuck and Maltese, Michael , title =. 1949 , note =

work page 1949
[49]

arXiv preprint arXiv:1812.01129 , year=

Mitigating planner overfitting in model-based reinforcement learning , author=. arXiv preprint arXiv:1812.01129 , year=

work page arXiv
[50]

Neural Information Processing Systems (NIPS) , year=

Autonomous helicopter flight via reinforcement learning , author=. Neural Information Processing Systems (NIPS) , year=

work page
[51]

Autonomous Agents and Multiagent Systems (AAMAS) , year=

The dependence of effective planning horizon on model accuracy , author=. Autonomous Agents and Multiagent Systems (AAMAS) , year=

work page
[52]

Nature , volume=

The quiet revolution of numerical weather prediction , author=. Nature , volume=. 2015 , publisher=

work page 2015
[53]

International Conference on Machine Learning (ICML) , year=

Policy invariance under reward transformations: Theory and application to reward shaping , author=. International Conference on Machine Learning (ICML) , year=

work page
[54]

On the expressivity of

Abel, David and Dabney, Will and Harutyunyan, Anna and Ho, Mark K and Littman, Michael and Precup, Doina and Singh, Satinder , journal=. On the expressivity of

work page
[55]

Finding the Frame: An

The big world hypothesis and its ramifications for artificial intelligence , author=. Finding the Frame: An

work page
[56]

Vehicle System Dynamics , year=

THE MAGIC FORMULA TYRE MODEL , author=. Vehicle System Dynamics , year=

work page
[57]

Reinforcement Learning Conference (RLC) , year=

An Optimal Tightness Bound for the Simulation Lemma , author=. Reinforcement Learning Conference (RLC) , year=

work page
[58]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998
[59]

, title =

Lee, John M. , title =. 2013 , publisher =

work page 2013
[60]

Machine learning , volume=

Near-optimal reinforcement learning in polynomial time , author=. Machine learning , volume=. 2002 , publisher=

work page 2002
[61]

Neural Information Processing Systems (NeurIPS) , year=

The value equivalence principle for model-based reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=

work page
[62]

Mathematische Annalen , volume=

Beweis der Invarianz des n -dimensionalen Gebiets , author=. Mathematische Annalen , volume=. 1911 , publisher=

work page 1911
[63]

Transactions of the Linnean Society of London , volume =

Bates, Henry Walter , title =. Transactions of the Linnean Society of London , volume =

work page
[64]

2012 , publisher=

Dynamic Programming and Optimal Control: Volume I , author=. 2012 , publisher=

work page 2012

[1] [1]

Defining and Characterizing Reward Hacking , year =

Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , journal =. Defining and Characterizing Reward Hacking , year =

work page

[2] [2]

International Conference on Learning Representations (ICLR) , year=

Correlated proxies: A new definition and improved mitigation for reward hacking , author=. International Conference on Learning Representations (ICLR) , year=

work page

[3] [3]

Deisenroth, Marc and Rasmussen, Carl E , journal=

work page

[4] [4]

Journal of statistical mechanics: theory and experiment , volume=

Path integrals and symmetry breaking for optimal control theory , author=. Journal of statistical mechanics: theory and experiment , volume=

work page

[5] [5]

Neural Information Processing Systems (NIPS) , year=

Exploiting model uncertainty estimates for safe dynamic control learning , author=. Neural Information Processing Systems (NIPS) , year=

work page

[6] [6]

Artificial Intelligence and Statistics (AISTATS) , year=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Artificial Intelligence and Statistics (AISTATS) , year=

work page

[7] [7]

Journal of Mathematical Analysis and Applications , volume=

Optimal control of. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=

work page 1965

[8] [8]

Biometrika , volume=

A new measure of rank correlation , author=. Biometrika , volume=. 1938 , publisher=

work page 1938

[9] [9]

Guaranteed margins for

Doyle, John , journal=. Guaranteed margins for. 1978 , publisher=

work page 1978

[10] [10]

Cosmos World Foundation Model Platform for Physical AI

Cosmos world foundation model platform for physical ai , author=. arXiv preprint arXiv:2501.03575 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Balestriero, Randall and LeCun, Yann , journal=. Le

work page

[12] [12]

Neural Information Processing Systems (NIPS) , year=

Inverse reward design , author=. Neural Information Processing Systems (NIPS) , year=

work page

[13] [13]

Real and complex analysis , author=

work page

[14] [14]

2014 , publisher=

Hilbert's fifth problem and related topics , author=. 2014 , publisher=

work page 2014

[15] [15]

International Conference on Robotics and Automation (ICRA) , year=

Simulation-based reinforcement learning for real-world autonomous driving , author=. International Conference on Robotics and Automation (ICRA) , year=

work page

[16] [16]

2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=

Sim-to-real transfer in deep reinforcement learning for robotics: a survey , author=. 2020 IEEE Symposium Series on Computational Intelligence (SSCI) , pages=. 2020 , organization=

work page 2020

[17] [17]

Yu, Tianhe and Thomas, Garrett and Yu, Lantao and Ermon, Stefano and Zou, James Y and Levine, Sergey and Finn, Chelsea and Ma, Tengyu , journal=

work page

[18] [18]

European Conference on Artificial Life , pages=

Noise and the reality gap: The use of simulation in evolutionary robotics , author=. European Conference on Artificial Life , pages=. 1995 , organization=

work page 1995

[19] [19]

Mathematics of Operations Research , volume=

Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=

work page 2005

[20] [20]

Robust control of

Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of. 2005 , publisher=

work page 2005

[21] [21]

ACM Sigart Bulletin , volume=

Dyna, an integrated architecture for learning, planning, and reacting , author=. ACM Sigart Bulletin , volume=. 1991 , publisher=

work page 1991

[22] [22]

Neural Information Processing Systems (NeurIPS) , year=

Sample-efficient reinforcement learning with stochastic ensemble value expansion , author=. Neural Information Processing Systems (NeurIPS) , year=

work page

[23] [23]

International Conference on Learning Representations (ICLR) , year=

Model-Ensemble Trust-Region Policy Optimization , author=. International Conference on Learning Representations (ICLR) , year=

work page

[24] [24]

Neural Information Processing Systems (NeurIPS) , year=

Proper value equivalence , author=. Neural Information Processing Systems (NeurIPS) , year=

work page

[25] [25]

Mastering

Schrittwieser, Julian and Antonoglou, Ioannis and Hubert, Thomas and Simonyan, Karen and Sifre, Laurent and Schmitt, Simon and Guez, Arthur and Lockhart, Edward and Hassabis, Demis and Graepel, Thore and others , journal=. Mastering. 2020 , publisher=

work page 2020

[26] [26]

International Conference on Learning Representations (ICLR) , year=

Dream to control: Learning behaviors by latent imagination , author=. International Conference on Learning Representations (ICLR) , year=

work page

[27] [27]

UAI , pages=

Model Regularization for Stable Sample Rollouts , author=. UAI , pages=

work page

[28] [28]

International Conference on Machine Learning (ICML) , year=

Goal Misgeneralization in Deep Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , year=

work page

[29] [29]

Journal of Machine Learning Research , volume=

R-max -- a general polynomial time algorithm for near-optimal reinforcement learning , author=. Journal of Machine Learning Research , volume=

work page

[30] [30]

2, 2022-06-27 , author=

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

work page 2022

[31] [31]

Mastering

Danijar Hafner and Timothy Lillicrap and Mohammad Norouzi and Jimmy Ba , year=. Mastering

work page

[32] [32]

Mastering diverse control tasks through world models , volume =

Hafner, Danijar and Pasukonis, Jurgis and Ba, Jimmy and Lillicrap, Timothy , journal =. Mastering diverse control tasks through world models , volume =

work page

[33] [33]

2025 , journal=

Training Agents Inside of Scalable World Models , author=. 2025 , journal=

work page 2025

[34] [34]

Sargent , publisher =

Lars Peter Hansen and Thomas J. Sargent , publisher =. Robustness , year =

work page

[35] [35]

Carnegie-Rochester conference series on public policy , volume=

Econometric policy evaluation: A critique , author=. Carnegie-Rochester conference series on public policy , volume=. 1976 , organization=

work page 1976

[36] [36]

Kidambi, Rahul and Rajeswaran, Aravind and Netrapalli, Praneeth and Joachims, Thorsten , journal=

work page

[37] [37]

Neural Information Processing Systems (NeurIPS) , year=

When to trust your model: Model-based policy optimization , author=. Neural Information Processing Systems (NeurIPS) , year=

work page

[38] [38]

Neural Information Processing Systems (NeurIPS) , year=

Deep reinforcement learning in a handful of trials using probabilistic dynamics models , author=. Neural Information Processing Systems (NeurIPS) , year=

work page

[39] [39]

International Conference on Intelligent Robots and Systems (IROS) , year=

Domain randomization for transferring deep neural networks from simulation to the real world , author=. International Conference on Intelligent Robots and Systems (IROS) , year=

work page

[40] [40]

International Conference on Machine Learning (ICML) , year=

Learning latent dynamics for planning from pixels , author=. International Conference on Machine Learning (ICML) , year=

work page

[41] [41]

Neural Information Processing Systems (NeurIPS) , year =

Recurrent World Models Facilitate Policy Evolution , author =. Neural Information Processing Systems (NeurIPS) , year =

work page

[42] [42]

Journal of Economic Theory , volume=

The arbitrage theory of capital asset pricing , author=. Journal of Economic Theory , volume=

work page

[43] [43]

Game Studies , volume=

A practiced practice: Speedrunning through space with de Certeau and Virilio , author=. Game Studies , volume=

work page

[44] [44]

2005 , publisher=

Approximation of large-scale dynamical systems , author=. 2005 , publisher=

work page 2005

[45] [45]

IEEE Transactions on Systems Science and Cybernetics , volume=

A formal basis for the heuristic determination of minimum cost paths , author=. IEEE Transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=

work page 1968

[46] [46]

The Quarterly Journal of Economics , pages=

A behavioral model of rational choice , author=. The Quarterly Journal of Economics , pages=. 1955 , publisher=

work page 1955

[47] [47]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

1949 , note =

Jones, Chuck and Maltese, Michael , title =. 1949 , note =

work page 1949

[49] [49]

arXiv preprint arXiv:1812.01129 , year=

Mitigating planner overfitting in model-based reinforcement learning , author=. arXiv preprint arXiv:1812.01129 , year=

work page arXiv

[50] [50]

Neural Information Processing Systems (NIPS) , year=

Autonomous helicopter flight via reinforcement learning , author=. Neural Information Processing Systems (NIPS) , year=

work page

[51] [51]

Autonomous Agents and Multiagent Systems (AAMAS) , year=

The dependence of effective planning horizon on model accuracy , author=. Autonomous Agents and Multiagent Systems (AAMAS) , year=

work page

[52] [52]

Nature , volume=

The quiet revolution of numerical weather prediction , author=. Nature , volume=. 2015 , publisher=

work page 2015

[53] [53]

International Conference on Machine Learning (ICML) , year=

Policy invariance under reward transformations: Theory and application to reward shaping , author=. International Conference on Machine Learning (ICML) , year=

work page

[54] [54]

On the expressivity of

Abel, David and Dabney, Will and Harutyunyan, Anna and Ho, Mark K and Littman, Michael and Precup, Doina and Singh, Satinder , journal=. On the expressivity of

work page

[55] [55]

Finding the Frame: An

The big world hypothesis and its ramifications for artificial intelligence , author=. Finding the Frame: An

work page

[56] [56]

Vehicle System Dynamics , year=

THE MAGIC FORMULA TYRE MODEL , author=. Vehicle System Dynamics , year=

work page

[57] [57]

Reinforcement Learning Conference (RLC) , year=

An Optimal Tightness Bound for the Simulation Lemma , author=. Reinforcement Learning Conference (RLC) , year=

work page

[58] [58]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998

[59] [59]

, title =

Lee, John M. , title =. 2013 , publisher =

work page 2013

[60] [60]

Machine learning , volume=

Near-optimal reinforcement learning in polynomial time , author=. Machine learning , volume=. 2002 , publisher=

work page 2002

[61] [61]

Neural Information Processing Systems (NeurIPS) , year=

The value equivalence principle for model-based reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=

work page

[62] [62]

Mathematische Annalen , volume=

Beweis der Invarianz des n -dimensionalen Gebiets , author=. Mathematische Annalen , volume=. 1911 , publisher=

work page 1911

[63] [63]

Transactions of the Linnean Society of London , volume =

Bates, Henry Walter , title =. Transactions of the Linnean Society of London , volume =

work page

[64] [64]

2012 , publisher=

Dynamic Programming and Optimal Control: Volume I , author=. 2012 , publisher=

work page 2012