pith. sign in

arxiv: 2606.21173 · v1 · pith:YPS6WFBMnew · submitted 2026-06-19 · 💻 cs.LG · cs.AI

Inverting the Bellman Equation: From Q-Values to World Models

Pith reviewed 2026-06-26 14:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningworld modelsQ-learninggoal-conditioned RLtransition kernelBellman equationmodel-free methods
0
0 comments X

The pith

Value-based agents trained on a rich set of rewards implicitly encode an accurate world model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the traditional split between model-free reinforcement learning, which learns values, and model-based approaches, which learn transitions. It establishes that training Q-values over enough different reward functions, as in goal-conditioned settings, builds a complete internal model of the environment's dynamics inside the agent. A new extraction method called P-learning inverts the usual process to recover that model from the agent's Q-values, policies, and rewards. If correct, this means many value-based agents already contain the information needed for planning and generalization without separate model learning.

Core claim

Value-based agents trained on a sufficiently rich set of reward functions implicitly encode a unique and accurate world model. To extract this model in practice, P-learning samples from an agent's Q-values, policies and rewards to decode its internal model of the environment. Sufficient conditions are given on the type and number of goals for which agents encode the true transition kernel P, covering stochastic and deterministic MDPs over finite or continuous state spaces. Even when assumptions are violated, agents trained on a handful of reward functions encode accurate dynamics, and policies trained on the implicit model perform well on out-of-distribution goals.

What carries the argument

P-learning, the inverse analogue to Q-learning that decodes the transition kernel from sampled Q-values, policies and rewards.

If this is right

  • The extracted model from a position-only trained Reacher agent supports quasi-optimal policies on velocity-based out-of-distribution goals.
  • Agents encode accurate dynamics in Reacher, MountainCar and stochastic FourRooms even when the formal sufficient conditions are not fully met.
  • Policies trained exclusively on the agent's implicit world model match performance on tasks outside the original training distribution.
  • Model-free value functions contain usable transition knowledge that connects them to model-based planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the encoding holds across algorithms, training on diverse goals could serve as a practical route to reliable internal simulators without explicit dynamics data.
  • The result offers one explanation for why goal-conditioned agents often generalize better than single-reward training.
  • Checking whether P-learning recovers consistent models from different value-based methods would test how general the implicit encoding is.

Load-bearing premise

The reward functions or goals used during training must be rich enough in type and number to uniquely determine the true transition kernel.

What would settle it

Apply P-learning to extract a transition kernel from an agent's Q-values and compare it directly to the true environment dynamics on a held-out set of states and actions; mismatch would show the encoding claim does not hold.

Figures

Figures reproduced from arXiv: 2606.21173 by Alexander D. Goldie, Alistair Letcher, Jakob N. Foerster, Jonathan Richens, Mattie Fellows, Oliver Richardson.

Figure 1
Figure 1. Figure 1: Illustration of P-learning: extracting the world model contained in an agent’s Q-values. Code available at github.com/aletcher/inverting-bellman. Preprint. arXiv:2606.21173v1 [cs.LG] 19 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Q-values (top) and implicit WM (bottom) of a Reacher agent trained on |G| = 4 goals. Learnt values Qg(θ, ω, a) are coarsely similar (NMSE = 5.7 × 10−1 ) to ground truth Qtrue, shown for g = (0, 1), a = (1, 1) and two slices ω = (1, −1), θ = (1, −1). Nevertheless, the extracted WM P(θ, ω, a) is highly accurate (NMSE = 1.2 × 10−4 ), shown for prediction of y-position Py vs true P true y (same slice). Bottom … view at source ↗
Figure 3
Figure 3. Figure 3: Mean return ± SE (10 training seeds and 512 env. resets), for optimal (R⋆ ) vs WM￾trained (RWM) policies on 3 unseen goals. We visualise slices of the learnt Q-values, extracted WM and trajectory rollouts in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agent return on training goals (left), implicit world model MSE (middle) and return of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quiver plots of WMs extracted from two agents trained with position- and velocity￾based goals, vs ground truth, for action a = right. Each arrow starts from a state s and points to the predicted or true next-state s ′ = P(s, a). Again, planning inside the implicit WM reveals implicit generalisation capabilities ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PQN return (left) and WM dynamics MSE (right) vs training step for [PITH_FULL_IMAGE:figures/full_fig_p043_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pearson and Spearman correlations over time between [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mean return ± SE (10 training seeds), for optimal (R⋆ ) vs WM-trained (RWM) policies, on two unseen goals, in each FourRooms variant, for varying number of training goals |G| on the x-axis. In the deterministic variant, recovery is exact at every |G| ≥ 1, in line with Theorem 2. In the windy variant, |G| = 4 training goals already drive the WM-derived policy to within ∼ 1% of optimal return. In the telepor… view at source ↗
Figure 9
Figure 9. Figure 9: Deterministic FourRooms: a single generic goal drives the extracted WM to zero error during training, in line with Theorem 2, and training a policy inside of it produces a policy (solid) that is optimal (dashed) on both unseen goals. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Windy FourRooms: the single-goal regime is no longer sufficient for stochastic dynamics, but |G| = 4 drives the WM-derived policy to within a few percent of the oracle on both unseen goals. The windy variant introduces local stochasticity: the chosen action is realised faithfully with probabil￾ity 1 2 , and rotated 90◦ counter-clockwise or clockwise each with probability 1 4 . This places the variant in t… view at source ↗
Figure 11
Figure 11. Figure 11: Teleporting FourRooms: for |G| = 20, the extracted WM produces quasi-optimal policies on OOD goals, well below our worst-case theoretical bound |G| ≥ |S| = 68 [PITH_FULL_IMAGE:figures/full_fig_p048_11.png] view at source ↗
read the original abstract

Model-based and model-free reinforcement learning are traditionally viewed as separate paradigms: instead of learning a model of the transition kernel $P$, model-free agents typically estimate value functions tied to a specific policy and reward. In this paper, we challenge this dichotomy by proving that value-based agents trained on a sufficiently rich set of reward functions, e.g. using goal-conditioned RL, implicitly encode a unique and accurate world model. To extract this model in practice, we introduce \textit{$P$-learning}, an inverse analogue to $Q$-learning that samples from an agent's $Q$-values, policies and rewards to decode its internal model of the environment. We then provide sufficient conditions on the type and number of goals for which agents encode the true kernel $P$, covering both stochastic and deterministic MDPs over finite or continuous state spaces. Even when our assumptions are violated, we empirically demonstrate that agents trained on a handful of reward functions encode accurate dynamics in $\texttt{Reacher}$, $\texttt{MountainCar}$ and stochastic variants of $\texttt{FourRooms}$. Surprisingly, we find that policies trained exclusively on a \texttt{Reacher} agent's implicit world model are quasi-optimal on out-of-distribution, velocity-based goals despite position-only training -- suggesting that agents contain hidden generalisation capabilities and providing a new lens into the connection between model-based, model-free, and goal-conditioned RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proves that value-based agents trained on a sufficiently rich set of reward functions (e.g. via goal-conditioned RL) implicitly encode a unique and accurate world model of the transition kernel P. It introduces P-learning, an inverse procedure to Q-learning that extracts the model by sampling from the agent's Q-values, policies and rewards. Sufficient conditions on the type and cardinality of goals are stated to guarantee recovery of the true P for both stochastic and deterministic MDPs over finite or continuous state spaces. Empirical results in Reacher, MountainCar and stochastic FourRooms show that agents encode accurate dynamics even when assumptions are mildly violated, and that policies trained exclusively on the implicit model achieve quasi-optimal performance on out-of-distribution velocity-based goals.

Significance. If the central claim and proof hold, the work supplies a concrete mathematical bridge between model-free value-based RL and model-based methods by showing that sufficiently diverse reward training causes agents to contain hidden, extractable world models. The explicit uniqueness proof, the P-learning algorithm, the parameter-free character of the inversion under the stated conditions, and the generalization experiments are all strengths that would be credited in a review. The result offers a new lens on goal-conditioned RL and could motivate new algorithms for model extraction and improved OOD performance.

major comments (2)
  1. [theoretical results / sufficient conditions] The uniqueness theorem (main theoretical section) asserts that Q-values over a rich reward set determine P uniquely; however, the proof sketch in the abstract and the empirical section do not clarify whether the stated cardinality conditions remain sufficient when the reward functions are linearly dependent or when the policy class is restricted, which is load-bearing for the claim that any goal-conditioned agent encodes the true kernel.
  2. [P-learning procedure] § on P-learning: the inversion procedure is presented as sampling from Q, π and r to recover P, but the manuscript does not report the sample complexity or the numerical stability of the inversion step when Q-values are estimated from finite data; this directly affects whether the extracted model is accurate enough to support the reported quasi-optimal OOD policies.
minor comments (2)
  1. [continuous state spaces] Notation for the continuous-state case should explicitly distinguish the measure-theoretic version of the Bellman equation from the finite case to avoid ambiguity in the uniqueness argument.
  2. [empirical evaluation] The Reacher and MountainCar experiments would benefit from an ablation that varies the number of training goals while holding total samples fixed, to quantify how quickly the implicit model accuracy saturates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation of minor revision. The comments highlight important points for clarification, which we address below.

read point-by-point responses
  1. Referee: [theoretical results / sufficient conditions] The uniqueness theorem (main theoretical section) asserts that Q-values over a rich reward set determine P uniquely; however, the proof sketch in the abstract and the empirical section do not clarify whether the stated cardinality conditions remain sufficient when the reward functions are linearly dependent or when the policy class is restricted, which is load-bearing for the claim that any goal-conditioned agent encodes the true kernel.

    Authors: The uniqueness theorem provides sufficient conditions on the cardinality and type of goals that ensure the reward set allows unique determination of P. These conditions are intended to guarantee that the rewards provide independent information sufficient for inversion; linear dependence among rewards would reduce the effective cardinality below the threshold, thus not satisfying the stated conditions. We will add an explicit remark in the theorem statement and surrounding discussion to clarify this aspect. The theorem applies under the policy class used by the agent for the given rewards, as detailed in the assumptions; no restriction beyond that is claimed. This clarification will be incorporated in the revision. revision: partial

  2. Referee: [P-learning procedure] § on P-learning: the inversion procedure is presented as sampling from Q, π and r to recover P, but the manuscript does not report the sample complexity or the numerical stability of the inversion step when Q-values are estimated from finite data; this directly affects whether the extracted model is accurate enough to support the reported quasi-optimal OOD policies.

    Authors: The manuscript indeed does not provide theoretical sample complexity analysis for the P-learning inversion or a dedicated study of numerical stability with finite-sample Q-value estimates. The empirical results across the environments demonstrate that the procedure yields models accurate enough for the reported OOD performance. In the revised manuscript, we will include additional discussion on the numerical stability observed in the experiments and note the lack of theoretical sample complexity bounds as a direction for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an explicit mathematical proof that Q-values over a sufficiently rich set of reward functions uniquely determine the transition kernel P (covering finite/continuous and deterministic/stochastic MDPs), grounded in the Bellman equation and stated conditions on reward type and cardinality. The derivation of the inversion procedure (P-learning) follows directly from this uniqueness result rather than from any fitted parameter or self-referential definition. No load-bearing step reduces by construction to the paper's own inputs, and the central claim does not rely on self-citations for its uniqueness theorem or ansatz. The result is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The claim rests on standard MDP and Bellman equation properties plus the unstated but load-bearing assumption that the training rewards meet the sufficient conditions for unique recovery of P.

axioms (2)
  • standard math The Bellman optimality equation holds for the learned Q-values
    Invoked implicitly when relating Q to the transition kernel P
  • domain assumption The environment is an MDP (finite or continuous state space, stochastic or deterministic)
    Stated as the setting in which the sufficient conditions apply
invented entities (1)
  • P-learning procedure no independent evidence
    purpose: Inverse method to decode the transition kernel from Q-values, policies and rewards
    Newly introduced extraction technique; no independent evidence supplied in abstract

pith-pipeline@v0.9.1-grok · 5793 in / 1328 out tokens · 14935 ms · 2026-06-26T14:31:37.639819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references

  1. [1]

    A Bradford Book, 2018

    Richard Sutton and Andrew Barto.Reinforcement learning: An introduction. A Bradford Book, 2018

  2. [2]

    Playing atari with deep reinforcement learning, 2013

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013

  3. [3]

    Proximal Policy Optimization Algorithms, August 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017

  4. [4]

    Objective mismatch in model-based reinforcement learning

    Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based reinforcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control, Proceedings of Machine Learning Research. PMLR, 2020

  5. [5]

    Moerland, Joost Broekens, Aske Plaat, and Catholijn M

    Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 2023

  6. [6]

    The value equivalence principle for model-based reinforcement learning

    Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. InAdvances in Neural Information Processing Systems, 2020

  7. [7]

    Proper value equivalence

    Christopher Grimm, André Barreto, Gregory Farquhar, David Silver, and Satinder Singh. Proper value equivalence. InAdvances in Neural Information Processing Systems, 2021

  8. [8]

    Hunt, Tom Schaul, Hado van Hasselt, and David Silver

    André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems, 2017

  9. [9]

    Universal Successor Features Approximators, 2018

    Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado van Hasselt, David Silver, and Tom Schaul. Universal Successor Features Approximators, 2018

  10. [10]

    Learning one representation to optimize all rewards

    Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. InProceedings of the 35th International Conference on Neural Information Processing Systems, 2021

  11. [11]

    Universal Value Function Approximators

    Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Approximators. In Proceedings of the 32nd International Conference on Machine Learning. PMLR, 2015

  12. [12]

    Contrastive learning as goal-conditioned reinforcement learning

    Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, 2022

  13. [13]

    OGBench: Benchmarking Offline Goal-Conditioned RL

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Benchmarking Offline Goal-Conditioned RL. InThe Thirteenth International Conference on Learning Representations, 2024

  14. [14]

    Unifying task specification in reinforcement learning

    Martha White. Unifying task specification in reinforcement learning. InProceedings of the 34th Interna- tional Conference on Machine Learning - Volume 70, ICML’17, 2017

  15. [15]

    Neural fitted q iteration – first experiences with a data efficient neural reinforcement learning method

    Martin Riedmiller. Neural fitted q iteration – first experiences with a data efficient neural reinforcement learning method. InMachine Learning: ECML 2005, 2005

  16. [16]

    Learning to predict by the method of temporal differences.Machine Learning, 08 1988

    Richard Sutton. Learning to predict by the method of temporal differences.Machine Learning, 08 1988

  17. [17]

    PhD thesis, King’s College, University of Cambridge, Cambridge, UK, May 1989

    Christopher John Cornish Hellaby Watkins.Learning from Delayed Rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK, May 1989

  18. [18]

    Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine learning, 1992

  19. [19]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement ...

  20. [20]

    Simplifying deep temporal difference learning.The International Conference on Learning Representations, 2025

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning.The International Conference on Learning Representations, 2025

  21. [21]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations, 2014

  22. [22]

    Stochastic backpropagation and ap- proximate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and ap- proximate inference in deep generative models. InProceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, 2014. 11

  23. [23]

    Leemon C. Baird. Residual algorithms: reinforcement learning with function approximation. InProceed- ings of the International Conference on Machine Learning, 1995

  24. [24]

    A Stochastic Approximation Method.The Annals of Mathematical Statistics, 1951

    Herbert Robbins and Sutton Monro. A Stochastic Approximation Method.The Annals of Mathematical Statistics, 1951

  25. [25]

    Rechenberg

    I. Rechenberg. Evolutionsstrategien. InSimulationsmethoden in der Medizin und Biologie. Springer Berlin Heidelberg, 1978

  26. [26]

    Evolution strategies at the hyperscale, 2026

    Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster. Evolution strategies at t...

  27. [27]

    Accelerating Goal-Conditioned RL Algorithms and Research

    Michał Bortkiewicz, Władek Pałucki, Vivek Myers, Tadeusz Dziarmaga, Tomasz Arczewski, Łukasz Kuci´nski, and Benjamin Eysenbach. Accelerating Goal-Conditioned RL Algorithms and Research. In International Conference on Learning Representations, 2025

  28. [28]

    A single goal is all you need: Skills and exploration emerge from contrastive RL without rewards, demonstrations, or subgoals

    Grace Liu, Michael Tang, and Benjamin Eysenbach. A single goal is all you need: Skills and exploration emerge from contrastive RL without rewards, demonstrations, or subgoals. InThe Thirteenth International Conference on Learning Representations, 2025

  29. [29]

    Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint, 2021

    Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint, 2021

  30. [30]

    Bellemare, and Hugo Larochelle

    William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons, 2019

  31. [31]

    Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M

    Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. InThe 10th International Conference on Autonomous Agents and Multiagent Systems, 2011

  32. [32]

    Gymnasium: A standard interface for reinforcement learning environments.https://gymnasium.farama.org/, 2023

    Mark Towers, Ariel Kwiatkowski, and Jordan Terry. Gymnasium: A standard interface for reinforcement learning environments.https://gymnasium.farama.org/, 2023. Farama Foundation

  33. [33]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012

  34. [34]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 1999

  35. [35]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, 2017

  36. [36]

    Kahrs, Carlo Sferrazza, Yuval Tassa, and Pieter Abbeel

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A. Kahrs, Carlo Sferrazza, Yuval Tassa, and Pieter Abbeel. Mujoco playground: An open-source framework for gpu-accelerated robot learning and sim-to-real transfer.,

  37. [37]

    URLhttps://github.com/google-deepmind/mujoco_playground

  38. [38]

    1000 layer networks for self-supervised RL: Scaling depth can enable new goal-reaching capabilities

    Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, and Benjamin Eysenbach. 1000 layer networks for self-supervised RL: Scaling depth can enable new goal-reaching capabilities. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  39. [39]

    Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 2020

  40. [40]

    Iterative value-aware model learning

    Amir-massoud Farahmand. Iterative value-aware model learning. InAdvances in Neural Information Processing Systems, 2018

  41. [41]

    Unifying model-based and model-free reinforcement learning with equivalent policy sets

    Benjamin Freed, Thomas Wei, Roberto Calandra, Jeff Schneider, and Howie Choset. Unifying model-based and model-free reinforcement learning with equivalent policy sets. InReinforcement Learning Conference, 2024

  42. [42]

    Bayesian exploration networks

    Mattie Fellows, Brandon Gary Kaplowitz, Christian Schroeder De Witt, and Shimon Whiteson. Bayesian exploration networks. InProceedings of the 41st International Conference on Machine Learning, Proceed- ings of Machine Learning Research, 2024. 12

  43. [43]

    Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L

    Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. InProceedings of the 23rd International Conference on Machine Learning, 2006

  44. [44]

    On representation complexity of model-based and model-free reinforcement learning

    Hanlin Zhu, Baihe Huang, and Stuart Russell. On representation complexity of model-based and model-free reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

  45. [45]

    Learning to achieve goals

    Leslie Pack Kaelbling. Learning to achieve goals. InInternational Joint Conference on Artificial Intelligence, 1993

  46. [46]

    On the Role of Iterative Computation in Reinforcement Learning, February 2026

    Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, and Benjamin Eysenbach. On the Role of Iterative Computation in Reinforcement Learning, February 2026. arXiv:2602.05999 [cs]

  47. [47]

    Goal-conditioned agents that learn everything all at once, 2026

    Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, and Jakob Foerster. Goal-conditioned agents that learn everything all at once, 2026

  48. [48]

    General agents need world models

    Jonathan Richens, Tom Everitt, and David Abel. General agents need world models. InForty-Second International Conference on Machine Learning, 2025

  49. [49]

    Inferring transition dynamics from value functions

    Jacob Adamczyk. Inferring transition dynamics from value functions. InThe Thirteenth International Conference on Learning Representations, 2025

  50. [50]

    On avoiding power-seeking by artificial intelligence, 2022

    Alexander Matt Turner. On avoiding power-seeking by artificial intelligence, 2022

  51. [51]

    Interpreting emergent planning in model-free reinforcement learning

    Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, and David Krueger. Interpreting emergent planning in model-free reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

  52. [52]

    Improving generalization for temporal difference learning: The successor representation

    Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 1993

  53. [53]

    Lucas Lehnert and Michael L. Littman. Successor features combine elements of model-free and model- based reinforcement learning.Journal of Machine Learning Research, 2020

  54. [54]

    Reward-free exploration for reinforcement learning

    Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. InProceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2020

  55. [55]

    Haoyang Cao and Samuel N. Cohen. Identifiability in inverse reinforcement learning. InAdvances in Neural Information Processing Systems, 2021

  56. [56]

    Oliver E Richardson, Mandana Samiei, Mehran Shakerinava, Joseph D Viviano, Abdessamad El Kabid, Ali Parviz, and Yoshua Bengio. Local inconsistency resolution: The interplay between attention and control in probabilistic models.Proceedings of the 29th International Conference on Artificial Intelligence and Statistics, 2026

  57. [57]

    Monte carlo gradient estimation in machine learning.J

    Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning.J. Mach. Learn. Res., 2020

  58. [58]

    A. J. Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted pfaffian functions and the exponential function.Journal of the American Mathematical Society, 1996

  59. [59]

    Krantz and Harold R

    Steven G. Krantz and Harold R. Parks.A Primer of Real Analytic Functions. Birkhäuser Boston, 2nd edition, 2002

  60. [60]

    Folland.Real Analysis: Modern Techniques and Their Applications

    Gerald B. Folland.Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics: A Wiley-Interscience Series of Texts, Monographs and Tracts. John Wiley & Sons, 1999

  61. [61]

    Lee.Introduction to Smooth Manifolds

    John M. Lee.Introduction to Smooth Manifolds. Springer, 2nd edition, 2013

  62. [62]

    University of California Press, 1955

    Salomon Bochner.Harmonic Analysis and the Theory of Probability. University of California Press, 1955

  63. [63]

    Stein and Guido Weiss.Introduction to Fourier Analysis on Euclidean Spaces

    Elias M. Stein and Guido Weiss.Introduction to Fourier Analysis on Euclidean Spaces. Princeton Mathematical Series. Princeton University Press, 1971

  64. [64]

    gymnax: A JAX-based reinforcement learning environment library, 2022

    Robert Tjarko Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. URL http://github.com/RobertTLange/gymnax

  65. [65]

    JAX: compos- able transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: compos- able transformations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax. 13 A Proofs forP-Learning A.1 Derivation ofP-Learning We provide pseudocode f...

  66. [66]

    test functions

    to form an unbiased estimate ˆδ1 of the outer factor δφ, and an independent(s ′ 2, a′ 2)to estimate the inner expectation. Independence ensures unbiased estimation: ∇φL(φ) =E (g,s,a)∼d δφ(s, a, g)E s′∼Pφ(s,a),a′∼π(s′,g) [ˆδφ∇φ logP φ(s′ |s, a)] =E (g,s,a)∼d,s1,2∼Pφ(s,a),a1,2∼π(s1,2,g) h ˆδ1ˆδ2∇φ logP φ(s2 |s, a) i . This derivation leads to the pseudocode...

  67. [67]

    exceptionally rare

    = 0 and thus det(M π) = 0. However, note that the matrix is full-rank for all γ̸= 1/ √ 2, and for any perturbation of the policies, making singularities “exceptionally rare” as guaranteed by Theorem 3. C.6 ArbitraryQ-Values Theorems 2 and 3 give conditions under which the true kernel P is identified uniquely by exact Q-values, or approximately by ϵ-approx...

  68. [68]

    Deterministic(Figures 8 and 9). Each action moves the agent one cell in the chosen cardinal direction; walls keep the agent in place.A single training goal( |G|= 1 ) already drives ˆP to zero world-model MSE within 8M env steps and the WM-derived policy is exactly optimal on both unseen goals, in line with Theorem 2

  69. [69]

    The chosen action is realised w.p

    Windy(Figures 8 and 10). The chosen action is realised w.p. 1 2, perturbed 90◦ counter- clockwise or clockwise each w.p. 1

  70. [70]

    Here, |G|= 4 training goals already drive the WM- derived policy to within∼1%of optimal return on both unseen goals

  71. [71]

    Four cells (4,4),(4,6),(6,4),(6,6) teleport the agent uniformlyinto the 16 cells of the diagonally opposite 4×4 room

    Teleporting(Figures 8 and 11). Four cells (4,4),(4,6),(6,4),(6,6) teleport the agent uniformlyinto the 16 cells of the diagonally opposite 4×4 room. Here, |G|= 20 training goals – well short of the |G|=|S|= 68 worst case of Theorem 3 – already drive the WM-derived policy to within∼0.1%of optimal return on both unseen goals. For each variant, the training ...

  72. [72]

    We train PQN on |G| ∈ {1,2,3,4} training goals over 10 seeds, recover ˆP via the local LP, and evaluate on the same two unseen goals as the deterministic variant

    This places the variant in the local-stochastic regime of Theorem 5. We train PQN on |G| ∈ {1,2,3,4} training goals over 10 seeds, recover ˆP via the local LP, and evaluate on the same two unseen goals as the deterministic variant. As expected for stochastic dynamics, the single-goal regime is under-determined, but |G| ≥3 already drives the WM-derived pol...