pith. sign in

arxiv: 2507.05561 · v2 · submitted 2025-07-08 · 💻 cs.LG · q-bio.NC

Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines

Pith reviewed 2026-05-19 06:35 UTC · model grok-4.3

classification 💻 cs.LG q-bio.NC
keywords multitask preplaycounterfactual simulationhuman generalizationpredictive representationsreinforcement learningcraftaxtransfer learning
0
0 comments X

The pith

Humans and machines use multitask preplay to learn solutions to future tasks by simulating them during current task experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that humans draw on experience from one task to preemptively solve other tasks that are accessible but not currently pursued. This idea is formalized as Multitask Preplay, an algorithm that uses replayed experience as a basis for counterfactual simulation to build predictive representations for later use. These representations enable faster adaptation when the future task becomes relevant. The approach better matches how people generalize in a grid-world setting compared to standard planning or predictive methods, even without prior awareness of the need to generalize. It also allows artificial agents to transfer learned behaviors effectively to new environments like Craftax that share similar task co-occurrence patterns.

Core claim

Multitask Preplay formalizes the process by which experience on a pursued task serves as the starting point for counterfactual simulation of an accessible but unpursued task, thereby learning a predictive representation that supports fast and adaptive performance on that task later on. This mechanism better predicts human generalization behavior in small grid-world tasks than traditional methods and enables artificial agents to acquire transferable behaviors in novel Craftax worlds that share task co-occurrence structure.

What carries the argument

Multitask Preplay is the central algorithm that replays experience from one task to initiate counterfactual simulation of unpursued tasks, building predictive representations for future adaptive performance.

Load-bearing premise

The proposed counterfactual simulation in multitask preplay accurately models the cognitive processes humans use for generalization rather than other mechanisms producing similar patterns.

What would settle it

Observing human generalization behavior in the grid-world that matches predictions from associative learning or explicit planning models but deviates from the multitask preplay simulations.

Figures

Figures reproduced from arXiv: 2507.05561 by Honglak Lee, Sam Hall-McMaster, Samuel J. Gershman, Wilka Carvalho.

Figure 1
Figure 1. Figure 1: Overview of Multitask Preplay. (1) At time [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of algorithms. (a) Q-learning is a model-free algorithm that updates cached reward [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Grid-world experiments and model comparisons. Across experiments, subjects learn to obtain training objects (blue boxes) before being evaluated on test objects (red/yellow boxes). Orange grid squares mark spawning locations during training; white stars indicate spawning locations present in both training and evaluation; red/yellow stars show novel evaluation spawning locations. All figures display optimal … view at source ↗
Figure 4
Figure 4. Figure 4: 2D minecraft experiments and model comparisons. Subjects learn to obtain stones (either a ruby, sapphire, or diamond) across 4 procedurally generated maps. Two are train stones, 1 is an evaluation stone. In one condition, they are told the evaluation stone; in another, they are not. (A) An example of the full map. Orange stars indicate spawning locations during training; the white star was a spawn location… view at source ↗
Figure 5
Figure 5. Figure 5: AI simulation results studying generalization to 10,000 unique new testing environments. (A) Each point is the mean and standard error across 5 model initializations (140 total individual training runs). Model-based methods are run for 1 million training steps and model-free methods are run for 10 million training steps. (B) Performance on individual subtasks during generalization when trained with 512 tra… view at source ↗
Figure 6
Figure 6. Figure 6: In our implementation of Dyna, synthetic data τsim is sampled from the environment model by running the model forward from observations in the replay buffer. The action selection policy is the same as the one used to generate real actions. This is repeated to produce nsim trajectories stored in a preplay buffer. The synthetic data are treated in the same way as real data, entering into the TD loss (equatio… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between Dyna and Multitask Preplay. In our experiments, [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Humans can pursue a near-infinite variety of tasks, but typically can only pursue a small number at the same time. We hypothesize that humans leverage experience on one task to preemptively learn solutions to other tasks that were accessible but not pursued. We formalize this idea as Multitask Preplay, a novel algorithm that replays experience on one task as the starting point for "preplay" -- counterfactual simulation of an accessible but unpursued task. Preplay is used to learn a predictive representation that can support fast, adaptive task performance later on. We first show that, compared to traditional planning and predictive representation methods, multitask preplay better predicts how humans generalize to tasks that were accessible but not pursued in a small grid-world, even when people didn't know they would need to generalize to these tasks. We then show these predictions generalize to Craftax, a partially observable 2D Minecraft environment. Finally, we show that Multitask Preplay enables artificial agents to learn behaviors that transfer to novel Craftax worlds sharing task co-occurrence structure. These findings demonstrate that Multitask Preplay is a scalable theory of how humans counterfactually learn and generalize across multiple tasks; endowing artificial agents with the same capacity can significantly improve their performance in challenging multitask environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multitask Preplay, an algorithm that replays experience from pursued tasks as the starting point for counterfactual simulation of accessible but unpursued tasks, thereby learning predictive representations that support later generalization. It reports that this approach better predicts human choices in a small grid-world task (even without foreknowledge of the need to generalize) than traditional planning or predictive-representation baselines, extends the predictions to the Craftax environment, and enables artificial agents to achieve better transfer to novel Craftax worlds that share task co-occurrence structure.

Significance. If the central empirical claims hold after controlling for experience volume and baseline construction, the work would offer a concrete, scalable mechanism for counterfactual multitask learning that links human generalization to improved agent transfer in partially observable environments. The combination of human behavioral modeling and agent experiments in a complex domain like Craftax is a notable strength, as is the emphasis on preemptive rather than reactive learning.

major comments (2)
  1. [Human results section] Human-experiment results (likely §3 or §4): the reported superiority of Multitask Preplay over predictive-representation baselines does not yet isolate the contribution of the explicit counterfactual task-switch from the simple fact of additional replay on the same trajectories. A standard successor-feature or replay baseline trained on identical pursued-task data (without the preplay switch) must be shown to underperform; otherwise the specific preplay mechanism is not required to explain the human generalization patterns.
  2. [Agent experiments] Agent transfer experiments (likely §5): the claim that Multitask Preplay enables transfer to novel Craftax worlds sharing task co-occurrence structure requires an ablation confirming that the benefit survives when the amount of total experience and the replay buffer contents are matched across conditions. Without this, the improvement could be driven by richer experience rather than the counterfactual structure.
minor comments (2)
  1. [Algorithm description] Clarify the precise definition of 'accessible but not pursued' tasks and how the set of counterfactual simulations is chosen; the current description leaves open whether the algorithm assumes oracle knowledge of which tasks will be relevant later.
  2. [Figures] Figure captions and legends should explicitly state the number of participants, trials per condition, and statistical tests used for the human prediction comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have addressed each of the major comments point by point below and made revisions to the manuscript where necessary to strengthen the empirical controls.

read point-by-point responses
  1. Referee: [Human results section] Human-experiment results (likely §3 or §4): the reported superiority of Multitask Preplay over predictive-representation baselines does not yet isolate the contribution of the explicit counterfactual task-switch from the simple fact of additional replay on the same trajectories. A standard successor-feature or replay baseline trained on identical pursued-task data (without the preplay switch) must be shown to underperform; otherwise the specific preplay mechanism is not required to explain the human generalization patterns.

    Authors: We agree that isolating the specific role of the counterfactual task-switch is important for interpreting the human results. Our existing predictive-representation baselines are trained solely on pursued-task data without any task-switching or counterfactual simulation. To directly address this concern, we will incorporate an additional baseline that performs extra replays exclusively on the pursued-task trajectories, matching the volume of experience but without the preplay switch to unpursued tasks. We will report these results in the revised human-experiment section to demonstrate that the preplay mechanism contributes beyond additional replay alone. revision: yes

  2. Referee: [Agent experiments] Agent transfer experiments (likely §5): the claim that Multitask Preplay enables transfer to novel Craftax worlds sharing task co-occurrence structure requires an ablation confirming that the benefit survives when the amount of total experience and the replay buffer contents are matched across conditions. Without this, the improvement could be driven by richer experience rather than the counterfactual structure.

    Authors: We concur that controlling for total experience and replay buffer contents is essential to attribute the transfer benefits specifically to the counterfactual structure. While we matched the total number of environment steps in the original experiments, the replay buffers in the Multitask Preplay condition contain additional simulated trajectories. We will add an ablation in which the baseline agent's replay buffer is supplemented with an equivalent volume of additional pursued-task replays or random experiences to match the buffer contents. This will be presented in the revised agent experiments section to confirm that the structured counterfactual preplay provides the observed advantage. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparisons rest on independent baselines and external human data

full rationale

The paper introduces Multitask Preplay as an algorithm that replays pursued-task experience to simulate unpursued tasks, then reports that this yields better predictions of human generalization in a grid-world and transfer in Craftax than traditional planning and predictive-representation baselines. No equations, parameter-fitting procedures, or self-citation chains are visible that would reduce the reported predictions or superiority claims to quantities defined by construction from the same human or agent data. The central results are empirical model comparisons against external benchmarks (human choices and novel environments), which remain falsifiable and do not rely on self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems imported from the authors' prior work. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the available text.

pith-pipeline@v0.9.0 · 5772 in / 1307 out tokens · 44533 ms · 2026-05-19T06:35:05.701064+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

  1. [1]

    When does model-based control pay off? PLoS computational biology , 12(8):e1005090, 2016

    Wouter Kool, Fiery A Cushman, and Samuel J Gershman. When does model-based control pay off? PLoS computational biology , 12(8):e1005090, 2016

  2. [2]

    Predictive representations: building blocks of intelligence

    Wilka Carvalho, Momchil S Tomov, William de Cothi, Caswell Barry, and Samuel J Gershman. Predictive representations: building blocks of intelligence. Neural Computation , pages 1–74, 2024

  3. [3]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

  4. [4]

    Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

    Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990 , pages 216–224. Elsevier, 1990

  5. [5]

    Retrospective revaluation in sequential decision making: a tale of two systems

    Samuel J Gershman, Arthur B Markman, and A Ross Otto. Retrospective revaluation in sequential decision making: a tale of two systems. Journal of Experimental Psychology: General , 143(1):182, 2014

  6. [6]

    Universal value function approximators

    Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning , pages 1312–1320. PMLR, 2015

  7. [7]

    Learning to achieve goals

    Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–1098. Citeseer, 1993

  8. [8]

    Multi-task reinforcement learning in humans

    Momchil S Tomov, Eric Schulz, and Samuel J Gershman. Multi-task reinforcement learning in humans. Nature Human Behaviour , 5(6):764–773, 2021

  9. [9]

    Open-ended learning leads to generally capable agents

    Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021

  10. [10]

    Successor features for transfer in reinforcement learning

    Andr´ e Barreto, Will Dabney, R´ emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017

  11. [11]

    Transfer in deep reinforcement learning using successor features and generalised policy improvement

    Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Mankowitz, Augustin Zidek, and Remi Munos. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning , pages 501–510. PMLR, 2018. 12

  12. [12]

    The successor representation in human reinforcement learning

    Ida Momennejad, Evan M Russek, Jin H Cheong, Matthew M Botvinick, Nathaniel Douglass Daw, and Samuel J Gershman. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9):680–692, 2017

  13. [13]

    Offline replay supports planning in human reinforcement learning

    Ida Momennejad, A Ross Otto, Nathaniel D Daw, and Kenneth A Norman. Offline replay supports planning in human reinforcement learning. elife, 7:e32548, 2018

  14. [14]

    Interplay of approximate planning strategies

    Quentin JM Huys, N ´ ıall Lally, Paul Faulkner, Neir Eshel, Erich Seifritz, Samuel J Gershman, Peter Dayan, and Jonathan P Roiser. Interplay of approximate planning strategies. Proceedings of the National Academy of Sciences, 112(10):3098–3103, 2015

  15. [15]

    Craftax: A lightning-fast benchmark for open-ended reinforcement learning

    Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In International Conference on Machine Learning (ICML) , 2024

  16. [16]

    Contrastive behavioral similarity embeddings for generalization in reinforcement learning

    Rishabh Agarwal, Marlos C Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021

  17. [17]

    Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation

    Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016

  18. [18]

    Universal Successor Features Approximators

    Diana Borsa, Andr´ e Barreto, John Quan, Daniel Mankowitz, R´ emi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018

  19. [19]

    Value generalization in human avoidance learning

    Agnes Norbury, Trevor W Robbins, and Ben Seymour. Value generalization in human avoidance learning. Elife, 7:e34779, 2018

  20. [20]

    Putting bandits into context: How function learning supports decision making

    Eric Schulz, Emmanouil Konstantinidis, and Maarten Speekenbrink. Putting bandits into context: How function learning supports decision making. Journal of experimental psychology: learning, memory, and cognition, 44(6):927, 2018

  21. [21]

    Unifying principles of generalization: past, present, and future

    Charley M Wu, Bj¨ orn Meder, and Eric Schulz. Unifying principles of generalization: past, present, and future. Annual Review of Psychology , 76, 2024

  22. [22]

    Neural evidence that humans reuse strategies to solve new tasks

    Sam Hall-McMaster, Momchil S Tomov, Samuel J Gershman, and Nicolas W Schuck. Neural evidence that humans reuse strategies to solve new tasks. PLoS Biology, 23:e3003174, 2025

  23. [23]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems , 30, 2017

  24. [24]

    Mher: Model-based hindsight experience replay

    Rui Yang, Meng Fang, Lei Han, Yali Du, Feng Luo, and Xiu Li. Mher: Model-based hindsight experience replay. arXiv preprint arXiv:2107.00306, 2021

  25. [25]

    Directed Exploration for Reinforcement Learning

    Zhaohan Daniel Guo and Emma Brunskill. Directed exploration for reinforcement learning. arXiv preprint arXiv:1906.07805, 2019

  26. [26]

    Exploration via hindsight goal generation

    Zhizhou Ren, Kefan Dong, Yuan Zhou, Qiang Liu, and Jian Peng. Exploration via hindsight goal generation. Advances in Neural Information Processing Systems , 32, 2019

  27. [27]

    Goal-directed planning via hindsight experience replay

    Lorenzo Moro, Amarildo Likmeta, Enrico Prati, Marcello Restelli, et al. Goal-directed planning via hindsight experience replay. In 10th International Conference on Learning Representations, ICLR 2022 , pages 1–16, 2022. 13

  28. [28]

    Many-Goals Reinforcement Learning

    Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018

  29. [29]

    Discovering and achieving goals via world models

    Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems , 34:24379–24391, 2021

  30. [30]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  31. [31]

    Mastering atari, go, chess and shogi by planning with a learned model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

  32. [32]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision , pages 9339–9347, 2019

  33. [33]

    Habitat 2.0: Training home assistants to rearrange their habitat

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in neural information processing systems , 34:251–266, 2021

  34. [34]

    Reinforcement learning for sparse-reward object-interaction tasks in a first-person simulated 3d environment

    Wilka Carvalho, Anthony Liang, Kimin Lee, Sungryull Sohn, Honglak Lee, Richard L Lewis, and Satinder Singh. Reinforcement learning for sparse-reward object-interaction tasks in a first-person simulated 3d environment. arXiv preprint arXiv:2010.15195, 2020

  35. [35]

    Preplay of future place cell sequences by hippocampal cellular assemblies

    George Dragoi and Susumu Tonegawa. Preplay of future place cell sequences by hippocampal cellular assemblies. Nature, 469(7330):397–401, 2011

  36. [36]

    Distinct preplay of multiple novel spatial experiences in the rat

    George Dragoi and Susumu Tonegawa. Distinct preplay of multiple novel spatial experiences in the rat. Proceedings of the National Academy of Sciences , 110(22):9100–9105, 2013

  37. [37]

    Hippocampal place cells construct reward related sequences through unexplored space

    H Freyja ´Olafsd´ ottir, Caswell Barry, Aman B Saleem, Demis Hassabis, and Hugo J Spiers. Hippocampal place cells construct reward related sequences through unexplored space. Elife, 4:e06063, 2015

  38. [38]

    Reinforcement Learning: An Introduction

    Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2018

  39. [39]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations , 2018

  40. [40]

    Acme: A research framework for distributed reinforcement learning

    Matthew W Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Nikola Momchev, Danila Sinopalnikov, Piotr Sta´ nczyk, Sabela Ramos, Anton Raichuk, Damien Vincent, et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020

  41. [41]

    Discovered policy optimisation

    Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems , 35:16455–16468, 2022

  42. [42]

    Linguistic regularities in continuous space word rep- resentations

    Tom´ aˇ s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word rep- resentations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. 14

  43. [43]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

  44. [44]

    Long short-term memory

    Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation , 9(8):1735– 1780, 1997

  45. [45]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

  46. [46]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  47. [47]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, 2019

  48. [48]

    The nethack learning environment

    Heinrich K¨ uttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefen- stette, and Tim Rockt¨ aschel. The nethack learning environment. Advances in Neural Information Pro- cessing Systems , 33:7671–7684, 2020

  49. [49]

    Improving transformer world models for data-efficient rl

    Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, and Kevin Patrick Murphy. Improving transformer world models for data-efficient rl. ICML, 2025

  50. [50]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347, 2017. 15 Methods Models Preliminaries We formulate domains as Partially Observable Controlled Markov Processes C = ⟨S, A, X , P, O⟩, where S de- notes the environment state space, A denotes its action s...