Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines

Honglak Lee; Sam Hall-McMaster; Samuel J. Gershman; Wilka Carvalho

arxiv: 2507.05561 · v2 · submitted 2025-07-08 · 💻 cs.LG · q-bio.NC

Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines

Wilka Carvalho , Sam Hall-McMaster , Honglak Lee , Samuel J. Gershman This is my paper

Pith reviewed 2026-05-19 06:35 UTC · model grok-4.3

classification 💻 cs.LG q-bio.NC

keywords multitask preplaycounterfactual simulationhuman generalizationpredictive representationsreinforcement learningcraftaxtransfer learning

0 comments

The pith

Humans and machines use multitask preplay to learn solutions to future tasks by simulating them during current task experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that humans draw on experience from one task to preemptively solve other tasks that are accessible but not currently pursued. This idea is formalized as Multitask Preplay, an algorithm that uses replayed experience as a basis for counterfactual simulation to build predictive representations for later use. These representations enable faster adaptation when the future task becomes relevant. The approach better matches how people generalize in a grid-world setting compared to standard planning or predictive methods, even without prior awareness of the need to generalize. It also allows artificial agents to transfer learned behaviors effectively to new environments like Craftax that share similar task co-occurrence patterns.

Core claim

Multitask Preplay formalizes the process by which experience on a pursued task serves as the starting point for counterfactual simulation of an accessible but unpursued task, thereby learning a predictive representation that supports fast and adaptive performance on that task later on. This mechanism better predicts human generalization behavior in small grid-world tasks than traditional methods and enables artificial agents to acquire transferable behaviors in novel Craftax worlds that share task co-occurrence structure.

What carries the argument

Multitask Preplay is the central algorithm that replays experience from one task to initiate counterfactual simulation of unpursued tasks, building predictive representations for future adaptive performance.

Load-bearing premise

The proposed counterfactual simulation in multitask preplay accurately models the cognitive processes humans use for generalization rather than other mechanisms producing similar patterns.

What would settle it

Observing human generalization behavior in the grid-world that matches predictions from associative learning or explicit planning models but deviates from the multitask preplay simulations.

Figures

Figures reproduced from arXiv: 2507.05561 by Honglak Lee, Sam Hall-McMaster, Samuel J. Gershman, Wilka Carvalho.

**Figure 2.** Figure 2: Overview of algorithms. (a) Q-learning is a model-free algorithm that updates cached reward [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Grid-world experiments and model comparisons. Across experiments, subjects learn to obtain training objects (blue boxes) before being evaluated on test objects (red/yellow boxes). Orange grid squares mark spawning locations during training; white stars indicate spawning locations present in both training and evaluation; red/yellow stars show novel evaluation spawning locations. All figures display optimal … view at source ↗

**Figure 4.** Figure 4: 2D minecraft experiments and model comparisons. Subjects learn to obtain stones (either a ruby, sapphire, or diamond) across 4 procedurally generated maps. Two are train stones, 1 is an evaluation stone. In one condition, they are told the evaluation stone; in another, they are not. (A) An example of the full map. Orange stars indicate spawning locations during training; the white star was a spawn location… view at source ↗

**Figure 5.** Figure 5: AI simulation results studying generalization to 10,000 unique new testing environments. (A) Each point is the mean and standard error across 5 model initializations (140 total individual training runs). Model-based methods are run for 1 million training steps and model-free methods are run for 10 million training steps. (B) Performance on individual subtasks during generalization when trained with 512 tra… view at source ↗

**Figure 6.** Figure 6: In our implementation of Dyna, synthetic data τsim is sampled from the environment model by running the model forward from observations in the replay buffer. The action selection policy is the same as the one used to generate real actions. This is repeated to produce nsim trajectories stored in a preplay buffer. The synthetic data are treated in the same way as real data, entering into the TD loss (equatio… view at source ↗

**Figure 6.** Figure 6: Comparison between Dyna and Multitask Preplay. In our experiments, [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Humans can pursue a near-infinite variety of tasks, but typically can only pursue a small number at the same time. We hypothesize that humans leverage experience on one task to preemptively learn solutions to other tasks that were accessible but not pursued. We formalize this idea as Multitask Preplay, a novel algorithm that replays experience on one task as the starting point for "preplay" -- counterfactual simulation of an accessible but unpursued task. Preplay is used to learn a predictive representation that can support fast, adaptive task performance later on. We first show that, compared to traditional planning and predictive representation methods, multitask preplay better predicts how humans generalize to tasks that were accessible but not pursued in a small grid-world, even when people didn't know they would need to generalize to these tasks. We then show these predictions generalize to Craftax, a partially observable 2D Minecraft environment. Finally, we show that Multitask Preplay enables artificial agents to learn behaviors that transfer to novel Craftax worlds sharing task co-occurrence structure. These findings demonstrate that Multitask Preplay is a scalable theory of how humans counterfactually learn and generalize across multiple tasks; endowing artificial agents with the same capacity can significantly improve their performance in challenging multitask environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multitask Preplay uses replayed experience for counterfactual simulation of unpursued tasks to build predictive representations, and the human gridworld results are the clearest new piece, though the mechanism may not be fully isolated from simpler replay.

read the letter

The main point is that the authors formalize multitask preplay as replaying real trajectories from one task to run targeted counterfactual simulations on an accessible but unpursued task, then using the results to learn a predictive representation for later generalization. They test this in a small grid world with humans and report better prediction of choices on the unpursued tasks compared to standard planning and predictive-representation baselines, even when participants had no reason to expect the test. The same idea is then applied to Craftax, where agents show transfer to new worlds that share task co-occurrence structure.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multitask Preplay, an algorithm that replays experience from pursued tasks as the starting point for counterfactual simulation of accessible but unpursued tasks, thereby learning predictive representations that support later generalization. It reports that this approach better predicts human choices in a small grid-world task (even without foreknowledge of the need to generalize) than traditional planning or predictive-representation baselines, extends the predictions to the Craftax environment, and enables artificial agents to achieve better transfer to novel Craftax worlds that share task co-occurrence structure.

Significance. If the central empirical claims hold after controlling for experience volume and baseline construction, the work would offer a concrete, scalable mechanism for counterfactual multitask learning that links human generalization to improved agent transfer in partially observable environments. The combination of human behavioral modeling and agent experiments in a complex domain like Craftax is a notable strength, as is the emphasis on preemptive rather than reactive learning.

major comments (2)

[Human results section] Human-experiment results (likely §3 or §4): the reported superiority of Multitask Preplay over predictive-representation baselines does not yet isolate the contribution of the explicit counterfactual task-switch from the simple fact of additional replay on the same trajectories. A standard successor-feature or replay baseline trained on identical pursued-task data (without the preplay switch) must be shown to underperform; otherwise the specific preplay mechanism is not required to explain the human generalization patterns.
[Agent experiments] Agent transfer experiments (likely §5): the claim that Multitask Preplay enables transfer to novel Craftax worlds sharing task co-occurrence structure requires an ablation confirming that the benefit survives when the amount of total experience and the replay buffer contents are matched across conditions. Without this, the improvement could be driven by richer experience rather than the counterfactual structure.

minor comments (2)

[Algorithm description] Clarify the precise definition of 'accessible but not pursued' tasks and how the set of counterfactual simulations is chosen; the current description leaves open whether the algorithm assumes oracle knowledge of which tasks will be relevant later.
[Figures] Figure captions and legends should explicitly state the number of participants, trials per condition, and statistical tests used for the human prediction comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have addressed each of the major comments point by point below and made revisions to the manuscript where necessary to strengthen the empirical controls.

read point-by-point responses

Referee: [Human results section] Human-experiment results (likely §3 or §4): the reported superiority of Multitask Preplay over predictive-representation baselines does not yet isolate the contribution of the explicit counterfactual task-switch from the simple fact of additional replay on the same trajectories. A standard successor-feature or replay baseline trained on identical pursued-task data (without the preplay switch) must be shown to underperform; otherwise the specific preplay mechanism is not required to explain the human generalization patterns.

Authors: We agree that isolating the specific role of the counterfactual task-switch is important for interpreting the human results. Our existing predictive-representation baselines are trained solely on pursued-task data without any task-switching or counterfactual simulation. To directly address this concern, we will incorporate an additional baseline that performs extra replays exclusively on the pursued-task trajectories, matching the volume of experience but without the preplay switch to unpursued tasks. We will report these results in the revised human-experiment section to demonstrate that the preplay mechanism contributes beyond additional replay alone. revision: yes
Referee: [Agent experiments] Agent transfer experiments (likely §5): the claim that Multitask Preplay enables transfer to novel Craftax worlds sharing task co-occurrence structure requires an ablation confirming that the benefit survives when the amount of total experience and the replay buffer contents are matched across conditions. Without this, the improvement could be driven by richer experience rather than the counterfactual structure.

Authors: We concur that controlling for total experience and replay buffer contents is essential to attribute the transfer benefits specifically to the counterfactual structure. While we matched the total number of environment steps in the original experiments, the replay buffers in the Multitask Preplay condition contain additional simulated trajectories. We will add an ablation in which the baseline agent's replay buffer is supplemented with an equivalent volume of additional pursued-task replays or random experiences to match the buffer contents. This will be presented in the revised agent experiments section to confirm that the structured counterfactual preplay provides the observed advantage. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparisons rest on independent baselines and external human data

full rationale

The paper introduces Multitask Preplay as an algorithm that replays pursued-task experience to simulate unpursued tasks, then reports that this yields better predictions of human generalization in a grid-world and transfer in Craftax than traditional planning and predictive-representation baselines. No equations, parameter-fitting procedures, or self-citation chains are visible that would reduce the reported predictions or superiority claims to quantities defined by construction from the same human or agent data. The central results are empirical model comparisons against external benchmarks (human choices and novel environments), which remain falsifiable and do not rely on self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems imported from the authors' prior work. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the available text.

pith-pipeline@v0.9.0 · 5772 in / 1307 out tokens · 44533 ms · 2026-05-19T06:35:05.701064+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

When does model-based control pay off? PLoS computational biology , 12(8):e1005090, 2016

Wouter Kool, Fiery A Cushman, and Samuel J Gershman. When does model-based control pay off? PLoS computational biology , 12(8):e1005090, 2016

work page 2016
[2]

Predictive representations: building blocks of intelligence

Wilka Carvalho, Momchil S Tomov, William de Cothi, Caswell Barry, and Samuel J Gershman. Predictive representations: building blocks of intelligence. Neural Computation , pages 1–74, 2024

work page 2024
[3]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

work page 2015
[4]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990 , pages 216–224. Elsevier, 1990

work page 1990
[5]

Retrospective revaluation in sequential decision making: a tale of two systems

Samuel J Gershman, Arthur B Markman, and A Ross Otto. Retrospective revaluation in sequential decision making: a tale of two systems. Journal of Experimental Psychology: General , 143(1):182, 2014

work page 2014
[6]

Universal value function approximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning , pages 1312–1320. PMLR, 2015

work page 2015
[7]

Learning to achieve goals

Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–1098. Citeseer, 1993

work page 1993
[8]

Multi-task reinforcement learning in humans

Momchil S Tomov, Eric Schulz, and Samuel J Gershman. Multi-task reinforcement learning in humans. Nature Human Behaviour , 5(6):764–773, 2021

work page 2021
[9]

Open-ended learning leads to generally capable agents

Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021

work page arXiv 2021
[10]

Successor features for transfer in reinforcement learning

Andr´ e Barreto, Will Dabney, R´ emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[11]

Transfer in deep reinforcement learning using successor features and generalised policy improvement

Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Mankowitz, Augustin Zidek, and Remi Munos. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning , pages 501–510. PMLR, 2018. 12

work page 2018
[12]

The successor representation in human reinforcement learning

Ida Momennejad, Evan M Russek, Jin H Cheong, Matthew M Botvinick, Nathaniel Douglass Daw, and Samuel J Gershman. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9):680–692, 2017

work page 2017
[13]

Offline replay supports planning in human reinforcement learning

Ida Momennejad, A Ross Otto, Nathaniel D Daw, and Kenneth A Norman. Offline replay supports planning in human reinforcement learning. elife, 7:e32548, 2018

work page 2018
[14]

Interplay of approximate planning strategies

Quentin JM Huys, N ´ ıall Lally, Paul Faulkner, Neir Eshel, Erich Seifritz, Samuel J Gershman, Peter Dayan, and Jonathan P Roiser. Interplay of approximate planning strategies. Proceedings of the National Academy of Sciences, 112(10):3098–3103, 2015

work page 2015
[15]

Craftax: A lightning-fast benchmark for open-ended reinforcement learning

Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In International Conference on Machine Learning (ICML) , 2024

work page 2024
[16]

Contrastive behavioral similarity embeddings for generalization in reinforcement learning

Rishabh Agarwal, Marlos C Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021

work page arXiv 2021
[17]

Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016

work page 2016
[18]

Universal Successor Features Approximators

Diana Borsa, Andr´ e Barreto, John Quan, Daniel Mankowitz, R´ emi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Value generalization in human avoidance learning

Agnes Norbury, Trevor W Robbins, and Ben Seymour. Value generalization in human avoidance learning. Elife, 7:e34779, 2018

work page 2018
[20]

Putting bandits into context: How function learning supports decision making

Eric Schulz, Emmanouil Konstantinidis, and Maarten Speekenbrink. Putting bandits into context: How function learning supports decision making. Journal of experimental psychology: learning, memory, and cognition, 44(6):927, 2018

work page 2018
[21]

Unifying principles of generalization: past, present, and future

Charley M Wu, Bj¨ orn Meder, and Eric Schulz. Unifying principles of generalization: past, present, and future. Annual Review of Psychology , 76, 2024

work page 2024
[22]

Neural evidence that humans reuse strategies to solve new tasks

Sam Hall-McMaster, Momchil S Tomov, Samuel J Gershman, and Nicolas W Schuck. Neural evidence that humans reuse strategies to solve new tasks. PLoS Biology, 23:e3003174, 2025

work page 2025
[23]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems , 30, 2017

work page 2017
[24]

Mher: Model-based hindsight experience replay

Rui Yang, Meng Fang, Lei Han, Yali Du, Feng Luo, and Xiu Li. Mher: Model-based hindsight experience replay. arXiv preprint arXiv:2107.00306, 2021

work page arXiv 2021
[25]

Directed Exploration for Reinforcement Learning

Zhaohan Daniel Guo and Emma Brunskill. Directed exploration for reinforcement learning. arXiv preprint arXiv:1906.07805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[26]

Exploration via hindsight goal generation

Zhizhou Ren, Kefan Dong, Yuan Zhou, Qiang Liu, and Jian Peng. Exploration via hindsight goal generation. Advances in Neural Information Processing Systems , 32, 2019

work page 2019
[27]

Goal-directed planning via hindsight experience replay

Lorenzo Moro, Amarildo Likmeta, Enrico Prati, Marcello Restelli, et al. Goal-directed planning via hindsight experience replay. In 10th International Conference on Learning Representations, ICLR 2022 , pages 1–16, 2022. 13

work page 2022
[28]

Many-Goals Reinforcement Learning

Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Discovering and achieving goals via world models

Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems , 34:24379–24391, 2021

work page 2021
[30]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Mastering atari, go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

work page 2020
[32]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision , pages 9339–9347, 2019

work page 2019
[33]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in neural information processing systems , 34:251–266, 2021

work page 2021
[34]

Reinforcement learning for sparse-reward object-interaction tasks in a first-person simulated 3d environment

Wilka Carvalho, Anthony Liang, Kimin Lee, Sungryull Sohn, Honglak Lee, Richard L Lewis, and Satinder Singh. Reinforcement learning for sparse-reward object-interaction tasks in a first-person simulated 3d environment. arXiv preprint arXiv:2010.15195, 2020

work page arXiv 2010
[35]

Preplay of future place cell sequences by hippocampal cellular assemblies

George Dragoi and Susumu Tonegawa. Preplay of future place cell sequences by hippocampal cellular assemblies. Nature, 469(7330):397–401, 2011

work page 2011
[36]

Distinct preplay of multiple novel spatial experiences in the rat

George Dragoi and Susumu Tonegawa. Distinct preplay of multiple novel spatial experiences in the rat. Proceedings of the National Academy of Sciences , 110(22):9100–9105, 2013

work page 2013
[37]

Hippocampal place cells construct reward related sequences through unexplored space

H Freyja ´Olafsd´ ottir, Caswell Barry, Aman B Saleem, Demis Hassabis, and Hugo J Spiers. Hippocampal place cells construct reward related sequences through unexplored space. Elife, 4:e06063, 2015

work page 2015
[38]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2018

work page 2018
[39]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations , 2018

work page 2018
[40]

Acme: A research framework for distributed reinforcement learning

Matthew W Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Nikola Momchev, Danila Sinopalnikov, Piotr Sta´ nczyk, Sabela Ramos, Anton Raichuk, Damien Vincent, et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020

work page arXiv 2006
[41]

Discovered policy optimisation

Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems , 35:16455–16468, 2022

work page 2022
[42]

Linguistic regularities in continuous space word rep- resentations

Tom´ aˇ s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word rep- resentations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. 14

work page 2013
[43]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017
[44]

Long short-term memory

Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation , 9(8):1735– 1780, 1997

work page 1997
[45]

Prioritized Experience Replay

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[46]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[47]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, 2019

work page arXiv 1912
[48]

The nethack learning environment

Heinrich K¨ uttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefen- stette, and Tim Rockt¨ aschel. The nethack learning environment. Advances in Neural Information Pro- cessing Systems , 33:7671–7684, 2020

work page 2020
[49]

Improving transformer world models for data-efficient rl

Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, and Kevin Patrick Murphy. Improving transformer world models for data-efficient rl. ICML, 2025

work page 2025
[50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347, 2017. 15 Methods Models Preliminaries We formulate domains as Partially Observable Controlled Markov Processes C = ⟨S, A, X , P, O⟩, where S de- notes the environment state space, A denotes its action s...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

When does model-based control pay off? PLoS computational biology , 12(8):e1005090, 2016

Wouter Kool, Fiery A Cushman, and Samuel J Gershman. When does model-based control pay off? PLoS computational biology , 12(8):e1005090, 2016

work page 2016

[2] [2]

Predictive representations: building blocks of intelligence

Wilka Carvalho, Momchil S Tomov, William de Cothi, Caswell Barry, and Samuel J Gershman. Predictive representations: building blocks of intelligence. Neural Computation , pages 1–74, 2024

work page 2024

[3] [3]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

work page 2015

[4] [4]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990 , pages 216–224. Elsevier, 1990

work page 1990

[5] [5]

Retrospective revaluation in sequential decision making: a tale of two systems

Samuel J Gershman, Arthur B Markman, and A Ross Otto. Retrospective revaluation in sequential decision making: a tale of two systems. Journal of Experimental Psychology: General , 143(1):182, 2014

work page 2014

[6] [6]

Universal value function approximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning , pages 1312–1320. PMLR, 2015

work page 2015

[7] [7]

Learning to achieve goals

Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–1098. Citeseer, 1993

work page 1993

[8] [8]

Multi-task reinforcement learning in humans

Momchil S Tomov, Eric Schulz, and Samuel J Gershman. Multi-task reinforcement learning in humans. Nature Human Behaviour , 5(6):764–773, 2021

work page 2021

[9] [9]

Open-ended learning leads to generally capable agents

Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021

work page arXiv 2021

[10] [10]

Successor features for transfer in reinforcement learning

Andr´ e Barreto, Will Dabney, R´ emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[11] [11]

Transfer in deep reinforcement learning using successor features and generalised policy improvement

Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Mankowitz, Augustin Zidek, and Remi Munos. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning , pages 501–510. PMLR, 2018. 12

work page 2018

[12] [12]

The successor representation in human reinforcement learning

Ida Momennejad, Evan M Russek, Jin H Cheong, Matthew M Botvinick, Nathaniel Douglass Daw, and Samuel J Gershman. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9):680–692, 2017

work page 2017

[13] [13]

Offline replay supports planning in human reinforcement learning

Ida Momennejad, A Ross Otto, Nathaniel D Daw, and Kenneth A Norman. Offline replay supports planning in human reinforcement learning. elife, 7:e32548, 2018

work page 2018

[14] [14]

Interplay of approximate planning strategies

Quentin JM Huys, N ´ ıall Lally, Paul Faulkner, Neir Eshel, Erich Seifritz, Samuel J Gershman, Peter Dayan, and Jonathan P Roiser. Interplay of approximate planning strategies. Proceedings of the National Academy of Sciences, 112(10):3098–3103, 2015

work page 2015

[15] [15]

Craftax: A lightning-fast benchmark for open-ended reinforcement learning

Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In International Conference on Machine Learning (ICML) , 2024

work page 2024

[16] [16]

Contrastive behavioral similarity embeddings for generalization in reinforcement learning

Rishabh Agarwal, Marlos C Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021

work page arXiv 2021

[17] [17]

Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016

work page 2016

[18] [18]

Universal Successor Features Approximators

Diana Borsa, Andr´ e Barreto, John Quan, Daniel Mankowitz, R´ emi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Value generalization in human avoidance learning

Agnes Norbury, Trevor W Robbins, and Ben Seymour. Value generalization in human avoidance learning. Elife, 7:e34779, 2018

work page 2018

[20] [20]

Putting bandits into context: How function learning supports decision making

Eric Schulz, Emmanouil Konstantinidis, and Maarten Speekenbrink. Putting bandits into context: How function learning supports decision making. Journal of experimental psychology: learning, memory, and cognition, 44(6):927, 2018

work page 2018

[21] [21]

Unifying principles of generalization: past, present, and future

Charley M Wu, Bj¨ orn Meder, and Eric Schulz. Unifying principles of generalization: past, present, and future. Annual Review of Psychology , 76, 2024

work page 2024

[22] [22]

Neural evidence that humans reuse strategies to solve new tasks

Sam Hall-McMaster, Momchil S Tomov, Samuel J Gershman, and Nicolas W Schuck. Neural evidence that humans reuse strategies to solve new tasks. PLoS Biology, 23:e3003174, 2025

work page 2025

[23] [23]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems , 30, 2017

work page 2017

[24] [24]

Mher: Model-based hindsight experience replay

Rui Yang, Meng Fang, Lei Han, Yali Du, Feng Luo, and Xiu Li. Mher: Model-based hindsight experience replay. arXiv preprint arXiv:2107.00306, 2021

work page arXiv 2021

[25] [25]

Directed Exploration for Reinforcement Learning

Zhaohan Daniel Guo and Emma Brunskill. Directed exploration for reinforcement learning. arXiv preprint arXiv:1906.07805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[26] [26]

Exploration via hindsight goal generation

Zhizhou Ren, Kefan Dong, Yuan Zhou, Qiang Liu, and Jian Peng. Exploration via hindsight goal generation. Advances in Neural Information Processing Systems , 32, 2019

work page 2019

[27] [27]

Goal-directed planning via hindsight experience replay

Lorenzo Moro, Amarildo Likmeta, Enrico Prati, Marcello Restelli, et al. Goal-directed planning via hindsight experience replay. In 10th International Conference on Learning Representations, ICLR 2022 , pages 1–16, 2022. 13

work page 2022

[28] [28]

Many-Goals Reinforcement Learning

Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Discovering and achieving goals via world models

Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems , 34:24379–24391, 2021

work page 2021

[30] [30]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Mastering atari, go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

work page 2020

[32] [32]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision , pages 9339–9347, 2019

work page 2019

[33] [33]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in neural information processing systems , 34:251–266, 2021

work page 2021

[34] [34]

Reinforcement learning for sparse-reward object-interaction tasks in a first-person simulated 3d environment

Wilka Carvalho, Anthony Liang, Kimin Lee, Sungryull Sohn, Honglak Lee, Richard L Lewis, and Satinder Singh. Reinforcement learning for sparse-reward object-interaction tasks in a first-person simulated 3d environment. arXiv preprint arXiv:2010.15195, 2020

work page arXiv 2010

[35] [35]

Preplay of future place cell sequences by hippocampal cellular assemblies

George Dragoi and Susumu Tonegawa. Preplay of future place cell sequences by hippocampal cellular assemblies. Nature, 469(7330):397–401, 2011

work page 2011

[36] [36]

Distinct preplay of multiple novel spatial experiences in the rat

George Dragoi and Susumu Tonegawa. Distinct preplay of multiple novel spatial experiences in the rat. Proceedings of the National Academy of Sciences , 110(22):9100–9105, 2013

work page 2013

[37] [37]

Hippocampal place cells construct reward related sequences through unexplored space

H Freyja ´Olafsd´ ottir, Caswell Barry, Aman B Saleem, Demis Hassabis, and Hugo J Spiers. Hippocampal place cells construct reward related sequences through unexplored space. Elife, 4:e06063, 2015

work page 2015

[38] [38]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2018

work page 2018

[39] [39]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations , 2018

work page 2018

[40] [40]

Acme: A research framework for distributed reinforcement learning

Matthew W Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Nikola Momchev, Danila Sinopalnikov, Piotr Sta´ nczyk, Sabela Ramos, Anton Raichuk, Damien Vincent, et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020

work page arXiv 2006

[41] [41]

Discovered policy optimisation

Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems , 35:16455–16468, 2022

work page 2022

[42] [42]

Linguistic regularities in continuous space word rep- resentations

Tom´ aˇ s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word rep- resentations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. 14

work page 2013

[43] [43]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017

[44] [44]

Long short-term memory

Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation , 9(8):1735– 1780, 1997

work page 1997

[45] [45]

Prioritized Experience Replay

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[46] [46]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[47] [47]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, 2019

work page arXiv 1912

[48] [48]

The nethack learning environment

Heinrich K¨ uttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefen- stette, and Tim Rockt¨ aschel. The nethack learning environment. Advances in Neural Information Pro- cessing Systems , 33:7671–7684, 2020

work page 2020

[49] [49]

Improving transformer world models for data-efficient rl

Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, and Kevin Patrick Murphy. Improving transformer world models for data-efficient rl. ICML, 2025

work page 2025

[50] [50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347, 2017. 15 Methods Models Preliminaries We formulate domains as Partially Observable Controlled Markov Processes C = ⟨S, A, X , P, O⟩, where S de- notes the environment state space, A denotes its action s...

work page internal anchor Pith review Pith/arXiv arXiv 2017