ARROW: Augmented Replay for RObust World models

Abdallah Al Siyabi; Abdulaziz Alyahya; Gideon Kowadlo; Levin Kuhlmann; Luke Yang; Markus R. Ernst

arxiv: 2603.11395 · v2 · pith:B4LNMOKDnew · submitted 2026-03-12 · 💻 cs.LG · cs.AI

ARROW: Augmented Replay for RObust World models

Abdulaziz Alyahya , Abdallah Al Siyabi , Markus R. Ernst , Luke Yang , Levin Kuhlmann , Gideon Kowadlo This is my paper

Pith reviewed 2026-05-21 11:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual reinforcement learningworld modelsreplay bufferscatastrophic forgettingmodel-based RLDreamerV3AtariProcgen

0 comments

The pith

ARROW uses a short-term and long-term replay buffer in DreamerV3 to reduce forgetting on unrelated tasks while matching forward transfer on related ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARROW to address continual reinforcement learning, where agents must learn new skills without losing old ones. It extends the DreamerV3 world model with a memory-efficient replay buffer that splits into a short-term buffer holding recent experiences and a long-term buffer that uses intelligent sampling to keep task diversity. This bio-inspired design replays experiences to the predictive world model rather than directly to the policy, aiming to avoid the large memory needs and forgetting seen in standard model-free replay methods. Tests on Atari games without shared structure and Procgen variants with shared structure show the approach yields substantially less forgetting than same-size baselines on non-overlapping tasks.

Core claim

ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling, demonstrating substantially less forgetting on tasks without shared structure compared to model-free and model-based baselines with replay buffers of the same size, while maintaining comparable forward transfer.

What carries the argument

The distribution-matching long-term buffer that intelligently samples past experiences to maintain task diversity during world model training.

If this is right

Model-based methods can achieve better retention in continual RL without needing larger replay memory than model-free baselines.
Replaying experiences to the world model supports retention on tasks with no shared structure.
Forward transfer remains intact on tasks that do share structure, such as Procgen CoinRun variants.
Bio-inspired replay mechanisms offer a scalable path for continual learning in reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-buffer design might allow world models to act as compact, reusable memory stores for sequences of many more tasks.
Similar sampling strategies could be tested in non-RL continual learning domains where distribution shift also drives forgetting.
If the long-term buffer scales well, it could lower the memory footprint for deployed agents that encounter open-ended task streams.

Load-bearing premise

The intelligent sampling from the long-term buffer preserves task diversity and prevents the distribution shift that causes forgetting, without introducing instabilities into world model training.

What would settle it

Running the same Atari continual learning experiments but replacing the intelligent sampling with uniform random selection from the long-term buffer and measuring whether the reduction in forgetting disappears.

Figures

Figures reproduced from arXiv: 2603.11395 by Abdallah Al Siyabi, Abdulaziz Alyahya, Gideon Kowadlo, Levin Kuhlmann, Luke Yang, Markus R. Ernst.

**Figure 2.** Figure 2: Experiment setup. (A) Augmented buffer used in ARROW. (B) Continual learning tasks with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Atari median normalized performance (Eq. 1). Shaded area depicts 0.25 and 0.75 quartiles of 5 seeds. Bold line segments indicate training of task. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). The dotted, vertical line marks the end of cycle 1 and the beginning of cycle 2. 5 Results 5.1 Tasks without shared structure: Atari Median n… view at source ↗

**Figure 4.** Figure 4: Atari metrics shown as median with (0.25 - 0.75) quartile confidence intervals, across 5 seeds, and calculated using normalized scores (Eq. 1). (A) Default task order (one-cycle). (B) Reversed task order (one-cycle). (C) Default task order (two-cycle). ARROW maintains the highest WC-ACC (0.618), confirming that its stability advantage is robust to task ordering. Two-cycle training. The two-cycle setting re… view at source ↗

**Figure 5.** Figure 5: CoinRun median normalized performance (Eq. equation 1). Shaded area depicts 0.25 and 0.75 quartiles of 5 seeds. Bold line segments indicate training of task. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). The dotted vertical line marks the end of cycle 1 and the beginning of cycle 2. plasticity balance: ARROW attains the highest WC-A… view at source ↗

**Figure 6.** Figure 6: CoinRun metrics shown as median with (0.25 - 0.75) quartile confidence intervals, across 5 seeds, and calculated using normalized scores, Eq. 1. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). 5.2.1 Continual learning sample efficiency The last columns of [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARROW adds a short-term plus long-term buffer with distribution matching to DreamerV3, but the abstract gives no numbers or sampling details so the forgetting reduction is hard to assess.

read the letter

The main takeaway is that ARROW tries to solve memory bloat in continual RL by replaying to the world model instead of the policy, using two buffers instead of one fixed FIFO. The short-term buffer handles recent data while the long-term one samples to keep task diversity. This is a direct extension of DreamerV3 with a neuroscience-inspired twist, and the paper runs it on Atari (no shared structure) and Procgen CoinRun variants (shared structure). It claims less forgetting than same-size baselines on the first set and comparable forward transfer on the second. That setup is straightforward and targets a real deployment issue in long-horizon agents. The framing is clear and the choice of benchmarks makes sense for testing both forgetting and transfer. The soft spots sit in the missing pieces. No quantitative results, error bars, or ablation numbers appear in the abstract, and the exact implementation of the intelligent sampling for the long-term buffer is not described. Without those, it is difficult to judge whether the distribution matching actually works as intended or whether it adds training instability. The full paper may fill this in, but the current text leaves the central empirical claim unverified. This paper is aimed at people already working on model-based continual RL who want practical memory tricks. A reader looking for new theory or large performance jumps will not find much, but someone testing incremental buffer designs could pick up the architecture idea. I would send it for peer review so the experiments and sampling method can be checked properly, though it reads as solid engineering rather than a foundational shift.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ARROW, a model-based continual RL method extending DreamerV3 with dual replay buffers: a short-term buffer for recent experiences and a long-term buffer that uses distribution-matching sampling to preserve task diversity. It evaluates the approach on Atari (tasks without shared structure) and Procgen CoinRun variants (tasks with shared structure), claiming substantially less forgetting than model-free and model-based baselines that use replay buffers of the same size, while maintaining comparable forward transfer.

Significance. If the empirical results hold, the work would be significant for continual RL by showing that a bio-inspired, model-based replay mechanism can reduce catastrophic forgetting more effectively than standard FIFO buffers while remaining memory-efficient. The dual-buffer design and explicit focus on distribution matching provide a concrete, testable alternative to existing replay strategies.

major comments (2)

Abstract and Experiments section: The central claim of 'substantially less forgetting' on non-shared tasks is not supported by any quantitative metrics, error bars, run counts, or ablation results in the provided text. Without these, the magnitude and reliability of the improvement cannot be assessed and the comparison to same-size-buffer baselines remains unverifiable.
Method section (distribution-matching sampling): The description of how the long-term buffer performs 'intelligent sampling' to preserve task diversity is missing algorithmic details, pseudocode, or a precise objective. This step is load-bearing for the weakest assumption and for the claimed advantage over standard replay; its absence prevents verification that no new instabilities are introduced in world-model training.

minor comments (2)

The abstract refers to 'replay buffers of the same-size' for baselines; clarify whether total memory footprint or per-buffer capacity is matched, as ARROW maintains two buffers.
Add a table or plot in the results section that directly reports forgetting metrics (e.g., performance drop on previous tasks) with standard deviations for ARROW versus each baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate in the next version to strengthen the presentation and verifiability of our results.

read point-by-point responses

Referee: Abstract and Experiments section: The central claim of 'substantially less forgetting' on non-shared tasks is not supported by any quantitative metrics, error bars, run counts, or ablation results in the provided text. Without these, the magnitude and reliability of the improvement cannot be assessed and the comparison to same-size-buffer baselines remains unverifiable.

Authors: We agree that the abstract as currently written does not contain explicit quantitative metrics, error bars, or run counts, which limits immediate assessment of the effect size. The full experiments section reports comparative performance on Atari and Procgen, but to address this directly we will revise the abstract to include key quantitative results (e.g., average forgetting reduction percentages relative to same-size FIFO baselines) and will add explicit statements on the number of independent runs (5 seeds per condition) together with standard error bars. We will also expand the experiments section with a dedicated ablation table isolating the contribution of the long-term distribution-matching buffer. These additions will make the magnitude and reliability of the improvement verifiable without altering the underlying claims. revision: yes
Referee: Method section (distribution-matching sampling): The description of how the long-term buffer performs 'intelligent sampling' to preserve task diversity is missing algorithmic details, pseudocode, or a precise objective. This step is load-bearing for the weakest assumption and for the claimed advantage over standard replay; its absence prevents verification that no new instabilities are introduced in world-model training.

Authors: We acknowledge that the current method description is insufficiently precise on the distribution-matching sampling mechanism. In the revised manuscript we will add a dedicated subsection with the exact objective (minimizing a divergence measure between the empirical task distribution in the long-term buffer and a uniform target distribution over observed tasks), the sampling procedure, and full pseudocode. This will clarify how the buffer differs from FIFO replay and allow direct inspection of potential effects on world-model training stability. Our internal experiments showed no introduced instabilities, but the added formalization will enable readers to verify this. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical extension of DreamerV3 using dual replay buffers (short-term and long-term with intelligent sampling) and reports performance via direct experimental comparison to model-free and model-based baselines on Atari and Procgen tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on observed reductions in forgetting and maintained transfer, which are externally falsifiable through replication rather than reducing to the method's own definitions or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, axioms, or invented entities beyond the standard assumptions of DreamerV3 and replay-buffer methods.

pith-pipeline@v0.9.0 · 5768 in / 1200 out tokens · 32110 ms · 2026-05-21T11:15:45.473975+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling... reservoir sampling by assigning each rollout chunk a random key
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure (Procgen CoinRun variants)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Layer Normalization

URL https://arxiv. org/abs/1607.06450. M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents.Journal of Artificial Intelligence Research, 47:253–279, June

work page internal anchor Pith review Pith/arXiv arXiv
[2]

doi: 10.1613/jair.3912

ISSN 1076-9757. doi: 10.1613/jair.3912. Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning.Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6993–7001, May

work page doi:10.1613/jair.3912
[3]

URL https://ojs.aaai.org/index.php/AAAI/ article/view/16861

doi: 10.1609/aaai.v35i8.16861. URL https://ojs.aaai.org/index.php/AAAI/ article/view/16861. Zhiyuan Chen and Bing Liu. Continual learning and catastrophic forgetting. InLifelong Machine Learning, pp. 55–75. Springer International Publishing, Cham,

work page doi:10.1609/aaai.v35i8.16861
[4]

doi: 10.1007/978-3-031-01581-6_4

ISBN 978-3-031-01581-6. doi: 10.1007/978-3-031-01581-6_4. Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In Hal Daumé III and Aarti Singh (eds.),Proceedings of the 37th International Conference on Machine Learning, volume119ofProceedings of Machine Learning Research, pp.2048–...

work page doi:10.1007/978-3-031-01581-6_4 2048
[5]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

URL https://arxiv.org/abs/1701.08734. Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4): 128–135,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

doi: 10.1016/S1364-6613(99)01294-2

ISSN 1364-6613. doi: 10.1016/S1364-6613(99)01294-2. URL https://www.sciencedirect. com/science/article/pii/S1364661399012942. David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.),Advances in Neural Information Processing Systems, volume

work page doi:10.1016/s1364-6613(99)01294-2
[7]

cc/paper_files/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

URL https://papers.neurips. cc/paper_files/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html. DanijarHafner, TimothyLillicrap, IanFischer, RubenVillegas, DavidHa, HonglakLee, andJamesDavidson. Learning latent dynamics for planning from pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.),Proceedings of the 36th International Conferen...

work page 2018
[8]

doi: 10.1038/s41586-025-08744-2

doi: 10.1038/s41586-025-08744-2. Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations,

work page doi:10.1038/s41586-025-08744-2
[9]

doi: 10.1016/j.neuron.2017

ISSN 0896-6273. doi: 10.1016/j.neuron.2017. 06.011. URL https://www.sciencedirect.com/science/article/pii/S0896627317305093. Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units,

work page doi:10.1016/j.neuron.2017 2017
[10]

Gaussian Error Linear Units (GELUs)

URL http://arxiv.org/abs/1606.08415. Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, and Florian Shkurti. Continual model-based reinforce- ment learning with hypernetworks. In2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 799–805,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Lee, Matthew Tan, Yuke Zhu, and Jeannette Bohg

doi: 10.1109/ICRA48506.2021.9560793. David Isele and Akansel Cosgun. Selective experience replay for lifelong learning.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr

work page doi:10.1109/icra48506.2021.9560793 2021
[12]

URL https://ojs

doi: 10.1609/aaai.v32i1.11595. URL https://ojs. aaai.org/index.php/AAAI/article/view/11595. Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1):99–134,

work page doi:10.1609/aaai.v32i1.11595
[13]

doi: 10.1016/ S0004-3702(98)00023-X

ISSN 0004-3702. doi: 10.1016/ S0004-3702(98)00023-X. URL https://www.sciencedirect.com/science/article/pii/S000437029800023X. 15 Samuel Kessler, Piotr Milos, Jack Parker-Holder, and Stephen J. Roberts. The surprising effectiveness of latent world models for continual reinforcement learning. InDeep Reinforcement Learning Workshop NeurIPS 2022,

work page 2022
[14]

doi: 10.1613/jair.1.13673

ISSN 1076-9757. doi: 10.1613/jair.1.13673. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the...

work page doi:10.1613/jair.1.13673
[15]

doi: 10.1073/pnas.1611835114

doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/doi/abs/10.1073/pnas.1611835114. Matthias De Lange, Gido van de Ven, and Tinne Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap,

work page doi:10.1073/pnas.1611835114
[16]

URL https://arxiv.org/abs/2205.13452. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back- propagation applied to handwritten zip code recognition.Neural Computation, 1(4):541–551,

work page arXiv
[17]

Backpropagation applied to handwritten zip code recognition,

doi: 10.1162/neco.1989.1.4.541. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40,

work page doi:10.1162/neco.1989.1.4.541 1989
[18]

Marlos C

URL https:// proceedings.neurips.cc/paper_files/paper/2017/file/f87522788a2be2d171666752f97ddebb-Paper.pdf. Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents.Journal of Artificial Intelligence Resea...

work page 2017
[19]

doi: 10.1613/jair.5699

ISSN 1076-9757. doi: 10.1613/jair.5699. Mackenzie Weygandt Mathis. The neocortical column as a universal template for perception and world-model learning.Nature Reviews Neuroscience, 24(1):3–3,

work page doi:10.1613/jair.5699
[20]

Michael McCloskey and Neal J

doi: 10.1038/s41583-022-00658-6. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Gordon H. Bower (ed.),Psychology of learning and motivation, volume 24 ofPsychol- ogy of Learning and Motivation, pp. 109–165. Academic Press,

work page doi:10.1038/s41583-022-00658-6
[21]

URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368

doi: 10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368. Martial Mermillod, Aurélia Bugaiska, and Patrick BONIN. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in Psychology, Volume 4 - 2013,

work page doi:10.1016/s0079-7421(08)60536-8 2013
[22]

doi: 10.3389/fpsyg.2013.00504

ISSN 1664-1078. doi: 10.3389/fpsyg.2013.00504. URL https://www.frontiersin.org/ journals/psychology/articles/10.3389/fpsyg.2013.00504. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis An...

work page doi:10.3389/fpsyg.2013.00504 2013
[23]

Deep online learning via meta-learning: Continual adaptation for model-based RL

16 Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta-learning: Continual adaptation for model-based RL. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019
[24]

Dota 2 with Large Scale Deep Reinforcement Learning

URL https://arxiv.org/abs/1912.06680. German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[25]

doi: 10.1016/j.neunet.2019. 01.012. Martin L. Puterman. Chapter 8 markov decision processes. InStochastic Models, volume 2 ofHandbooks in Operations Research and Management Science, pp. 331–434. Elsevier,

work page doi:10.1016/j.neunet.2019 2019
[26]

URL https://www.sciencedirect.com/science/article/pii/S0927050705801720

doi: 10.1016/S0927-0507(05) 80172-0. URL https://www.sciencedirect.com/science/article/pii/S0927050705801720. Ali Rahimi-Kalahroudi, Janarthanan Rajendran, Ida Momennejad, Harm van Seijen, and Sarath Chandar. Replaybufferwithlocalforgettingforadaptingtolocalenvironmentchangesindeepmodel-basedreinforce- ment learning. In Sarath Chandar, Razvan Pascanu, Han...

work page doi:10.1016/s0927-0507(05
[27]

Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro

URL https://proceedings.mlr.press/v232/rahimi-kalahroudi23a.html. Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In7th Inter- national Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019
[28]

Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell

URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf. Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Jennifer Dy and Andreas Krause (eds.),Proceeding...

work page 2019
[29]

Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

doi: 10.1109/IROS.2012.6386109. 17 Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblo...

work page doi:10.1109/iros.2012.6386109 2012
[30]

Liar” ends the game, then both players reveal their dice. If the last bid is not satisﬁed, then the player who called “Liar

doi: 10.1038/s41586-019-1724-z. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256,

work page doi:10.1038/s41586-019-1724-z
[31]

Williams

doi: 10.1007/BF00992696. Yaosheng Xu, Dailin Hu, Litian Liang, Stephen Marcus McAleer, Pieter Abbeel, and Roy Fox. Target entropy annealing for discrete soft actor-critic. InDeep RL Workshop NeurIPS 2021,

work page doi:10.1007/bf00992696 2021
[32]

URL https://arxiv.org/abs/2401.16650. 18 A Tabular data & additional results Variant Procgen flag Description Coinrun — regularly rendered game +NBuse_backgrounds = Falseremoves decorative backgrounds +RTrestrict_themes = Truerestricts the set of level themes +GAuse_generated_assets = Trueenables procedurally generated assets +MAuse_monochrome_assets = Tr...

work page arXiv

[1] [1]

Layer Normalization

URL https://arxiv. org/abs/1607.06450. M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents.Journal of Artificial Intelligence Research, 47:253–279, June

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

doi: 10.1613/jair.3912

ISSN 1076-9757. doi: 10.1613/jair.3912. Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning.Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6993–7001, May

work page doi:10.1613/jair.3912

[3] [3]

URL https://ojs.aaai.org/index.php/AAAI/ article/view/16861

doi: 10.1609/aaai.v35i8.16861. URL https://ojs.aaai.org/index.php/AAAI/ article/view/16861. Zhiyuan Chen and Bing Liu. Continual learning and catastrophic forgetting. InLifelong Machine Learning, pp. 55–75. Springer International Publishing, Cham,

work page doi:10.1609/aaai.v35i8.16861

[4] [4]

doi: 10.1007/978-3-031-01581-6_4

ISBN 978-3-031-01581-6. doi: 10.1007/978-3-031-01581-6_4. Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In Hal Daumé III and Aarti Singh (eds.),Proceedings of the 37th International Conference on Machine Learning, volume119ofProceedings of Machine Learning Research, pp.2048–...

work page doi:10.1007/978-3-031-01581-6_4 2048

[5] [5]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

URL https://arxiv.org/abs/1701.08734. Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4): 128–135,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

doi: 10.1016/S1364-6613(99)01294-2

ISSN 1364-6613. doi: 10.1016/S1364-6613(99)01294-2. URL https://www.sciencedirect. com/science/article/pii/S1364661399012942. David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.),Advances in Neural Information Processing Systems, volume

work page doi:10.1016/s1364-6613(99)01294-2

[7] [7]

cc/paper_files/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

URL https://papers.neurips. cc/paper_files/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html. DanijarHafner, TimothyLillicrap, IanFischer, RubenVillegas, DavidHa, HonglakLee, andJamesDavidson. Learning latent dynamics for planning from pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.),Proceedings of the 36th International Conferen...

work page 2018

[8] [8]

doi: 10.1038/s41586-025-08744-2

doi: 10.1038/s41586-025-08744-2. Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations,

work page doi:10.1038/s41586-025-08744-2

[9] [9]

doi: 10.1016/j.neuron.2017

ISSN 0896-6273. doi: 10.1016/j.neuron.2017. 06.011. URL https://www.sciencedirect.com/science/article/pii/S0896627317305093. Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units,

work page doi:10.1016/j.neuron.2017 2017

[10] [10]

Gaussian Error Linear Units (GELUs)

URL http://arxiv.org/abs/1606.08415. Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, and Florian Shkurti. Continual model-based reinforce- ment learning with hypernetworks. In2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 799–805,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Lee, Matthew Tan, Yuke Zhu, and Jeannette Bohg

doi: 10.1109/ICRA48506.2021.9560793. David Isele and Akansel Cosgun. Selective experience replay for lifelong learning.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr

work page doi:10.1109/icra48506.2021.9560793 2021

[12] [12]

URL https://ojs

doi: 10.1609/aaai.v32i1.11595. URL https://ojs. aaai.org/index.php/AAAI/article/view/11595. Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1):99–134,

work page doi:10.1609/aaai.v32i1.11595

[13] [13]

doi: 10.1016/ S0004-3702(98)00023-X

ISSN 0004-3702. doi: 10.1016/ S0004-3702(98)00023-X. URL https://www.sciencedirect.com/science/article/pii/S000437029800023X. 15 Samuel Kessler, Piotr Milos, Jack Parker-Holder, and Stephen J. Roberts. The surprising effectiveness of latent world models for continual reinforcement learning. InDeep Reinforcement Learning Workshop NeurIPS 2022,

work page 2022

[14] [14]

doi: 10.1613/jair.1.13673

ISSN 1076-9757. doi: 10.1613/jair.1.13673. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the...

work page doi:10.1613/jair.1.13673

[15] [15]

doi: 10.1073/pnas.1611835114

doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/doi/abs/10.1073/pnas.1611835114. Matthias De Lange, Gido van de Ven, and Tinne Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap,

work page doi:10.1073/pnas.1611835114

[16] [16]

URL https://arxiv.org/abs/2205.13452. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back- propagation applied to handwritten zip code recognition.Neural Computation, 1(4):541–551,

work page arXiv

[17] [17]

Backpropagation applied to handwritten zip code recognition,

doi: 10.1162/neco.1989.1.4.541. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40,

work page doi:10.1162/neco.1989.1.4.541 1989

[18] [18]

Marlos C

URL https:// proceedings.neurips.cc/paper_files/paper/2017/file/f87522788a2be2d171666752f97ddebb-Paper.pdf. Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents.Journal of Artificial Intelligence Resea...

work page 2017

[19] [19]

doi: 10.1613/jair.5699

ISSN 1076-9757. doi: 10.1613/jair.5699. Mackenzie Weygandt Mathis. The neocortical column as a universal template for perception and world-model learning.Nature Reviews Neuroscience, 24(1):3–3,

work page doi:10.1613/jair.5699

[20] [20]

Michael McCloskey and Neal J

doi: 10.1038/s41583-022-00658-6. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Gordon H. Bower (ed.),Psychology of learning and motivation, volume 24 ofPsychol- ogy of Learning and Motivation, pp. 109–165. Academic Press,

work page doi:10.1038/s41583-022-00658-6

[21] [21]

URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368

doi: 10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368. Martial Mermillod, Aurélia Bugaiska, and Patrick BONIN. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in Psychology, Volume 4 - 2013,

work page doi:10.1016/s0079-7421(08)60536-8 2013

[22] [22]

doi: 10.3389/fpsyg.2013.00504

ISSN 1664-1078. doi: 10.3389/fpsyg.2013.00504. URL https://www.frontiersin.org/ journals/psychology/articles/10.3389/fpsyg.2013.00504. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis An...

work page doi:10.3389/fpsyg.2013.00504 2013

[23] [23]

Deep online learning via meta-learning: Continual adaptation for model-based RL

16 Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta-learning: Continual adaptation for model-based RL. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019

[24] [24]

Dota 2 with Large Scale Deep Reinforcement Learning

URL https://arxiv.org/abs/1912.06680. German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[25] [25]

doi: 10.1016/j.neunet.2019. 01.012. Martin L. Puterman. Chapter 8 markov decision processes. InStochastic Models, volume 2 ofHandbooks in Operations Research and Management Science, pp. 331–434. Elsevier,

work page doi:10.1016/j.neunet.2019 2019

[26] [26]

URL https://www.sciencedirect.com/science/article/pii/S0927050705801720

doi: 10.1016/S0927-0507(05) 80172-0. URL https://www.sciencedirect.com/science/article/pii/S0927050705801720. Ali Rahimi-Kalahroudi, Janarthanan Rajendran, Ida Momennejad, Harm van Seijen, and Sarath Chandar. Replaybufferwithlocalforgettingforadaptingtolocalenvironmentchangesindeepmodel-basedreinforce- ment learning. In Sarath Chandar, Razvan Pascanu, Han...

work page doi:10.1016/s0927-0507(05

[27] [27]

Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro

URL https://proceedings.mlr.press/v232/rahimi-kalahroudi23a.html. Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In7th Inter- national Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019

[28] [28]

Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell

URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf. Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Jennifer Dy and Andreas Krause (eds.),Proceeding...

work page 2019

[29] [29]

Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

doi: 10.1109/IROS.2012.6386109. 17 Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblo...

work page doi:10.1109/iros.2012.6386109 2012

[30] [30]

Liar” ends the game, then both players reveal their dice. If the last bid is not satisﬁed, then the player who called “Liar

doi: 10.1038/s41586-019-1724-z. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256,

work page doi:10.1038/s41586-019-1724-z

[31] [31]

Williams

doi: 10.1007/BF00992696. Yaosheng Xu, Dailin Hu, Litian Liang, Stephen Marcus McAleer, Pieter Abbeel, and Roy Fox. Target entropy annealing for discrete soft actor-critic. InDeep RL Workshop NeurIPS 2021,

work page doi:10.1007/bf00992696 2021

[32] [32]

URL https://arxiv.org/abs/2401.16650. 18 A Tabular data & additional results Variant Procgen flag Description Coinrun — regularly rendered game +NBuse_backgrounds = Falseremoves decorative backgrounds +RTrestrict_themes = Truerestricts the set of level themes +GAuse_generated_assets = Trueenables procedurally generated assets +MAuse_monochrome_assets = Tr...

work page arXiv