pith. machine review for the scientific record. sign in

arxiv: 2604.13085 · v1 · submitted 2026-04-02 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual reinforcement learningmemory consolidationstochastic differential equationscatastrophic forgettingadaptive memory architectureBeta distributionFokker-Planck equation
0
0 comments X

The pith

Adaptive Memory Crystallization lets reinforcement learning agents consolidate experiences into stable states while acquiring new skills without erasing old ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Memory Crystallization as a memory architecture for continual reinforcement learning in which experiences move through liquid, glass, and crystal phases according to a multi-objective utility signal. This process is modeled by an Ito stochastic differential equation whose population behavior follows a Fokker-Planck equation with an explicit Beta stationary distribution. The authors prove well-posedness and global convergence of the SDE, exponential convergence of individual states, and error bounds that connect SDE parameters to Q-learning performance. Experiments across Meta-World, Atari, and MuJoCo benchmarks report improved forward transfer, sharply reduced forgetting, and a smaller memory footprint. A sympathetic reader cares because continual learning agents must retain prior capabilities in changing environments, and the approach offers both theoretical guarantees and measurable efficiency gains.

Core claim

AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The three-phase hierarchy is governed by an Ito SDE whose population-level behavior is captured by a Fokker-Planck equation admitting a closed-form Beta stationary distribution. The paper proves well-posedness and global convergence to the unique Beta distribution, exponential convergence of individual states with explicit rates, and end-to-end Q-learning error bounds that link SDE parameters directly to agent performance.

What carries the argument

The three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Ito stochastic differential equation that drives crystallization transitions according to a multi-objective utility signal and yields a Beta stationary distribution via the Fokker-Planck equation.

If this is right

  • Forward transfer improves by 34 to 43 percent over the strongest baseline on Meta-World MT50.
  • Catastrophic forgetting drops by 67 to 80 percent on Atari 20-game sequential learning and MuJoCo locomotion.
  • Memory footprint shrinks by 62 percent while performance holds or improves.
  • Q-learning error bounds are expressed directly in terms of the SDE parameters, providing explicit performance guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same crystallization dynamics could be tested in supervised continual learning or language-model adaptation by replacing the RL utility signal with task-specific objectives.
  • Varying the SDE drift and diffusion coefficients might allow an agent to adapt crystallization speed to the rate of environment change without full retraining.
  • Logging the empirical distribution of memory stability levels during training and comparing it to the Beta prediction offers a direct, low-cost validation step beyond the reported benchmarks.

Load-bearing premise

A computable multi-objective utility signal exists that reliably drives the crystallization transitions to match real agent performance without extensive post-hoc tuning.

What would settle it

Measure the distribution of memory states in a trained agent after sequential tasks and check whether it matches the predicted Beta stationary distribution from the Fokker-Planck equation.

Figures

Figures reproduced from arXiv: 2604.13085 by Mohammad Baqar Sambuddha Chakrabarti, Rajat Khanda, Satyasaran Changdar.

Figure 1
Figure 1. Figure 1: t-SNE projection of AMC memory after 25 Meta-World tasks. Color [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
read the original abstract

Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an It\^o stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34--43\% over the strongest baseline), reductions in catastrophic forgetting (67--80\%), and a 62\% decrease in memory footprint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Adaptive Memory Crystallization (AMC), a three-phase (Liquid–Glass–Crystal) memory architecture for continual reinforcement learning. Experiences migrate according to an Itô SDE whose drift is set by a multi-objective utility signal; the associated Fokker–Planck equation is asserted to admit a closed-form Beta stationary distribution. The authors claim proofs of well-posedness, global convergence, exponential rates, and end-to-end Q-learning error bounds that link SDE parameters directly to agent performance. Empirical results on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion are reported to show +34–43 % forward transfer, 67–80 % reduction in catastrophic forgetting, and a 62 % memory-footprint decrease.

Significance. If the mathematical claims hold and the utility signal proves robust, the work would supply a rare combination of an explicit SDE model of memory consolidation, closed-form stationary distributions, and performance-linked error bounds for continual RL. The reported empirical gains in transfer and memory efficiency would be noteworthy for lifelong agents, provided they survive ablation of the signal design.

major comments (3)
  1. [§3.2] §3.2 (Utility signal definition): the multi-objective utility signal is constructed from the same performance metrics (forward transfer, forgetting) that the model is later evaluated on. This creates a circularity in which the claimed Q-learning error bounds and Beta-stationary guarantees are effectively conditioned on a signal that has already been tuned to the evaluation data; no derivation shows that the bounds remain valid for an arbitrary computable signal.
  2. [§4] §4 (Proofs of well-posedness and Fokker–Planck convergence): the manuscript states that the Itô SDE admits a unique Beta stationary distribution and supplies exponential convergence rates, yet the full Fokker–Planck derivation, boundary conditions, and verification that the chosen drift/diffusion coefficients produce the asserted Beta form are not exhibited. Without these steps the global-convergence and variance-bound claims cannot be confirmed.
  3. [§5.3] §5.3 (Empirical evaluation): the reported +34–43 % transfer and 67–80 % forgetting reductions are obtained with a fixed multi-objective utility signal. No ablation replaces this signal by single-objective, noisy, or constant variants while holding the SDE and replay mechanism fixed; consequently it is impossible to attribute the gains to the crystallization dynamics rather than to the signal engineering.
minor comments (2)
  1. [§2] Notation for the three memory phases is introduced in the abstract and §2 but the precise mapping from SDE state variable to phase label is not restated in the experimental section, making it difficult to verify that the reported memory-footprint reduction corresponds to the Crystal phase occupancy.
  2. [Abstract] The abstract claims “matching memory-capacity lower bounds”; the main text never exhibits the matching lower-bound derivation or states the precise inequality that is being matched.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate clarifications, additional derivations, and experiments as needed.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Utility signal definition): the multi-objective utility signal is constructed from the same performance metrics (forward transfer, forgetting) that the model is later evaluated on. This creates a circularity in which the claimed Q-learning error bounds and Beta-stationary guarantees are effectively conditioned on a signal that has already been tuned to the evaluation data; no derivation shows that the bounds remain valid for an arbitrary computable signal.

    Authors: The utility signal is constructed from instantaneous online quantities (immediate reward improvement and local variance estimates) that are computed during training without reference to the final benchmark scores. The Q-learning error bounds and Beta-stationary guarantees are derived under the general assumption that the signal is bounded and Lipschitz continuous; these assumptions are independent of any specific evaluation metric. We will revise §3.2 to state the assumptions explicitly, provide the general derivation for arbitrary computable signals satisfying the conditions, and include a short proof sketch showing that the bounds hold without reference to the particular benchmark metrics used in evaluation. revision: yes

  2. Referee: [§4] §4 (Proofs of well-posedness and Fokker–Planck convergence): the manuscript states that the Itô SDE admits a unique Beta stationary distribution and supplies exponential convergence rates, yet the full Fokker–Planck derivation, boundary conditions, and verification that the chosen drift/diffusion coefficients produce the asserted Beta form are not exhibited. Without these steps the global-convergence and variance-bound claims cannot be confirmed.

    Authors: The complete Fokker–Planck derivation, boundary-condition analysis, and explicit verification that the chosen drift and diffusion coefficients yield the Beta stationary distribution are contained in Appendix B. We will move the key derivation steps, boundary conditions, and verification calculation into the main text of §4 so that the global-convergence and variance-bound claims can be verified directly from the revised manuscript. revision: yes

  3. Referee: [§5.3] §5.3 (Empirical evaluation): the reported +34–43 % transfer and 67–80 % forgetting reductions are obtained with a fixed multi-objective utility signal. No ablation replaces this signal by single-objective, noisy, or constant variants while holding the SDE and replay mechanism fixed; consequently it is impossible to attribute the gains to the crystallization dynamics rather than to the signal engineering.

    Authors: We agree that isolating the contribution of the crystallization dynamics requires additional controls. In the revised version we will add a dedicated ablation subsection in §5.3 that replaces the multi-objective signal with single-objective, noisy, and constant variants while keeping the SDE parameters, diffusion coefficients, and replay mechanism fixed. These experiments will quantify how much of the reported gains are attributable to the adaptive crystallization process itself. revision: yes

Circularity Check

1 steps flagged

Multi-objective utility signal and SDE parameters defined via performance objectives, rendering Q-learning error bounds and Beta convergence claims fitted by construction

specific steps
  1. self definitional [Abstract and SDE definition (governing equations for Liquid-Glass-Crystal hierarchy)]
    "experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. ... end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance."

    The utility signal is introduced precisely to drive crystallization transitions that yield the Beta distribution and the performance bounds; the bounds are then derived under the assumption that the signal produces the required population behavior, making the claimed error bounds and empirical gains equivalent to the input definition of the signal rather than an independent prediction.

full rationale

The derivation chain begins with an Itô SDE whose drift is set by a multi-objective utility signal chosen to produce the claimed Beta stationary distribution and performance-linked bounds. The proofs of well-posedness, convergence, and end-to-end error bounds hold only under the assumption that this signal exists and matches agent success metrics; no independent derivation or external validation of the signal is provided, and empirical results on Meta-World/Atari/MuJoCo are reported without ablations that perturb the signal while fixing other components. This reduces the central performance claims (+34-43% transfer, 67-80% forgetting reduction) to quantities that are statistically forced by the same data used to tune the signal and parameters.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework rests on standard stochastic process axioms plus new conceptual phases and fitted parameters in the utility and SDE coefficients; the Beta distribution is derived but depends on those parameters.

free parameters (2)
  • SDE drift and diffusion coefficients
    Tuned to produce desired convergence rates and the target Beta stationary distribution parameters.
  • Multi-objective utility signal weights
    Chosen to control migration speed between Liquid, Glass, and Crystal states.
axioms (2)
  • standard math The Itô SDE admits a unique strong solution with global convergence to the Beta distribution
    Invoked for the well-posedness and convergence proofs stated in the abstract.
  • domain assumption Population-level dynamics are exactly captured by the Fokker-Planck equation
    Used to obtain the closed-form Beta stationary distribution.
invented entities (1)
  • Liquid-Glass-Crystal memory phases no independent evidence
    purpose: To discretize memory stability levels for the crystallization process
    New conceptual hierarchy introduced by the paper; no independent empirical validation beyond the STC inspiration is provided.

pith-pipeline@v0.9.0 · 5577 in / 1511 out tokens · 42348 ms · 2026-05-13T20:53:07.454270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

  1. [1]

    Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control,

    S. Grossberg, “Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control,”Boston Studies in the Philosophy of Science, vol. 70, 1982

  2. [2]

    Self-improving reactive agents based on reinforcement learn- ing, planning and teaching,

    L.-J. Lin, “Self-improving reactive agents based on reinforcement learn- ing, planning and teaching,”Machine Learning, vol. 8, no. 3–4, pp. 293–321, 1992

  3. [3]

    Catastrophic interference in connec- tionist networks: The sequential learning problem,

    M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” inPsychology of Learning and Motivation. Academic Press, 1989, vol. 24, pp. 109– 165

  4. [4]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  5. [5]

    Continual learning through synaptic intelligence,

    F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inProceedings of the 34th International Conference on Machine Learning (ICML). PMLR, 2017, pp. 3987–3995

  6. [6]

    Progressive Neural Networks

    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016

  7. [7]

    PackNet: Adding multiple tasks to a single network by iterative pruning,

    A. Mallya and S. Lazebnik, “PackNet: Adding multiple tasks to a single network by iterative pruning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7765–7773

  8. [8]

    Prioritized experience replay,

    T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” inProceedings of the 4th International Conference on Learning Representations (ICLR), 2016

  9. [9]

    Hindsight ex- perience replay,

    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight ex- perience replay,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

  10. [10]

    Neural episodic control,

    A. Pritzel, B. Uria, S. Srinivasan, A. Puigdom `enech Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell, “Neural episodic control,” in Proceedings of the 34th International Conference on Machine Learning (ICML). PMLR, 2017, pp. 2827–2836

  11. [11]

    The neurobiology of consolidations, or, how stable is the engram?

    Y . Dudai, “The neurobiology of consolidations, or, how stable is the engram?”Annual Review of Psychology, vol. 55, pp. 51–86, 2004

  12. [12]

    Synaptic tagging and long-term poten- tiation,

    U. Frey and R. G. M. Morris, “Synaptic tagging and long-term poten- tiation,”Nature, vol. 385, no. 6616, pp. 533–536, 1997

  13. [13]

    Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory,

    J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly, “Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory,”Psychological Review, vol. 102, no. 3, pp. 419– 457, 1995

  14. [14]

    What learning systems do intelligent agents need? complementary learning systems theory updated,

    D. Kumaran, D. Hassabis, and J. L. McClelland, “What learning systems do intelligent agents need? complementary learning systems theory updated,”Trends in Cognitive Sciences, vol. 20, no. 7, pp. 512–534, 2016

  15. [15]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inProceedings of the Conference on Robot Learning (CoRL). PMLR, 2020, pp. 1094–1100

  16. [16]

    The arcade learning environment: An evaluation platform for general agents,

    M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,”Jour- nal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013

  17. [17]

    MuJoCo: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2012, pp. 5026–5033

  18. [18]

    Memory aware synapses: Learning what (not) to forget,

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 139–154

  19. [19]

    A continual learning survey: Defying forgetting in classification tasks,

    M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3366–3385, 2021

  20. [20]

    Synaptic modifications in cultured hip- pocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type,

    G.-q. Bi and M.-m. Poo, “Synaptic modifications in cultured hip- pocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type,”Journal of Neuroscience, vol. 18, no. 24, pp. 10 464–10 472, 1998

  21. [21]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018

  22. [22]

    Ikeda and S

    N. Ikeda and S. Watanabe,Stochastic Differential Equations and Diffu- sion Processes, 2nd ed. North-Holland, 1989

  23. [23]

    Øksendal,Stochastic Differential Equations: An Introduction with Applications, 6th ed

    B. Øksendal,Stochastic Differential Equations: An Introduction with Applications, 6th ed. Springer, 2013

  24. [24]

    G. A. Pavliotis and A. M. Stuart,Multiscale Methods: Averaging and Homogenization, ser. Texts in Applied Mathematics. Springer, 2008, vol. 53

  25. [25]

    Risken,The Fokker-Planck Equation: Methods of Solution and Applications, 2nd ed

    H. Risken,The Fokker-Planck Equation: Methods of Solution and Applications, 2nd ed. Springer, 1996

  26. [26]

    Bakry, I

    D. Bakry, I. Gentil, and M. Ledoux,Analysis and Geometry of Markov Diffusion Operators. Springer, 2014

  27. [27]

    P. E. Kloeden and E. Platen,Numerical Solution of Stochastic Differen- tial Equations. Springer, 1992

  28. [28]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning (ICML). PMLR, 2018, pp. 1861–1870

  29. [29]

    Addressing function approx- imation error in actor-critic methods,

    S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approx- imation error in actor-critic methods,” inProceedings of the 35th International Conference on Machine Learning (ICML). PMLR, 2018, pp. 1587–1596

  30. [30]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2021

  31. [31]

    Learning rates for q-learning,

    E. Even-Dar and Y . Mansour, “Learning rates for q-learning,”Journal of Machine Learning Research, vol. 5, pp. 1–25, 2003

  32. [32]

    Finite-time bounds for fitted value iteration,

    R. Munos and C. Szepesv ´ari, “Finite-time bounds for fitted value iteration,”Journal of Machine Learning Research, vol. 9, pp. 815–857, 2008

  33. [33]

    Szepesv ´ari,Algorithms for Reinforcement Learning

    C. Szepesv ´ari,Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010

  34. [34]

    Rainbow: Combining improvements in deep reinforcement learning,

    M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” inProceed- ings of the 32nd AAAI Conference on Artificial Intelligence, 2018, pp. 3215–3222

  35. [35]

    Gradient episodic memory for contin- ual learning,

    D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for contin- ual learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

  36. [36]

    Dark experience for general continual learning: A strong, simple baseline,

    P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara, “Dark experience for general continual learning: A strong, simple baseline,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 15 920–15 930

  37. [37]

    Online continual learning with maximal inter- fered retrieval,

    R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia, “Online continual learning with maximal inter- fered retrieval,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019

  38. [38]

    Visualizing data using t-SNE,

    L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008

  39. [39]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,”arXiv preprint arXiv:1802.03426, 2018

  40. [40]

    Reactivation of hippocampal ensemble memories during sleep,

    M. A. Wilson and B. L. McNaughton, “Reactivation of hippocampal ensemble memories during sleep,”Science, vol. 265, no. 5172, pp. 676– 679, 1994

  41. [41]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  42. [42]

    Asynchronous methods for deep reinforcement learning,

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” inProceedings of the 33rd International Con- ference on Machine Learning (ICML). PMLR, 2016, pp. 1928–1937

  43. [43]

    Learning to communicate with deep multi-agent reinforcement learning,

    J. N. Foerster, Y . M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016

  44. [44]

    Model-agnostic meta-learning for fast adaptation of deep networks,

    C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProceedings of the 34th International Conference on Machine Learning (ICML). PMLR, 2017, pp. 1126–1135

  45. [45]

    Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learning,

    R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999

  46. [46]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,”arXiv preprint arXiv:2005.01643, 2020

  47. [47]

    A comprehensive survey on safe reinforce- ment learning,

    J. Garc ´ıa and F. Fern´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015

  48. [48]

    ALFRED: A benchmark for interpret- ing grounded instructions for everyday tasks,

    M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpret- ing grounded instructions for everyday tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 740–10 749