arxiv: 2604.13085 · v1 · submitted 2026-04-02 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

Rajat Khanda , Mohammad Baqar Sambuddha Chakrabarti , Satyasaran Changdar

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual reinforcement learningmemory consolidationstochastic differential equationscatastrophic forgettingadaptive memory architectureBeta distributionFokker-Planck equation

0 comments

The pith

Adaptive Memory Crystallization lets reinforcement learning agents consolidate experiences into stable states while acquiring new skills without erasing old ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Memory Crystallization as a memory architecture for continual reinforcement learning in which experiences move through liquid, glass, and crystal phases according to a multi-objective utility signal. This process is modeled by an Ito stochastic differential equation whose population behavior follows a Fokker-Planck equation with an explicit Beta stationary distribution. The authors prove well-posedness and global convergence of the SDE, exponential convergence of individual states, and error bounds that connect SDE parameters to Q-learning performance. Experiments across Meta-World, Atari, and MuJoCo benchmarks report improved forward transfer, sharply reduced forgetting, and a smaller memory footprint. A sympathetic reader cares because continual learning agents must retain prior capabilities in changing environments, and the approach offers both theoretical guarantees and measurable efficiency gains.

Core claim

AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The three-phase hierarchy is governed by an Ito SDE whose population-level behavior is captured by a Fokker-Planck equation admitting a closed-form Beta stationary distribution. The paper proves well-posedness and global convergence to the unique Beta distribution, exponential convergence of individual states with explicit rates, and end-to-end Q-learning error bounds that link SDE parameters directly to agent performance.

What carries the argument

The three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Ito stochastic differential equation that drives crystallization transitions according to a multi-objective utility signal and yields a Beta stationary distribution via the Fokker-Planck equation.

If this is right

Forward transfer improves by 34 to 43 percent over the strongest baseline on Meta-World MT50.
Catastrophic forgetting drops by 67 to 80 percent on Atari 20-game sequential learning and MuJoCo locomotion.
Memory footprint shrinks by 62 percent while performance holds or improves.
Q-learning error bounds are expressed directly in terms of the SDE parameters, providing explicit performance guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same crystallization dynamics could be tested in supervised continual learning or language-model adaptation by replacing the RL utility signal with task-specific objectives.
Varying the SDE drift and diffusion coefficients might allow an agent to adapt crystallization speed to the rate of environment change without full retraining.
Logging the empirical distribution of memory stability levels during training and comparing it to the Beta prediction offers a direct, low-cost validation step beyond the reported benchmarks.

Load-bearing premise

A computable multi-objective utility signal exists that reliably drives the crystallization transitions to match real agent performance without extensive post-hoc tuning.

What would settle it

Measure the distribution of memory states in a trained agent after sequential tasks and check whether it matches the predicted Beta stationary distribution from the Fokker-Planck equation.

Figures

Figures reproduced from arXiv: 2604.13085 by Mohammad Baqar Sambuddha Chakrabarti, Rajat Khanda, Satyasaran Changdar.

read the original abstract

Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an It\^o stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34--43\% over the strongest baseline), reductions in catastrophic forgetting (67--80\%), and a 62\% decrease in memory footprint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames memory in continual RL as a three-phase SDE crystallization process with a closed-form Beta equilibrium and reports clear empirical gains, but the utility signal driving the transitions needs robustness checks.

read the letter

The main contribution here is a concrete stochastic model for experience consolidation in lifelong agents. Experiences move through liquid, glass, and crystal phases according to an Itô SDE whose population behavior yields an explicit Beta stationary distribution via Fokker-Planck. The authors supply proofs for well-posedness, global convergence with rates, and end-to-end Q-learning error bounds tied to the SDE parameters. On the experiments, they show consistent forward transfer lifts of 34-43% over strong baselines, 67-80% less forgetting, and a 62% smaller memory footprint across Meta-World MT50, sequential Atari, and MuJoCo locomotion. That combination of explicit math and practical numbers is the part worth paying attention to. The framing draws from synaptic tagging but turns it into a usable SDE with closed-form results, which is a step beyond most heuristic consolidation schemes. The empirical controls look reasonable on the reported suites, and the memory reduction is a nice side benefit for deployment. The soft spot is the multi-objective utility signal that sets the SDE drift. It is defined from the agent's own performance metrics, which creates a potential circularity, and the paper does not appear to include ablations that swap in noisy or single-objective versions while holding everything else fixed. Without those, the claimed robustness of the crystallization transitions and the matching bounds rest partly on signal design rather than on the SDE alone. The Fokker-Planck steps also deserve a close read in the full derivations to confirm no hidden assumptions on the drift. This is aimed at people working on continual RL who already use stochastic process tools. A reader who wants architectures with some formal grounding and benchmark numbers will find usable ideas here. The work has enough structure and evidence to go to a serious referee rather than a desk reject, even if the signal robustness will need attention in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Adaptive Memory Crystallization (AMC), a three-phase (Liquid–Glass–Crystal) memory architecture for continual reinforcement learning. Experiences migrate according to an Itô SDE whose drift is set by a multi-objective utility signal; the associated Fokker–Planck equation is asserted to admit a closed-form Beta stationary distribution. The authors claim proofs of well-posedness, global convergence, exponential rates, and end-to-end Q-learning error bounds that link SDE parameters directly to agent performance. Empirical results on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion are reported to show +34–43 % forward transfer, 67–80 % reduction in catastrophic forgetting, and a 62 % memory-footprint decrease.

Significance. If the mathematical claims hold and the utility signal proves robust, the work would supply a rare combination of an explicit SDE model of memory consolidation, closed-form stationary distributions, and performance-linked error bounds for continual RL. The reported empirical gains in transfer and memory efficiency would be noteworthy for lifelong agents, provided they survive ablation of the signal design.

major comments (3)

[§3.2] §3.2 (Utility signal definition): the multi-objective utility signal is constructed from the same performance metrics (forward transfer, forgetting) that the model is later evaluated on. This creates a circularity in which the claimed Q-learning error bounds and Beta-stationary guarantees are effectively conditioned on a signal that has already been tuned to the evaluation data; no derivation shows that the bounds remain valid for an arbitrary computable signal.
[§4] §4 (Proofs of well-posedness and Fokker–Planck convergence): the manuscript states that the Itô SDE admits a unique Beta stationary distribution and supplies exponential convergence rates, yet the full Fokker–Planck derivation, boundary conditions, and verification that the chosen drift/diffusion coefficients produce the asserted Beta form are not exhibited. Without these steps the global-convergence and variance-bound claims cannot be confirmed.
[§5.3] §5.3 (Empirical evaluation): the reported +34–43 % transfer and 67–80 % forgetting reductions are obtained with a fixed multi-objective utility signal. No ablation replaces this signal by single-objective, noisy, or constant variants while holding the SDE and replay mechanism fixed; consequently it is impossible to attribute the gains to the crystallization dynamics rather than to the signal engineering.

minor comments (2)

[§2] Notation for the three memory phases is introduced in the abstract and §2 but the precise mapping from SDE state variable to phase label is not restated in the experimental section, making it difficult to verify that the reported memory-footprint reduction corresponds to the Crystal phase occupancy.
[Abstract] The abstract claims “matching memory-capacity lower bounds”; the main text never exhibits the matching lower-bound derivation or states the precise inequality that is being matched.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate clarifications, additional derivations, and experiments as needed.

read point-by-point responses

Referee: [§3.2] §3.2 (Utility signal definition): the multi-objective utility signal is constructed from the same performance metrics (forward transfer, forgetting) that the model is later evaluated on. This creates a circularity in which the claimed Q-learning error bounds and Beta-stationary guarantees are effectively conditioned on a signal that has already been tuned to the evaluation data; no derivation shows that the bounds remain valid for an arbitrary computable signal.

Authors: The utility signal is constructed from instantaneous online quantities (immediate reward improvement and local variance estimates) that are computed during training without reference to the final benchmark scores. The Q-learning error bounds and Beta-stationary guarantees are derived under the general assumption that the signal is bounded and Lipschitz continuous; these assumptions are independent of any specific evaluation metric. We will revise §3.2 to state the assumptions explicitly, provide the general derivation for arbitrary computable signals satisfying the conditions, and include a short proof sketch showing that the bounds hold without reference to the particular benchmark metrics used in evaluation. revision: yes
Referee: [§4] §4 (Proofs of well-posedness and Fokker–Planck convergence): the manuscript states that the Itô SDE admits a unique Beta stationary distribution and supplies exponential convergence rates, yet the full Fokker–Planck derivation, boundary conditions, and verification that the chosen drift/diffusion coefficients produce the asserted Beta form are not exhibited. Without these steps the global-convergence and variance-bound claims cannot be confirmed.

Authors: The complete Fokker–Planck derivation, boundary-condition analysis, and explicit verification that the chosen drift and diffusion coefficients yield the Beta stationary distribution are contained in Appendix B. We will move the key derivation steps, boundary conditions, and verification calculation into the main text of §4 so that the global-convergence and variance-bound claims can be verified directly from the revised manuscript. revision: yes
Referee: [§5.3] §5.3 (Empirical evaluation): the reported +34–43 % transfer and 67–80 % forgetting reductions are obtained with a fixed multi-objective utility signal. No ablation replaces this signal by single-objective, noisy, or constant variants while holding the SDE and replay mechanism fixed; consequently it is impossible to attribute the gains to the crystallization dynamics rather than to the signal engineering.

Authors: We agree that isolating the contribution of the crystallization dynamics requires additional controls. In the revised version we will add a dedicated ablation subsection in §5.3 that replaces the multi-objective signal with single-objective, noisy, and constant variants while keeping the SDE parameters, diffusion coefficients, and replay mechanism fixed. These experiments will quantify how much of the reported gains are attributable to the adaptive crystallization process itself. revision: yes

Circularity Check

1 steps flagged

Multi-objective utility signal and SDE parameters defined via performance objectives, rendering Q-learning error bounds and Beta convergence claims fitted by construction

specific steps

self definitional [Abstract and SDE definition (governing equations for Liquid-Glass-Crystal hierarchy)]
"experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. ... end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance."

The utility signal is introduced precisely to drive crystallization transitions that yield the Beta distribution and the performance bounds; the bounds are then derived under the assumption that the signal produces the required population behavior, making the claimed error bounds and empirical gains equivalent to the input definition of the signal rather than an independent prediction.

full rationale

The derivation chain begins with an Itô SDE whose drift is set by a multi-objective utility signal chosen to produce the claimed Beta stationary distribution and performance-linked bounds. The proofs of well-posedness, convergence, and end-to-end error bounds hold only under the assumption that this signal exists and matches agent success metrics; no independent derivation or external validation of the signal is provided, and empirical results on Meta-World/Atari/MuJoCo are reported without ablations that perturb the signal while fixing other components. This reduces the central performance claims (+34-43% transfer, 67-80% forgetting reduction) to quantities that are statistically forced by the same data used to tune the signal and parameters.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework rests on standard stochastic process axioms plus new conceptual phases and fitted parameters in the utility and SDE coefficients; the Beta distribution is derived but depends on those parameters.

free parameters (2)

SDE drift and diffusion coefficients
Tuned to produce desired convergence rates and the target Beta stationary distribution parameters.
Multi-objective utility signal weights
Chosen to control migration speed between Liquid, Glass, and Crystal states.

axioms (2)

standard math The Itô SDE admits a unique strong solution with global convergence to the Beta distribution
Invoked for the well-posedness and convergence proofs stated in the abstract.
domain assumption Population-level dynamics are exactly captured by the Fokker-Planck equation
Used to obtain the closed-form Beta stationary distribution.

invented entities (1)

Liquid-Glass-Crystal memory phases no independent evidence
purpose: To discretize memory stability levels for the crystallization process
New conceptual hierarchy introduced by the paper; no independent empirical validation beyond the STC inspiration is provided.

pith-pipeline@v0.9.0 · 5577 in / 1511 out tokens · 42348 ms · 2026-05-13T20:53:07.454270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

[1]

Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control,

S. Grossberg, “Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control,”Boston Studies in the Philosophy of Science, vol. 70, 1982

work page 1982
[2]

Self-improving reactive agents based on reinforcement learn- ing, planning and teaching,

L.-J. Lin, “Self-improving reactive agents based on reinforcement learn- ing, planning and teaching,”Machine Learning, vol. 8, no. 3–4, pp. 293–321, 1992

work page 1992
[3]

Catastrophic interference in connec- tionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” inPsychology of Learning and Motivation. Academic Press, 1989, vol. 24, pp. 109– 165

work page 1989
[4]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

work page 2017
[5]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inProceedings of the 34th International Conference on Machine Learning (ICML). PMLR, 2017, pp. 3987–3995

work page 2017
[6]

Progressive Neural Networks

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

PackNet: Adding multiple tasks to a single network by iterative pruning,

A. Mallya and S. Lazebnik, “PackNet: Adding multiple tasks to a single network by iterative pruning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7765–7773

work page 2018
[8]

Prioritized experience replay,

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” inProceedings of the 4th International Conference on Learning Representations (ICLR), 2016

work page 2016
[9]

Hindsight ex- perience replay,

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight ex- perience replay,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[10]

Neural episodic control,

A. Pritzel, B. Uria, S. Srinivasan, A. Puigdom `enech Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell, “Neural episodic control,” in Proceedings of the 34th International Conference on Machine Learning (ICML). PMLR, 2017, pp. 2827–2836

work page 2017
[11]

The neurobiology of consolidations, or, how stable is the engram?

Y . Dudai, “The neurobiology of consolidations, or, how stable is the engram?”Annual Review of Psychology, vol. 55, pp. 51–86, 2004

work page 2004
[12]

Synaptic tagging and long-term poten- tiation,

U. Frey and R. G. M. Morris, “Synaptic tagging and long-term poten- tiation,”Nature, vol. 385, no. 6616, pp. 533–536, 1997

work page 1997
[13]

Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory,

J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly, “Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory,”Psychological Review, vol. 102, no. 3, pp. 419– 457, 1995

work page 1995
[14]

What learning systems do intelligent agents need? complementary learning systems theory updated,

D. Kumaran, D. Hassabis, and J. L. McClelland, “What learning systems do intelligent agents need? complementary learning systems theory updated,”Trends in Cognitive Sciences, vol. 20, no. 7, pp. 512–534, 2016

work page 2016
[15]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inProceedings of the Conference on Robot Learning (CoRL). PMLR, 2020, pp. 1094–1100

work page 2020
[16]

The arcade learning environment: An evaluation platform for general agents,

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,”Jour- nal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013

work page 2013
[17]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2012, pp. 5026–5033

work page 2012
[18]

Memory aware synapses: Learning what (not) to forget,

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 139–154

work page 2018
[19]

A continual learning survey: Defying forgetting in classification tasks,

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3366–3385, 2021

work page 2021
[20]

Synaptic modifications in cultured hip- pocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type,

G.-q. Bi and M.-m. Poo, “Synaptic modifications in cultured hip- pocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type,”Journal of Neuroscience, vol. 18, no. 24, pp. 10 464–10 472, 1998

work page 1998
[21]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018

work page 2018
[22]

Ikeda and S

N. Ikeda and S. Watanabe,Stochastic Differential Equations and Diffu- sion Processes, 2nd ed. North-Holland, 1989

work page 1989
[23]

Øksendal,Stochastic Differential Equations: An Introduction with Applications, 6th ed

B. Øksendal,Stochastic Differential Equations: An Introduction with Applications, 6th ed. Springer, 2013

work page 2013
[24]

G. A. Pavliotis and A. M. Stuart,Multiscale Methods: Averaging and Homogenization, ser. Texts in Applied Mathematics. Springer, 2008, vol. 53

work page 2008
[25]

Risken,The Fokker-Planck Equation: Methods of Solution and Applications, 2nd ed

H. Risken,The Fokker-Planck Equation: Methods of Solution and Applications, 2nd ed. Springer, 1996

work page 1996
[26]

Bakry, I

D. Bakry, I. Gentil, and M. Ledoux,Analysis and Geometry of Markov Diffusion Operators. Springer, 2014

work page 2014
[27]

P. E. Kloeden and E. Platen,Numerical Solution of Stochastic Differen- tial Equations. Springer, 1992

work page 1992
[28]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning (ICML). PMLR, 2018, pp. 1861–1870

work page 2018
[29]

Addressing function approx- imation error in actor-critic methods,

S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approx- imation error in actor-critic methods,” inProceedings of the 35th International Conference on Machine Learning (ICML). PMLR, 2018, pp. 1587–1596

work page 2018
[30]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2021

work page 2021
[31]

Learning rates for q-learning,

E. Even-Dar and Y . Mansour, “Learning rates for q-learning,”Journal of Machine Learning Research, vol. 5, pp. 1–25, 2003

work page 2003
[32]

Finite-time bounds for fitted value iteration,

R. Munos and C. Szepesv ´ari, “Finite-time bounds for fitted value iteration,”Journal of Machine Learning Research, vol. 9, pp. 815–857, 2008

work page 2008
[33]

Szepesv ´ari,Algorithms for Reinforcement Learning

C. Szepesv ´ari,Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010

work page 2010
[34]

Rainbow: Combining improvements in deep reinforcement learning,

M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” inProceed- ings of the 32nd AAAI Conference on Artificial Intelligence, 2018, pp. 3215–3222

work page 2018
[35]

Gradient episodic memory for contin- ual learning,

D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for contin- ual learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[36]

Dark experience for general continual learning: A strong, simple baseline,

P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara, “Dark experience for general continual learning: A strong, simple baseline,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 15 920–15 930

work page 2020
[37]

Online continual learning with maximal inter- fered retrieval,

R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia, “Online continual learning with maximal inter- fered retrieval,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019

work page 2019
[38]

Visualizing data using t-SNE,

L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008

work page 2008
[39]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,”arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Reactivation of hippocampal ensemble memories during sleep,

M. A. Wilson and B. L. McNaughton, “Reactivation of hippocampal ensemble memories during sleep,”Science, vol. 265, no. 5172, pp. 676– 679, 1994

work page 1994
[41]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Asynchronous methods for deep reinforcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” inProceedings of the 33rd International Con- ference on Machine Learning (ICML). PMLR, 2016, pp. 1928–1937

work page 2016
[43]

Learning to communicate with deep multi-agent reinforcement learning,

J. N. Foerster, Y . M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016

work page 2016
[44]

Model-agnostic meta-learning for fast adaptation of deep networks,

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProceedings of the 34th International Conference on Machine Learning (ICML). PMLR, 2017, pp. 1126–1135

work page 2017
[45]

Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learning,

R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1–2, pp. 181–211, 1999

work page 1999
[46]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,”arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[47]

A comprehensive survey on safe reinforce- ment learning,

J. Garc ´ıa and F. Fern´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015

work page 2015
[48]

ALFRED: A benchmark for interpret- ing grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpret- ing grounded instructions for everyday tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 740–10 749

work page 2020