Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

Lute Lillo; Nick Cheney

arxiv: 2604.15414 · v2 · pith:YRRAMBCSnew · submitted 2026-04-16 · 💻 cs.LG · cs.AI· cs.NE

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

Lute Lillo , Nick Cheney This is my paper

Pith reviewed 2026-05-10 10:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords continual reinforcement learningpolicy archivesplasticity preservationlatent space alignmenttransfer in RLMiniGridbehavioral diversity

0 comments

The pith

Maintaining archives of diverse policies in a shared latent space preserves plasticity in continual reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continual reinforcement learning struggles when agents lock onto one evolving policy, because even retained solutions can lose their value as starting points after new tasks interfere. The paper shows that archiving multiple behaviorally related policies per task, kept comparable through a shared latent space, lets the agent pick among competent alternatives instead of depending on a single representative. A reader would care because this directly targets the retention-adaptation tradeoff that limits lifelong agents, and the MiniGrid results indicate agents can master more tasks, rebound quicker on old ones, and hold higher performance over time. The core shift is from isolated solutions to neighborhoods of nearby competent policies that support ongoing relearning.

Core claim

The paper claims that source-optimal policies are often not transfer-optimal even inside a local competent neighborhood, and that effective reuse therefore requires retaining and selecting among multiple nearby alternatives rather than collapsing to one representative; TeLAPA achieves this by placing behaviorally diverse policies into per-task archives connected by a shared latent space that keeps them reusable under non-stationary drift.

What carries the argument

TeLAPA (Transfer-Enabled Latent-Aligned Policy Archives), which builds per-task archives of behaviorally diverse policies kept aligned and comparable inside one shared latent space.

If this is right

Agents learn more tasks successfully across a long sequence of changing environments.
Competence on revisited tasks is recovered faster once interference from new learning has occurred.
Overall performance across the entire task sequence remains higher than with single-model methods.
Selecting from multiple competent nearby policies proves more effective for transfer than using any single source-optimal policy.
Single-model preservation cannot fully solve loss of plasticity because it collapses alternatives that remain useful.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same neighborhood principle could be tested in domains with stronger or more abrupt task shifts to see whether the latent alignment still maintains reuse value.
Storing and searching multiple policies per task introduces a storage and selection cost that future implementations would need to manage efficiently.
The finding that local competence does not guarantee transfer optimality suggests similar archive structures might help in other non-stationary learning settings beyond reinforcement learning.

Load-bearing premise

Organizing policies into behaviorally diverse neighborhoods via a shared latent space will reliably preserve plasticity and transfer better than single-model preservation.

What would settle it

An experiment in the same MiniGrid continual-learning setting where TeLAPA shows no gains in task success rate, recovery speed after interference, or final retention compared with strong single-policy baselines would falsify the central result.

Figures

Figures reproduced from arXiv: 2604.15414 by Lute Lillo, Nick Cheney.

**Figure 1.** Figure 1: Transfer gap in target task performance between the best goodenough elite and source-optimal elite. Source-best is not always transfer-best. We demonstrate the failure of single-model preservation and show that source-optimality does not uniquely determine transfer-optimality. For each s → t pair, we compare the target performance of the source-best elite against the policy with the highest target perform… view at source ↗

**Figure 2.** Figure 2: Relative latent spread of the good-enough set. The good-enough set occupies a broad latent basin. If the good-enough set were concentrated at essentially one point, then preserving a single representative policy could indeed be sufficient [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Latent distance-rank bins, relative to the source-best, pooled within the local competent basin B. Transfer differs even within a local competent neighborhood. A stricter test is whether the source-best policy is already insufficient even within its own local competent neighborhood, rather than only because better transfer seeds exist farther away in the source-competent set. To test this, [PITH_FULL_IMAG… view at source ↗

**Figure 4.** Figure 4: Transfer gain obtained by searching only within the local basin. Collapsing the local basin to one preserved policy incurs a transfer cost [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Pooled QD structure across runs. We visualize the density of archive elites as a function of fitness and normalized novelty ν norm. Small novelty indicates local redundancy, whereas large novelty indicates more isolated or frontier elites. Differences across tasks reveal how each archive balances competence and behavioral spread within its own latent geometry. redundant clusters, but instead exhibit task-d… view at source ↗

**Figure 6.** Figure 6: Stepping-stone lineage structure across task archives. (a) Immediate source-task archive → current target-task archive. Rows denote the most recent source archive used for transfer, and columns denote the current target-task archive. Each cell shows how many evaluated candidates for that target were drawn from that immediate source archive. (b) Recorded archive visits in lineage. Columns denote the current… view at source ↗

**Figure 7.** Figure 7: Lineage utility across transferred candidates. Each panel compares the distributions of final transfer performance for two candidate groups defined by a different lineage property, shown separately for each target task. (a) Breadth-rich versus breadth-poor lineage histories. Candidates are grouped by whether their lineage contains the original archive family corresponding to the current target. (b) Revisit… view at source ↗

**Figure 8.** Figure 8: Cross-task geometry in the shared latent space. Left: t-SNE of elites from the best run (highest mean SR across tasks). Middle: same visualization for the worst run (lowest mean SR across tasks). Right: mean separation ratio matrix Skℓ (Eq. 61), measuring centroid distance normalized by within-task dispersion [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeLAPA keeps multiple policies per task in a shared latent space to fight plasticity loss in continual RL, but the gains are shown only on MiniGrid.

read the letter

The core move here is to stop treating a single evolving policy as the reusable unit and instead archive behaviorally diverse policies per task, then keep them aligned in a latent space so they stay comparable after drift. That reframing, plus the claim that source-optimal policies are often not the best transfer points even inside a competent neighborhood, is the actual new piece relative to standard single-model preservation work in continual RL. The MiniGrid results back the practical side: more tasks solved, faster recovery on revisited ones, and better retained performance across the sequence. The analyses that show why collapsing to one representative hurts reuse are the most useful part of the evidence so far. Credit for grounding the idea in quality-diversity methods without overclaiming prior results. The experiments stay inside MiniGrid, a low-dimensional discrete domain where behavioral diversity is relatively easy to capture and align. No quantitative details, error bars, or ablation numbers appear in the abstract, and there is no test of whether the latent alignment remains stable once policy variance grows or task interference becomes non-convex, as it does in MuJoCo-style or Atari continual sequences. If the shared space stops producing comparable neighborhoods under stronger drift, the method reduces to the single-policy case it wants to beat. This is aimed at people already working on lifelong RL and plasticity, especially those who have tried regularization or replay and still see collapse. A reader who wants a concrete alternative to single-model retention will get something usable to try or extend. The idea is coherent enough and the MiniGrid evidence is positive enough that it deserves referee time rather than a desk reject, though any review should press for broader domains and clearer controls on the archive and alignment steps. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes TeLAPA (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that maintains per-task archives of behaviorally diverse policies aligned in a shared latent space. This shifts from single-model preservation to retaining skill-aligned neighborhoods to better address loss of plasticity and enable reuse under task interference. In MiniGrid continual learning experiments, TeLAPA is reported to solve more tasks, recover faster after interference, and retain higher performance; analyses indicate that source-optimal policies are often not transfer-optimal within neighborhoods.

Significance. If the core mechanism proves robust, the work offers a promising reframing of continual RL around reusable policy neighborhoods rather than isolated solutions, potentially improving plasticity in lifelong agents. The MiniGrid results provide initial support for the value of latent alignment and multi-policy selection over single-policy baselines, but the narrow domain limits broader impact assessment.

major comments (3)

[§4] §4 (Experiments): All quantitative results and claims of superior task learning, recovery speed, and retention are demonstrated exclusively on MiniGrid, a low-dimensional discrete domain. The central claim that shared latent spaces produce stable, reusable behaviorally diverse neighborhoods under non-stationary drift lacks validation in higher-variance settings such as continuous-control MuJoCo or Atari sequences, where the skeptic concern about latent alignment failure directly undermines the transfer mechanism.
[§4.2, §5] §4.2 and §5 (Analyses): The assertion that source-optimal policies are not transfer-optimal and that effective reuse requires multiple nearby alternatives rests on MiniGrid-specific metrics without reported error bars, statistical tests, or ablation controls on latent space dimensionality and archive size. This makes it impossible to assess whether the neighborhood advantage is robust or an artifact of the low-dimensional setting.
[§3] §3 (Method): The description of how the shared latent space maintains comparability of archived policies after task drift is high-level; no concrete mechanism, loss terms, or stability guarantees are provided to ensure that behavioral diversity remains aligned and selectable when policy parameters evolve under interference, which is load-bearing for the 'beyond single-model' claim.

minor comments (2)

[Abstract] Abstract: Quantitative details, error bars, and baseline comparisons are absent, reducing the ability to evaluate the strength of the reported gains in task success, recovery, and retention.
[§3] Notation: The distinction between 'source-optimal' and 'transfer-optimal' policies is introduced in the analyses but would benefit from an explicit definition or equation early in the method section to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our manuscript. We have carefully considered each major point and provide detailed responses below, along with indications of how we plan to revise the paper.

read point-by-point responses

Referee: [§4] §4 (Experiments): All quantitative results and claims of superior task learning, recovery speed, and retention are demonstrated exclusively on MiniGrid, a low-dimensional discrete domain. The central claim that shared latent spaces produce stable, reusable behaviorally diverse neighborhoods under non-stationary drift lacks validation in higher-variance settings such as continuous-control MuJoCo or Atari sequences, where the skeptic concern about latent alignment failure directly undermines the transfer mechanism.

Authors: We acknowledge the limitation that all empirical results are confined to the MiniGrid environment. MiniGrid was selected as it provides a controlled setting to study continual learning with clear task boundaries and measurable interference effects, allowing us to focus on the plasticity preservation mechanism without the confounding factors of high-dimensional observations or continuous actions. The proposed framework is intended to be general, but we agree that additional validation in domains like MuJoCo or Atari would be valuable to address concerns about latent alignment under greater variance. In the revised version, we will add a dedicated limitations subsection discussing the challenges of scaling the latent alignment to these domains and suggest directions for future empirical work. We cannot perform new large-scale experiments in this revision cycle. revision: partial
Referee: [§4.2, §5] §4.2 and §5 (Analyses): The assertion that source-optimal policies are not transfer-optimal and that effective reuse requires multiple nearby alternatives rests on MiniGrid-specific metrics without reported error bars, statistical tests, or ablation controls on latent space dimensionality and archive size. This makes it impossible to assess whether the neighborhood advantage is robust or an artifact of the low-dimensional setting.

Authors: We appreciate this feedback on the rigor of our analyses. In the revised manuscript, we will include error bars (standard deviation across multiple seeds) for all quantitative results in §4.2 and §5. We will also conduct and report statistical tests to compare TeLAPA against baselines. Furthermore, we will add ablation studies varying the latent space dimensionality and archive size to demonstrate the robustness of the multi-policy neighborhood approach. These changes will strengthen the evidence that the observed advantages are not artifacts of the specific setting. revision: yes
Referee: [§3] §3 (Method): The description of how the shared latent space maintains comparability of archived policies after task drift is high-level; no concrete mechanism, loss terms, or stability guarantees are provided to ensure that behavioral diversity remains aligned and selectable when policy parameters evolve under interference, which is load-bearing for the 'beyond single-model' claim.

Authors: We will revise §3 to provide a more detailed description of the latent alignment process. This will include the specific loss functions employed to project policies into the shared latent space, the criteria for maintaining behavioral diversity within archives, and any mechanisms or regularizations used to promote stability under policy updates and task drift. We will also discuss the selection strategy for retrieving policies from the archive during adaptation. If appropriate, we will include an algorithmic outline to clarify the overall procedure. revision: yes

standing simulated objections not resolved

Validation of the approach in higher-variance domains such as MuJoCo or Atari, which would require substantial new experiments not feasible in the current revision.

Circularity Check

0 steps flagged

No circularity: framework and results are empirically grounded without self-referential reductions.

full rationale

The paper presents TeLAPA as an empirically evaluated framework for continual RL that organizes policies into latent-aligned archives, drawing inspiration from quality-diversity methods. No equations, fitted parameters, or derivations are described that reduce the central claims (e.g., faster recovery via neighborhood selection) to inputs by construction. Analyses of source-optimal vs. transfer-optimal policies and MiniGrid results provide independent empirical content. Self-citations, if present for inspiration, are not load-bearing for the core argument, which rests on experimental comparisons rather than tautological definitions or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract introduces the TeLAPA framework and the concept of skill-aligned neighborhoods without specifying free parameters, background axioms, or new physical entities; the approach relies on standard RL assumptions plus the unstated premise that latent alignment preserves behavioral comparability under drift.

axioms (1)

domain assumption Policies that are behaviorally similar remain comparable and reusable when projected into a shared latent space even after non-stationary task drift.
This is the core premise enabling the archive-based transfer; it is invoked when the abstract states that archived policies remain comparable and reusable under non-stationary drift.

invented entities (1)

Transfer-Enabled Latent-Aligned Policy Archives (TeLAPA) no independent evidence
purpose: Organize behaviorally diverse policy neighborhoods per task to support future relearning and plasticity.
The framework itself is the central new construct introduced to move beyond single-model preservation.

pith-pipeline@v0.9.0 · 5550 in / 1469 out tokens · 35746 ms · 2026-05-10T10:57:09.649445+00:00 · methodology

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)