arxiv: 2604.08336 · v1 · submitted 2026-04-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Leveraging Complementary Embeddings for Replay Selection in Continual Learning with Small Buffers

Danit Yanowsky , Daphna Weinshall

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningreplay selectioncatastrophic forgettingself-supervised embeddingssupervised embeddingsgraph-based selectionsmall memory buffersMERS

0 comments

The pith

A graph-based selector that fuses supervised and self-supervised embeddings improves replay quality for continual learning under tight memory limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that combining class-specific supervised embeddings with class-agnostic self-supervised embeddings in a graph structure allows better choice of replay samples than either embedding type alone. This matters for continual learning because small buffers force algorithms to be highly selective, and missed semantic information leads to faster forgetting of earlier tasks. The proposed MERS method requires no extra model parameters or larger buffers yet delivers measurable gains on standard image benchmarks when memory is scarce.

Core claim

The authors introduce Multiple Embedding Replay Selection (MERS), a graph-based replay selection strategy that integrates both supervised and self-supervised embeddings to rank and retain samples for the memory buffer. They show that this integration produces consistent accuracy improvements over existing single-embedding selection baselines across multiple replay-based continual learning algorithms, with the largest relative gains occurring in low-memory regimes on CIFAR-100 and TinyImageNet.

What carries the argument

MERS, a graph-based selector that builds a unified representation from complementary supervised and self-supervised embeddings to decide which incoming samples to store in the replay buffer.

If this is right

MERS functions as a drop-in module that can replace the selection component in many existing replay-based continual learning algorithms.
Performance advantages grow as buffer size shrinks, making the approach especially relevant for edge or resource-limited deployments.
No increase in stored parameters or replay volume is required, preserving the original memory footprint.
The same dual-embedding graph construction can be applied to any dataset where both supervised labels and self-supervised pretraining are feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that representation diversity itself, rather than any single embedding quality, may be the key lever for effective replay under extreme memory constraints.
Similar fusion of multiple embedding sources could be tested in non-replay continual learning settings such as regularization-based or architecture-based methods.
One could measure whether the benefit scales with the degree of semantic overlap between the supervised and self-supervised views on a given task sequence.

Load-bearing premise

Self-supervised representations encode class-relevant semantics that supervised embeddings overlook and that are useful for choosing good replay samples.

What would settle it

If MERS produces no statistically significant reduction in forgetting or increase in final accuracy compared with a strong single supervised-embedding baseline on TinyImageNet using a 200-sample buffer, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2604.08336 by Danit Yanowsky, Daphna Weinshall.

**Figure 1.** Figure 1: Illustration of our MERS in the class-incremental learning (CIL) setup, after training episode T. 1 arXiv:2604.08336v1 [cs.LG] 9 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 3.** Figure 3: FAA (left) and AAA (right) on Split TinyImageNet for ER-ACE with buffer size |M| = 1000, using MERS, compared against alternative selection strategies. 6.2. Pretrained vs. Episodic Embeddings Following the same protocol as outlined above, results when using different SSL embeddings (see Section 5.3) are presented in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: FAA of MERS with ER-ACE-STAR on Split CIFAR100 using different embeddings: SimCLR, VICReg and DINOv2 6.3. Selection stability and forgetting We analyze selection stability and forgetting for MaxHerding with a supervised embedding, Max-Herding with SimCLR embedding, and the integrated MERS approach. Results are reported in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 2.** Figure 2: FAA as a function of memory size |M| on Split CIFAR100 for three continual learning algorithms, described in Section 5.1. Results with MERS are compared against alternative selection strategies, described in Section 5.2. The selection-strategy legend is shown in panel (c). (a) FAA (b) AAA [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 6.** Figure 6: Improvements in FAA on CIFAR-100 as a function of |M| while varying the RBF bandwidth σ in MaxHerding. We conducted an ablation study on the embedding weight α using different density estimators. The results show a slight improvement when using the α defined in (5), as reported in Appendix I. We also conducted an ablation study using MaxHerding with only SimCLR embeddings, and showed that MERS achieves hi… view at source ↗

**Figure 7.** Figure 7: MERS ProbCover: FAA as a function of memory size |M| on Split CIFAR-100 for three continual learning algorithms, see Section 5.1. Results with MERS are compared against alternative selection strategies, see Section 5.2. (a) FAA (b) AAA [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: MERS ProbCover: FAA (right) and AAA (left) as a function of memory size |M| on Split CIFAR-100 with ER-ACE. Results with MERS are compared against alternative selection strategies. Analogously, ∆x(B) = X u∈U:x∈Cu, B∩Cu=∅ wu. Since A ⊆ B, we have {u ∈ U : x ∈ Cu, B ∩ Cu = ∅} ⊆ {u ∈ U : x ∈ Cu, A ∩ Cu = ∅}, and all weights are non-negative. Therefore ∆x(B) ≤ ∆x(A), which is exactly the submodularity inequali… view at source ↗

**Figure 9.** Figure 9: MERS ProbCover: Stability and forgetting of ER-ACE-STAR with MERS as a function of |M| on Split CIFAR-100. (a) Stability (b) Forgetting [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: MERS ProbCover: Stability and forgetting of ER-ACE with MERS as a function of |M| on Split CIFAR-100. (a) Stability (b) Forgetting [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: MERS ProbCover: Stability and forgetting of ER with MERS as a function of |M| on Split CIFAR-100. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 15.** Figure 15: MERS MaxHerding. Ablation of the embedding weight α using K-NN density estimators on Split CIFAR-100 with ER-ACE. The baseline corresponds to Eq. (5), and a minor but consistent improvement is observed with this weighting. J. Robustness to Episode Class Order in Continual Learning As in the experiments presented in Tables 1–2, we repeated them using different episode Class orders. Below are the Final Aver… view at source ↗

**Figure 12.** Figure 12: AAA as a function of memory size |M| on Split CIFAR-100 for different continual learning algorithms. Results with MERS are compared against alternative selection strategies. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Stability and forgetting of ER-ACE with MERS as a function of |M| on Split CIFAR-100. (a) Stability (b) Forgetting [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Stability and forgetting of ER with MERS as a function of |M| on Split CIFAR-100. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

read the original abstract

Catastrophic forgetting remains a key challenge in Continual Learning (CL). In replay-based CL with severe memory constraints, performance critically depends on the sample selection strategy for the replay buffer. Most existing approaches construct memory buffers using embeddings learned under supervised objectives. However, class-agnostic, self-supervised representations often encode rich, class-relevant semantics that are overlooked. We propose a new method, Multiple Embedding Replay Selection, MERS, which replaces the buffer selection module with a graph-based approach that integrates both supervised and self-supervised embeddings. Empirical results show consistent improvements over SOTA selection strategies across a range of continual learning algorithms, with particularly strong gains in low-memory regimes. On CIFAR-100 and TinyImageNet, MERS outperforms single-embedding baselines without adding model parameters or increasing replay volume, making it a practical, drop-in enhancement for replay-based continual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Multiple Embedding Replay Selection (MERS), a graph-based replay buffer selection module for replay-based continual learning that integrates supervised and self-supervised embeddings. It claims this yields consistent improvements over SOTA selection strategies across CL algorithms, with particularly strong gains in low-memory regimes on CIFAR-100 and TinyImageNet, without adding parameters or increasing replay volume.

Significance. If the performance gains are shown to stem specifically from the complementarity of the two embedding types rather than the graph construction alone, the method would provide a practical, drop-in enhancement for memory-constrained continual learning. The evaluation on standard benchmarks and emphasis on low-buffer regimes addresses a relevant practical constraint in the field.

major comments (2)

[§5] §5 (Experimental results): The central claim that gains derive from complementary supervised and self-supervised embeddings is not supported by ablations that fix the graph-based selection procedure and vary only the embedding sources (supervised-only graph, self-supervised-only graph, and combined). Without these controls, it remains possible that any sufficiently rich single embedding fed into the same graph module would produce similar results, directly weakening the load-bearing assumption stated in the abstract and introduction.
[§4] §4 (Method): The description of the graph construction does not specify how the two embedding types are fused (e.g., joint node/edge features, separate graphs with combined scoring, or concatenation before distance computation). This detail is necessary to evaluate whether the reported improvements are reproducible and attributable to complementarity rather than implementation choices such as the distance metric.

minor comments (2)

[Abstract] Abstract: Replace qualitative statements such as 'consistent improvements' and 'particularly strong gains' with specific quantitative deltas (e.g., average accuracy lift on CIFAR-100 at buffer size 500) to allow readers to assess effect sizes immediately.
Figures: Ensure all plots include error bars or statistical significance markers when comparing MERS against baselines, and label the exact buffer sizes and CL algorithms used in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which have identified key opportunities to strengthen the presentation of our core hypothesis and the reproducibility of the method. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§5] §5 (Experimental results): The central claim that gains derive from complementary supervised and self-supervised embeddings is not supported by ablations that fix the graph-based selection procedure and vary only the embedding sources (supervised-only graph, self-supervised-only graph, and combined). Without these controls, it remains possible that any sufficiently rich single embedding fed into the same graph module would produce similar results, directly weakening the load-bearing assumption stated in the abstract and introduction.

Authors: We agree that the requested ablations are necessary to rigorously support the claim of complementarity. In the revised manuscript we will add experiments that hold the graph construction and selection procedure fixed while varying only the embedding sources: supervised-only, self-supervised-only, and the combined embeddings used by MERS. These results will be reported in §5 under the same protocols and benchmarks as the main experiments. We believe the combined variant will demonstrate clear gains over the single-embedding graphs, thereby confirming that the observed improvements are attributable to the integration of complementary representations rather than the graph module alone. revision: yes
Referee: [§4] §4 (Method): The description of the graph construction does not specify how the two embedding types are fused (e.g., joint node/edge features, separate graphs with combined scoring, or concatenation before distance computation). This detail is necessary to evaluate whether the reported improvements are reproducible and attributable to complementarity rather than implementation choices such as the distance metric.

Authors: We apologize for the omission of these implementation details. The revised §4 will explicitly describe the fusion mechanism, including how the supervised and self-supervised embeddings are combined to form node features, the precise distance metric applied, and whether a single joint graph or multiple graphs are used. This clarification will enable full reproducibility and allow readers to assess whether the gains arise from complementarity or from other design choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical method (MERS) for replay buffer selection in continual learning by combining supervised and self-supervised embeddings in a graph-based module. No equations, derivations, fitted parameters, or predictions appear in the provided text. All claims rest on experimental comparisons against external SOTA baselines on standard datasets (CIFAR-100, TinyImageNet). No self-citations form load-bearing chains, no ansatzes are smuggled, and no results reduce to inputs by construction. The central performance gains are presented as measured outcomes rather than self-referential logic, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-supervised embeddings contain overlooked class-relevant information; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Class-agnostic, self-supervised representations often encode rich, class-relevant semantics that are overlooked by supervised embeddings.
Explicitly stated in the abstract as the motivation for combining the two embedding types.

pith-pipeline@v0.9.0 · 5445 in / 1190 out tokens · 109677 ms · 2026-05-10T17:44:31.458028+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graph-based approach that integrates both supervised and self-supervised embeddings... weighted maximum k-coverage problem
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

non-parametric alignment strategy based on k-NN density estimation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages

[1]

2, 6 Fini, E., Turrisi da Costa, V

URLhttps://arxiv.org/abs/2503.01595. 2, 6 Fini, E., Turrisi da Costa, V . G., Alameda-Pineda, X., Ricci, E., Alahari, K., and Mairal, J. Self-supervised models are contin- ual learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2 Garreau, D., Jitkrittum, W., and Kanagawa, M. Large sam- ple analysis o...

work page arXiv 2022
[2]

16 Li, Z

See Proposition 4.5 for a supremum characterization of total variation. 16 Li, Z. and Hoiem, D. Learning without forgetting.IEEE trans- actions on pattern analysis and machine intelligence, 40(12): 2935–2947, 2017. 2 McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learni...

2017
[3]

1 Nemhauser, G

Elsevier, 1989. 1 Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming, 14(1):265–294, 1978. 12 9 Leveraging Complementary Embeddings for Replay Selection in Continual Learning with Small Buffers Ni, Z., Tang, S., and Zhuang, Y . Self-supervised class increment...

work page doi:10.1073/pnas.2015509117 1989
[4]

and Isola, P

4 Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pp. 9929–
[5]

Lifelong Learning with Dynamically Expandable Networks

PMLR, 2020. 14, 15 Welling, M. Herding dynamical weights to learn. InProceed- ings of the 26th annual international conference on machine learning, pp. 1121–1128, 2009. 6 Yehuda, O., Dekel, A., Hacohen, G., and Weinshall, D. Active learning through a covering lens.Advances in Neural Informa- tion Processing Systems, 35:22354–22367, 2022. 3, 10, 14 Yoon, J...

work page Pith review arXiv 2020
[6]

uniformity

ThusFis non-negative and normalized. Monotonicity.LetA⊆B⊆X. If an indexu∈Uis covered byA, i.e.,A∩C u ̸=∅, then sinceA⊆Bwe also haveB∩C u ̸=∅. Therefore, {u∈U:A∩C u ̸=∅} ⊆ {u∈U:B∩C u ̸=∅}, and by non-negativity of the weights, F(A) = X u:A∩Cu̸=∅ wu ≤ X u:B∩Cu̸=∅ wu =F(B). ThusFis monotone. Submodularity.To show submodularity, letA⊆B⊆ Xandx∈X\B. Consider th...

1978