pith. sign in

arxiv: 2606.03843 · v1 · pith:3F5IW2SMnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Re-Evaluating Continual Learning with Few-Shot Adaptation

Pith reviewed 2026-06-28 11:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningfew-shot evaluationper-shot plasticitymeta-learninglearning-to-learnstability and plasticityimage classification
0
0 comments X

The pith

Few-shot evaluation reveals that meta-learning future tasks induces learning-to-learn behavior in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continual learning methods are evaluated on sequences of tasks using 0-shot performance to measure stability via forgetting and plasticity via recent-task accuracy. The paper argues this standard misses the ability to retain information while adapting quickly, since it assumes perfect recall. It introduces few-shot evaluation on image classification task sequences and a per-shot plasticity metric that tracks adaptation across successive examples of new tasks. Adding foresight by meta-learning a short sequence of future tasks produces learning-to-learn behavior over the full sequence.

Core claim

The paper claims that incorporating foresight into continual learning methods by meta-learning a short sequence of future tasks induces learning-to-learn behavior over the task sequence, as shown by improved per-shot plasticity under few-shot evaluation on continual image classification benchmarks.

What carries the argument

The per-shot plasticity metric, which measures how performance on a new task improves with each additional labeled example, paired with meta-learning of a short future-task sequence to inject foresight into the continual learner.

If this is right

  • Few-shot evaluation supplies a finer-grained picture of stability and plasticity than 0-shot alone.
  • Popular continual learning strategies exhibit previously unseen behaviors once assessed with per-shot plasticity.
  • Meta-learning a short sequence of future tasks produces measurable learning-to-learn across an entire task stream.
  • Foresight-augmented methods retain and adapt information more effectively when new tasks arrive in sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same foresight mechanism could be tested in settings where task boundaries are unknown in advance.
  • Per-shot plasticity may serve as a diagnostic for how quickly any sequential learner recovers from distribution shifts.
  • If the pattern holds, training pipelines for deployed systems could routinely include short future-task meta-training to improve long-term adaptation.

Load-bearing premise

0-shot evaluation requires perfect recall across tasks and therefore cannot measure a method's capacity for quick adaptation to new information.

What would settle it

Experiments that apply the same base continual learning methods both with and without the meta-learning foresight step and find no measurable difference in per-shot plasticity on the task sequences would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.03843 by Amogh Inamdar, Matthew So, Richard Zemel, Vici Milenia.

Figure 1
Figure 1. Figure 1: Few-shot evaluation and adaptation in continual learning. A model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 0-shot vs. 5-shot accuracy on backward tasks in 4 common CL sequences. With 0-shot evaluation, replay-based methods almost match prior accuracy, but other methods show significant forgetting, especially the non-stability-preserving SGD baseline. However, 5-shot evaluation tells a different story—performance recovers drastically and nearly matches the prior best in every setting [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 3
Figure 3. Figure 3: CL Plasticity (0-shot accuracy on the current task) is not highly informative of method performance, as many methods saturate on the most common benchmarks for visual CL. In contrast, few-shot forward transfer effectively characterizes model plasticity in visual CL. Evaluated 10-shot, EWC appears to sharply lose plasticity on long sequences, AGEM’s plasticity benefits from a lack of cross-task structure, a… view at source ↗
Figure 4
Figure 4. Figure 4: Per-checkpoint backward and forward performance on task- and domain-incremental [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of 5-shot backward (upper row) and forward (lower row) transfer with foresight meta-learning added to CL methods. We update each method with MAML on 1 (SEQ-MNIST-5) or 3 (others) look-ahead tasks in parallel (forward measurement does not include look-ahead tasks). Many methods see an improvement in forward transfer. Though we only apply meta-learning in the forward direction, we also observe im… view at source ↗
Figure 6
Figure 6. Figure 6: 5-shot backwards and forwards accuracy of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A metric that captures the rate of adaptation as the scaled regret over early training. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Forward and backward per-shot plasticity, measured by averaged SAUCE, for each method [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Forward and backward per-shot plasticity on 20-task sequences. The solid lines represent [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean k-shot accuracy of non-meta CL baselines on SEQ-MNIST-5 for backward, current, and forward tasks, averaged across all checkpoints. Error bars denote standard error over 10 seeds. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: K-shot evaluation accuracy for SGD, evaluated on [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: K-shot evaluation accuracy for SGD, evaluated on [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Improvement in 5-shot forward transfer with [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Mean k-shot accuracy of meta-SGD variants on ROT-MNIST-20 for backward,current, and forward tasks, averaged across all checkpoints. Error bars denote standard error over 10 seeds. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: K-shot evaluation accuracy for parallel MAML with 20 adaptation steps and 20 meta [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Forward and backward per-shot plasticity on the task incremental [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method's ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive assessment of the stability and plasticity of a continual learning system. We conduct a fine-grained assessment on task sequences for continual image classification and find that this paradigm produces novel insights into the performance of popular continual learning strategies. Through few-shot evaluation with a novel metric -- per-shot plasticity -- we show that adding `foresight' to continual learning methods via the meta-learning of a short sequence of future tasks induces learning-to-learn behavior over the task sequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper argues that 0-shot evaluation is insufficient to measure stability and plasticity in continual learning because it requires perfect recall across tasks. It proposes few-shot evaluation as a more comprehensive paradigm, introduces the per-shot plasticity metric, and reports that meta-learning a short sequence of future tasks to add 'foresight' induces learning-to-learn behavior, yielding novel insights into popular continual learning strategies on continual image classification task sequences.

Significance. If the empirical findings on per-shot plasticity and the foresight effect hold under rigorous controls, the work could shift evaluation standards in continual learning away from 0-shot metrics and highlight a practical way to improve adaptation via meta-learning of future tasks. The introduction of a targeted metric for few-shot regimes is a clear contribution if the results prove reproducible.

minor comments (1)
  1. The abstract refers to a 'fine-grained assessment on task sequences for continual image classification' but does not name the datasets, number of tasks, or specific CL baselines evaluated; this detail would strengthen the claim of novel insights.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for noting its potential to influence evaluation standards in continual learning. The report lists no specific major comments under the MAJOR COMMENTS section, so we provide no point-by-point responses below. We remain available to address any additional concerns or requests for clarification.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that proposes few-shot evaluation and a per-shot plasticity metric for assessing continual learning methods, then reports experimental findings on image classification task sequences. The central claim (that meta-learning a short sequence of future tasks induces learning-to-learn behavior) is presented as an observed outcome of experiments rather than a mathematical derivation or prediction that reduces to fitted parameters or self-citations by construction. No load-bearing self-citations, self-definitional constructs, or renamings of known results appear in the abstract or described methodology. The work is self-contained against external benchmarks via standard continual learning evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5707 in / 985 out tokens · 18166 ms · 2026-06-28T11:33:39.116865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 5 canonical work pages

  1. [1]

    W. C. Abraham and A. Robins. Memory retention – the synaptic stability versus plasticity dilemma.Trends in Neurosciences, 28(2):73–78, 2005

  2. [2]

    Aljundi, L

    R. Aljundi, L. Caccia, E. Belilovsky, M. Caccia, M. Lin, L. Charlin, and T. Tuytelaars. Online continual learning with maximally interfered retrieval, 2019

  3. [3]

    J. Bang, H. Kim, Y . Yoo, J.-W. Ha, and J. Choi. Rainbow memory: Continual learning with a memory of diverse samples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8218–8227, 2021

  4. [4]

    Beaulieu, L

    S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney. Learning to continually learn.arXiv preprint arXiv:2002.09571, 2020

  5. [5]

    Benfenati

    F. Benfenati. Synaptic plasticity and the neurobiology of learning and memory.Acta Biomed, 78(Suppl 1):58–66, 2007

  6. [6]

    S. A. Bidaki, A. Mohammadkhah, K. Rezaee, F. Hassani, S. Eskandari, M. Salahi, and M. M. Ghassemi. Online continual learning: A systematic literature review of approaches, challenges, and benchmarks.arXiv preprint arXiv:2501.04897, 2025

  7. [7]

    Blackwell

    D. Blackwell. An analog of the minimax theorem for vector payoffs.Pacific Journal of Mathematics, 6(1) , 1–8., 1956

  8. [8]

    Boschini, L

    M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  9. [9]

    M. E. Bouton. Context and behavioral processes in extinction.Learning & memory, 11(5):485– 494, 2004

  10. [10]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  11. [11]

    Buzzega, M

    P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline, 2020

  12. [12]

    Buzzega, M

    P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15920–15930. Curran Associates, Inc., 2020

  13. [13]

    Chaudhry, M

    A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. InICLR, 2019. 11

  14. [14]

    De Lange, G

    M. De Lange, G. van de Ven, and T. Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap.arXiv preprint arXiv:2205.13452, 2022

  15. [15]

    Dohare, J

    S. Dohare, J. F. Hernandez-Garcia, P. Rahman, A. R. Mahmood, and R. S. Sutton. Maintaining plasticity in deep continual learning, 2024

  16. [16]

    Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, T. Liu, B. Chang, X. Sun, L. Li, and Z. Sui. A survey on in-context learning, 2024

  17. [17]

    Ebbinghaus.Über das gedächtnis: untersuchungen zur experimentellen psychologie

    H. Ebbinghaus.Über das gedächtnis: untersuchungen zur experimentellen psychologie. Duncker & Humblot, 1885

  18. [18]

    C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks, 2017

  19. [19]

    C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online meta-learning. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1920–1930. PMLR, 09–15 Jun 2019

  20. [20]

    R. M. French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

  21. [21]

    C. Ge, X. Wang, Z. Zhang, H. Chen, J. Fan, L. Huang, H. Xue, and W. Zhu. Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning. InProceedings of the 42nd International Conference on Machine Learning, ICML ’25. PMLR, 2025

  22. [22]

    I. J. Goodfellow, M. Mirza, A. Courville, and Y . Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.stat, 1050:4, 2015

  23. [23]

    Gupta, K

    G. Gupta, K. Yadav, and L. Paull. La-maml: Look-ahead meta learning for continual learning, 2020

  24. [24]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016

  25. [25]

    Hospedales, A

    T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey. Meta-learning in neural networks: A survey, 2020

  26. [26]

    Ibrahim, B

    A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish. Simple and scalable strategies to continually pre-train large language models. Transactions on Machine Learning Research, 2024

  27. [27]

    Javed and M

    K. Javed and M. White. Meta-learning representations for continual learning.Advances in neural information processing systems, 32, 2019

  28. [28]

    Jirenhed, F

    D.-A. Jirenhed, F. Bengtsson, and G. Hesslow. Acquisition, extinction, and reacquisition of a cerebellar cortical memory trace.Journal of Neuroscience, 27(10):2493–2502, 2007

  29. [29]

    Kirkpatrick, R

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  30. [30]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009

  31. [31]

    B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual concepts. InProceedings of the annual meeting of the cognitive science society, volume 33, 2011

  32. [32]

    LeCun, C

    Y . LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database.ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. 12

  33. [33]

    J. Li, M. Armandpour, S. I. Mirzadeh, S. Mehta, V . Shankar, R. Vemulapalli, S. Bengio, O. Tuzel, M. Farajtabar, H. Pouransari, et al. Tic-lm: A web-scale benchmark for time-continual llm pretraining. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32231–32273, 2025

  34. [34]

    Li and D

    Z. Li and D. Hoiem. Learning without forgetting, 2017

  35. [35]

    L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8:293–321, 1992

  36. [36]

    Y . Liu, Y . Su, A.-A. Liu, B. Schiele, and Q. Sun. Mnemonics training: Multi-class incremental learning without forgetting. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020

  37. [37]

    Lopez-Paz and M

    D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

  38. [38]

    Mahdaviyeh, J

    Y . Mahdaviyeh, J. Lucas, M. Ren, A. S. Tolias, R. Zemel, and T. Pitassi. Replay can provably increase forgetting.arXiv preprint arXiv:2506.04377, 2025

  39. [39]

    McCloskey and N

    M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  40. [40]

    J. M. Murre and J. Dros. Replication and analysis of ebbinghaus’ forgetting curve.PloS one, 10(7):e0120644, 2015

  41. [41]

    R. M. Napier, M. Macrae, and E. J. Kehoe. Rapid reacquisition in conditioning of the rab- bit’s nictitating membrane response.Journal of Experimental Psychology: Animal Behavior Processes, 18(2):182, 1992

  42. [42]

    Nichol, J

    A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms, 2018

  43. [43]

    S. T. Ricker and M. E. Bouton. Reacquisition following extinction in appetitive conditioning. Animal Learning & Behavior, 24(4):423–436, 1996

  44. [44]

    Riemer, I

    M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y . Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference, 2019

  45. [45]

    H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

  46. [46]

    Snell, K

    J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning.Advances in neural information processing systems, 30, 2017

  47. [47]

    Tulving and D

    E. Tulving and D. M. Thomson. Encoding specificity and retrieval processes in episodic memory. Psychological review, 80(5):352, 1973

  48. [48]

    Vinyals, C

    O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016

  49. [49]

    A. J. Wang, K. Q. Lin, D. J. Zhang, S. W. Lei, and M. Z. Shou. Too large; data reduction for vision-language pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3147–3157, 2023

  50. [50]

    L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  51. [51]

    Wu, L.-K

    Y . Wu, L.-K. Huang, R. Wang, D. Meng, and Y . Wei. Meta continual learning revisited: Implicitly enhancing online hessian approximation via variance reduction. InThe Twelfth International Conference on Learning Representations, 2024

  52. [52]

    arXiv preprint arXiv:2404.19132 , year=

    Y . Zhang, L. Charlin, R. Zemel, and M. Ren. Integrating present and past in unsupervised continual learning.arXiv preprint arXiv:2404.19132, 2024. 13 A Additional Related Work A.1 Continual Learning Strategies Existing methods for continual learning can generally be categorized into replay-based, regularization- based, architectural, and distillation str...