pith. sign in

arxiv: 2604.04420 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI

Is Prompt Selection Necessary for Task-Free Online Continual Learning?

Pith reviewed 2026-05-10 19:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningtask-free learningonline learningprompt tuningvision transformercatastrophic forgettingstreaming dataclassifier design
0
0 comments X

The pith

Prompt selection from a pool is not necessary for state-of-the-art task-free online continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Task-free online continual learning requires models to learn from a non-stationary data stream without task boundaries or revisiting samples. The paper shows that prompt selection strategies, which choose from a pool based on input, often select poor prompts and underperform despite extra training. The authors instead propose using one fixed prompt inserted into every self-attention block, calculating logits via cosine similarity to limit forgetting, and masking out logits of classes absent from the current batch. This straightforward setup delivers top results on standard benchmarks. Readers should care because it simplifies real-time adaptation in dynamic environments where task cues are absent.

Core claim

The authors claim that prompt selection strategies in task-free online continual learning frequently fail to pick suitable prompts, leading to suboptimal performance. They demonstrate that a SinglePrompt method, which injects a single prompt into each self-attention block, uses a cosine similarity-based design for logits to reduce the forgetting effect in classifier weights, and masks logits for unexposed classes in the minibatch, achieves state-of-the-art performance across various online continual learning benchmarks without needing task boundaries or multiple passes over the data.

What carries the argument

The SinglePrompt mechanism, which consists of injecting one shared prompt into each self-attention block of a transformer, computing classification logits with cosine similarity instead of dot product, and applying logit masking for unseen classes, carries the argument by focusing optimization on the classifier while avoiding the pitfalls of adaptive selection.

Load-bearing premise

That the failures of prompt selection are general across methods and datasets, rather than specific to the implementations tested, and that the single prompt design with cosine logits and masking is broadly sufficient without any form of task information.

What would settle it

An experiment showing that an improved prompt selection strategy achieves higher accuracy than SinglePrompt on the same continual learning benchmarks, or ablation studies where removing the cosine design or masking causes SinglePrompt to underperform significantly.

Figures

Figures reproduced from arXiv: 2604.04420 by Haemin Lee, Hankook Lee, Seoyoung Park.

Figure 1
Figure 1. Figure 1: An overview of the proposed SinglePrompt. When minibatch Bt = {(x (i) t , y (i) t )} Bt i=1 is provided, it passes though a pretrained Vision Transformer encoder. At the i-th self-attention block fi, the input sequence hi−1 is given, and during the attention operation the learnable prompts p k i and p v i are prepended to the key and value, respectively. Only the class token from the encoder’s output seque… view at source ↗
Figure 2
Figure 2. Figure 2: Histograms of prompt selection counts per class on task-free continual learning using CIFAR100 [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt selection failures in task-based methods on CI [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the L2 norms of the weights for each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on Prompt length. The x-axis denotes the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the Si-blurry scenario. The dataset for [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Anytime inference accuracy curves of SinglePrompt and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt selection failures on additional datasets, showing [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure of ConvPrompt [20] selection in task-based con￾tinual learning on CIFAR100 [13]. The x-axis represents the task ID of input sample and the y-axis indicates the average cosine sim￾ilarity between each task’s samples and their assigned keys. (a) Result in the offline continual learning setting, where class infor￾mation of upcoming tasks is available. Task similarity is computed using class descriptor… view at source ↗
read the original abstract

Task-free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real-world environments, where data arrive in a non-stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self-attention block, (ii) employ a cosine similarity-based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task-free design, our framework achieves state-of-the-art performance across various online continual learning benchmarks. Source code is available at https://github.com/efficient-learning-lab/SinglePrompt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper observes that prompt selection strategies in task-free online continual learning often fail to select appropriate prompts despite extra training of key parameters. It proposes a simple SinglePrompt framework that (i) injects a single prompt into each self-attention block, (ii) employs a cosine similarity-based logit design to alleviate forgetting in classifier weights, and (iii) masks logits for unexposed classes per minibatch. This task-free design (no boundaries or replay) is claimed to achieve state-of-the-art performance across standard online continual learning benchmarks, with public code provided.

Significance. If the empirical results hold under rigorous verification, the work demonstrates that complex prompt selection may be unnecessary, shifting focus to minimal classifier optimizations in non-stationary streams. The public repository strengthens the contribution by enabling direct reproducibility checks on the reported SOTA benchmarks.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim requires explicit reporting of all baselines, statistical significance tests, number of runs, and ablations on the three proposed components; without these, it is unclear whether the performance gains are attributable to the single-prompt design or to implementation details.
minor comments (2)
  1. [§3.2] §3.2: the cosine logit formulation should include a short derivation or reference showing how it explicitly counters the bias in classifier weights compared to standard softmax.
  2. [Figure 2] Figure 2 or equivalent: clarify the masking operation's effect on the loss computation to confirm it does not inadvertently use future class information.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment on empirical rigor and SOTA claims below, agreeing that additional details will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim requires explicit reporting of all baselines, statistical significance tests, number of runs, and ablations on the three proposed components; without these, it is unclear whether the performance gains are attributable to the single-prompt design or to implementation details.

    Authors: We agree that rigorous SOTA claims benefit from explicit and comprehensive reporting. In the revised manuscript, we will expand the experiments section (§4) and update the abstract as needed to: (i) explicitly list all baselines with their original citations and any adaptations for the task-free online continual learning setting; (ii) report all results as mean ± standard deviation over a specified number of independent runs (we will use 5 runs for consistency with common practice in the field); (iii) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) between SinglePrompt and the strongest baselines; and (iv) add a dedicated ablation study isolating the contribution of each of the three components—single prompt per attention block, cosine similarity-based logit design, and per-minibatch logit masking for unexposed classes—along with their cumulative effects. These changes will clarify that performance improvements arise from the proposed design choices rather than implementation artifacts. The publicly available code repository already supports full reproduction of the reported results, which can serve as an immediate verification aid. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's chain consists of an empirical observation (prompt selection often fails in task-free online CL) followed by a direct proposal of three concrete architectural choices (single prompt per self-attention block, cosine-similarity logits, per-minibatch logit masking) that are motivated by that observation and then validated on public benchmarks. No equations are presented whose outputs are defined in terms of their own inputs; no parameter is fitted on a subset and then relabeled as a prediction; no uniqueness theorem or ansatz is imported via self-citation to force the design; and the central claim (SOTA performance with a strictly task-free method) remains externally falsifiable via the linked repository. The construction is therefore self-contained and does not reduce to its own premises by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities are detailed. The method builds on standard transformer self-attention and continual learning assumptions like non-stationary data streams.

pith-pipeline@v0.9.0 · 5489 in / 1084 out tokens · 57453 ms · 2026-05-10T19:33:15.157182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Rainbow memory: Continual learn- ing with a memory of diverse samples

    Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow memory: Continual learn- ing with a memory of diverse samples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8218–8227, 2021. 6

  2. [2]

    Dark experience for general continual learning: a strong, simple baseline

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. InAdvances in Neural Information Processing Systems, pages 15920– 15930. Curran Associates, Inc., 2020. 6

  3. [3]

    arXiv preprint arXiv:2104.05025 , year=

    Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuyte- laars, Joelle Pineau, and Eugene Belilovsky. New insights on reducing abrupt representation change in online continual learning.arXiv preprint arXiv:2104.05025, 2021. 6

  4. [4]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapt- ing vision transformers for scalable visual recognition.arXiv preprint arXiv:2205.13535, 2022. 8

  5. [5]

    Morgan & Claypool Publishers, 2018

    Zhiyuan Chen and Bing Liu.Lifelong machine learning. Morgan & Claypool Publishers, 2018. 1

  6. [6]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 2, 3

  7. [7]

    The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021. 2, 6, 7

  8. [8]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 1, 3

  9. [9]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision (ECCV), 2022. 1

  10. [10]

    Advancing prompt-based methods for replay- independent general continual learning

    Zhiqi Kang, Liyuan Wang, Xingxing Zhang, and Karteek Alahari. Advancing prompt-based methods for replay- independent general continual learning. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 1, 2, 3, 4, 5, 6, 7, 8

  11. [11]

    Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, et al. Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017. 6

  12. [12]

    Online continual learning on class incremental blurry task configuration with anytime inference

    Hyunseo Koh, Dahyun Kim, Jung-Woo Ha, and Jonghyun Choi. Online continual learning on class incremental blurry task configuration with anytime inference. InICLR, 2022. 6, 7, 1

  13. [13]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009. 2, 3, 4, 5, 6, 7, 1

  14. [14]

    Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

    Yann Le and Xuan Yang. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015. 2, 6, 7

  15. [15]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceed- ings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 1, 3

  16. [16]

    Prefix-tuning: Optimiz- ing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Confer- ence on Natural Language Processing, pages 4582–4597, Online, 2021. Association for Computational Linguistics. 5

  17. [17]

    Learning without forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence, 40(12):2935–2947, 2017. 6

  18. [18]

    Online class incremental learning on stochastic blurry task boundary via mask and visual prompt tuning

    Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, and Gyeong-Moon Park. Online class incremental learning on stochastic blurry task boundary via mask and visual prompt tuning. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023. 1, 2, 3, 4, 5, 6, 7, 8

  19. [19]

    Experience replay for continual learning.Advances in neural information processing sys- tems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lil- licrap, and Gregory Wayne. Experience replay for continual learning.Advances in neural information processing sys- tems, 32, 2019. 6

  20. [20]

    Convolutional prompting meets lan- guage models for continual learning

    Anurag Roy, Riddhiman Moulick, Vinay Verma, Saptarshi Ghosh, and Abir Das. Convolutional prompting meets lan- guage models for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 3, 4, 8, 2

  21. [21]

    Dualprompt: Complementary prompting for rehearsal-free continual learning.European Conference on Computer Vision, 2022

    Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin- cent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning.European Conference on Computer Vision, 2022. 1, 3, 4, 5, 6, 8

  22. [22]

    Learning to prompt for continual learning

    Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149,

  23. [23]

    Online-lora: Task-free online continual learning via low rank adaptation

    Xiwen Wei, Guihong Li, and Radu Marculescu. Online-lora: Task-free online continual learning via low rank adaptation. arXiv preprint arXiv:2411.05663, 2024. 3, 7, 8 9

  24. [24]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1593–1603, 2024. 1, 3

  25. [25]

    Continual learning with pre-trained mod- els: A survey

    Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained mod- els: A survey. InIJCAI, pages 8363–8371, 2024. 1

  26. [26]

    Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024

    Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De- Chuan Zhan, and Ziwei Liu. Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024. 1 10 Is Prompt Selection Necessary for Task-Free Online Continual Learning? Supplementary Material A. Evaluation Metrics In this section, we provide a...