Is Prompt Selection Necessary for Task-Free Online Continual Learning?
Pith reviewed 2026-05-10 19:33 UTC · model grok-4.3
The pith
Prompt selection from a pool is not necessary for state-of-the-art task-free online continual learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that prompt selection strategies in task-free online continual learning frequently fail to pick suitable prompts, leading to suboptimal performance. They demonstrate that a SinglePrompt method, which injects a single prompt into each self-attention block, uses a cosine similarity-based design for logits to reduce the forgetting effect in classifier weights, and masks logits for unexposed classes in the minibatch, achieves state-of-the-art performance across various online continual learning benchmarks without needing task boundaries or multiple passes over the data.
What carries the argument
The SinglePrompt mechanism, which consists of injecting one shared prompt into each self-attention block of a transformer, computing classification logits with cosine similarity instead of dot product, and applying logit masking for unseen classes, carries the argument by focusing optimization on the classifier while avoiding the pitfalls of adaptive selection.
Load-bearing premise
That the failures of prompt selection are general across methods and datasets, rather than specific to the implementations tested, and that the single prompt design with cosine logits and masking is broadly sufficient without any form of task information.
What would settle it
An experiment showing that an improved prompt selection strategy achieves higher accuracy than SinglePrompt on the same continual learning benchmarks, or ablation studies where removing the cosine design or masking causes SinglePrompt to underperform significantly.
Figures
read the original abstract
Task-free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real-world environments, where data arrive in a non-stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self-attention block, (ii) employ a cosine similarity-based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task-free design, our framework achieves state-of-the-art performance across various online continual learning benchmarks. Source code is available at https://github.com/efficient-learning-lab/SinglePrompt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes that prompt selection strategies in task-free online continual learning often fail to select appropriate prompts despite extra training of key parameters. It proposes a simple SinglePrompt framework that (i) injects a single prompt into each self-attention block, (ii) employs a cosine similarity-based logit design to alleviate forgetting in classifier weights, and (iii) masks logits for unexposed classes per minibatch. This task-free design (no boundaries or replay) is claimed to achieve state-of-the-art performance across standard online continual learning benchmarks, with public code provided.
Significance. If the empirical results hold under rigorous verification, the work demonstrates that complex prompt selection may be unnecessary, shifting focus to minimal classifier optimizations in non-stationary streams. The public repository strengthens the contribution by enabling direct reproducibility checks on the reported SOTA benchmarks.
major comments (1)
- [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim requires explicit reporting of all baselines, statistical significance tests, number of runs, and ablations on the three proposed components; without these, it is unclear whether the performance gains are attributable to the single-prompt design or to implementation details.
minor comments (2)
- [§3.2] §3.2: the cosine logit formulation should include a short derivation or reference showing how it explicitly counters the bias in classifier weights compared to standard softmax.
- [Figure 2] Figure 2 or equivalent: clarify the masking operation's effect on the loss computation to confirm it does not inadvertently use future class information.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment on empirical rigor and SOTA claims below, agreeing that additional details will strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim requires explicit reporting of all baselines, statistical significance tests, number of runs, and ablations on the three proposed components; without these, it is unclear whether the performance gains are attributable to the single-prompt design or to implementation details.
Authors: We agree that rigorous SOTA claims benefit from explicit and comprehensive reporting. In the revised manuscript, we will expand the experiments section (§4) and update the abstract as needed to: (i) explicitly list all baselines with their original citations and any adaptations for the task-free online continual learning setting; (ii) report all results as mean ± standard deviation over a specified number of independent runs (we will use 5 runs for consistency with common practice in the field); (iii) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) between SinglePrompt and the strongest baselines; and (iv) add a dedicated ablation study isolating the contribution of each of the three components—single prompt per attention block, cosine similarity-based logit design, and per-minibatch logit masking for unexposed classes—along with their cumulative effects. These changes will clarify that performance improvements arise from the proposed design choices rather than implementation artifacts. The publicly available code repository already supports full reproduction of the reported results, which can serve as an immediate verification aid. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's chain consists of an empirical observation (prompt selection often fails in task-free online CL) followed by a direct proposal of three concrete architectural choices (single prompt per self-attention block, cosine-similarity logits, per-minibatch logit masking) that are motivated by that observation and then validated on public benchmarks. No equations are presented whose outputs are defined in terms of their own inputs; no parameter is fitted on a subset and then relabeled as a prediction; no uniqueness theorem or ansatz is imported via self-citation to force the design; and the central claim (SOTA performance with a strictly task-free method) remains externally falsifiable via the linked repository. The construction is therefore self-contained and does not reduce to its own premises by definition.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
employ a cosine similarity-based logit design to alleviate the forgetting effect inherent in the classifier weights
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
inject a single prompt into each self-attention block
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rainbow memory: Continual learn- ing with a memory of diverse samples
Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow memory: Continual learn- ing with a memory of diverse samples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8218–8227, 2021. 6
work page 2021
-
[2]
Dark experience for general continual learning: a strong, simple baseline
Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. InAdvances in Neural Information Processing Systems, pages 15920– 15930. Curran Associates, Inc., 2020. 6
work page 2020
-
[3]
arXiv preprint arXiv:2104.05025 , year=
Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuyte- laars, Joelle Pineau, and Eugene Belilovsky. New insights on reducing abrupt representation change in online continual learning.arXiv preprint arXiv:2104.05025, 2021. 6
-
[4]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapt- ing vision transformers for scalable visual recognition.arXiv preprint arXiv:2205.13535, 2022. 8
-
[5]
Morgan & Claypool Publishers, 2018
Zhiyuan Chen and Bing Liu.Lifelong machine learning. Morgan & Claypool Publishers, 2018. 1
work page 2018
-
[6]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 2, 3
work page 2021
-
[7]
The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021. 2, 6, 7
work page 2021
-
[8]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 1, 3
work page 2022
-
[9]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision (ECCV), 2022. 1
work page 2022
-
[10]
Advancing prompt-based methods for replay- independent general continual learning
Zhiqi Kang, Liyuan Wang, Xingxing Zhang, and Karteek Alahari. Advancing prompt-based methods for replay- independent general continual learning. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 1, 2, 3, 4, 5, 6, 7, 8
work page 2025
-
[11]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, et al. Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017. 6
work page 2017
-
[12]
Online continual learning on class incremental blurry task configuration with anytime inference
Hyunseo Koh, Dahyun Kim, Jung-Woo Ha, and Jonghyun Choi. Online continual learning on class incremental blurry task configuration with anytime inference. InICLR, 2022. 6, 7, 1
work page 2022
-
[13]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009. 2, 3, 4, 5, 6, 7, 1
work page 2009
-
[14]
Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015
Yann Le and Xuan Yang. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015. 2, 6, 7
work page 2015
-
[15]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceed- ings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 1, 3
work page 2021
-
[16]
Prefix-tuning: Optimiz- ing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Confer- ence on Natural Language Processing, pages 4582–4597, Online, 2021. Association for Computational Linguistics. 5
work page 2021
-
[17]
Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence, 40(12):2935–2947, 2017. 6
work page 2017
-
[18]
Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, and Gyeong-Moon Park. Online class incremental learning on stochastic blurry task boundary via mask and visual prompt tuning. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023. 1, 2, 3, 4, 5, 6, 7, 8
work page 2023
-
[19]
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lil- licrap, and Gregory Wayne. Experience replay for continual learning.Advances in neural information processing sys- tems, 32, 2019. 6
work page 2019
-
[20]
Convolutional prompting meets lan- guage models for continual learning
Anurag Roy, Riddhiman Moulick, Vinay Verma, Saptarshi Ghosh, and Abir Das. Convolutional prompting meets lan- guage models for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 3, 4, 8, 2
work page 2024
-
[21]
Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin- cent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning.European Conference on Computer Vision, 2022. 1, 3, 4, 5, 6, 8
work page 2022
-
[22]
Learning to prompt for continual learning
Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149,
-
[23]
Online-lora: Task-free online continual learning via low rank adaptation
Xiwen Wei, Guihong Li, and Radu Marculescu. Online-lora: Task-free online continual learning via low rank adaptation. arXiv preprint arXiv:2411.05663, 2024. 3, 7, 8 9
-
[24]
Low-rank few-shot adaptation of vision-language models
Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1593–1603, 2024. 1, 3
work page 2024
-
[25]
Continual learning with pre-trained mod- els: A survey
Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained mod- els: A survey. InIJCAI, pages 8363–8371, 2024. 1
work page 2024
-
[26]
Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De- Chuan Zhan, and Ziwei Liu. Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024. 1 10 Is Prompt Selection Necessary for Task-Free Online Continual Learning? Supplementary Material A. Evaluation Metrics In this section, we provide a...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.