pith. sign in

arxiv: 2605.14938 · v1 · pith:5PSRNSJOnew · submitted 2026-05-14 · 💻 cs.LG · cs.CV

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Pith reviewed 2026-06-30 20:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords continual learningmultimodal large language modelsgradient orthogonalizationcatastrophic forgettinghistory-freetwo-stage finetuningplasticity stability
0
0 comments X

The pith

Octopus achieves continual learning in multimodal LLMs by enforcing gradient orthogonality without historical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Octopus as a framework for sequential learning in multimodal large language models that avoids the need to store past task data. It relies on history-free gradient orthogonalization to keep updates from interfering with previously learned knowledge. The method uses a two-stage process to first adapt to new tasks and then apply regularization, aiming to maintain both the ability to learn new information and retain old skills. This approach is shown to outperform earlier methods on the UCIT benchmark.

Core claim

Octopus establishes that a two-stage finetuning strategy based on History-Free Gradient Orthogonalization (HiFGO) can enforce gradient-level orthogonality without historical task data, decoupling task adaptation from regularization to achieve a balance between plasticity and stability in multimodal large language models, leading to state-of-the-art performance on UCIT with improvements of 2.14% in average and 6.82% in last metrics.

What carries the argument

History-Free Gradient Orthogonalization (HiFGO), a mechanism that makes current task gradients orthogonal to those of previous tasks without accessing stored data.

Load-bearing premise

Enforcing gradient-level orthogonality without any historical task data is sufficient to prevent parameter interference and achieve stable balance between plasticity and stability.

What would settle it

A demonstration that models trained with HiFGO still suffer substantial catastrophic forgetting on earlier tasks despite orthogonal gradients.

Figures

Figures reproduced from arXiv: 2605.14938 by Chao Ma, Shanyan Guan, Weijia Zhang, Wei Li, Xuanming Shang, Yanhao Ge, Yuehao Liu.

Figure 1
Figure 1. Figure 1: Performance comparison between Octopus (ours) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline for continual learning in MLLMs ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effectiveness of HiFGO. We present the performance [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Backward transfer (BTW) of model after fine-tuning on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average performance after fine-tuning on each task un [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Finetuning curve comparison between w/ two-stage finetuning and w/o two-stage finetuning on VizWiz-Caption (a), IconQA (b) and CLEVR-Math (c). SPICE SPICE was proposed in [2] as a scene￾graph–based semantic evaluation metric that assesses the degree of semantic alignment between generated and refer￾ence descriptions by comparing their structured representa￾tions of objects, attributes, and relations. The c… view at source ↗
Figure 7
Figure 7. Figure 7: ∆θ T s→tHS∆θs→t Comparison for Seq. FT and Octopus 𝜃𝐴 ∗ 𝜃B ∗ 𝜃 ∗ 𝑔𝐺𝑃𝑊𝐶 𝑂0 𝑤1 𝑤2 ∇𝐿𝑐𝑒 ∇𝐿orth ⊥ 𝑔𝐺𝑃𝑊𝐶 𝜃𝐵 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A toy example for GPWC. which is a weighted combination with the eigenvalues {λi} of HA as weights. This indicates that GPWC primarily cap￾tures the projection of v onto the high-curvature directions of HA. These directions correspond to those along which updating θ ∗ A under the data distribution of T ask B is more likely to degrade the performance on T ask A. Therefore, constraining the parameter update … view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity analysis of λ1 and λ2 formance on TA deteriorates significantly. Now consider the case where the GPWC constraint is introduced. When the model parameters move toward a point θB, the GPWC direction is defined as the direction from θ ∗ A to θ ∗ B. No￾tably, this direction also corresponds to the direction along which the loss on TA increases during optimization for TB. Therefore, the gradient dir… view at source ↗
Figure 10
Figure 10. Figure 10: Instance-wise comparison between Octopus (ours) and [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of the additional number of parameters [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Octopus, a two-stage continual learning framework for multimodal large language models (MLLMs) based on History-Free Gradient Orthogonalization (HiFGO). HiFGO is claimed to enforce gradient-level orthogonality to prior tasks without access to historical data or gradients. The two-stage finetuning decouples task adaptation from regularization to balance plasticity and stability. Experiments on the UCIT benchmark report state-of-the-art results, surpassing prior SOTA by 2.14% in Avg and 6.82% in Last metrics.

Significance. If the HiFGO mechanism can be shown to achieve genuine orthogonality to unseen prior task gradients using only current-task information and if the reported gains are causally linked to this mechanism rather than other factors, the work would meaningfully advance continual learning for MLLMs by eliminating rehearsal-based storage/privacy issues and architecture overhead. The explicit two-stage decoupling is a constructive design choice that could be adopted more broadly.

major comments (2)
  1. [Method (HiFGO definition)] Method section describing HiFGO: the central claim that gradient orthogonality to all prior tasks can be enforced history-free must be supported by an explicit formulation (e.g., how current-batch statistics or parameter constraints serve as a proxy for past gradient directions) together with either a proof of orthogonality preservation or an empirical verification measuring angles against held-out past gradients; without this, the reported interference mitigation cannot be attributed to the stated mechanism.
  2. [Experiments (UCIT evaluation)] Experiments section on UCIT results: the SOTA claim (2.14% Avg, 6.82% Last gains) is presented without error bars, number of random seeds, ablation studies isolating HiFGO from the two-stage strategy, or full implementation details (hyperparameters, exact architecture modifications), rendering it impossible to verify reproducibility or rule out confounding factors.
minor comments (1)
  1. [Abstract] Abstract: the UCIT benchmark is referenced without a one-sentence description of its task sequence or modalities, which would aid readers unfamiliar with the continual-learning MLLM literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of HiFGO and strengthen the experimental reporting. We address each point below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Method (HiFGO definition)] Method section describing HiFGO: the central claim that gradient orthogonality to all prior tasks can be enforced history-free must be supported by an explicit formulation (e.g., how current-batch statistics or parameter constraints serve as a proxy for past gradient directions) together with either a proof of orthogonality preservation or an empirical verification measuring angles against held-out past gradients; without this, the reported interference mitigation cannot be attributed to the stated mechanism.

    Authors: We agree that an explicit formulation and supporting verification are necessary to substantiate the history-free orthogonality claim. The current manuscript presents the HiFGO update rule in Section 3.2, but we will expand this in revision to include: (i) the precise mathematical definition showing how per-batch gradient statistics and a projection operator serve as a proxy without storing past gradients, (ii) a short derivation establishing that the resulting update direction is orthogonal to the span of prior-task gradients under the stated assumptions, and (iii) an empirical verification subsection that reports cosine angles between the HiFGO-adjusted gradients and held-out past-task gradients on UCIT sequences. These additions will directly link the observed interference reduction to the mechanism. revision: yes

  2. Referee: [Experiments (UCIT evaluation)] Experiments section on UCIT results: the SOTA claim (2.14% Avg, 6.82% Last gains) is presented without error bars, number of random seeds, ablation studies isolating HiFGO from the two-stage strategy, or full implementation details (hyperparameters, exact architecture modifications), rendering it impossible to verify reproducibility or rule out confounding factors.

    Authors: We acknowledge that the current experimental section lacks sufficient statistical detail and controls. In the revised version we will: (i) report means and standard deviations over at least three random seeds for all UCIT metrics, (ii) add an ablation table that isolates HiFGO from the two-stage finetuning schedule, and (iii) include a dedicated reproducibility subsection (or appendix) listing all hyperparameters, optimizer settings, exact model modifications, and training schedules. These changes will allow readers to reproduce the results and assess the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes a two-stage continual learning framework (Octopus) based on HiFGO for enforcing gradient orthogonality without historical data. No equations, derivations, or self-citations appear in the provided abstract or text that reduce the claimed SOTA gains (2.14% Avg, 6.82% Last on UCIT) to a fitted parameter or input quantity defined from the same data by construction. The performance claims rest on experimental benchmarks rather than any self-definitional or load-bearing self-citation step. The method is presented as an empirical engineering contribution whose validity is tested externally, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract introduces HiFGO as a new mechanism whose internal construction is not detailed; no free parameters or additional axioms are explicitly listed.

invented entities (1)
  • HiFGO no independent evidence
    purpose: Enforce gradient orthogonality without historical task data
    New named technique introduced to solve the stated limitations of prior continual learning approaches.

pith-pipeline@v0.9.1-grok · 5697 in / 1135 out tokens · 25869 ms · 2026-06-30T20:54:18.362675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    Ss-il: Separated softmax for incremental learning

    Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. Ss-il: Separated softmax for incremental learning. InICCV, 2021. 1

  2. [2]

    Spice: Semantic propositional image cap- tion evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image cap- tion evaluation. InECCV, 2016. 1, 2

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

  4. [4]

    Dark experience for general continual learning: a strong, simple baseline.NeurIPS, 2020

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.NeurIPS, 2020. 2

  5. [5]

    Continual llava: Continual instruction tuning in large vision-language models.arXiv preprint arXiv:2411.02564, 2024

    Meng Cao, Yuyang Liu, Yingfei Liu, Tiancai Wang, Ji- ahua Dong, Henghui Ding, Xiangyu Zhang, Ian Reid, and Xiaodan Liang. Continual llava: Continual instruction tuning in large vision-language models.arXiv preprint arXiv:2411.02564, 2024. 1

  6. [6]

    Multitask learning.Machine learning, 1997

    Rich Caruana. Multitask learning.Machine learning, 1997. 2

  7. [7]

    Continual learning with tiny episodic memories

    Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, P Dokania, P Torr, and M Ran- zato. Continual learning with tiny episodic memories. In Workshop on Multi-Task and Lifelong Reinforcement Learn- ing, 2019. 1

  8. [8]

    Coin: A benchmark of continual instruction tuning for multimodel large language models

    Cheng Chen, Junchen Zhu, Xu Luo, Heng T Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models. NeurIPS, 2024. 1, 5, 7, 2

  9. [9]

    Adaptformer: Adapting vision transformers for scalable visual recognition

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, 2022. 2

  10. [10]

    Multi-task learning with deep neural networks: A survey.arXiv preprint arXiv:2009.09796, 2020

    Michael Crawshaw. Multi-task learning with deep neural networks: A survey.arXiv preprint arXiv:2009.09796, 2020. 2

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 1

  12. [12]

    Meteor universal: Lan- guage specific translation evaluation for any target language

    Michael Denkowski and Alon Lavie. Meteor universal: Lan- guage specific translation evaluation for any target language. InProceedings of the ninth workshop on statistical machine translation, 2014. 7, 1

  13. [13]

    Podnet: Pooled outputs distil- lation for small-tasks incremental learning

    Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distil- lation for small-tasks incremental learning. InECCV, 2020. 1

  14. [14]

    Orthogonal gradient descent for continual learning

    Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InInter- national conference on artificial intelligence and statistics,

  15. [15]

    Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 1999

    Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 1999. 1, 3

  16. [16]

    Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing

    Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Ce- likyilmaz, and Lawrence Carin. Cyclical annealing schedule: A simple approach to mitigating kl vanishing.arXiv preprint arXiv:1903.10145, 2019. 6

  17. [17]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024. 2

  18. [18]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InCVPR, 2017. 1

  19. [19]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2

  20. [20]

    Hide- llava: Hierarchical decoupling for continual instruction tun- ing of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

    Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da- Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide- llava: Hierarchical decoupling for continual instruction tun- ing of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025. 1, 2, 5, 6, 7, 4

  21. [21]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InCVPR, 2018. 1

  22. [22]

    The many faces of robust- ness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InICCV, 2021. 1

  23. [23]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InICML, 2019. 2

  24. [24]

    Lora: Low-rank adaptation of large language models.ICLR,

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR,

  25. [25]

    Distilling causal effect of data in class- incremental learning

    Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. Distilling causal effect of data in class- incremental learning. InCVPR, 2021. 1

  26. [26]

    Cl-moe: Enhancing multi- modal large language model with dual momentum mixture- of-experts for continual visual question answering

    Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multi- modal large language model with dual momentum mixture- of-experts for continual visual question answering. InCVPR,

  27. [27]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 1

  28. [28]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InECCV, 2022. 2

  29. [29]

    Hierarchical-task-aware multi-modal mix- ture of incremental lora experts for embodied continual learning.arXiv preprint arXiv:2506.04595, 2025

    Ziqi Jia, Anmin Wang, Xiaoyang Qu, Xiaowen Yang, and Jianzong Wang. Hierarchical-task-aware multi-modal mix- ture of incremental lora experts for embodied continual learning.arXiv preprint arXiv:2506.04595, 2025. 2

  30. [30]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InEMNLP, 2014. 1

  31. [31]

    Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, et al. Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 2017. 2, 5, 7

  32. [32]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021. 2

  33. [33]

    Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231, 2024

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231, 2024. 1

  34. [34]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation.arXiv preprint arXiv:2101.00190, 2021. 2

  35. [35]

    Learning without forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 2017. 2, 5, 7

  36. [36]

    Inflora: Interference-free low-rank adaptation for continual learning

    Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. InCVPR, 2024. 2

  37. [37]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 7, 1

  38. [38]

    Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning.arXiv preprint arXiv:2208.05358, 2022

    Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning.arXiv preprint arXiv:2208.05358, 2022. 1

  39. [39]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

  40. [40]

    Gradient episodic memory for continual learning.NeurIPS, 2017

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.NeurIPS, 2017. 2, 7

  41. [41]

    Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning.arXiv preprint arXiv:2110.13214, 2021

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning.arXiv preprint arXiv:2110.13214, 2021. 1

  42. [42]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS,

  43. [43]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 2

  44. [44]

    Piggy- back: Adapting a single network to multiple tasks by learn- ing to mask weights

    Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learn- ing to mask weights. InECCV, 2018. 2

  45. [45]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016. 1

  46. [46]

    Catastrophic inter- ference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic inter- ference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation. 1989. 1, 3

  47. [47]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR, 2019. 1

  48. [48]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002. 7, 1

  49. [49]

    Continual lifelong learning with neural networks: A review.Neural networks, 2019

    German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 2019. 1

  50. [50]

    Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InICCV,

  51. [51]

    Language models are unsu- pervised multitask learners.OpenAI blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 2019. 2

  52. [52]

    Correlated low-rank adaptation for convnets

    Wu Ran, Weijia Zhang, ShuYang Pang, Qi Zhu, Jinfan Liu, JingSheng Liu, Xin Cao, Qiang Li, Yichao Yan, and Chao Ma. Correlated low-rank adaptation for convnets. In NeurIPS, 2025. 2

  53. [53]

    Experience replay for continual learning

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lil- licrap, and Gregory Wayne. Experience replay for continual learning. InNeurIPS, 2019. 1

  54. [54]

    Continual learning with deep generative replay.NeurIPS,

    Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.NeurIPS,

  55. [55]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,

  56. [56]

    Coda-prompt: Contin- ual decomposed attention-based prompting for rehearsal-free continual learning

    James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Contin- ual decomposed attention-based prompting for rehearsal-free continual learning. InCVPR, 2023. 2

  57. [57]

    Adaptive memory replay for continual learning

    James Seale Smith, Lazar Valkov, Shaunak Halbe, Vysh- navi Gutta, Rogerio Feris, Zsolt Kira, and Leonid Karlinsky. Adaptive memory replay for continual learning. InCVPR,

  58. [58]

    Improving multimodal large lan- guage models using continual learning.arXiv preprint arXiv:2410.19925, 2024

    Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, and Christopher Kanan. Improving multimodal large lan- guage models using continual learning.arXiv preprint arXiv:2410.19925, 2024. 1

  59. [59]

    Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

    George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024. 2

  60. [60]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InCVPR, pages 4566–4575, 2015. 7, 1

  61. [61]

    Meet jeanie: a similarity measure for 3d skeleton sequences via temporal-viewpoint alignment.IJCV, 2024

    Lei Wang, Jun Liu, Liang Zheng, Tom Gedeon, and Piotr Koniusz. Meet jeanie: a similarity measure for 3d skeleton sequences via temporal-viewpoint alignment.IJCV, 2024. 1

  62. [62]

    A comprehensive survey of continual learning: Theory, method and application.TPAMI, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.TPAMI, 2024. 1

  63. [63]

    Or- thogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Or- thogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP, 2023. 1, 2, 4, 5, 7

  64. [64]

    Dualprompt: Complementary prompting for rehearsal-free continual learning

    Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin- cent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InECCV,

  65. [65]

    Learning to prompt for continual learning

    Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jen- nifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InCVPR, 2022. 1, 2, 5

  66. [66]

    Large scale in- cremental learning

    Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale in- cremental learning. InCVPR, 2019. 1

  67. [67]

    Restricted orthogonal gradient projection for continual learning.AI Open, 2023

    Zeyuan Yang, Zonghan Yang, Yichen Liu, Peng Li, and Yang Liu. Restricted orthogonal gradient projection for continual learning.AI Open, 2023. 2

  68. [68]

    A survey on multimodal large language models.National Science Review, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 2024. 1

  69. [69]

    An overview of multi-task learn- ing.National Science Review, 2018

    Yu Zhang and Qiang Yang. An overview of multi-task learn- ing.National Science Review, 2018. 2

  70. [70]

    A survey on multi-task learn- ing.IEEE transactions on knowledge and data engineering,

    Yu Zhang and Qiang Yang. A survey on multi-task learn- ing.IEEE transactions on knowledge and data engineering,

  71. [71]

    Chartedit: How far are mllms from automating chart analysis? eval- uating mllms’ capability via chart editing.arXiv preprint arXiv:2505.11935, 2025

    Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, and Chi Chen. Chartedit: How far are mllms from automating chart analysis? eval- uating mllms’ capability via chart editing.arXiv preprint arXiv:2505.11935, 2025. 2

  72. [72]

    Rethinking gradient projection continual learning: Stability/plasticity feature space decou- pling

    Zhen Zhao, Zhizhong Zhang, Xin Tan, Jun Liu, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Rethinking gradient projection continual learning: Stability/plasticity feature space decou- pling. InCVPR, 2023. 4

  73. [73]

    Class-incremental learning: A survey.TPAMI, 2024

    Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De- Chuan Zhan, and Ziwei Liu. Class-incremental learning: A survey.TPAMI, 2024. 1

  74. [74]

    source→target

    Hao Zhu, Yifei Zhang, Junhao Dong, and Piotr Koniusz. Bilora: Almost-orthogonal parameter spaces for continual learning. InCVPR, 2025. 1, 2 Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models Supplementary Material This material provides supplementary information on the proposed Octopus framework. It...