pith. sign in

arxiv: 2605.16353 · v2 · pith:C4RCUVGRnew · submitted 2026-05-08 · 💻 cs.CV · cs.AI

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Pith reviewed 2026-05-20 23:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords streaming continual learningvisual instruction tuningmultimodal large language modelsexpert routingLoRAcatastrophic forgettingtask-aware adaptationnon-stationary data streams
0
0 comments X

The pith

A two-stage expert routing system lets multimodal models learn from mixed-task visual data streams while reducing forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a realistic Streaming Continual Visual Instruction Tuning setting in which data reaches the model as successive chunks that each contain an unknown mixture of evolving tasks. Prior continual-tuning methods cannot cope because they lack ways to separate those mixed tasks inside a single chunk. StrLoRA therefore introduces a regularized two-stage routing process: textual instructions first activate only a sparse set of relevant experts, after which cross-modal attention computes token-level weights that blend the chosen experts with local visual features. A stability regularizer keeps the routing distribution close to its historical average across the non-stationary stream. A reader would care because the approach moves continual learning from artificial single-task increments toward the interleaved, ever-changing data flows that real systems actually encounter.

Core claim

In the StrCVIT setting, where each data chunk mixes multiple dynamically changing tasks, StrLoRA performs task-aware expert selection by using the textual instruction to activate a sparse subset of experts, followed by token-wise expert weighting computed through cross-modal attention between local visual tokens and the global instruction representation. Routing-stability regularization then aligns the current routing distribution with a historical exponential moving average to preserve consistency across the stream. Experiments on a newly constructed StrCVIT benchmark show that this framework substantially outperforms existing continual visual instruction tuning methods.

What carries the argument

StrLoRA's two-stage expert routing: instruction-based sparse selection of experts, followed by token-wise cross-modal attention weighting, plus routing-stability regularization against a historical moving average.

If this is right

  • The model can acquire new visual-language abilities while simultaneously reinforcing recurring ones inside the same data chunks.
  • Cross-task interference drops because only a sparse, instruction-selected subset of experts is activated at any time.
  • Token-level weighting supplies fine-grained adaptation that respects local visual content within each selected expert.
  • Routing distributions remain consistent over time, limiting abrupt shifts that would otherwise accelerate forgetting.
  • Overall accuracy on streaming benchmarks rises because the framework handles both novelty and repetition without separate task boundaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same instruction-driven routing pattern could be tested on continual streams that mix vision with other modalities such as audio or text-only updates.
  • When instructions become genuinely ambiguous, adding a lightweight visual-only routing cue might reduce reliance on text alone.
  • Longer streams with hundreds of chunks would provide a direct test of whether the moving-average regularizer continues to stabilize performance.

Load-bearing premise

Textual instructions alone can reliably distinguish and route the heterogeneous task samples inside each non-stationary data chunk without causing significant misrouting or interference.

What would settle it

Measure whether StrLoRA loses its performance advantage over baselines on a controlled stream in which task instructions within the same chunk are deliberately made similar or ambiguous, or after the stability regularizer is removed for long sequences.

Figures

Figures reproduced from arXiv: 2605.16353 by Chang Che, Cheems Wang, Hui Ma, Zenglin Shi, Ziqi Wang.

Figure 1
Figure 1. Figure 1: (a) In StrCVIT, data arrives as a single-pass stream of chunks, each containing a dynamic [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) CKA similarity [12] heatmap of StrCVIT, showing that standard MoE routing leads to homogeneous expert activation across samples from different tasks. (b) t-SNE visualization [13] of visual and textual features, showing that textual embeddings form well-separated clusters and provide a more reliable signal for task discrimination than overlapping visual features. To address the unique challenges of StrC… view at source ↗
Figure 3
Figure 3. Figure 3: StrLoRA framework. StrLoRA first performs task-aware expert selection to activate a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall performance over the data stream on InternVL3.5-8B-Pretrained. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expert activation proportions across tasks. Each stacked bar represents the activation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean accuracy differences between original and proxy test sets on data 001 to data 025. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Composition of data chunks from data 001 to data 025. Each chunk contains 2,000 samples, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative temporal forgetting examples across the streaming process. Each panel tracks [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Streaming Continual Visual Instruction Tuning (StrCVIT), a realistic setting in which MLLMs must learn from non-stationary streams of data chunks that interleave multiple evolving tasks. It proposes StrLoRA, a two-stage regularized expert-routing framework: textual instructions first select a sparse subset of experts to reduce cross-task interference, after which token-wise weights are computed via cross-modal attention between local visual tokens and the global instruction embedding; a routing-stability regularizer aligns current routing distributions to an exponential moving average of historical references. Experiments on a newly constructed StrCVIT benchmark are reported to show that StrLoRA substantially outperforms prior CVIT methods while mitigating forgetting and acquiring new abilities.

Significance. If the empirical claims hold, the work would meaningfully extend continual learning research for multimodal models beyond the restrictive task-incremental paradigm toward streaming, mixed-task scenarios that better match real-world data arrival. The release of code and the benchmark dataset constitutes a concrete contribution that supports reproducibility and follow-on research.

major comments (2)
  1. [§3.1] §3.1 (Task-aware Expert Selection): the central claim that textual-instruction routing reliably distinguishes heterogeneous tasks inside each mixed chunk and thereby reduces cross-task interference is load-bearing, yet the manuscript provides no direct measurement (e.g., routing precision, expert-overlap statistics, or interference ablation) on mixed-task chunks; without such evidence the reported gains could be driven by easier separation in the benchmark rather than by the proposed mechanism.
  2. [§4] §4 (Experiments): performance tables report absolute improvements without error bars, standard deviations across random seeds, or statistical significance tests; given the non-stationary nature of the stream, these omissions make it impossible to judge whether the claimed “substantial” outperformance is robust.
minor comments (2)
  1. [Abstract] The abstract states that the code is available but does not include the exact commit or installation instructions; adding these would aid immediate reproducibility.
  2. [§3.2] Notation for the EMA reference distribution and the cross-modal attention weights should be introduced once with a single consistent symbol table rather than redefined inline in multiple subsections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Task-aware Expert Selection): the central claim that textual-instruction routing reliably distinguishes heterogeneous tasks inside each mixed chunk and thereby reduces cross-task interference is load-bearing, yet the manuscript provides no direct measurement (e.g., routing precision, expert-overlap statistics, or interference ablation) on mixed-task chunks; without such evidence the reported gains could be driven by easier separation in the benchmark rather than by the proposed mechanism.

    Authors: We agree that direct measurements of the routing mechanism on mixed-task chunks would provide stronger support for the claim. In the revised manuscript we have added a dedicated analysis subsection that reports routing precision, expert-overlap statistics, and an interference ablation performed specifically on the mixed chunks of the StrCVIT benchmark. These new results show that textual-instruction routing achieves high task-separation precision and substantially lower expert overlap than non-task-aware baselines, directly linking the observed gains to reduced cross-task interference rather than benchmark artifacts. revision: yes

  2. Referee: [§4] §4 (Experiments): performance tables report absolute improvements without error bars, standard deviations across random seeds, or statistical significance tests; given the non-stationary nature of the stream, these omissions make it impossible to judge whether the claimed “substantial” outperformance is robust.

    Authors: We acknowledge that variability measures are essential for assessing robustness under non-stationary streams. We have re-executed all experiments across five random seeds, added error bars and standard deviations to the tables in Section 4, and included paired statistical significance tests. The updated results confirm that StrLoRA’s improvements remain statistically significant (p < 0.05) and consistent across seeds, supporting the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new construction with independent regularization

full rationale

The paper defines a new streaming CVIT setting and proposes StrLoRA as an explicit two-stage routing framework plus EMA-based stability regularizer. Task-aware expert selection uses textual instructions directly, token-wise weighting uses cross-modal attention, and the regularizer aligns current routing distributions to a historical moving average reference; none of these steps are shown to reduce by construction to fitted parameters or prior self-citations. The derivation chain remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from overlapping prior work as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard mixture-of-experts routing assumptions and the newly defined streaming data model; no explicit free parameters or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5810 in / 1050 out tokens · 42272 ms · 2026-05-20T23:27:33.819355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Gemma 3 Technical Report

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicovaet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

  3. [3]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  4. [4]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  6. [6]

    Coin: A benchmark of continual instruction tuning for multimodel large language models,

    C. Chen, J. Zhu, X. Luo, H. T. Shen, J. Song, and L. Gao, “Coin: A benchmark of continual instruction tuning for multimodel large language models,”Advances in neural information processing systems, vol. 37, pp. 57 817–57 840, 2024

  7. [7]

    Smolora: Exploring and defying dual catastrophic forgetting in continual visual instruction tuning,

    Z. Wang, C. Che, Q. Wang, Y . Li, Z. Shi, and M. Wang, “Smolora: Exploring and defying dual catastrophic forgetting in continual visual instruction tuning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 177–186

  8. [8]

    LoRA in LoRA: Towards parameter-efficient architecture expansion for continual visual instruction tuning,

    C. Che, Z. Wang, P. Yang, C. Wang, H. Ma, and Z. Shi, “LoRA in LoRA: Towards parameter-efficient architecture expansion for continual visual instruction tuning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 24, pp. 19 978–19 986, Mar. 2026

  9. [9]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

  10. [10]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

  11. [11]

    Kss-moe: Knowledge space synergy framework in mixture of experts for continual visual instruction tuning,

    L. Song, Z. Chen, K. Pan, X. Han, X. Gan, Y . Pan, X. Sun, X. Wang, and X. Shang, “Kss-moe: Knowledge space synergy framework in mixture of experts for continual visual instruction tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 30, 2026, pp. 25 536–25 544

  12. [12]

    Similarity of neural network representations revisited,

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational conference on machine learning. PMlR, 2019, pp. 3519–3529

  13. [13]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

  14. [14]

    On information and sufficiency,

    S. Kullback and R. A. Leibler, “On information and sufficiency,”The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951

  15. [15]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR, 2009

  16. [16]

    Places: A 10 million image database for scene recognition,

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464, 2017

  17. [17]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

  18. [18]

    Harmonious parameter adaptation in continual visual instruction tuning for safety-aligned mllms,

    Z. Wang, C. Che, Q. Wang, H. Ma, Z. Shi, C. G. Snoek, and M. Wang, “Harmonious parameter adaptation in continual visual instruction tuning for safety-aligned mllms,”arXiv preprint arXiv:2511.20158, 2025

  19. [19]

    Ocr-vqa: Visual question answering by reading text in images,

    A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answering by reading text in images,” inICDAR. IEEE, 2019, pp. 947–952

  20. [20]

    Textcaps: a dataset for image captioning withs reading comprehension,

    O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Textcaps: a dataset for image captioning withs reading comprehension,” inECCV. Springer, 2020, pp. 742–758. 10

  21. [21]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inCVPR, 2017

  22. [22]

    Rsvqa: Visual question answering for remote sensing data,

    S. Lobry, D. Marcos, J. Murray, and D. Tuia, “Rsvqa: Visual question answering for remote sensing data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020

  23. [23]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

    A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,” inFindings of the association for computational linguistics: ACL 2022, 2022, pp. 2263–2279

  24. [24]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering,

    D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, 2019, pp. 6700–6709

  25. [25]

    On Tiny Episodic Memories in Continual Learning

    A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato, “On tiny episodic memories in continual learning,”arXiv preprint arXiv:1902.10486, 2019

  26. [26]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”arXiv preprint arXiv:2106.09685, 2021

  27. [27]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinskaet al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  28. [28]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inNeurIPS, 2023, pp. 34 892–34 916

  29. [29]

    Enhancing multimodal continual instruction tuning with branchlora,

    D. Zhang, Y . Ren, Z.-Z. Li, Y . Yu, J. Dong, C. Li, Z. Ji, and J. Bai, “Enhancing multimodal continual instruction tuning with branchlora,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 5743–5756

  30. [30]

    Progressive LoRA for multimodal continual instruction tuning,

    Y . Yu, D. Zhang, Y . Ren, X. Zhao, X. Chen, and C. Chu, “Progressive LoRA for multimodal continual instruction tuning,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 2779–2796

  31. [31]

    Task-free continual learning,

    R. Aljundi, K. Kelchtermans, and T. Tuytelaars, “Task-free continual learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 254–11 263

  32. [32]

    Online class incremental learning on stochastic blurry task boundary via mask and visual prompt tuning,

    J.-Y . Moon, K.-H. Park, J. U. Kim, and G.-M. Park, “Online class incremental learning on stochastic blurry task boundary via mask and visual prompt tuning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 731–11 741

  33. [33]

    Online task-free continual generative and discriminative learning via dynamic cluster memory,

    F. Ye and A. G. Bors, “Online task-free continual generative and discriminative learning via dynamic cluster memory,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 202–26 212

  34. [34]

    Symmetric image-text tuning with entropy- guided fusion for online continual learning in non-stationary visual streams,

    L. Wang, L. Xiang, Y . Wei, Y . Ru, Y . Wang, and Z. He, “Symmetric image-text tuning with entropy- guided fusion for online continual learning in non-stationary visual streams,”IEEE Transactions on Image Processing, vol. 35, pp. 3202–3213, 2026

  35. [35]

    Oasis: Online sample selection for continual visual instruction tuning,

    M. Lee, M. Seo, T. Qu, T. Tuytelaars, and J. Choi, “Oasis: Online sample selection for continual visual instruction tuning,”arXiv preprint arXiv:2506.02011, 2025

  36. [36]

    Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering,

    T. Huai, J. Zhou, X. Wu, Q. Chen, Q. Bai, Z. Zhou, and L. He, “Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering,” in Proceedings of the computer vision and pattern recognition conference, 2025, pp. 19 608–19 617

  37. [37]

    Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning,

    C. Ge, X. Wang, Z. Zhang, H. Chen, J. Fan, L. Huang, H. Xue, and W. Zhu, “Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 19 011–19 033

  38. [38]

    Swift: a scalable lightweight infrastructure for fine-tuning,

    Y . Zhao, J. Huang, J. Hu, X. Wang, Y . Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wanget al., “Swift: a scalable lightweight infrastructure for fine-tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, 2025, pp. 29 733–29 735. 11 A Limitations Due to the difficulty of collecting real-world MLLM data streams with natur...