LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che; Hui Ma; Pengwan Yang; Qi Wang; Zenglin Shi; Ziqi Wang

arxiv: 2508.06202 · v2 · pith:UGLRDB77new · submitted 2025-08-08 · 💻 cs.CV · cs.AI

LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che , Ziqi Wang , Pengwan Yang , Qi Wang , Hui Ma , Zenglin Shi This is my paper

Pith reviewed 2026-05-18 23:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords LoRAcontinual learningvisual instruction tuningparameter efficiencymultimodal LLMscatastrophic forgettingarchitecture expansionstability loss

0 comments

The pith

LiLoRA shares the LoRA matrix A across tasks and further decomposes B to expand MLLM architecture efficiently during continual visual instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiLoRA to solve catastrophic forgetting in continual visual instruction tuning of multimodal large language models while keeping added parameters low. It shares one LoRA matrix A for every task to eliminate redundant copies and applies an extra low-rank factorization to the task-specific matrix B. A cosine-regularized stability loss is added to keep the shared representations consistent as new tasks arrive. If this works, models can accumulate many sequential visual and instruction tasks without the memory cost of full layer expansion or separate per-task adapters.

Core claim

By nesting LoRA inside LoRA—sharing matrix A across all tasks, decomposing task-specific matrix B into lower-rank factors, and applying a cosine-regularized stability loss—LiLoRA performs architecture expansion for CVIT with far fewer added parameters than prior methods while achieving higher accuracy on both new and old tasks in sequential learning benchmarks.

What carries the argument

LiLoRA module that shares LoRA matrix A across tasks, applies additional low-rank decomposition to task-specific B, and adds cosine-regularized stability loss to preserve shared representation consistency.

If this is right

Parameter growth stays linear in the number of tasks rather than quadratic or worse.
Earlier tasks retain higher performance because the stability loss protects shared features.
The same base model can handle longer sequences of visual instruction tasks without retraining from scratch.
Scalability improves for real-world deployment where new visual domains arrive over months or years.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared A matrix may encode generic visual features usable across domains, which could be inspected to measure task overlap.
The same nesting pattern might transfer to other continual-learning settings such as language-only instruction tuning.
If tasks conflict strongly, increasing the rank of the decomposed B factors could restore adaptation capacity without losing the efficiency gain.
Combining LiLoRA with replay buffers or data-selection strategies might further reduce forgetting on edge cases.

Load-bearing premise

Sharing one matrix A across all tasks and further decomposing B into low-rank factors will not create harmful interference or block adaptation when tasks differ substantially in visual content or instruction style.

What would settle it

A large drop in accuracy on the first tasks after training on a long sequence of visually dissimilar later tasks, exceeding the forgetting seen with fully task-specific expansion baselines.

read the original abstract

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

LiLoRA shares one A matrix across tasks, further decomposes B, and adds a cosine loss to cut parameters in continual MLLM tuning, but the shared A could still constrain adaptation on dissimilar tasks. The paper presents a concrete recipe on top of standard LoRA for continual visual instruction tuning in multimodal models. Sharing A reduces redundancy across tasks, the extra low-rank split on B keeps the per-task footprint small, and the stability loss aims to limit drift in the common parts. This directly targets the parameter bloat that comes from expanding entire layers for each new visual task while trying to avoid forgetting. The abstract reports better sequential performance and efficiency on a CVIT benchmark, and the code release lets others inspect the details. This is a practical incremental move inside the existing LoRA-plus-continual-learning literature rather than a first-principles change. It focuses on a real deployment constraint for models that must absorb new instructions over time. The soft spot is the risk that a single shared A limits flexibility when tasks differ sharply in visual content or style. The decomposition on B is meant to supply the needed task-specific room, but without clear ablations on high-dissimilarity sequences it is hard to know whether the efficiency gain trades off against under-adaptation or extra forgetting later. The abstract itself gives no numbers, baselines, or stats, so the full paper must supply those to make the central claim verifiable. The underlying math stays within ordinary low-rank updates plus a regularizer, with no obvious inconsistencies. Citations follow the usual references in this area. This paper is for researchers working on parameter-efficient lifelong adaptation of large vision-language models. A reader who needs concrete ways to scale continual tuning without massive overhead would find usable ideas here if the experiments hold up. I would send it to peer review because the problem is relevant and the method is explicit, even though the interference question will need tighter evidence.

Referee Report

2 major / 2 minor

Summary. The paper proposes LoRA in LoRA (LiLoRA) for Continual Visual Instruction Tuning (CVIT) of Multimodal Large Language Models. It mitigates catastrophic forgetting via architecture expansion by sharing a single LoRA matrix A across tasks, applying an extra low-rank decomposition to the task-specific B matrices, and adding a cosine-regularized stability loss on shared representations. Experiments on a CVIT benchmark are reported to show better sequential performance and parameter efficiency than prior expansion methods.

Significance. If the empirical gains are confirmed, LiLoRA would provide a practical route to scalable continual adaptation of MLLMs with substantially lower parameter growth than full-layer expansion baselines. The public code release at the cited GitHub repository is a clear strength that supports reproducibility and further testing.

major comments (2)

[§3] §3 (Method): The central efficiency claim rests on sharing A while factorizing B, yet no analysis, rank bound, or dissimilarity metric is supplied to show that the reduced-rank B factors remain expressive enough to avoid harmful interference or under-adaptation when later tasks differ sharply in visual content or instruction style from earlier ones.
[§4] §4 (Experiments): The headline claim of consistent superiority in sequential task learning is stated without reported numerical deltas, baseline implementation details, number of runs, or statistical significance tests in the visible results; this prevents direct verification that the observed gains are load-bearing rather than marginal.

minor comments (2)

[Abstract] Abstract: The statement that LiLoRA 'consistently achieves superior performance' would be strengthened by including at least one concrete metric (e.g., average accuracy or forgetting rate) alongside the efficiency comparison.
[§3] Notation: The precise rank chosen for the additional decomposition of B and the weighting of the stability loss are introduced without an accompanying sensitivity study or default-value justification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving rigor and clarity, and we have revised the paper to address them directly. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3] §3 (Method): The central efficiency claim rests on sharing A while factorizing B, yet no analysis, rank bound, or dissimilarity metric is supplied to show that the reduced-rank B factors remain expressive enough to avoid harmful interference or under-adaptation when later tasks differ sharply in visual content or instruction style from earlier ones.

Authors: We agree that an explicit analysis of expressiveness would strengthen the central efficiency claim. In the revised manuscript we have added a new paragraph in §3.2 that derives a rank bound for the low-rank decomposition of B, showing that the effective rank of the task-specific update is preserved up to a small additive factor controlled by the inner rank. We also introduce a simple task-dissimilarity metric (average cosine distance between task-specific gradient directions on a held-out validation set) and report its values across the benchmark tasks. These additions demonstrate that the factorization retains sufficient capacity for adaptation even when visual content and instruction styles differ substantially between tasks. revision: yes
Referee: [§4] §4 (Experiments): The headline claim of consistent superiority in sequential task learning is stated without reported numerical deltas, baseline implementation details, number of runs, or statistical significance tests in the visible results; this prevents direct verification that the observed gains are load-bearing rather than marginal.

Authors: We acknowledge that the experimental presentation lacked the quantitative details needed for direct verification. In the revised §4 and the accompanying appendix we now report explicit numerical deltas (absolute and relative) for all metrics against each baseline, provide complete hyper-parameter and implementation details for every compared method, average all results over five independent runs with standard deviations, and include paired t-test p-values to establish statistical significance of the observed improvements. These changes make the superiority claims verifiable and show that the gains are both consistent and statistically meaningful. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit new components introduced independently of fitted results

full rationale

The paper proposes LiLoRA as a new architecture expansion technique for CVIT, defining shared matrix A, additional low-rank decomposition on B, and a cosine-regularized stability loss through explicit design choices in the method section. These elements are not defined in terms of quantities already fitted or predicted within the paper's own experiments or equations. Performance claims rest on empirical benchmarks rather than any derivation that reduces to its inputs by construction. No self-citation chain or uniqueness theorem is invoked to force the central architecture; standard LoRA references are external and non-load-bearing for the novelty. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method inherits standard LoRA rank assumptions and relies on empirical benchmark results whose details are not visible here.

pith-pipeline@v0.9.0 · 5739 in / 1095 out tokens · 44733 ms · 2026-05-18T23:48:22.072499+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LiLoRA shares the LoRA matrix A across tasks... applies an additional low-rank decomposition to matrix B... cosine-regularized basis stability loss
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on the CVIT Benchmark... superior performance in sequential task learning while significantly improving parameter efficiency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.