MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning
Pith reviewed 2026-05-18 02:51 UTC · model grok-4.3
The pith
Text bias in multimodal LLMs arises because visual key vectors occupy a separate subspace from text keys in the attention space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that visual key vectors are out-of-distribution relative to the text key space learned in language-only pretraining. This misalignment produces lower attention similarity scores for visual tokens, which in turn causes their systematic under-utilization when the model processes vision-language inputs. Evidence comes from extracting key vectors from LLaVA and Qwen2.5-VL and comparing their distributions: qualitative t-SNE plots and quantitative Jensen-Shannon divergence both show that the inter-modal divergence is orders of magnitude larger than variation within each modality. The findings indicate that text bias is an intrinsic property of the attention key space rather than,
What carries the argument
MaLoRA, a gated modality LoRA that applies separate low-rank updates to visual and textual pathways with a gating mechanism to pull visual key vectors into the text key distribution during fine-tuning.
If this is right
- Fine-tuning with MaLoRA should improve performance on visual reasoning tasks without requiring changes to the base model architecture.
- The same key-space alignment approach can be applied to other multimodal models that exhibit similar text bias.
- Training procedures for future multimodal models can incorporate explicit key-space regularization to prevent the emergence of modality-specific subspaces.
- The method adds only a small number of trainable parameters while targeting the precise location of the bias.
Where Pith is reading between the lines
- If the key-space view is correct, pretraining objectives that enforce cross-modal key alignment from the start could reduce the need for post-hoc fixes.
- The same distributional analysis could be run on the value or query vectors to test whether the misalignment is specific to keys.
- The finding suggests that modality bias may appear in other transformer-based multimodal systems beyond LLMs, such as vision-language encoders.
Load-bearing premise
That the key vectors extracted from frozen models accurately capture the attention behavior that occurs when visual and textual inputs are processed together at inference time.
What would settle it
An experiment that measures attention scores for visual tokens before and after MaLoRA alignment and checks whether the increase in visual attention scores matches the observed reduction in text bias on the same inputs.
read the original abstract
Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper hypothesizes that text bias in MLLMs stems from an intrinsic architectural issue: visual key vectors extracted from models such as LLaVA and Qwen2.5-VL occupy distinct subspaces from text keys in the attention key space, causing lower similarity scores and under-utilization of visual information. It supports this via t-SNE visualizations and Jensen-Shannon divergence measurements showing inter-modal divergence exceeds intra-modal variation by orders of magnitude. Building on the hypothesis, the authors introduce MaLoRA, a gated modality LoRA technique designed to align the key spaces during fine-tuning.
Significance. If the distributional misalignment is shown to causally drive attention under-utilization and MaLoRA demonstrably improves visual key utilization without degrading text performance, the work would offer a targeted, architecture-aware alternative to data-rebalancing approaches for reducing modality bias in MLLMs. The combination of qualitative and quantitative evidence on key-space separation is a constructive step toward falsifiable internal explanations.
major comments (3)
- [§3] §3 (Key Vector Extraction): The extraction procedure is not shown to replicate the precise key projections that occur under interleaved vision-language inputs. Separate-modality extraction can miss cross-modal conditioning on queries and the effect of full context on dot-product scores, so distributional separation alone does not establish that observed divergence produces lower attention weights for visual tokens in practice.
- [§4.2] §4.2 (Divergence Analysis): The claim that inter-modal Jensen-Shannon divergence exceeds intra-modal variation by several orders of magnitude lacks reported sample sizes, layer selection criteria, statistical testing procedure, and controls for input formatting or token count. Without these, it is unclear whether the reported significance is robust or sensitive to confounding factors.
- [§5] §5 (MaLoRA Evaluation): While MaLoRA is presented as aligning key spaces, the experiments do not include direct measurements (e.g., attention weight distributions or key-query similarity scores before/after alignment) confirming that the method increases utilization of visual keys during joint multimodal inference.
minor comments (2)
- [§3] The definition of 'Visual Keys' and the precise projection matrices used for extraction should be stated explicitly with equation numbers in the methods section for reproducibility.
- [Figure 2] Figure captions for t-SNE plots should indicate the number of tokens sampled per modality and the specific layers visualized.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our work. We address each major comment point by point below, clarifying our approach and outlining the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Key Vector Extraction): The extraction procedure is not shown to replicate the precise key projections that occur under interleaved vision-language inputs. Separate-modality extraction can miss cross-modal conditioning on queries and the effect of full context on dot-product scores, so distributional separation alone does not establish that observed divergence produces lower attention weights for visual tokens in practice.
Authors: The key projection in transformer layers is a modality-agnostic linear transformation applied to the hidden states of individual tokens. Therefore, the subspace occupied by visual keys is determined by the distribution of visual hidden states rather than by query conditioning or full context. Our separate extraction isolates this intrinsic property. Nevertheless, to strengthen the argument, we will include additional experiments in the revised §3 that extract and analyze key vectors from interleaved vision-language inputs, demonstrating that the separation persists under joint processing. revision: yes
-
Referee: [§4.2] §4.2 (Divergence Analysis): The claim that inter-modal Jensen-Shannon divergence exceeds intra-modal variation by several orders of magnitude lacks reported sample sizes, layer selection criteria, statistical testing procedure, and controls for input formatting or token count. Without these, it is unclear whether the reported significance is robust or sensitive to confounding factors.
Authors: We will revise the manuscript to provide the missing details. Specifically, the Jensen-Shannon divergence was computed using 5,000 visual tokens and 5,000 textual tokens sampled from diverse inputs across all 32 layers of LLaVA and Qwen2.5-VL. We will report these sample sizes, specify that all layers were analyzed, include statistical tests such as Mann-Whitney U tests to confirm the orders-of-magnitude difference, and add controls ensuring balanced token counts and consistent formatting. revision: yes
-
Referee: [§5] §5 (MaLoRA Evaluation): While MaLoRA is presented as aligning key spaces, the experiments do not include direct measurements (e.g., attention weight distributions or key-query similarity scores before/after alignment) confirming that the method increases utilization of visual keys during joint multimodal inference.
Authors: Our evaluation primarily relies on task performance metrics to show the effectiveness of MaLoRA. To directly validate the alignment mechanism, we will add in the revised §5 quantitative measurements of key-query similarity scores and attention weight distributions for visual tokens, comparing the original model and the MaLoRA-tuned model under multimodal inference settings. revision: yes
Circularity Check
Empirical key-vector extraction and distributional metrics provide independent validation
full rationale
The paper's core argument proceeds from an architectural hypothesis (visual keys OOD in text-trained key space) to direct empirical checks: extraction of key vectors from LLaVA and Qwen2.5-VL followed by t-SNE visualization and Jensen-Shannon divergence computation. These steps rely on standard forward-pass extraction and external statistical measures rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equation or claim reduces to its own inputs by construction; the reported divergence is computed on held-out model activations and compared against intra-modal baselines, keeping the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Attention scores are determined by similarity between query and key vectors in transformer layers.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space... Jensen-Shannon divergence
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
-
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.
-
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal pe...
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.