pith. sign in

arxiv: 2510.26721 · v2 · submitted 2025-10-30 · 💻 cs.AI · cs.MM

MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Pith reviewed 2026-05-18 02:51 UTC · model grok-4.3

classification 💻 cs.AI cs.MM
keywords text biasmultimodal LLMkey spaceLoRAattention mechanismvision-language modelsmodality alignment
0
0 comments X

The pith

Text bias in multimodal LLMs arises because visual key vectors occupy a separate subspace from text keys in the attention space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models show a strong preference for textual inputs over visual ones when processing combined data. Earlier explanations pointed to data imbalance or instruction tuning, yet this work traces the bias to an internal architectural feature. Visual key vectors extracted during attention computation sit out of distribution relative to the text key space established in language pretraining. As a result, these visual keys receive systematically lower similarity scores and contribute less to the final context representation. Distributional analysis on models such as LLaVA and Qwen2.5-VL, using both t-SNE visualizations and Jensen-Shannon divergence, confirms that inter-modal separation greatly exceeds intra-modal variation. The paper introduces MaLoRA, a gated modality LoRA, to align the key spaces during fine-tuning and thereby increase visual utilization.

Core claim

The paper claims that visual key vectors are out-of-distribution relative to the text key space learned in language-only pretraining. This misalignment produces lower attention similarity scores for visual tokens, which in turn causes their systematic under-utilization when the model processes vision-language inputs. Evidence comes from extracting key vectors from LLaVA and Qwen2.5-VL and comparing their distributions: qualitative t-SNE plots and quantitative Jensen-Shannon divergence both show that the inter-modal divergence is orders of magnitude larger than variation within each modality. The findings indicate that text bias is an intrinsic property of the attention key space rather than,

What carries the argument

MaLoRA, a gated modality LoRA that applies separate low-rank updates to visual and textual pathways with a gating mechanism to pull visual key vectors into the text key distribution during fine-tuning.

If this is right

  • Fine-tuning with MaLoRA should improve performance on visual reasoning tasks without requiring changes to the base model architecture.
  • The same key-space alignment approach can be applied to other multimodal models that exhibit similar text bias.
  • Training procedures for future multimodal models can incorporate explicit key-space regularization to prevent the emergence of modality-specific subspaces.
  • The method adds only a small number of trainable parameters while targeting the precise location of the bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the key-space view is correct, pretraining objectives that enforce cross-modal key alignment from the start could reduce the need for post-hoc fixes.
  • The same distributional analysis could be run on the value or query vectors to test whether the misalignment is specific to keys.
  • The finding suggests that modality bias may appear in other transformer-based multimodal systems beyond LLMs, such as vision-language encoders.

Load-bearing premise

That the key vectors extracted from frozen models accurately capture the attention behavior that occurs when visual and textual inputs are processed together at inference time.

What would settle it

An experiment that measures attention scores for visual tokens before and after MaLoRA alignment and checks whether the increase in visual attention scores matches the observed reduction in text bias on the same inputs.

read the original abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper hypothesizes that text bias in MLLMs stems from an intrinsic architectural issue: visual key vectors extracted from models such as LLaVA and Qwen2.5-VL occupy distinct subspaces from text keys in the attention key space, causing lower similarity scores and under-utilization of visual information. It supports this via t-SNE visualizations and Jensen-Shannon divergence measurements showing inter-modal divergence exceeds intra-modal variation by orders of magnitude. Building on the hypothesis, the authors introduce MaLoRA, a gated modality LoRA technique designed to align the key spaces during fine-tuning.

Significance. If the distributional misalignment is shown to causally drive attention under-utilization and MaLoRA demonstrably improves visual key utilization without degrading text performance, the work would offer a targeted, architecture-aware alternative to data-rebalancing approaches for reducing modality bias in MLLMs. The combination of qualitative and quantitative evidence on key-space separation is a constructive step toward falsifiable internal explanations.

major comments (3)
  1. [§3] §3 (Key Vector Extraction): The extraction procedure is not shown to replicate the precise key projections that occur under interleaved vision-language inputs. Separate-modality extraction can miss cross-modal conditioning on queries and the effect of full context on dot-product scores, so distributional separation alone does not establish that observed divergence produces lower attention weights for visual tokens in practice.
  2. [§4.2] §4.2 (Divergence Analysis): The claim that inter-modal Jensen-Shannon divergence exceeds intra-modal variation by several orders of magnitude lacks reported sample sizes, layer selection criteria, statistical testing procedure, and controls for input formatting or token count. Without these, it is unclear whether the reported significance is robust or sensitive to confounding factors.
  3. [§5] §5 (MaLoRA Evaluation): While MaLoRA is presented as aligning key spaces, the experiments do not include direct measurements (e.g., attention weight distributions or key-query similarity scores before/after alignment) confirming that the method increases utilization of visual keys during joint multimodal inference.
minor comments (2)
  1. [§3] The definition of 'Visual Keys' and the precise projection matrices used for extraction should be stated explicitly with equation numbers in the methods section for reproducibility.
  2. [Figure 2] Figure captions for t-SNE plots should indicate the number of tokens sampled per modality and the specific layers visualized.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We address each major comment point by point below, clarifying our approach and outlining the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Key Vector Extraction): The extraction procedure is not shown to replicate the precise key projections that occur under interleaved vision-language inputs. Separate-modality extraction can miss cross-modal conditioning on queries and the effect of full context on dot-product scores, so distributional separation alone does not establish that observed divergence produces lower attention weights for visual tokens in practice.

    Authors: The key projection in transformer layers is a modality-agnostic linear transformation applied to the hidden states of individual tokens. Therefore, the subspace occupied by visual keys is determined by the distribution of visual hidden states rather than by query conditioning or full context. Our separate extraction isolates this intrinsic property. Nevertheless, to strengthen the argument, we will include additional experiments in the revised §3 that extract and analyze key vectors from interleaved vision-language inputs, demonstrating that the separation persists under joint processing. revision: yes

  2. Referee: [§4.2] §4.2 (Divergence Analysis): The claim that inter-modal Jensen-Shannon divergence exceeds intra-modal variation by several orders of magnitude lacks reported sample sizes, layer selection criteria, statistical testing procedure, and controls for input formatting or token count. Without these, it is unclear whether the reported significance is robust or sensitive to confounding factors.

    Authors: We will revise the manuscript to provide the missing details. Specifically, the Jensen-Shannon divergence was computed using 5,000 visual tokens and 5,000 textual tokens sampled from diverse inputs across all 32 layers of LLaVA and Qwen2.5-VL. We will report these sample sizes, specify that all layers were analyzed, include statistical tests such as Mann-Whitney U tests to confirm the orders-of-magnitude difference, and add controls ensuring balanced token counts and consistent formatting. revision: yes

  3. Referee: [§5] §5 (MaLoRA Evaluation): While MaLoRA is presented as aligning key spaces, the experiments do not include direct measurements (e.g., attention weight distributions or key-query similarity scores before/after alignment) confirming that the method increases utilization of visual keys during joint multimodal inference.

    Authors: Our evaluation primarily relies on task performance metrics to show the effectiveness of MaLoRA. To directly validate the alignment mechanism, we will add in the revised §5 quantitative measurements of key-query similarity scores and attention weight distributions for visual tokens, comparing the original model and the MaLoRA-tuned model under multimodal inference settings. revision: yes

Circularity Check

0 steps flagged

Empirical key-vector extraction and distributional metrics provide independent validation

full rationale

The paper's core argument proceeds from an architectural hypothesis (visual keys OOD in text-trained key space) to direct empirical checks: extraction of key vectors from LLaVA and Qwen2.5-VL followed by t-SNE visualization and Jensen-Shannon divergence computation. These steps rely on standard forward-pass extraction and external statistical measures rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equation or claim reduces to its own inputs by construction; the reported divergence is computed on held-out model activations and compared against intra-modal baselines, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the standard transformer attention mechanism and the premise that out-of-distribution keys receive lower scores; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • standard math Attention scores are determined by similarity between query and key vectors in transformer layers.
    This is a core property of scaled dot-product attention used in the analyzed models.

pith-pipeline@v0.9.0 · 5747 in / 1241 out tokens · 44244 ms · 2026-05-18T02:51:44.626292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

    cs.CL 2026-04 unverdicted novelty 7.0

    MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.

  2. DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

    cs.CL 2026-05 conditional novelty 6.0

    DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.

  3. DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

    cs.CL 2026-05 conditional novelty 6.0

    DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal pe...

  4. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.