Information Router for Mitigating Modality Dominance in Vision-Language Models

Seulgi Kim , Mohit Prabhushankar , Ghassan AlRegib

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords informationmodalitymodelmoirtextscdominanceattentionmodels

0 comments

The pith

MoIR mitigates modality dominance in VLMs by explicitly enriching low-information tokens with routed data from stronger modalities prior to LLM processing, yielding more balanced contributions and improved robustness under degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models combine pictures and text but often lean too heavily on whichever input feels stronger, such as ignoring a fuzzy image and trusting only the caption. MoIR scans the tokens coming from each modality, spots the ones that carry little useful signal, and pulls in missing pieces from the other modality to fill them out. This happens before the big language model sees the combined input, so the AI starts with richer, more even information. Experiments on standard benchmarks show the approach keeps performance steadier even when one input is deliberately degraded.

Core claim

By modifying information availability, MoIR enables reliable shifts in modality dominance, even when one modality is degraded. Experimental results show that MoIR consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation.

Load-bearing premise

That less informative tokens can be accurately identified and that complementary information from the stronger modality is both available and safe to inject without adding noise or incorrect details that could mislead the downstream model.

read the original abstract

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoIR adds a token-level router to enrich weak-modality tokens with complementary info from the stronger one, which is a clear step past attention steering, but the paper gives little direct evidence that the identification step works reliably or that the injected content stays accurate.

read the letter

The paper's main contribution is MoIR, which spots low-information tokens and routes extra details from the other modality to build denser representations before the LLM processes them. This targets the actual information gap instead of just reweighting attention, and the abstract positions it as complementary to prior work on dominance. They test across three benchmarks and multiple backbones, including cases with one modality degraded, and report more balanced modality use plus gains in robustness and task performance. That setup matches real deployment issues where inputs vary in quality, so the direction is practical and worth checking if you work on VLMs under noise or missing data. The experiments appear to hold up at the level of overall metrics, with consistent patterns across backbones. The soft spot is the lack of checks on the router itself. The method assumes it can reliably flag less informative tokens and that the routed content from the stronger modality is both available and non-misleading. Under degradation the detector runs on noisy inputs, yet the paper does not report precision on token selection, error analysis of the routed tokens, or ablations that separate the router from other fusion changes. Without those, the reported lifts could partly reflect incidental effects rather than the intended information correction. The free parameter for the information density threshold also needs more scrutiny for sensitivity. This paper is for researchers focused on multimodal robustness and fusion mechanics. It is not a broad theoretical advance, but the mechanism is concrete enough that a serious referee could evaluate the experiments and ask for the missing validation on token accuracy. I would send it to peer review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper proposes MoIR (Multi-modal Information Router), an information-level fusion method for VLMs that identifies less informative tokens and routes complementary information from the stronger modality to construct denser token representations before LLM processing. It claims this explicitly reduces cross-modal information disparity, enabling reliable shifts in modality dominance even under degradation, with consistent improvements in balanced contribution, robustness, and downstream performance across three benchmarks and multiple backbones.

Significance. If the central claims hold after addressing validation gaps, the work introduces a useful complement to attention-steering approaches by operating at the information level rather than focus allocation. The multi-backbone evaluation and explicit degradation testing are strengths that could support broader adoption in robust multi-modal systems, provided the token-routing mechanism proves reliable and non-misleading.

major comments (3)

[Abstract / Experiments] Abstract and experimental section: The reported consistent gains lack any specification of baselines, statistical tests (e.g., significance levels or variance across runs), or exact degradation protocols (e.g., noise type, intensity, or which modality is affected). This directly undermines verification of the robustness and balanced-contribution claims, as the performance lift could arise from unisolated factors.
[Method] Method description (likely §3): The identification of 'less informative tokens' and the routing of complementary information are described at a high level without the concrete scoring function, threshold computation (noted as the free parameter 'information density threshold'), or any precision/recall metrics on the detector. Under degradation this detector operates on noisy inputs, yet no error analysis or ablation isolates its contribution from incidental regularization effects.
[Experiments] Experimental results: No controls or ablations are described that separate MoIR's information-routing effect from other fusion modifications, and no validation is provided that injected content from the stronger modality is accurate and non-hallucinated. This is load-bearing for the claim that MoIR 'enables reliable shifts in modality dominance' rather than merely regularizing the model.

minor comments (1)

[Method] Notation for token representations and routing could be clarified with an explicit equation or diagram showing the pre- and post-MoIR token states.

Circularity Check

0 steps flagged

No circularity: method defined and evaluated independently

full rationale

The paper defines MoIR as a constructive information-level fusion procedure that first identifies less-informative tokens and then routes complementary content from the stronger modality before LLM processing. This construction is presented as an explicit design choice, not derived from any equation or prior result that reduces to the same inputs. Evaluation occurs on external public benchmarks with reported robustness gains under degradation; no fitted parameters are relabeled as predictions, no self-citation chain supplies the central uniqueness claim, and no ansatz or renaming of known patterns is invoked to close the derivation. The reported improvements therefore remain independent of the method's own definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The approach assumes the existence of measurable information density per token and the availability of useful complementary signals in the stronger modality; these are not derived from first principles but introduced as operational definitions for the router.

free parameters (1)

information density threshold
Used to decide which tokens are less informative and require routing; value not specified in abstract and would need fitting or tuning on data.

invented entities (1)

Multi-modal Information Router (MoIR) no independent evidence
purpose: To identify low-information tokens and route complementary data from stronger modalities
New architectural component introduced to modify information availability at the token level.

pith-pipeline@v0.9.0 · 5555 in / 1182 out tokens · 27922 ms · 2026-05-10T08:36:46.154364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks,

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al., “Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks,” inCVPR, 2024, pp. 24185–24198

2024
[2]

Vilt: Vision-and-language transformer without convolution or region supervision,

Wonjae Kim, Bokyung Son, and Ildoo Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 5583–5594

2021
[3]

Can vlms actually see and read? a sur- vey on modality collapse in vision-language models,

Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, and Biaoyan Fang, “Can vlms actually see and read? a sur- vey on modality collapse in vision-language models,” in Findings of the Association for Computational Linguis- tics: ACL 2025, 2025, pp. 24452–24470

2025
[4]

Multimodal large language mod- els: A survey,

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S Yu, “Multimodal large language mod- els: A survey,” inBigData. IEEE, 2023, pp. 2247–2256

2023
[5]

Synthesize diagnose and optimize: To- wards fine-grained vision-language understanding,

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zuxuan Wu, “Synthesize diagnose and optimize: To- wards fine-grained vision-language understanding,” in CVPR, 2024, pp. 13279–13288

2024
[6]

Mllm as video narrator: Mitigating modality imbalance in video moment retrieval,

Weitong Cai, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu, “Mllm as video narrator: Mitigating modality imbalance in video moment retrieval,”Pattern Recognition, vol. 166, pp. 111670, 2025

2025
[7]

Learn to explain: Multi- modal reasoning via thought chains for science question answering,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan, “Learn to explain: Multi- modal reasoning via thought chains for science question answering,”Neurips, vol. 35, pp. 2507–2521, 2022

2022
[8]

Improved baselines with visual instruction tun- ing,

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee, “Improved baselines with visual instruction tun- ing,” inCVPR, 2024, pp. 26296–26306

2024
[9]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Lora: Low-rank adaptation of large lan- guage models.,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large lan- guage models.,”ICLR, vol. 1, no. 2, pp. 3, 2022

2022
[11]

Vizwiz grand challenge: Answering visual questions from blind people,

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” inCVPR, 2018, pp. 3608– 3617

2018
[12]

Mmbench- video: A long-form multi-shot benchmark for holistic video understanding,

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen, “Mmbench- video: A long-form multi-shot benchmark for holistic video understanding,”Neurips, vol. 37, pp. 89098– 89124, 2024

2024
[13]

When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025

Huyu Wu, Meng Tang, Xinhan Zheng, and Haiyun Jiang, “When language overrules: Revealing text dom- inance in multimodal large language models,”arXiv preprint arXiv:2508.10552, 2025

work page arXiv 2025
[14]

Mitigating modality collapse in multimodal vaes via impartial optimization,

Adri ´an Javaloy, Maryam Meghdadi, and Isabel Valera, “Mitigating modality collapse in multimodal vaes via impartial optimization,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 9938–9964

2022
[15]

Assessing modality bias in video question answering benchmarks with multimodal large language models,

Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew Zolensky, Eric Eaton, Insup Lee, and Kevin Johnson, “Assessing modality bias in video question answering benchmarks with multimodal large language models,” inAAAI, 2025, vol. 39, pp. 19821– 19829

2025
[16]

Hex: Hierarchical emergence exploitation in self-supervised algorithms,

Kiran Kokilepersaud, Seulgi Kim, Mohit Prab- hushankar, and Ghassan AlRegib, “Hex: Hierarchical emergence exploitation in self-supervised algorithms,” arXiv preprint arXiv:2410.23200, 2024

work page arXiv 2024
[17]

Countering multi- modal representation collapse through rank-targeted fu- sion,

Seulgi Kim, Kiran Kokilepersaud, Mohit Prab- hushankar, and Ghassan AlRegib, “Countering multi- modal representation collapse through rank-targeted fu- sion,”arXiv preprint arXiv:2511.06450, 2025

work page arXiv 2025
[18]

arXiv preprint arXiv:2505.12576 , year=

Kiran Kokilepersaud, Mohit Prabhushankar, and Ghas- san AlRegib, “Adadim: Dimensionality adaptation for ssl representational dynamics,”arXiv preprint arXiv:2505.12576, 2025

work page arXiv 2025
[19]

Multimodal deep learning,

Cem Akkus, Luyang Chu, Vladana Djakovic, Steffen Jauch-Walser, Philipp Koch, Giacomo Loss, Christo- pher Marquardt, Marco Moldovan, Nadja Sauter, Max- imilian Schneider, et al., “Multimodal deep learning,” arXiv preprint arXiv:2301.04856, 2023

work page arXiv 2023
[20]

arXiv preprint arXiv:2504.17696 , year=

Ghazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, and Ameya Patil, “Hierarchical and multimodal data for daily activity understanding,”arXiv preprint arXiv:2504.17696, 2025

work page arXiv 2025
[21]

Multi-level and multi-modal action anticipation,

Seulgi Kim, Ghazal Kaviani, Mohit Prabhushankar, and Ghassan AlRegib, “Multi-level and multi-modal action anticipation,”arXiv preprint arXiv:2506.02382, 2025

work page arXiv 2025
[22]

Multimodal machine learning: A survey and taxonomy,

Tadas Baltru ˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency, “Multimodal machine learning: A survey and taxonomy,”TPAMI, vol. 41, no. 2, pp. 423–443, 2018

2018
[23]

Intra-and inter-modal curriculum for multimodal learning,

Yuwei Zhou, Xin Wang, Hong Chen, Xuguang Duan, and Wenwu Zhu, “Intra-and inter-modal curriculum for multimodal learning,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3724–3735

2023
[24]

Overcoming catastrophic forgetting in neural networks,

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the na- tional academy of sciences, vol. 114, no. 13, pp. 3521– 3526, 2017

2017
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

The effective rank: A measure of effective dimensionality,

Olivier Roy and Martin Vetterli, “The effective rank: A measure of effective dimensionality,” in2007 15th European signal processing conference. IEEE, 2007, pp. 606–610

2007