Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Pith reviewed 2026-05-16 08:15 UTC · model grok-4.3
The pith
ReAlign aligns text embeddings to image distributions via a training-free three-step process using unpaired data, letting MLLMs pretrain without paired image-text examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The modality gap decomposes inside a frozen reference frame into stable biases and anisotropic residuals; ReAlign then uses massive unpaired statistics to perform Anchor, Trace, and Centroid Alignment, moving text representations into the image distribution so that unpaired text can replace paired image-text data during MLLM pretraining.
What carries the argument
The Fixed-frame Modality Gap Theory, which splits the gap into stable biases and anisotropic residuals, and the three-step ReAlign procedure (Anchor, Trace, Centroid Alignment) that applies those statistics to shift text embeddings.
Load-bearing premise
Statistics drawn from unpaired text and image sets accurately capture the target image distribution once the reference frame is held fixed.
What would settle it
Train two otherwise identical MLLMs—one with ReAlign on unpaired text, one with standard paired data—then compare zero-shot visual reasoning accuracy; a large and consistent gap in favor of the paired version would falsify the substitution claim.
read the original abstract
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the Fixed-frame Modality Gap Theory decomposes the modality gap in a frozen reference frame into stable biases plus anisotropic residuals; this decomposition guides ReAlign, a training-free three-step procedure (Anchor, Trace, Centroid Alignment) that uses first- and second-order statistics from massive unpaired corpora to map text embeddings onto the image distribution. ReAlign is then embedded in the ReVision pretraining paradigm, allowing MLLMs to learn visual representations from unpaired text before instruction tuning and thereby substituting for large-scale paired image-text data.
Significance. If the geometric modeling and substitution claim hold, the work would be significant for efficient MLLM scaling: it offers a concrete mechanism to leverage abundant unpaired text in place of expensive paired data, potentially lowering pretraining costs while preserving alignment quality. The shift from isotropic to anisotropic residual modeling could also inform subsequent embedding-alignment research.
major comments (2)
- [§3] §3 (Fixed-frame Modality Gap Theory): the central substitution claim—that unpaired-text moments accurately proxy the target image distribution inside the frozen frame—receives no quantitative validation, ablation, or held-out benchmark; without such evidence the three-step transform may map text outside the true image manifold when semantic coverage or marginals differ.
- [§4] §4 (ReAlign procedure): the Anchor/Trace/Centroid steps are defined using statistics computed from the same unpaired corpora later used for training; the manuscript supplies no external validation set or independence test to demonstrate that these statistics remain unbiased with respect to the downstream MLLM task.
minor comments (2)
- [§4] Notation for the three alignment steps should be accompanied by explicit equations showing how the bias vector, residual covariance, and centroid shift are computed from the unpaired statistics.
- [Abstract] The abstract states the method but reports no quantitative results, ablation tables, or error analysis; adding these in the experimental section would strengthen readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The concerns about empirical validation of the substitution claim and statistical independence are important, and we address them point by point below. We commit to adding the requested quantitative evidence and tests in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Fixed-frame Modality Gap Theory): the central substitution claim—that unpaired-text moments accurately proxy the target image distribution inside the frozen frame—receives no quantitative validation, ablation, or held-out benchmark; without such evidence the three-step transform may map text outside the true image manifold when semantic coverage or marginals differ.
Authors: We acknowledge that the manuscript lacks direct quantitative validation of the proxy assumption. While Section 5 reports downstream MLLM performance gains when substituting paired data with ReAlign-aligned unpaired text, we agree this does not explicitly measure manifold adherence or distribution fidelity. In the revision we will add a dedicated ablation subsection using held-out image-text pairs, reporting metrics such as Wasserstein distance and maximum mean discrepancy between aligned text and image embeddings, plus coverage tests that vary semantic marginals to verify the three-step transform remains inside the image manifold. revision: yes
-
Referee: [§4] §4 (ReAlign procedure): the Anchor/Trace/Centroid steps are defined using statistics computed from the same unpaired corpora later used for training; the manuscript supplies no external validation set or independence test to demonstrate that these statistics remain unbiased with respect to the downstream MLLM task.
Authors: We clarify that the statistics are computed from large-scale, general-purpose unpaired corpora that are disjoint from the specific instruction-tuning datasets used in downstream tasks. Nevertheless, we agree an explicit independence test is absent. In the revision we will introduce an external validation protocol: moments will be recomputed on a held-out disjoint subset of the corpora, and we will report the resulting change (or lack thereof) in downstream MLLM task performance to demonstrate that the Anchor/Trace/Centroid statistics remain unbiased. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's claimed chain introduces the Fixed-frame Modality Gap Theory as an explicit decomposition of the gap into stable biases plus anisotropic residuals inside a frozen reference frame, then defines ReAlign as a three-step (Anchor/Trace/Centroid) procedure that computes alignment transforms from unpaired-data moments, and finally integrates the result into ReVision pretraining. None of these steps reduces by construction to its inputs: the decomposition is presented as a modeling choice, the alignment statistics are computed externally from unpaired corpora, and the substitution claim is a downstream consequence rather than a definitional identity. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided derivation. The central result therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The modality gap can be decomposed into stable biases and anisotropic residuals within a frozen reference frame.
Forward citations
Cited by 4 Pith papers
-
UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
UniCVR is the first unified zero-shot framework that handles composed image, multi-turn image, and video retrieval by MLLM-VLP alignment plus dual-level reranking.
-
Anisotropic Modality Align
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022
work page 2022
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Towards injecting medical visual knowledge into multimodal llms at scale
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024
work page 2024
-
[4]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024
work page 2024
-
[5]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[6]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
work page 2024
-
[7]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[8]
Mme: A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
work page 2025
-
[9]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...
work page 2024
-
[10]
Efficient multimodal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024
Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024
-
[11]
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, et al. Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024
-
[12]
Decap: Decoding clip latents for zero-shot captioning via text-only training
Wei Li, Linchao Zhu, Longyin Wen, and Yi Yang. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv preprint arXiv:2303.03032, 2023
-
[13]
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024
work page 2024
-
[14]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
VictorWeixinLiang, YuhuiZhang, YongchanKwon, SerenaYeung, andJamesYZou. Mindthegap: Understanding the modality gap in multi-modal contrastive representation learning.Advancesin Neural Information Processing Systems, 35:17612–17625, 2022
work page 2022
-
[16]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 12
work page 2023
-
[17]
Arcsin: Adaptive ranged cosine similarity injected noise for language-driven visual tasks
Yang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, and Sebastien Ourselin. Arcsin: Adaptive ranged cosine similarity injected noise for language-driven visual tasks. arXiv preprint arXiv:2402.17298, 2024
-
[18]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advancesin Neural Information Processing Systems, 35:2507–2521, 2022
work page 2022
-
[19]
Text-only training for image captioning using noise-injected clip
David Nukrai, Ron Mokady, and Amir Globerson. Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575, 2022
-
[20]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[22]
Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier. Language models can see: Plugging visual controls in text generation.arXiv preprint arXiv:2205.02655, 2022
-
[23]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Zerocap: Zero-shot image-to-text generation for visual- semantic arithmetic
Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. Zerocap: Zero-shot image-to-text generation for visual- semantic arithmetic. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17918–17928, 2022
work page 2022
-
[25]
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020
work page 2020
-
[26]
The all-seeing project v2: Towards general relation comprehension of the open world
Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. In European Conference on Computer Vision, pages 471–490. Springer, 2024
work page 2024
-
[27]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Tinyclip: Clip distillation via affinity mimicking and weight inheritance
Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21970–21980, 2023
work page 2023
-
[29]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279, 2025
-
[32]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Can Yaras, Siyi Chen, Peng Wang, and Qing Qu. Explaining and mitigating the modality gap in contrastive multimodal learning. arXiv preprint arXiv:2412.07909, 2024
-
[34]
Lingjie Yi, Raphael Douady, and Chao Chen. Decipher the modality gap in multimodal contrastive learning: From convergent representations to pairwise alignment.arXiv preprint arXiv:2510.03268, 2025
-
[35]
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe. arXiv preprint arXiv:2509.18154, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, and Donglin Wang. Unicorn: Text-only data synthesis for vision language model training.arXiv preprint arXiv:2503.22655, 2025
-
[37]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[38]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025
work page 2025
-
[39]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[40]
Yuhui Zhang, Elaine Sui, and Serena Yeung-Levy. Connect, collapse, corrupt: Learning cross-modal tasks with uni-modal data. arXiv preprint arXiv:2401.08567, 2024. 14 A Modality Gap Phenomenon Essential Causes This appendix section explains why a modality gap exists at all in the setting studied in Section 3. We emphasize structural necessity: the modality...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.