pith. machine review for the scientific record. sign in

arxiv: 2602.07026 · v2 · submitted 2026-02-02 · 💻 cs.CV · cs.AI· cs.MM

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Pith reviewed 2026-05-16 08:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords modality gapsubspace alignmentmultimodal large language modelsunpaired datapretraininggeometric misalignmentReAlignReVision
0
0 comments X

The pith

ReAlign aligns text embeddings to image distributions via a training-free three-step process using unpaired data, letting MLLMs pretrain without paired image-text examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that image and text embeddings for the same meaning sit in systematically offset regions, and this offset can be broken into fixed biases plus direction-dependent residuals inside a locked reference frame. From that breakdown the authors derive ReAlign, which shifts text points into the image cloud by computing three adjustments—anchor point, trace direction, and centroid offset—from large amounts of unpaired text and image statistics. Once aligned, the text alone supplies the visual distribution the model needs during pretraining, after which ordinary instruction tuning finishes the job. If the claim holds, the expensive step of collecting matched image-text pairs can be replaced by abundant separate text corpora, lowering the data cost of scaling multimodal models.

Core claim

The modality gap decomposes inside a frozen reference frame into stable biases and anisotropic residuals; ReAlign then uses massive unpaired statistics to perform Anchor, Trace, and Centroid Alignment, moving text representations into the image distribution so that unpaired text can replace paired image-text data during MLLM pretraining.

What carries the argument

The Fixed-frame Modality Gap Theory, which splits the gap into stable biases and anisotropic residuals, and the three-step ReAlign procedure (Anchor, Trace, Centroid Alignment) that applies those statistics to shift text embeddings.

Load-bearing premise

Statistics drawn from unpaired text and image sets accurately capture the target image distribution once the reference frame is held fixed.

What would settle it

Train two otherwise identical MLLMs—one with ReAlign on unpaired text, one with standard paired data—then compare zero-shot visual reasoning accuracy; a large and consistent gap in favor of the paired version would falsify the substitution claim.

read the original abstract

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the Fixed-frame Modality Gap Theory decomposes the modality gap in a frozen reference frame into stable biases plus anisotropic residuals; this decomposition guides ReAlign, a training-free three-step procedure (Anchor, Trace, Centroid Alignment) that uses first- and second-order statistics from massive unpaired corpora to map text embeddings onto the image distribution. ReAlign is then embedded in the ReVision pretraining paradigm, allowing MLLMs to learn visual representations from unpaired text before instruction tuning and thereby substituting for large-scale paired image-text data.

Significance. If the geometric modeling and substitution claim hold, the work would be significant for efficient MLLM scaling: it offers a concrete mechanism to leverage abundant unpaired text in place of expensive paired data, potentially lowering pretraining costs while preserving alignment quality. The shift from isotropic to anisotropic residual modeling could also inform subsequent embedding-alignment research.

major comments (2)
  1. [§3] §3 (Fixed-frame Modality Gap Theory): the central substitution claim—that unpaired-text moments accurately proxy the target image distribution inside the frozen frame—receives no quantitative validation, ablation, or held-out benchmark; without such evidence the three-step transform may map text outside the true image manifold when semantic coverage or marginals differ.
  2. [§4] §4 (ReAlign procedure): the Anchor/Trace/Centroid steps are defined using statistics computed from the same unpaired corpora later used for training; the manuscript supplies no external validation set or independence test to demonstrate that these statistics remain unbiased with respect to the downstream MLLM task.
minor comments (2)
  1. [§4] Notation for the three alignment steps should be accompanied by explicit equations showing how the bias vector, residual covariance, and centroid shift are computed from the unpaired statistics.
  2. [Abstract] The abstract states the method but reports no quantitative results, ablation tables, or error analysis; adding these in the experimental section would strengthen readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about empirical validation of the substitution claim and statistical independence are important, and we address them point by point below. We commit to adding the requested quantitative evidence and tests in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Fixed-frame Modality Gap Theory): the central substitution claim—that unpaired-text moments accurately proxy the target image distribution inside the frozen frame—receives no quantitative validation, ablation, or held-out benchmark; without such evidence the three-step transform may map text outside the true image manifold when semantic coverage or marginals differ.

    Authors: We acknowledge that the manuscript lacks direct quantitative validation of the proxy assumption. While Section 5 reports downstream MLLM performance gains when substituting paired data with ReAlign-aligned unpaired text, we agree this does not explicitly measure manifold adherence or distribution fidelity. In the revision we will add a dedicated ablation subsection using held-out image-text pairs, reporting metrics such as Wasserstein distance and maximum mean discrepancy between aligned text and image embeddings, plus coverage tests that vary semantic marginals to verify the three-step transform remains inside the image manifold. revision: yes

  2. Referee: [§4] §4 (ReAlign procedure): the Anchor/Trace/Centroid steps are defined using statistics computed from the same unpaired corpora later used for training; the manuscript supplies no external validation set or independence test to demonstrate that these statistics remain unbiased with respect to the downstream MLLM task.

    Authors: We clarify that the statistics are computed from large-scale, general-purpose unpaired corpora that are disjoint from the specific instruction-tuning datasets used in downstream tasks. Nevertheless, we agree an explicit independence test is absent. In the revision we will introduce an external validation protocol: moments will be recomputed on a held-out disjoint subset of the corpora, and we will report the resulting change (or lack thereof) in downstream MLLM task performance to demonstrate that the Anchor/Trace/Centroid statistics remain unbiased. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's claimed chain introduces the Fixed-frame Modality Gap Theory as an explicit decomposition of the gap into stable biases plus anisotropic residuals inside a frozen reference frame, then defines ReAlign as a three-step (Anchor/Trace/Centroid) procedure that computes alignment transforms from unpaired-data moments, and finally integrates the result into ReVision pretraining. None of these steps reduces by construction to its inputs: the decomposition is presented as a modeling choice, the alignment statistics are computed externally from unpaired corpora, and the substitution claim is a downstream consequence rather than a definitional identity. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided derivation. The central result therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that the modality gap admits a stable bias-plus-anisotropic-residual decomposition inside any frozen reference frame and that unpaired statistics suffice to recover the target image distribution without additional parameters.

axioms (1)
  • domain assumption The modality gap can be decomposed into stable biases and anisotropic residuals within a frozen reference frame.
    Invoked in the Fixed-frame Modality Gap Theory section of the abstract.

pith-pipeline@v0.9.0 · 5596 in / 1319 out tokens · 21614 ms · 2026-05-16T08:15:12.284635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

    cs.CV 2026-04 unverdicted novelty 8.0

    UniCVR is the first unified zero-shot framework that handles composed image, multi-turn image, and video retrieval by MLLM-VLP alignment plus dual-level reranking.

  2. Anisotropic Modality Align

    cs.MM 2026-05 unverdicted novelty 6.0

    Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.

  3. When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...

  4. When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 3 Pith papers · 9 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Towards injecting medical visual knowledge into multimodal llms at scale

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

  4. [4]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

  5. [5]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  6. [6]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  7. [7]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  8. [8]

    Mme: A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  9. [9]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

  10. [10]

    Efficient multimodal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024

    Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024

  11. [11]

    Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024

    Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, et al. Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024

  12. [12]

    Decap: Decoding clip latents for zero-shot captioning via text-only training

    Wei Li, Linchao Zhu, Longyin Wen, and Yi Yang. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv preprint arXiv:2303.03032, 2023

  13. [13]

    Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

    Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

  14. [14]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  15. [15]

    Mindthegap: Understanding the modality gap in multi-modal contrastive representation learning.Advancesin Neural Information Processing Systems, 35:17612–17625, 2022

    VictorWeixinLiang, YuhuiZhang, YongchanKwon, SerenaYeung, andJamesYZou. Mindthegap: Understanding the modality gap in multi-modal contrastive representation learning.Advancesin Neural Information Processing Systems, 35:17612–17625, 2022

  16. [16]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 12

  17. [17]

    Arcsin: Adaptive ranged cosine similarity injected noise for language-driven visual tasks

    Yang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, and Sebastien Ourselin. Arcsin: Adaptive ranged cosine similarity injected noise for language-driven visual tasks. arXiv preprint arXiv:2402.17298, 2024

  18. [18]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advancesin Neural Information Processing Systems, 35:2507–2521, 2022

  19. [19]

    Text-only training for image captioning using noise-injected clip

    David Nukrai, Ron Mokady, and Amir Globerson. Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575, 2022

  20. [20]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  22. [22]

    Language models can see: Plugging visual controls in text generation.arXiv preprint arXiv:2205.02655, 2022

    Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier. Language models can see: Plugging visual controls in text generation.arXiv preprint arXiv:2205.02655, 2022

  23. [23]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  24. [24]

    Zerocap: Zero-shot image-to-text generation for visual- semantic arithmetic

    Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. Zerocap: Zero-shot image-to-text generation for visual- semantic arithmetic. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17918–17928, 2022

  25. [25]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020

  26. [26]

    The all-seeing project v2: Towards general relation comprehension of the open world

    Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. In European Conference on Computer Vision, pages 471–490. Springer, 2024

  27. [27]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  28. [28]

    Tinyclip: Clip distillation via affinity mimicking and weight inheritance

    Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21970–21980, 2023

  29. [29]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024

  30. [30]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

  31. [31]

    Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279, 2025

  32. [32]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  33. [33]

    Explaining and mitigating the modality gap in contrastive multimodal learning.arXiv preprint arXiv:2412.07909,

    Can Yaras, Siyi Chen, Peng Wang, and Qing Qu. Explaining and mitigating the modality gap in contrastive multimodal learning. arXiv preprint arXiv:2412.07909, 2024

  34. [34]

    Decipher the modality gap in multimodal contrastive learning: From convergent representations to pairwise alignment.arXiv preprint arXiv:2510.03268,

    Lingjie Yi, Raphael Douady, and Chao Chen. Decipher the modality gap in multimodal contrastive learning: From convergent representations to pairwise alignment.arXiv preprint arXiv:2510.03268, 2025

  35. [35]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe. arXiv preprint arXiv:2509.18154, 2025. 13

  36. [36]

    Unicorn: Text-only data synthesis for vision language model training.arXiv preprint arXiv:2503.22655, 2025

    Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, and Donglin Wang. Unicorn: Text-only data synthesis for vision language model training.arXiv preprint arXiv:2503.22655, 2025

  37. [37]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  38. [38]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

  39. [39]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  40. [40]

    nugglife,

    Yuhui Zhang, Elaine Sui, and Serena Yeung-Levy. Connect, collapse, corrupt: Learning cross-modal tasks with uni-modal data. arXiv preprint arXiv:2401.08567, 2024. 14 A Modality Gap Phenomenon Essential Causes This appendix section explains why a modality gap exists at all in the setting studied in Section 3. We emphasize structural necessity: the modality...