Recognition: unknown
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
Pith reviewed 2026-05-09 21:53 UTC · model grok-4.3
The pith
MiMIC uses fusion-in-decoder architecture plus single-modality mixin and random caption dropout to prevent both visual collapse and semantic misalignment in universal multimodal retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MiMIC introduces a fusion-in-decoder architecture for effective multimodal integration, and robust training through single modality mixin and random caption dropout, which together mitigate visual modality collapse while avoiding semantic misalignment and yield consistent outperformance over early- and late-fusion baselines on the WebQA+ and EVQA+ datasets.
What carries the argument
Fusion-in-decoder architecture combined with single modality mixin and random caption dropout, which balances visual and textual contributions during both integration and training.
Load-bearing premise
The fusion-in-decoder architecture combined with single modality mixin and random caption dropout will reliably prevent both visual collapse and semantic misalignment without introducing new trade-offs.
What would settle it
A controlled ablation on WebQA+ where removing the caption dropout or switching back to early/late fusion causes MiMIC to match or fall below baseline performance while exhibiting either visual collapse or increased embedding distances for semantically related pairs.
Figures
read the original abstract
Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model's tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MiMIC, a fusion-in-decoder architecture augmented with single-modality mixin and random caption dropout, to mitigate visual modality collapse observed in early-fusion UMR models (e.g., Marvel) and semantic misalignment in late-fusion models (e.g., UniVL-DR). It evaluates the approach on modified WebQA+ and EVQA+ benchmarks featuring missing captions and claims consistent outperformance over both early- and late-fusion baselines.
Significance. If the reported empirical gains prove robust under detailed scrutiny, the work would supply a practical training recipe and architecture for universal multimodal retrieval that balances modality integration without the two identified failure modes, potentially benefiting retrieval systems handling incomplete multimodal documents.
major comments (1)
- Abstract: the central claim that MiMIC 'consistently outperforms' early- and late-fusion baselines rests entirely on experimental assertions, yet the abstract supplies no quantitative metrics, ablation results, statistical significance tests, or controls for confounding factors; this absence is load-bearing because the soundness of the contribution cannot be assessed without them.
Simulated Author's Rebuttal
We thank the referee for the insightful comments and the recommendation for major revision. We provide a point-by-point response to the major comment below.
read point-by-point responses
-
Referee: [—] Abstract: the central claim that MiMIC 'consistently outperforms' early- and late-fusion baselines rests entirely on experimental assertions, yet the abstract supplies no quantitative metrics, ablation results, statistical significance tests, or controls for confounding factors; this absence is load-bearing because the soundness of the contribution cannot be assessed without them.
Authors: We agree with the referee that the abstract would be strengthened by the inclusion of quantitative metrics to support the claim of consistent outperformance. In the revised manuscript, we will update the abstract to include specific performance improvements observed in our experiments on the WebQA+ and EVQA+ datasets. While the detailed ablation studies, statistical significance tests, and controls for confounding factors (including the handling of missing captions in the modified benchmarks) are thoroughly discussed in Sections 4 and 5, we will consider adding a concise reference to these in the abstract if space allows. This revision should address the concern regarding the assessability of the contribution. revision: yes
Circularity Check
No significant circularity; empirical architecture validated on benchmarks
full rationale
The paper advances an empirical proposal consisting of a fusion-in-decoder architecture together with single-modality mixin and random caption dropout training. Its central claims rest on reported performance gains versus early- and late-fusion baselines on the WebQA+ and EVQA+ datasets rather than on any closed-form derivations, fitted-parameter predictions, or self-referential definitions. The pilot study on Marvel and UniVL-DR is invoked solely for motivation; no load-bearing step reduces by construction to the inputs, self-citations, or ansatzes. The work is therefore self-contained as an architecture-plus-recipe contribution whose validity is externally falsifiable through the stated experimental results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education
Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)
-
[2]
Classification Problem Solving
Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence
-
[3]
, title =
Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =
1980
-
[4]
New Ways to Make Microcircuits Smaller---Duplicate Entry
Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science
-
[5]
Clancey and Glenn Rennels , abstract =
Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =
-
[6]
and Rennels, Glenn R
Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies
-
[7]
Poligon: A System for Parallel Problem Solving
Rice, James. Poligon: A System for Parallel Problem Solving
-
[8]
Transfer of Rule-Based Expertise through a Tutorial Dialogue
Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue
-
[9]
The Engineering of Qualitative Models
Clancey, William J. The Engineering of Qualitative Models
-
[10]
2017 , eprint=
Attention Is All You Need , author=. 2017 , eprint=
2017
-
[11]
Pluto: The 'Other' Red Planet
NASA. Pluto: The 'Other' Red Planet
-
[12]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Are multimodal transformers robust to missing modality? , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[13]
International Journal of Multimedia Information Retrieval , volume=
Chameleon: A Multimodal Learning Framework Robust to Missing Modalities , author=. International Journal of Multimedia Information Retrieval , volume=. 2025 , publisher=
2025
-
[14]
Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages=
Do We Really Need to Drop Items with Missing Modalities in Multimodal Recommendation? , author=. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages=
-
[15]
Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering
Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille. Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering. Advances in Information Retrieval. 2023
2023
-
[16]
M u KA : Multimodal Knowledge Augmented Visual Information-Seeking
Deng, Lianghao and Sun, Yuchong and Chen, Shizhe and Yang, Ning and Wang, Yunfeng and Song, Ruihua. M u KA : Multimodal Knowledge Augmented Visual Information-Seeking. Proceedings of the 31st International Conference on Computational Linguistics. 2025
2025
-
[19]
2023 , MONTH = Apr, KEYWORDS =
Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille , URL =. 2023 , MONTH = Apr, KEYWORDS =
2023
-
[20]
Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Liu , title =
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[23]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[24]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
Yang, Yi and Yih, Wen-tau and Meek, Christopher. W iki QA : A Challenge Dataset for Open-Domain Question Answering. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:10.18653/v1/D15-1237
-
[25]
Encyclopedic
Thomas Mensink and Jasper Uijlings and Lluis Castrejon and Arushi Goel and Felipe Cadar and Howard Zhou and Fei Sha and Andre Araujo and Vittorio Ferrari , booktitle=. Encyclopedic
-
[26]
2024 , MONTH = Jan, KEYWORDS =
Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille , URL =. 2024 , MONTH = Jan, KEYWORDS =
2024
-
[27]
BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[28]
International Conference on Learning Representations (ICLR) , year=
Adversarial Retriever-Ranker for Dense Text Retrieval , author=. International Conference on Learning Representations (ICLR) , year=
-
[29]
R ocket QA : An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
Qu, Yingqi and Ding, Yuchen and Liu, Jing and Liu, Kai and Ren, Ruiyang and Zhao, Wayne Xin and Dong, Daxiang and Wu, Hua and Wang, Haifeng. R ocket QA : An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...
2021
-
[30]
Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi , title =. Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[32]
Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval , author=
-
[34]
International conference on machine learning (ICML) , year=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning (ICML) , year=
-
[35]
2025 , eprint=
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs , author=. 2025 , eprint=
2025
-
[36]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Webqa: Multihop and multimodal qa , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[37]
International Conference on Learning Representations (ICLR) , year=
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , author=. International Conference on Learning Representations (ICLR) , year=
-
[38]
2020 , booktitle =
Khattab, Omar and Zaharia, Matei , title =. 2020 , booktitle =
2020
-
[39]
2021 , booktitle=
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , author=. 2021 , booktitle=
2021
-
[40]
Adam: A Method for Stochastic Optimization
Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
2023 , booktitle =
Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. 2023 , booktitle =
2023
-
[42]
2023 , eprint=
C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=
2023
-
[43]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[44]
Billion-scale similarity search with
Johnson, Jeff and Douze, Matthijs and J. Billion-scale similarity search with. IEEE Transactions on Big Data , year=
-
[45]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[46]
Foundations and Trends
The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends
-
[47]
Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
-
[48]
Unified Embeddings for Multimodal Retrieval via Frozen LLM s
Wang, Ziyang and Elfardy, Heba and Dreyer, Markus and Small, Kevin and Bansal, Mohit. Unified Embeddings for Multimodal Retrieval via Frozen LLM s. Findings of the Association for Computational Linguistics: EACL 2024. 2024
2024
-
[49]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
-
[51]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=
2019
-
[52]
VL-BERT: Pre-training of generic visual-linguistic representations
Vl-bert: Pre-training of generic visual-linguistic representations , author=. arXiv preprint arXiv:1908.08530 , year=
-
[53]
International conference on machine learning , pages=
Vilt: Vision-and-language transformer without convolution or region supervision , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[54]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Eva-clip: Improved training techniques for clip at scale , author=. arXiv preprint arXiv:2303.15389 , year=
work page internal anchor Pith review arXiv
-
[55]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[56]
2025 , eprint=
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction , author=. 2025 , eprint=
2025
-
[57]
KG - F i D : Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering
Yu, Donghan and Zhu, Chenguang and Fang, Yuwei and Yu, Wenhao and Wang, Shuohang and Xu, Yichong and Ren, Xiang and Yang, Yiming and Zeng, Michael. KG - F i D : Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2...
-
[58]
Leveraging passage retrieval with generative models for open domain question answering
Izacard, Gautier and Grave, Edouard. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.74
-
[59]
Advances in neural information processing systems , volume=
Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=
-
[60]
Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
-
[61]
2025 , eprint=
Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation , author=. 2025 , eprint=
2025
-
[62]
2025 , eprint=
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=
2025
-
[63]
2024 , url=
Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=
2024
-
[64]
2023 , eprint=
PaLI-X: On Scaling up a Multilingual Vision and Language Model , author=. 2023 , eprint=
2023
-
[65]
2025 , eprint=
Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation , author=. 2025 , eprint=
2025
-
[66]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Wang, Weiyao and Tran, Du and Feiszli, Matt , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[67]
Journal of Machine Learning Research , year =
Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , year =
-
[68]
MULTIMODAL PATIENT REPRESENTATION LEARNING WITH MISSING MODALITIES AND LABELS
Zhenbang Wu and Anant Dadu and Nicholas Tustison and Brian Avants and Mike Nalls and Jimeng Sun and Faraz Faghri. MULTIMODAL PATIENT REPRESENTATION LEARNING WITH MISSING MODALITIES AND LABELS. 2024
2024
-
[69]
The Twelfth International Conference on Learning Representations , year=
Multimodal Patient Representation Learning with Missing Modalities and Labels , author=. The Twelfth International Conference on Learning Representations , year=
-
[73]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Can vlms actually see and read? a survey on modality collapse in vision-language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[74]
Findings of the Association for Computational Linguistics: ACL 2024 , pages=
Finding and editing multi-modal neurons in pre-trained transformers , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=
2024
-
[75]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Scaling vision transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[77]
International Conference on Machine Learning , pages=
Mitigating modality collapse in multimodal VAEs via impartial optimization , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[78]
Advances in Neural Information Processing Systems , volume=
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=
-
[79]
Computer Vision – ECCV 2024 , year=
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers , ISBN=. Computer Vision – ECCV 2024 , year=
2024
-
[80]
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. https://openreview.net/forum?id=IW1PR7vEBf LLM 2vec: Large language models are secretly powerful text encoders . In First Conference on Language Modeling
2024
-
[81]
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2022. Webqa: Multihop and multimodal qa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2022
- [82]
-
[83]
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, and 24 others. 2023 a . https://arxiv.org/abs/2305.18565 Pali-x: On scaling up a...
-
[84]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023 b . https://doi.org/10.18653/v1/2023.emnlp-main.925 Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948--14968, Sin...
-
[85]
Adri \'a n Javaloy, Maryam Meghdadi, and Isabel Valera. 2022. Mitigating modality collapse in multimodal vaes via impartial optimization. In International Conference on Machine Learning, pages 9938--9964. PMLR
2022
-
[86]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP)
2020
-
[87]
Paul Lerner, Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, Jose G Moreno, and Jesús Lovón Melgarejo. 2022. https://doi.org/10.1145/3477495.3531753 ViQuAE , a dataset for knowledge-based visual question answering about named entities . In Proceedings of The 45th International ACM SIGIR Conference on Research and Development in Info...
-
[88]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 7871--7880
2020
-
[89]
Muhammad Irzam Liaqat, Shah Nawaz, Muhammad Zaigham Zaheer, Muhammad Saad Saeed, Hassan Sajjad, Tom De Schepper, Karthik Nandakumar, Muhammad Haris Khan, Ignazio Gallo, and Markus Schedl. 2025. Chameleon: A multimodal learning framework robust to missing modalities. International Journal of Multimedia Information Retrieval, 14(2):21
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.