pith. machine review for the scientific record. sign in

arxiv: 2604.21326 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI

Recognition: unknown

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords universal multimodal retrievalvisual modality collapsesemantic misalignmentfusion-in-decodermodality mixincaption dropoutWebQA+EVQA+
0
0 comments X

The pith

MiMIC uses fusion-in-decoder architecture plus single-modality mixin and random caption dropout to prevent both visual collapse and semantic misalignment in universal multimodal retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing early-fusion methods like Marvel tend to ignore visual features and rely too heavily on text, while late-fusion methods like UniVL-DR keep related content too far apart in the shared embedding space. The paper shows these two failure modes are not inevitable. MiMIC integrates visual and textual inputs inside the decoder instead of before or after separate encoders, and it trains with mixtures that include single-modality examples plus random drops of captions. On WebQA+ and EVQA+ datasets that include images without captions, this produces better retrieval results than either baseline. The result matters because real-world multimodal search often encounters incomplete captions and needs embeddings that respect both modalities equally.

Core claim

MiMIC introduces a fusion-in-decoder architecture for effective multimodal integration, and robust training through single modality mixin and random caption dropout, which together mitigate visual modality collapse while avoiding semantic misalignment and yield consistent outperformance over early- and late-fusion baselines on the WebQA+ and EVQA+ datasets.

What carries the argument

Fusion-in-decoder architecture combined with single modality mixin and random caption dropout, which balances visual and textual contributions during both integration and training.

Load-bearing premise

The fusion-in-decoder architecture combined with single modality mixin and random caption dropout will reliably prevent both visual collapse and semantic misalignment without introducing new trade-offs.

What would settle it

A controlled ablation on WebQA+ where removing the caption dropout or switching back to early/late fusion causes MiMIC to match or fall below baseline performance while exhibiting either visual collapse or increased embedding distances for semantically related pairs.

Figures

Figures reproduced from arXiv: 2604.21326 by Cam-Tu Nguyen, Chuanghao Ding, Juan Li, Xujie Zhang.

Figure 1
Figure 1. Figure 1: First row demonstrates the visual modality [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Different Fusion Strategies. (a) Late Fusion. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Embeddings of T2I Queries, Image Docu￾ments (I V C or I V ) and Text Documents (T). a multimodal framework (Chameleon) to address missing modalities; Malitesta et al. (Malitesta et al., 2024) discussed the necessity of dropping items with missing modalities in multimodal recommen￾dation, introducing modality dropout as a common strategy for missing-modality issues. Our work differs from these studies in th… view at source ↗
Figure 4
Figure 4. Figure 4: T2I V and T2I V C using Marvel and UniVL￾DR: T2I V C retrieves from DIV C , a corpus of images with captions; T2I V retrieves from DIV , a corpus of images with no caption. contain only visual information. Finally, we em￾ployed t-SNE (van der Maaten and Hinton, 2008) to project the embeddings into a two-dimensional space for visualization in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: T2T and T2ALL using Marvel and UniVL￾DR: T2ALL retrieves from a corpus of I V C and T Docs; T2T retrieves from a corpus of T Docs. refers to textual documents, while I V and I C rep￾resent the separate visual and caption-based embed￾dings associated with a single image. Furthermore, I V C denotes the fused multimodal embeddings derived from both visual and textual information. We then employ t-SNE to map t… view at source ↗
Figure 7
Figure 7. Figure 7: (left) The encoding of different modalities [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Different modality retrieval tasks: T2T (left) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Recall@20 performance varies with the [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: The Recall@20 performance of MiMIC using [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Embeddings in MiMIC Space Model k=5 k=50 UniVL-DR 0.0204 0.0182 Marvel-ANCE 0.0230 0.0208 MiMIC-ANCE 0.1010 0.0936 [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The retrieval performance of T2T (left) and [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model's tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes MiMIC, a fusion-in-decoder architecture augmented with single-modality mixin and random caption dropout, to mitigate visual modality collapse observed in early-fusion UMR models (e.g., Marvel) and semantic misalignment in late-fusion models (e.g., UniVL-DR). It evaluates the approach on modified WebQA+ and EVQA+ benchmarks featuring missing captions and claims consistent outperformance over both early- and late-fusion baselines.

Significance. If the reported empirical gains prove robust under detailed scrutiny, the work would supply a practical training recipe and architecture for universal multimodal retrieval that balances modality integration without the two identified failure modes, potentially benefiting retrieval systems handling incomplete multimodal documents.

major comments (1)
  1. Abstract: the central claim that MiMIC 'consistently outperforms' early- and late-fusion baselines rests entirely on experimental assertions, yet the abstract supplies no quantitative metrics, ablation results, statistical significance tests, or controls for confounding factors; this absence is load-bearing because the soundness of the contribution cannot be assessed without them.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: the central claim that MiMIC 'consistently outperforms' early- and late-fusion baselines rests entirely on experimental assertions, yet the abstract supplies no quantitative metrics, ablation results, statistical significance tests, or controls for confounding factors; this absence is load-bearing because the soundness of the contribution cannot be assessed without them.

    Authors: We agree with the referee that the abstract would be strengthened by the inclusion of quantitative metrics to support the claim of consistent outperformance. In the revised manuscript, we will update the abstract to include specific performance improvements observed in our experiments on the WebQA+ and EVQA+ datasets. While the detailed ablation studies, statistical significance tests, and controls for confounding factors (including the handling of missing captions in the modified benchmarks) are thoroughly discussed in Sections 4 and 5, we will consider adding a concise reference to these in the abstract if space allows. This revision should address the concern regarding the assessability of the contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture validated on benchmarks

full rationale

The paper advances an empirical proposal consisting of a fusion-in-decoder architecture together with single-modality mixin and random caption dropout training. Its central claims rest on reported performance gains versus early- and late-fusion baselines on the WebQA+ and EVQA+ datasets rather than on any closed-form derivations, fitted-parameter predictions, or self-referential definitions. The pilot study on Marvel and UniVL-DR is invoked solely for motivation; no load-bearing step reduces by construction to the inputs, self-citations, or ansatzes. The work is therefore self-contained as an architecture-plus-recipe contribution whose validity is externally falsifiable through the stated experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities appear in the abstract; the contribution is an empirical method proposal.

pith-pipeline@v0.9.0 · 5528 in / 1074 out tokens · 41482 ms · 2026-05-09T21:53:09.203158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 25 canonical work pages · 7 internal anchors

  1. [1]

    Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

    Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

  2. [2]

    Classification Problem Solving

    Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

  3. [3]

    , title =

    Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

  4. [4]

    New Ways to Make Microcircuits Smaller---Duplicate Entry

    Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

  5. [5]

    Clancey and Glenn Rennels , abstract =

    Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

  6. [6]

    and Rennels, Glenn R

    Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

  7. [7]

    Poligon: A System for Parallel Problem Solving

    Rice, James. Poligon: A System for Parallel Problem Solving

  8. [8]

    Transfer of Rule-Based Expertise through a Tutorial Dialogue

    Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

  9. [9]

    The Engineering of Qualitative Models

    Clancey, William J. The Engineering of Qualitative Models

  10. [10]

    2017 , eprint=

    Attention Is All You Need , author=. 2017 , eprint=

  11. [11]

    Pluto: The 'Other' Red Planet

    NASA. Pluto: The 'Other' Red Planet

  12. [12]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Are multimodal transformers robust to missing modality? , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  13. [13]

    International Journal of Multimedia Information Retrieval , volume=

    Chameleon: A Multimodal Learning Framework Robust to Missing Modalities , author=. International Journal of Multimedia Information Retrieval , volume=. 2025 , publisher=

  14. [14]

    Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages=

    Do We Really Need to Drop Items with Missing Modalities in Multimodal Recommendation? , author=. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages=

  15. [15]

    Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering

    Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille. Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering. Advances in Information Retrieval. 2023

  16. [16]

    M u KA : Multimodal Knowledge Augmented Visual Information-Seeking

    Deng, Lianghao and Sun, Yuchong and Chen, Shizhe and Yang, Ning and Wang, Yunfeng and Song, Ruihua. M u KA : Multimodal Knowledge Augmented Visual Information-Seeking. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  17. [19]

    2023 , MONTH = Apr, KEYWORDS =

    Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille , URL =. 2023 , MONTH = Apr, KEYWORDS =

  18. [20]

    Qwen2.5-VL Technical Report

    Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

  19. [22]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  20. [23]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  21. [24]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

    Yang, Yi and Yih, Wen-tau and Meek, Christopher. W iki QA : A Challenge Dataset for Open-Domain Question Answering. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:10.18653/v1/D15-1237

  22. [25]

    Encyclopedic

    Thomas Mensink and Jasper Uijlings and Lluis Castrejon and Arushi Goel and Felipe Cadar and Howard Zhou and Fei Sha and Andre Araujo and Vittorio Ferrari , booktitle=. Encyclopedic

  23. [26]

    2024 , MONTH = Jan, KEYWORDS =

    Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille , URL =. 2024 , MONTH = Jan, KEYWORDS =

  24. [27]

    BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  25. [28]

    International Conference on Learning Representations (ICLR) , year=

    Adversarial Retriever-Ranker for Dense Text Retrieval , author=. International Conference on Learning Representations (ICLR) , year=

  26. [29]

    R ocket QA : An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

    Qu, Yingqi and Ding, Yuchen and Liu, Jing and Liu, Kai and Ren, Ruiyang and Zhao, Wayne Xin and Dong, Daxiang and Wu, Hua and Wang, Haifeng. R ocket QA : An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...

  27. [30]

    Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi , title =. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  28. [32]

    Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval , author=

  29. [34]

    International conference on machine learning (ICML) , year=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning (ICML) , year=

  30. [35]

    2025 , eprint=

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs , author=. 2025 , eprint=

  31. [36]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Webqa: Multihop and multimodal qa , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  32. [37]

    International Conference on Learning Representations (ICLR) , year=

    Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , author=. International Conference on Learning Representations (ICLR) , year=

  33. [38]

    2020 , booktitle =

    Khattab, Omar and Zaharia, Matei , title =. 2020 , booktitle =

  34. [39]

    2021 , booktitle=

    Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , author=. 2021 , booktitle=

  35. [40]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  36. [41]

    2023 , booktitle =

    Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. 2023 , booktitle =

  37. [42]

    2023 , eprint=

    C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

  38. [43]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  39. [44]

    Billion-scale similarity search with

    Johnson, Jeff and Douze, Matthijs and J. Billion-scale similarity search with. IEEE Transactions on Big Data , year=

  40. [45]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  41. [46]

    Foundations and Trends

    The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends

  42. [47]

    Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

    Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

  43. [48]

    Unified Embeddings for Multimodal Retrieval via Frozen LLM s

    Wang, Ziyang and Elfardy, Heba and Dreyer, Markus and Small, Kevin and Bansal, Mohit. Unified Embeddings for Multimodal Retrieval via Frozen LLM s. Findings of the Association for Computational Linguistics: EACL 2024. 2024

  44. [49]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  45. [50]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  46. [51]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  47. [52]

    VL-BERT: Pre-training of generic visual-linguistic representations

    Vl-bert: Pre-training of generic visual-linguistic representations , author=. arXiv preprint arXiv:1908.08530 , year=

  48. [53]

    International conference on machine learning , pages=

    Vilt: Vision-and-language transformer without convolution or region supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  49. [54]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Eva-clip: Improved training techniques for clip at scale , author=. arXiv preprint arXiv:2303.15389 , year=

  50. [55]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  51. [56]

    2025 , eprint=

    MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction , author=. 2025 , eprint=

  52. [57]

    KG - F i D : Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering

    Yu, Donghan and Zhu, Chenguang and Fang, Yuwei and Yu, Wenhao and Wang, Shuohang and Xu, Yichong and Ren, Xiang and Yang, Yiming and Zeng, Michael. KG - F i D : Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2...

  53. [58]

    Leveraging passage retrieval with generative models for open domain question answering

    Izacard, Gautier and Grave, Edouard. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.74

  54. [59]

    Advances in neural information processing systems , volume=

    Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=

  55. [60]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  56. [61]

    2025 , eprint=

    Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation , author=. 2025 , eprint=

  57. [62]

    2025 , eprint=

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

  58. [63]

    2024 , url=

    Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=

  59. [64]

    2023 , eprint=

    PaLI-X: On Scaling up a Multilingual Vision and Language Model , author=. 2023 , eprint=

  60. [65]

    2025 , eprint=

    Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation , author=. 2025 , eprint=

  61. [66]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Wang, Weiyao and Tran, Du and Feiszli, Matt , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  62. [67]

    Journal of Machine Learning Research , year =

    Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , year =

  63. [68]

    MULTIMODAL PATIENT REPRESENTATION LEARNING WITH MISSING MODALITIES AND LABELS

    Zhenbang Wu and Anant Dadu and Nicholas Tustison and Brian Avants and Mike Nalls and Jimeng Sun and Faraz Faghri. MULTIMODAL PATIENT REPRESENTATION LEARNING WITH MISSING MODALITIES AND LABELS. 2024

  64. [69]

    The Twelfth International Conference on Learning Representations , year=

    Multimodal Patient Representation Learning with Missing Modalities and Labels , author=. The Twelfth International Conference on Learning Representations , year=

  65. [73]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Can vlms actually see and read? a survey on modality collapse in vision-language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  66. [74]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Finding and editing multi-modal neurons in pre-trained transformers , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  67. [75]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Scaling vision transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  68. [77]

    International Conference on Machine Learning , pages=

    Mitigating modality collapse in multimodal VAEs via impartial optimization , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  69. [78]

    Advances in Neural Information Processing Systems , volume=

    Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=

  70. [79]

    Computer Vision – ECCV 2024 , year=

    UniIR: Training and Benchmarking Universal Multimodal Information Retrievers , ISBN=. Computer Vision – ECCV 2024 , year=

  71. [80]

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. https://openreview.net/forum?id=IW1PR7vEBf LLM 2vec: Large language models are secretly powerful text encoders . In First Conference on Language Modeling

  72. [81]

    Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2022. Webqa: Multihop and multimodal qa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  73. [82]

    Abhra Chaudhuri, Anjan Dutta, Tu Bui, and Serban Georgescu. 2025. A closer look at multimodal representation collapse. arXiv preprint arXiv:2505.22483

  74. [83]

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, and 24 others. 2023 a . https://arxiv.org/abs/2305.18565 Pali-x: On scaling up a...

  75. [84]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023 b . https://doi.org/10.18653/v1/2023.emnlp-main.925 Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948--14968, Sin...

  76. [85]

    Adri \'a n Javaloy, Maryam Meghdadi, and Isabel Valera. 2022. Mitigating modality collapse in multimodal vaes via impartial optimization. In International Conference on Machine Learning, pages 9938--9964. PMLR

  77. [86]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP)

  78. [87]

    Paul Lerner, Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, Jose G Moreno, and Jesús Lovón Melgarejo. 2022. https://doi.org/10.1145/3477495.3531753 ViQuAE , a dataset for knowledge-based visual question answering about named entities . In Proceedings of The 45th International ACM SIGIR Conference on Research and Development in Info...

  79. [88]

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 7871--7880

  80. [89]

    Muhammad Irzam Liaqat, Shah Nawaz, Muhammad Zaigham Zaheer, Muhammad Saad Saeed, Hassan Sajjad, Tom De Schepper, Karthik Nandakumar, Muhammad Haris Khan, Ignazio Gallo, and Markus Schedl. 2025. Chameleon: A multimodal learning framework robust to missing modalities. International Journal of Multimedia Information Retrieval, 14(2):21

Showing first 80 references.