Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding
Pith reviewed 2026-05-17 23:25 UTC · model grok-4.3
The pith
A compressed pre-training phase lets MLLMs become competitive multimodal embedding models with minimal data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose CoMa, a compressed pre-training phase which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model that achieves new state-of-the-art results among MLLMs of comparable size on the MMEB.
What carries the argument
The CoMa paradigm of first compressing inputs to build understanding then using contrastive matching to learn discriminative embeddings.
Load-bearing premise
The two objectives of building comprehensive understanding and emphasizing discriminative features can be decoupled such that compression enables superior contrastive learning performance.
What would settle it
A direct comparison showing no performance gain from adding the compression stage before contrastive learning on the MMEB benchmark would falsify the approach.
Figures
read the original abstract
Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model. CoMa achieves new state-of-the-art results among MLLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness. Our project is available at https://github.com/Trustworthy-Information-Access/CoMa.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoMa, a two-stage paradigm for adapting Multimodal Large Language Models into embedding models. It introduces a compressed pre-training phase as a warm-up to build comprehensive input understanding, followed by contrastive learning to emphasize discriminative features. The central claim is that decoupling these objectives allows transforming an MLLM into a competitive embedding model using only a small amount of pre-training data, achieving new state-of-the-art results on the MMEB benchmark among MLLMs of comparable size.
Significance. If the efficiency and SOTA claims hold after proper isolation of contributions, this work would offer a practical recipe for data-efficient adaptation of MLLMs to multimodal embedding tasks, potentially reducing computational costs in vision-language representation learning while improving performance on retrieval, clustering, and classification.
major comments (2)
- Abstract and Experiments: The central claim that the compressed pre-training phase produces a 'comprehensive understanding' enabling superior contrastive performance is not isolated. No ablation is described that runs the contrastive stage alone on identical small data or measures understanding proxies (e.g., zero-shot VQA or captioning accuracy immediately after the compressed phase), so it remains possible that reported MMEB gains are driven entirely by the contrastive objective rather than the proposed decoupling.
- Abstract: The assertion of new SOTA results on MMEB after the two-stage procedure provides no details on baselines, data splits, statistical significance, or ablation controls. Without these, it is impossible to judge whether the efficiency claim (small data, competitive embedding model) is supported.
minor comments (1)
- Abstract: The phrase 'small amount of pre-training data' should be quantified with exact dataset sizes or percentages to allow reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, offering clarifications from the manuscript and indicating revisions that will strengthen the presentation of our contributions.
read point-by-point responses
-
Referee: Abstract and Experiments: The central claim that the compressed pre-training phase produces a 'comprehensive understanding' enabling superior contrastive performance is not isolated. No ablation is described that runs the contrastive stage alone on identical small data or measures understanding proxies (e.g., zero-shot VQA or captioning accuracy immediately after the compressed phase), so it remains possible that reported MMEB gains are driven entirely by the contrastive objective rather than the proposed decoupling.
Authors: We agree that a more explicit isolation of the compressed pre-training stage's contribution would strengthen the central claim. Our existing experiments compare the full CoMa pipeline against direct contrastive learning on larger-scale data and demonstrate consistent gains from the two-stage approach, but we did not include a direct head-to-head ablation of contrastive learning alone using precisely the same small pre-training corpus. We will add this ablation (contrastive-only on the small data) together with zero-shot VQA and captioning metrics evaluated immediately after the compressed phase. These results will be reported in a new subsection of the Experiments section to quantify the 'comprehensive understanding' benefit and rule out the possibility that gains arise solely from the contrastive objective. revision: yes
-
Referee: Abstract: The assertion of new SOTA results on MMEB after the two-stage procedure provides no details on baselines, data splits, statistical significance, or ablation controls. Without these, it is impossible to judge whether the efficiency claim (small data, competitive embedding model) is supported.
Authors: The abstract provides a concise summary of the main result, while the full experimental details—including the specific MLLM baselines of comparable size, the exact data splits and volumes used in the compressed pre-training phase, multiple-run averaging for statistical reliability, and ablation controls—are presented in Section 4 and Tables 1–3. To address the concern directly, we will revise the abstract to include a brief parenthetical reference to the key baselines and the limited data scale (e.g., “using only X% of typical contrastive data”) while preserving its length constraints. We will also add an explicit sentence in the abstract pointing readers to the experimental setup for full details on splits and significance. revision: partial
Circularity Check
Empirical recipe with no derivation chain that reduces to inputs
full rationale
The paper advances an empirical pre-training paradigm (compressed phase as warm-up for contrastive learning) based on the hypothesis that understanding and discrimination objectives can be decoupled. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or method description. Claims rest on experimental results on MMEB rather than any closed-form reduction or self-referential construction. The central argument is presented as a testable recipe, not a mathematical derivation equivalent to its inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning.
Forward citations
Cited by 1 Pith paper
-
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
Reference graph
Works this paper leans on
-
[1]
Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan
Probabilistic embeddings for cross-modal re- trieval.Preprint, arXiv:2101.05068. Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling deep contrastive learning batch size under memory limited setup.Preprint, arXiv:2101.06983. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang....
-
[2]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Seed-x: Multimodal models with unified multi- granularity comprehension and generation.Preprint, arXiv:2404.14396. Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. 2025. Breaking the modality barrier: Universal embedding learning with multimodal llms.Preprint, arXiv:2504.1743...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Focus on distinct key elements (objects, actions, settings)
-
[4]
Be clear and answerable from visual con- tent alone
-
[5]
Consider, but not limited to the following questions:
Avoid subjective interpretations. Consider, but not limited to the following questions:
-
[6]
Main objects and their attributes (type, color, position)
-
[7]
The Scene context (time/weather if appar- ent, location)
-
[8]
OOD datasets are highlighted with a yellow background
Notable relationships between elements [Output]: Figure 6: Prompt for Data Generation Table 4:Detailed MMEB Results.Performance of baselines and CoMa variants across 20 in-distribution (IND) and 16 out-of-distribution (OOD) datasets. OOD datasets are highlighted with a yellow background. CLIP VLM2VecMMRet UniME mmE5 MoCa-7B CoMa -3BCoMa -7B Classification...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.