Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

Biao Yang; Da Li; Fan Yang; Guorui Zhou; Jiafeng Guo; Keping Bi; Tingting Gao; Wei Yuan; Yan Wang; Yuxiao Luo

arxiv: 2511.08480 · v3 · submitted 2025-11-11 · 💻 cs.CV · cs.IR

Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

Da Li , Yuxiao Luo , Keping Bi , Jiafeng Guo , Wei Yuan , Biao Yang , Yan Wang , Fan Yang

show 2 more authors

Tingting Gao Guorui Zhou

This is my paper

Pith reviewed 2026-05-17 23:25 UTC · model grok-4.3

classification 💻 cs.CV cs.IR

keywords multimodal embeddingpre-trainingcontrastive learningMLLMdata efficiencyvision languageMMEBembedding model

0 comments

The pith

A compressed pre-training phase lets MLLMs become competitive multimodal embedding models with minimal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that the objectives of preserving semantic content and emphasizing discriminative features can be decoupled in pre-training multimodal large language models. It introduces a compressed pre-training stage that builds comprehensive understanding as a warm-up before contrastive learning. This approach requires only a small amount of data to adapt MLLMs into effective embedding models. Readers might care because it promises more efficient ways to create models for tasks like retrieval, clustering, and classification. The method achieves state-of-the-art results on the MMEB benchmark among models of similar size.

Core claim

We propose CoMa, a compressed pre-training phase which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model that achieves new state-of-the-art results among MLLMs of comparable size on the MMEB.

What carries the argument

The CoMa paradigm of first compressing inputs to build understanding then using contrastive matching to learn discriminative embeddings.

Load-bearing premise

The two objectives of building comprehensive understanding and emphasizing discriminative features can be decoupled such that compression enables superior contrastive learning performance.

What would settle it

A direct comparison showing no performance gain from adding the compression stage before contrastive learning on the MMEB benchmark would falsify the approach.

Figures

Figures reproduced from arXiv: 2511.08480 by Biao Yang, Da Li, Fan Yang, Guorui Zhou, Jiafeng Guo, Keping Bi, Tingting Gao, Wei Yuan, Yan Wang, Yuxiao Luo.

**Figure 2.** Figure 2: The impact of different numbers of compressed tokens on performance. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity between compression tokens with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Representations of queries and targets across [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Prompt for Data Generation [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model. CoMa achieves new state-of-the-art results among MLLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness. Our project is available at https://github.com/Trustworthy-Information-Access/CoMa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoMa's compressed warm-up before contrastive learning is a practical efficiency tweak for MLLM embeddings, but the decoupling benefit is not yet isolated from the contrastive stage itself.

read the letter

Colleague, the main thing here is that CoMa runs a compressed pre-training phase as a warm-up, then switches to contrastive learning to turn an MLLM into an embedding model. They report that this sequence delivers competitive results on MMEB with only a small amount of data and claims new SOTA among similar-sized models. The efficiency angle is the practical hook. What the paper actually does is lay out an empirical recipe that treats understanding and discrimination as separable steps. The compressed stage is meant to build semantic coverage first so the later contrastive pass can focus on discriminative features without as much data. That sequencing is presented cleanly and the GitHub release helps with checking the details. The approach sits as an incremental but usable extension of staged training ideas already in the contrastive literature. The soft spot is exactly the one the stress-test flags. The central claim needs the compressed phase to add something that plain contrastive learning on the same small data would not. Without an ablation that runs the contrastive stage alone, or a proxy measure of understanding right after the warm-up, the gains could be driven entirely by the second stage. The abstract states SOTA results but the summary gives little on baselines, splits, or significance, so the evidence for the decoupling is still thin. This paper is for people working on efficient adaptation of multimodal models for retrieval and classification. Readers who need concrete recipes for limited-data embedding training will find the method description useful. It shows straightforward thinking about the objectives and engages the efficiency problem directly. I would send it to peer review. The efficiency claim is worth referee time even if the current experiments need tightening on the stage contributions.

Referee Report

2 major / 1 minor

Summary. The paper proposes CoMa, a two-stage paradigm for adapting Multimodal Large Language Models into embedding models. It introduces a compressed pre-training phase as a warm-up to build comprehensive input understanding, followed by contrastive learning to emphasize discriminative features. The central claim is that decoupling these objectives allows transforming an MLLM into a competitive embedding model using only a small amount of pre-training data, achieving new state-of-the-art results on the MMEB benchmark among MLLMs of comparable size.

Significance. If the efficiency and SOTA claims hold after proper isolation of contributions, this work would offer a practical recipe for data-efficient adaptation of MLLMs to multimodal embedding tasks, potentially reducing computational costs in vision-language representation learning while improving performance on retrieval, clustering, and classification.

major comments (2)

Abstract and Experiments: The central claim that the compressed pre-training phase produces a 'comprehensive understanding' enabling superior contrastive performance is not isolated. No ablation is described that runs the contrastive stage alone on identical small data or measures understanding proxies (e.g., zero-shot VQA or captioning accuracy immediately after the compressed phase), so it remains possible that reported MMEB gains are driven entirely by the contrastive objective rather than the proposed decoupling.
Abstract: The assertion of new SOTA results on MMEB after the two-stage procedure provides no details on baselines, data splits, statistical significance, or ablation controls. Without these, it is impossible to judge whether the efficiency claim (small data, competitive embedding model) is supported.

minor comments (1)

Abstract: The phrase 'small amount of pre-training data' should be quantified with exact dataset sizes or percentages to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, offering clarifications from the manuscript and indicating revisions that will strengthen the presentation of our contributions.

read point-by-point responses

Referee: Abstract and Experiments: The central claim that the compressed pre-training phase produces a 'comprehensive understanding' enabling superior contrastive performance is not isolated. No ablation is described that runs the contrastive stage alone on identical small data or measures understanding proxies (e.g., zero-shot VQA or captioning accuracy immediately after the compressed phase), so it remains possible that reported MMEB gains are driven entirely by the contrastive objective rather than the proposed decoupling.

Authors: We agree that a more explicit isolation of the compressed pre-training stage's contribution would strengthen the central claim. Our existing experiments compare the full CoMa pipeline against direct contrastive learning on larger-scale data and demonstrate consistent gains from the two-stage approach, but we did not include a direct head-to-head ablation of contrastive learning alone using precisely the same small pre-training corpus. We will add this ablation (contrastive-only on the small data) together with zero-shot VQA and captioning metrics evaluated immediately after the compressed phase. These results will be reported in a new subsection of the Experiments section to quantify the 'comprehensive understanding' benefit and rule out the possibility that gains arise solely from the contrastive objective. revision: yes
Referee: Abstract: The assertion of new SOTA results on MMEB after the two-stage procedure provides no details on baselines, data splits, statistical significance, or ablation controls. Without these, it is impossible to judge whether the efficiency claim (small data, competitive embedding model) is supported.

Authors: The abstract provides a concise summary of the main result, while the full experimental details—including the specific MLLM baselines of comparable size, the exact data splits and volumes used in the compressed pre-training phase, multiple-run averaging for statistical reliability, and ablation controls—are presented in Section 4 and Tables 1–3. To address the concern directly, we will revise the abstract to include a brief parenthetical reference to the key baselines and the limited data scale (e.g., “using only X% of typical contrastive data”) while preserving its length constraints. We will also add an explicit sentence in the abstract pointing readers to the experimental setup for full details on splits and significance. revision: partial

Circularity Check

0 steps flagged

Empirical recipe with no derivation chain that reduces to inputs

full rationale

The paper advances an empirical pre-training paradigm (compressed phase as warm-up for contrastive learning) based on the hypothesis that understanding and discrimination objectives can be decoupled. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or method description. Claims rest on experimental results on MMEB rather than any closed-form reduction or self-referential construction. The central argument is presented as a testable recipe, not a mathematical derivation equivalent to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that comprehensive semantic understanding can be acquired separately and then leveraged by contrastive learning. No explicit free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption A comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning.
Explicitly stated as the key argument enabling the decoupling of objectives.

pith-pipeline@v0.9.0 · 5528 in / 1315 out tokens · 34306 ms · 2026-05-17T23:25:35.696527+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
cs.CL 2026-01 unverdicted novelty 6.0

CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan

Probabilistic embeddings for cross-modal re- trieval.Preprint, arXiv:2101.05068. Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling deep contrastive learning batch size under memory limited setup.Preprint, arXiv:2101.06983. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang....

work page arXiv 2021
[2]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Seed-x: Multimodal models with unified multi- granularity comprehension and generation.Preprint, arXiv:2404.14396. Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. 2025. Breaking the modality barrier: Universal embedding learning with multimodal llms.Preprint, arXiv:2504.1743...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Focus on distinct key elements (objects, actions, settings)

work page
[4]

Be clear and answerable from visual con- tent alone

work page
[5]

Consider, but not limited to the following questions:

Avoid subjective interpretations. Consider, but not limited to the following questions:

work page
[6]

Main objects and their attributes (type, color, position)

work page
[7]

The Scene context (time/weather if appar- ent, location)

work page
[8]

OOD datasets are highlighted with a yellow background

Notable relationships between elements [Output]: Figure 6: Prompt for Data Generation Table 4:Detailed MMEB Results.Performance of baselines and CoMa variants across 20 in-distribution (IND) and 16 out-of-distribution (OOD) datasets. OOD datasets are highlighted with a yellow background. CLIP VLM2VecMMRet UniME mmE5 MoCa-7B CoMa -3BCoMa -7B Classification...

work page

[1] [1]

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan

Probabilistic embeddings for cross-modal re- trieval.Preprint, arXiv:2101.05068. Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling deep contrastive learning batch size under memory limited setup.Preprint, arXiv:2101.06983. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang....

work page arXiv 2021

[2] [2]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Seed-x: Multimodal models with unified multi- granularity comprehension and generation.Preprint, arXiv:2404.14396. Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. 2025. Breaking the modality barrier: Universal embedding learning with multimodal llms.Preprint, arXiv:2504.1743...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Focus on distinct key elements (objects, actions, settings)

work page

[4] [4]

Be clear and answerable from visual con- tent alone

work page

[5] [5]

Consider, but not limited to the following questions:

Avoid subjective interpretations. Consider, but not limited to the following questions:

work page

[6] [6]

Main objects and their attributes (type, color, position)

work page

[7] [7]

The Scene context (time/weather if appar- ent, location)

work page

[8] [8]

OOD datasets are highlighted with a yellow background

Notable relationships between elements [Output]: Figure 6: Prompt for Data Generation Table 4:Detailed MMEB Results.Performance of baselines and CoMa variants across 20 in-distribution (IND) and 16 out-of-distribution (OOD) datasets. OOD datasets are highlighted with a yellow background. CLIP VLM2VecMMRet UniME mmE5 MoCa-7B CoMa -3BCoMa -7B Classification...

work page