pith. sign in

arxiv: 2605.15868 · v1 · pith:2Y7Z4U6Bnew · submitted 2026-05-15 · 💻 cs.CV

SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval

Pith reviewed 2026-05-20 19:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords symmetric multimodal retrievalself-supervised learningimage-text pairsintersection maskhard-negative samplesmultimodal embeddingsvision-language models
0
0 comments X

The pith

A self-supervised two-stage framework learns intersection masks from unlabeled image-text pairs to generate positive and hard-negative samples for symmetric multimodal retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets symmetric multimodal-to-multimodal retrieval, where an image or text query can retrieve from either modality interchangeably. Existing methods rely on labeled asymmetric datasets and struggle here. SOLAR instead uses readily available web-scale unlabeled image-text pairs in two stages: first learning an intersection mask that aligns shared semantics while preserving differences, then applying that mask to create positive pairs and hard negatives through selective masking. This produces embeddings that perform symmetric retrieval without supervision. The authors also release a human-verified benchmark of positive and hard-negative pairs to test the setting realistically.

Core claim

The central claim is that an intersection mask learned from image-text pairs aligns shared semantics while preserving modality-specific differences, and that using this mask to mask different parts of images or texts produces effective positive and hard-negative samples that enable self-supervised learning of multimodal embeddings competitive with or better than supervised vision-language models on symmetric retrieval.

What carries the argument

The intersection mask, which identifies overlapping semantics between an image and its paired text and is applied to construct positive and hard-negative samples by masking non-overlapping parts.

If this is right

  • Symmetric retrieval becomes feasible without large labeled asymmetric datasets.
  • Web-scale unlabeled pairs suffice to train embeddings that handle interchangeable image and text queries.
  • Model size and embedding dimension can be reduced substantially while improving performance on the target task.
  • A reproducible human-verified benchmark now exists for measuring progress on realistic symmetric multimodal retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mask-based sample construction could transfer to other multimodal alignment problems such as captioning or visual question answering.
  • Extending the approach to additional modalities like audio or video might require only modest changes to the mask-learning stage.
  • The performance gain with far fewer parameters suggests the method captures essential cross-modal structure more efficiently than current supervised alternatives.

Load-bearing premise

That an automatically learned intersection mask from image-text pairs will reliably separate shared semantics from differences in a way that produces useful positive and hard-negative samples for embedding training.

What would settle it

A controlled run on the authors' human-verified benchmark where SOLAR fails to match or exceed the strongest supervised vision-language model in retrieval accuracy.

Figures

Figures reproduced from arXiv: 2605.15868 by Hang Yu, Peng Di, Wenjie Yang, Yuyu Guo.

Figure 1
Figure 1. Figure 1: A comparison of existing multimodal retrieval paradigms with the symmetric MM2MM task addressed in this paper. Tasks are categorized based on two properties: whether the retrieval is symmetric (query and content are interchangeable) and whether both are multimodal (MM2MM vs. UM2MM/MM2UM). anced representations. Yet, multimodal retrieval has re￾ceived little attention compared to single-modal tasks. One par… view at source ↗
Figure 2
Figure 2. Figure 2: The data augmentation and annotation pipeline. To collect the hard samples, we employ an image editing process based on VLMs, LLMs and SDs. Subsequently, symmetric positive-negative pairs are manually annotated through text rewriting and image replacing, ensuring mutual consistency in both query-positive and query-negative interactions. More examples are provided in App. B 3. Symmetric MM2MM Retrieval Task… view at source ↗
Figure 3
Figure 3. Figure 3: Architectural overview of two training stages and inference pipeline. In training stage 1, the Mask Generation module (MaskGen), guided by an alignment signal from Global-to-Local Alignment (LGLA) and Local Distillation (LLD), disentangles shared (intersection) and unique (difference) information. The resulting mask orchestrates two key objectives: Masked Image-Text Contrastive (LITC) learning aligns the s… view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves for different settings in Stage 1. Visualization: To qualitatively verify that SOLAR learns the intended disentanglement, we visualize its internal mech￾anisms. Training loss curves in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: More examples of our sym-MM2MM dataset. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The distribution of categories for sym-MM2MM. rejection, for the accepted pairs, annotators would often perform further edits, such as rewriting text to increase difficulty. Note that to construct more challenging test cases, we occasionally introduce a variant X′ for some X, which is created by manually modifying the text of the original sample, as [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of the distance threshold t on hierarchical clustering for image segmentation, C is the number of resulting clusters C.2. Iterative Hierarchical Clustering for Image Segmentation Generation For image masking, we first partition the image into semantically coherent segments using hierarchical clustering on the patch embeddings. We adopt the mean linkage criterion, where the decision to merge two clus… view at source ↗
Figure 8
Figure 8. Figure 8: Top-5 retrieval results for a query from the sym-MM2MM benchmark. The positive sample is marked with ✓and the negative with ✗. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Top-5 retrieval results for a query from the sym-MM2MM benchmark. The positive sample is marked with ✓and the negative with ✗. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of the margin hyperparameter δ in GLA F. Ablation Studies F.1. Detailed Analysis for Ablation Studies of Stage 1 We present the complete results with all the metrics in Tab. 2. We also give the specific explanation of each setting, as well as the analysis below. • LITC only: We only use LITC to train, without evolutionary mask. It is similar to the feature fusion baseline with CLIP as backbone in U… view at source ↗
Figure 11
Figure 11. Figure 11: The evolution of similarity score distributions for positive (in-pair) and negative (out-of-pair) text-global-to-image-local comparisons during Stage 1 training. distribution is estimated from O(B2 s ) (Bs=64, which is the per-GPU batch-size) out-of-pair comparisons, benefiting from the Law of Large Numbers to form a stable Gaussian. In contrast, the positive distribution is derived from only O(Bs) in-pai… view at source ↗
Figure 12
Figure 12. Figure 12: Visualization the global-to-local similarity of single-object case This behavior is insightful. The robustness to orientation and occlusion is a desirable trait, likely inherited from the extensive data augmentation (e.g., random flips and crops) used to pre-train the vision backbone, which teaches geometric invariance. Conversely, the sensitivity to color is also logical, as color is a powerful semantic … view at source ↗
Figure 13
Figure 13. Figure 13: Visualization the global-to-local similarity of multiple-object case Text: Tennis player Nadal in a match 0.9835 0.9726 0. 8646 0. 7700 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The influence of edge semantics the ’ball’ as part of the intersection, as both contribute to the joint representation. This broader definition of ’intersection’ naturally leads to a lower mIoU on a task that rewards segmenting a single, specific object. H.2. HalDec-Bench For the text side, we evaluate on HalDec-Bench with novel fine-grained metrics, our model achieves a zero-shot Token￾AUROC of 0.59 (dis… view at source ↗
Figure 15
Figure 15. Figure 15: MIoU trend for stage 1 [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
read the original abstract

In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SOLAR, a two-stage self-supervised framework for symmetric multimodal-to-multimodal retrieval. Stage 1 learns an intersection mask from unlabeled web-scale image-text pairs to align shared semantics while preserving differences. Stage 2 applies the mask to generate positive and hard-negative samples for contrastive embedding learning. A new human-verified benchmark is introduced, and experiments against ten SOTA methods report that SOLAR exceeds the strongest supervised VLM by 7.08 points with over 50x fewer parameters and 5x smaller embedding dimension.

Significance. If the central results hold, the work would be significant for demonstrating that self-supervision on readily available unlabeled data can outperform supervised VLMs on symmetric retrieval while achieving substantial efficiency gains. Explicit credit is due for the planned release of code and the new benchmark, which supports reproducibility and enables future comparisons.

major comments (2)
  1. [Experiments] The 7.08-point gain on the new benchmark (reported in the abstract and Experiments section) is load-bearing for the central claim yet depends on the unverified premise that the learned intersection mask produces genuinely hard negatives. No quantitative metrics on mask quality (e.g., IoU against human annotations of shared regions) or hardness statistics for the generated negatives are provided, nor is there an ablation that removes the mask stage while keeping other components fixed.
  2. [§3] §3 (Method), the description of stage-2 sample construction: the precise mechanism by which the intersection mask is used to create positives versus hard-negatives is not formalized with equations or pseudocode, making it impossible to verify that the contrastive objective isolates semantic differences without inadvertently removing shared semantics.
minor comments (2)
  1. [Abstract] The abstract states 'over 50x fewer model parameters' without naming the exact supervised VLM baseline or reporting absolute parameter counts and embedding dimensions for all compared methods.
  2. [Benchmark] Benchmark construction details (human verification protocol, inter-annotator agreement, and rules for selecting hard negatives) are mentioned but lack sufficient specificity to allow independent replication or assessment of potential overlap with the web-scale training data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [Experiments] The 7.08-point gain on the new benchmark (reported in the abstract and Experiments section) is load-bearing for the central claim yet depends on the unverified premise that the learned intersection mask produces genuinely hard negatives. No quantitative metrics on mask quality (e.g., IoU against human annotations of shared regions) or hardness statistics for the generated negatives are provided, nor is there an ablation that removes the mask stage while keeping other components fixed.

    Authors: We agree that direct quantitative validation of the intersection mask quality and an ablation study would provide stronger support for the central claims. In the revised manuscript we add (i) IoU scores between the learned masks and human annotations of shared regions on a sampled subset, (ii) hardness statistics such as average similarity of the generated negatives to the positives, and (iii) an ablation that disables the mask-based sample construction while keeping the contrastive learning stage and all other components fixed. These additions confirm the mask's role in producing effective hard negatives. revision: yes

  2. Referee: [§3] §3 (Method), the description of stage-2 sample construction: the precise mechanism by which the intersection mask is used to create positives versus hard-negatives is not formalized with equations or pseudocode, making it impossible to verify that the contrastive objective isolates semantic differences without inadvertently removing shared semantics.

    Authors: We thank the referee for noting the lack of formalization. Section 3 has been revised to include explicit equations and pseudocode that define how the intersection mask is applied: positives are formed by masking non-intersection regions to emphasize alignment on shared semantics, while hard-negatives are formed by masking intersection regions to highlight differences. This formalization makes clear that the contrastive objective operates on semantic discrepancies while the mask preserves shared content. revision: yes

Circularity Check

0 steps flagged

Self-supervised framework shows no circularity

full rationale

The paper presents a two-stage self-supervised framework for symmetric multimodal retrieval using unlabeled web-scale image-text pairs. Stage 1 learns an intersection mask from the observation of shared semantics and differences; stage 2 uses the mask to construct positives and hard-negatives for contrastive embedding learning. Evaluation occurs on a separately introduced human-verified benchmark. No derivation step reduces by construction to its inputs, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or imported uniqueness theorems appear. The chain is data-driven from external unlabeled sources and remains self-contained against the new benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that semantic alignment and discrepancies coexist between modalities and introduces the intersection mask as a learned component without external independent evidence.

axioms (1)
  • domain assumption Both semantic alignment and discrepancies exist between two modalities
    Explicitly stated as the observation that motivates the first-stage mask learning.
invented entities (1)
  • intersection mask no independent evidence
    purpose: To align shared semantics while preserving modality-specific differences in image-text pairs
    New component learned in stage one and then used to construct training samples; no independent falsifiable evidence outside the paper is mentioned.

pith-pipeline@v0.9.0 · 5765 in / 1349 out tokens · 48538 ms · 2026-05-20T19:21:04.771954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [3]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    URL https://openreview.net/forum? id=nZeVKeeFYf9. Hu, H., Luan, Y ., Chen, Y ., Khandelwal, U., Joshi, M., Lee, K., Toutanova, K., and Chang, M. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. InIEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 12031–12041. I...

  2. [5]

    Dickerson

    doi: 10.48550/ARXIV .2505.19650. URL https: //doi.org/10.48550/arXiv.2505.19650. 10 SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval Li, M., Zhang, Y ., Long, D., Chen, K., Song, S., Bai, S., Yang, Z., Xie, P., Yang, A., Liu, D., et al. Qwen3- vl-embedding and qwen3-vl-reranker: A unified frame- work for state-of-the-art multimodal...

  3. [6]

    In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

    URL https://openreview.net/forum? id=i45NQb2iKO. Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),Com- puter Vision - ECCV 2014 - 13th European Confer- ence, Zurich, Switzerland, September 6-1...

  4. [7]

    A ConvNet for the 2020s

    PMLR, 2021. URL http://proceedings. mlr.press/v139/radford21a.html. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Or- leans, LA, USA, June 18-24, 2022, pp. 10674–10685. IEEE, 2022. doi: 10.1109/C...

  5. [8]

    Zhan, J., Dai, J., Ye, J., Zhou, Y ., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al

    URL https://openreview.net/forum? id=NZQkumsNlf. Zhan, J., Dai, J., Ye, J., Zhou, Y ., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al. Anygpt: Unified multimodal llm with discrete sequence modeling. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9637–9662, 2024. Z...

  6. [9]

    Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

    URL https://openreview.net/forum? id=Zc22RDtsvP. Zhang, X., Zhang, Y ., Xie, W., Li, M., Dai, Z., Long, D., Xie, P., Zhang, M., Li, W., and Zhang, M. Bridging modalities: Improving universal multimodal retrieval by multimodal large language models. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2025, Nashville, TN, USA, June 11-1...

  7. [10]

    theme of the speech

    and MagicLens (Zhang et al., 2024) employ a late fusion approach that integrates features extracted by CLIP using an additional vision-language (VL) encoder. More recent techniques, such as MM-Embed (Lin et al., 2025) and VLM2Vec (Jiang et al., 2025), utilize multimodal large language models (VLMs) to fuse features, often using the hidden state of the [EN...

  8. [11]

    2) Multi-Task Contrastive Learning and Supervised Fine-Tuning

    Contrastive Pre-training. 2) Multi-Task Contrastive Learning and Supervised Fine-Tuning. 3) Distillation and Model Merging. Training data consists of 40M high-quality collected data and 300M synthesized data. For all VLM-based methods, we employ an adaptation strategy to make them suitable for the symmetric MM2MM retrieval task. As these models are instru...

  9. [12]

    Corpus Construction:We first assemble a large-scale candidate corpus comprising 4 million image-text pairs from the LAION dataset

  10. [13]

    The models include: a text embedding model (BGE-m3), a visual feature extractor (DINOv2), a cross-modal model (CLIP), and our own model from Stage 1

    Multi-Modal Feature Extraction:We utilize a suite of powerful pre-trained models to extract features for all samples in the corpus. The models include: a text embedding model (BGE-m3), a visual feature extractor (DINOv2), a cross-modal model (CLIP), and our own model from Stage 1. 3.Candidate Retrieval:For each training sample (anchor), we retrieve hard n...

  11. [14]

    car horn

    Final Set Aggregation:The final hard-negative set for each anchor is formed by the union of all candidates retrieved from the different models and similarity metrics. During training, for each anchor sample, we randomly sample two hard negatives from its aggregated set to be used in the contrastive loss computation. This approach provides a diverse and ch...

  12. [15]

    children

    During this final phase, the model operates with a stable masking policy (i.e., using the hard mask ˆM directly). However, ˆM itself is not fixed; it remains dynamic and is re-computed for each batch based on the evolving feature similarities. This ensures the model fully converges while perfecting its representation based on its own continuously refined,...

  13. [16]

    For fair comparison, the batch-size are all set as 32, the VLM-based methods all use LORA finetuning with rank 8. For VLM-based methods, since they are use contrastive loss to train, thus the only factor that influence training time is backbone, thus we divide them into 3 categories based on their backbone: 1) Qwen2-VL-7B (shorted as Qwen), including MM-E...