SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval
Pith reviewed 2026-05-20 19:21 UTC · model grok-4.3
The pith
A self-supervised two-stage framework learns intersection masks from unlabeled image-text pairs to generate positive and hard-negative samples for symmetric multimodal retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an intersection mask learned from image-text pairs aligns shared semantics while preserving modality-specific differences, and that using this mask to mask different parts of images or texts produces effective positive and hard-negative samples that enable self-supervised learning of multimodal embeddings competitive with or better than supervised vision-language models on symmetric retrieval.
What carries the argument
The intersection mask, which identifies overlapping semantics between an image and its paired text and is applied to construct positive and hard-negative samples by masking non-overlapping parts.
If this is right
- Symmetric retrieval becomes feasible without large labeled asymmetric datasets.
- Web-scale unlabeled pairs suffice to train embeddings that handle interchangeable image and text queries.
- Model size and embedding dimension can be reduced substantially while improving performance on the target task.
- A reproducible human-verified benchmark now exists for measuring progress on realistic symmetric multimodal retrieval.
Where Pith is reading between the lines
- The mask-based sample construction could transfer to other multimodal alignment problems such as captioning or visual question answering.
- Extending the approach to additional modalities like audio or video might require only modest changes to the mask-learning stage.
- The performance gain with far fewer parameters suggests the method captures essential cross-modal structure more efficiently than current supervised alternatives.
Load-bearing premise
That an automatically learned intersection mask from image-text pairs will reliably separate shared semantics from differences in a way that produces useful positive and hard-negative samples for embedding training.
What would settle it
A controlled run on the authors' human-verified benchmark where SOLAR fails to match or exceed the strongest supervised vision-language model in retrieval accuracy.
Figures
read the original abstract
In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SOLAR, a two-stage self-supervised framework for symmetric multimodal-to-multimodal retrieval. Stage 1 learns an intersection mask from unlabeled web-scale image-text pairs to align shared semantics while preserving differences. Stage 2 applies the mask to generate positive and hard-negative samples for contrastive embedding learning. A new human-verified benchmark is introduced, and experiments against ten SOTA methods report that SOLAR exceeds the strongest supervised VLM by 7.08 points with over 50x fewer parameters and 5x smaller embedding dimension.
Significance. If the central results hold, the work would be significant for demonstrating that self-supervision on readily available unlabeled data can outperform supervised VLMs on symmetric retrieval while achieving substantial efficiency gains. Explicit credit is due for the planned release of code and the new benchmark, which supports reproducibility and enables future comparisons.
major comments (2)
- [Experiments] The 7.08-point gain on the new benchmark (reported in the abstract and Experiments section) is load-bearing for the central claim yet depends on the unverified premise that the learned intersection mask produces genuinely hard negatives. No quantitative metrics on mask quality (e.g., IoU against human annotations of shared regions) or hardness statistics for the generated negatives are provided, nor is there an ablation that removes the mask stage while keeping other components fixed.
- [§3] §3 (Method), the description of stage-2 sample construction: the precise mechanism by which the intersection mask is used to create positives versus hard-negatives is not formalized with equations or pseudocode, making it impossible to verify that the contrastive objective isolates semantic differences without inadvertently removing shared semantics.
minor comments (2)
- [Abstract] The abstract states 'over 50x fewer model parameters' without naming the exact supervised VLM baseline or reporting absolute parameter counts and embedding dimensions for all compared methods.
- [Benchmark] Benchmark construction details (human verification protocol, inter-annotator agreement, and rules for selecting hard negatives) are mentioned but lack sufficient specificity to allow independent replication or assessment of potential overlap with the web-scale training data.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additional analyses.
read point-by-point responses
-
Referee: [Experiments] The 7.08-point gain on the new benchmark (reported in the abstract and Experiments section) is load-bearing for the central claim yet depends on the unverified premise that the learned intersection mask produces genuinely hard negatives. No quantitative metrics on mask quality (e.g., IoU against human annotations of shared regions) or hardness statistics for the generated negatives are provided, nor is there an ablation that removes the mask stage while keeping other components fixed.
Authors: We agree that direct quantitative validation of the intersection mask quality and an ablation study would provide stronger support for the central claims. In the revised manuscript we add (i) IoU scores between the learned masks and human annotations of shared regions on a sampled subset, (ii) hardness statistics such as average similarity of the generated negatives to the positives, and (iii) an ablation that disables the mask-based sample construction while keeping the contrastive learning stage and all other components fixed. These additions confirm the mask's role in producing effective hard negatives. revision: yes
-
Referee: [§3] §3 (Method), the description of stage-2 sample construction: the precise mechanism by which the intersection mask is used to create positives versus hard-negatives is not formalized with equations or pseudocode, making it impossible to verify that the contrastive objective isolates semantic differences without inadvertently removing shared semantics.
Authors: We thank the referee for noting the lack of formalization. Section 3 has been revised to include explicit equations and pseudocode that define how the intersection mask is applied: positives are formed by masking non-intersection regions to emphasize alignment on shared semantics, while hard-negatives are formed by masking intersection regions to highlight differences. This formalization makes clear that the contrastive objective operates on semantic discrepancies while the mask preserves shared content. revision: yes
Circularity Check
Self-supervised framework shows no circularity
full rationale
The paper presents a two-stage self-supervised framework for symmetric multimodal retrieval using unlabeled web-scale image-text pairs. Stage 1 learns an intersection mask from the observation of shared semantics and differences; stage 2 uses the mask to construct positives and hard-negatives for contrastive embedding learning. Evaluation occurs on a separately introduced human-verified benchmark. No derivation step reduces by construction to its inputs, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or imported uniqueness theorems appear. The chain is data-driven from external unlabeled sources and remains self-contained against the new benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Both semantic alignment and discrepancies exist between two modalities
invented entities (1)
-
intersection mask
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference... construct positive and hardnegative samples via masking different parts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[3]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
URL https://openreview.net/forum? id=nZeVKeeFYf9. Hu, H., Luan, Y ., Chen, Y ., Khandelwal, U., Joshi, M., Lee, K., Toutanova, K., and Chang, M. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. InIEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 12031–12041. I...
-
[5]
doi: 10.48550/ARXIV .2505.19650. URL https: //doi.org/10.48550/arXiv.2505.19650. 10 SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval Li, M., Zhang, Y ., Long, D., Chen, K., Song, S., Bai, S., Yang, Z., Xie, P., Yang, A., Liu, D., et al. Qwen3- vl-embedding and qwen3-vl-reranker: A unified frame- work for state-of-the-art multimodal...
work page internal anchor Pith review doi:10.48550/arxiv 2026
-
[6]
In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T
URL https://openreview.net/forum? id=i45NQb2iKO. Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),Com- puter Vision - ECCV 2014 - 13th European Confer- ence, Zurich, Switzerland, September 6-1...
-
[7]
PMLR, 2021. URL http://proceedings. mlr.press/v139/radford21a.html. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Or- leans, LA, USA, June 18-24, 2022, pp. 10674–10685. IEEE, 2022. doi: 10.1109/C...
-
[8]
URL https://openreview.net/forum? id=NZQkumsNlf. Zhan, J., Dai, J., Ye, J., Zhou, Y ., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al. Anygpt: Unified multimodal llm with discrete sequence modeling. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9637–9662, 2024. Z...
work page 2024
-
[9]
Freeman, Frédo Durand, Eli Shechtman, and Xun Huang
URL https://openreview.net/forum? id=Zc22RDtsvP. Zhang, X., Zhang, Y ., Xie, W., Li, M., Dai, Z., Long, D., Xie, P., Zhang, M., Li, W., and Zhang, M. Bridging modalities: Improving universal multimodal retrieval by multimodal large language models. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2025, Nashville, TN, USA, June 11-1...
-
[10]
and MagicLens (Zhang et al., 2024) employ a late fusion approach that integrates features extracted by CLIP using an additional vision-language (VL) encoder. More recent techniques, such as MM-Embed (Lin et al., 2025) and VLM2Vec (Jiang et al., 2025), utilize multimodal large language models (VLMs) to fuse features, often using the hidden state of the [EN...
work page 2024
-
[11]
2) Multi-Task Contrastive Learning and Supervised Fine-Tuning
Contrastive Pre-training. 2) Multi-Task Contrastive Learning and Supervised Fine-Tuning. 3) Distillation and Model Merging. Training data consists of 40M high-quality collected data and 300M synthesized data. For all VLM-based methods, we employ an adaptation strategy to make them suitable for the symmetric MM2MM retrieval task. As these models are instru...
-
[12]
Corpus Construction:We first assemble a large-scale candidate corpus comprising 4 million image-text pairs from the LAION dataset
-
[13]
Multi-Modal Feature Extraction:We utilize a suite of powerful pre-trained models to extract features for all samples in the corpus. The models include: a text embedding model (BGE-m3), a visual feature extractor (DINOv2), a cross-modal model (CLIP), and our own model from Stage 1. 3.Candidate Retrieval:For each training sample (anchor), we retrieve hard n...
-
[14]
Final Set Aggregation:The final hard-negative set for each anchor is formed by the union of all candidates retrieved from the different models and similarity metrics. During training, for each anchor sample, we randomly sample two hard negatives from its aggregated set to be used in the contrastive loss computation. This approach provides a diverse and ch...
work page 2020
-
[15]
During this final phase, the model operates with a stable masking policy (i.e., using the hard mask ˆM directly). However, ˆM itself is not fixed; it remains dynamic and is re-computed for each batch based on the evolving feature similarities. This ensures the model fully converges while perfecting its representation based on its own continuously refined,...
-
[16]
For fair comparison, the batch-size are all set as 32, the VLM-based methods all use LORA finetuning with rank 8. For VLM-based methods, since they are use contrastive loss to train, thus the only factor that influence training time is backbone, thus we divide them into 3 categories based on their backbone: 1) Qwen2-VL-7B (shorted as Qwen), including MM-E...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.