UNIV: Unified Foundation Model for Infrared and Visible Modalities
Pith reviewed 2026-05-18 16:23 UTC · model grok-4.3
The pith
UNIV builds a shared feature space for infrared and visible images by contrasting semantically matched patches to bypass modal pattern biases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UNIV is a unified foundation model whose Patch Cross-modal Contrastive Learning constructs a single cross-modal feature space. A frozen pre-trained model supplies pseudo patch pairs according to semantic similarity; the contrastive objective then attracts corresponding infrared-visible representations while repelling unrelated ones, simultaneously improving alignment and inter-class separability so that the model attends to semantics rather than sensor-specific shortcuts.
What carries the argument
Patch Cross-modal Contrastive Learning (PCCL), which samples pseudo patch pairs from semantic similarity computed by a frozen model and performs attraction-repulsion alignment to enforce a unified semantic space across modalities.
If this is right
- Infrared semantic segmentation improves by 1.7 mIoU over prior cross-modal baselines.
- Infrared object detection improves by 0.7 mAP while RGB detection remains competitive.
- The MVIP dataset of 98,992 aligned pairs enables systematic study of cross-modal correspondence.
- The same alignment objective simultaneously raises inter-class semantic separability inside each modality.
Where Pith is reading between the lines
- The same pseudo-pair construction could be tested on thermal-depth or RGB-LiDAR pairs to check generality beyond the infrared-visible case.
- A controlled study that swaps the frozen sampler for a different pre-trained backbone would reveal how much the gains depend on the particular semantic prior.
- Deploying the model on unpaired real-world infrared-visible streams would test whether the learned alignment transfers without the MVIP pairing.
Load-bearing premise
Pseudo patch pairs sampled by a frozen pre-trained model based on semantic similarity provide reliable cross-modal correspondences that reflect true underlying semantics rather than inheriting biases from the pre-trained model.
What would settle it
Retraining the model after replacing PCCL with either random patch pairing or single-modality contrastive loss and checking whether the reported gains of +1.7 mIoU on infrared segmentation and +0.7 mAP on infrared detection vanish.
Figures
read the original abstract
Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UNIV, a unified foundation model for infrared and visible modalities that addresses cross-modal degradation attributed to pattern shortcuts (modal bias favoring sensor patterns over semantics). The core contribution is Patch Cross-modal Contrastive Learning (PCCL), which uses a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity and performs contrastive alignment to build a unified feature space. The work also presents MVIP, a new benchmark with 98,992 aligned visible-infrared image pairs, and reports performance gains on infrared tasks (+1.7 mIoU for segmentation, +0.7 mAP for detection) while remaining competitive on RGB tasks.
Significance. If the central mechanism of PCCL produces genuine semantic alignment rather than reinforcing biases, the approach could meaningfully advance robust multi-modal perception for applications like autonomous driving under varying conditions. The introduction of the large-scale MVIP benchmark is a clear community contribution that enables standardized evaluation of cross-modal methods. The paper ships a new dataset and a self-supervised alignment strategy, which are strengths for reproducibility and further research.
major comments (2)
- The central claim that PCCL avoids pattern shortcuts by focusing on semantics rests on the assumption that pseudo patch pairs sampled by the frozen pre-trained model reflect true underlying object semantics across modalities. The manuscript provides no direct validation, ablation, or analysis (e.g., qualitative inspection of pair quality, comparison to RGB-only sampling, or measurement of bias inheritance) to rule out that the pairs instead capture sensor-specific textures or co-occurrence statistics; without this, the reported infrared gains cannot be confidently attributed to the proposed semantic alignment rather than architecture, data, or training schedule.
- The experimental section reports specific deltas (+1.7 mIoU, +0.7 mAP) but lacks details on baseline implementations, statistical significance testing, variance across runs, or full ablation studies isolating PCCL from other components; this makes it difficult to assess the robustness of the superiority claims on infrared tasks.
minor comments (2)
- Clarify the exact training protocol, including how the frozen model is chosen, the temperature parameter in contrastive loss, and the proportion of pseudo pairs used during training.
- The MVIP benchmark description would benefit from explicit discussion of alignment precision metrics and potential domain gaps relative to existing visible-infrared datasets.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and outline revisions to strengthen the presentation of our claims and experimental details.
read point-by-point responses
-
Referee: The central claim that PCCL avoids pattern shortcuts by focusing on semantics rests on the assumption that pseudo patch pairs sampled by the frozen pre-trained model reflect true underlying object semantics across modalities. The manuscript provides no direct validation, ablation, or analysis (e.g., qualitative inspection of pair quality, comparison to RGB-only sampling, or measurement of bias inheritance) to rule out that the pairs instead capture sensor-specific textures or co-occurrence statistics; without this, the reported infrared gains cannot be confidently attributed to the proposed semantic alignment rather than architecture, data, or training schedule.
Authors: We appreciate this observation on the need for direct evidence. The PCCL design relies on a frozen pre-trained model to sample pairs via semantic similarity in feature space, with the intent of prioritizing object-level semantics over sensor-specific patterns. However, we acknowledge that the original manuscript does not include explicit validation such as pair visualizations or targeted ablations. In the revised version, we will add qualitative examples of sampled pseudo patch pairs, an ablation contrasting semantic sampling against random or RGB-only baselines, and discussion of potential bias metrics to better support attribution of the infrared gains to semantic alignment. revision: yes
-
Referee: The experimental section reports specific deltas (+1.7 mIoU, +0.7 mAP) but lacks details on baseline implementations, statistical significance testing, variance across runs, or full ablation studies isolating PCCL from other components; this makes it difficult to assess the robustness of the superiority claims on infrared tasks.
Authors: We agree that greater experimental transparency is warranted for assessing robustness. The reported deltas are based on our implementations, but we will expand the experimental section in revision to detail baseline setups (including any infrared adaptations), report means and standard deviations over multiple runs, include statistical significance tests for the key improvements, and provide expanded ablations that isolate PCCL from the MVIP dataset and training schedule. These changes will improve reproducibility and allow clearer evaluation of the claims. revision: yes
Circularity Check
No significant circularity detected; derivation is self-contained
full rationale
The paper defines PCCL as a contrastive alignment procedure that samples pseudo patch pairs from an independent frozen external pre-trained model and then applies standard contrastive loss; this construction does not reduce to the target performance metrics by definition. MVIP is presented as a new dataset with standard evaluation metrics, and reported gains (+1.7 mIoU, +0.7 mAP) are empirical outcomes on tasks rather than quantities forced by the sampling procedure or any self-citation chain. No equations, uniqueness theorems, or fitted parameters are shown to be equivalent to the inputs by construction, and the method relies on external components for its key step, keeping the chain independent.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic similarity from a frozen pre-trained model accurately identifies cross-modal patch correspondences
invented entities (2)
-
PCCL
no independent evidence
-
MVIP
no independent evidence
Reference graph
Works this paper leans on
-
[1]
https://github.com/ultralytics/ultralytics
Yolov8. https://github.com/ultralytics/ultralytics. 5
-
[2]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2
work page 2023
-
[3]
End- to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InECCV, pages 213–229. Springer, 2020. 5
work page 2020
-
[4]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021. 1, 2, 5
work page 2021
-
[5]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 5
work page 2018
-
[6]
An empiri- cal study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empiri- cal study of training self-supervised vision transformers. In CVPR, pages 9640–9649, 2021. 1, 7
work page 2021
-
[7]
Tagging before alignment: Integrating multi- modal tags for video-text retrieval
Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, and Ying Shan. Tagging before alignment: Integrating multi- modal tags for video-text retrieval. InProceedings of the AAAI conference on artificial intelligence, pages 396–404,
-
[8]
Learning a similarity metric discriminatively, with application to face verification
Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 539–546. IEEE, 2005. 2
work page 2005
-
[9]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 5
work page 2009
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
Flir thermal dataset version 1.3 [dataset],
Teledyne FLIR. Flir thermal dataset version 1.3 [dataset],
-
[13]
https://www.flir.com/oem/adas/adas-dataset-form/. 2, 4
-
[14]
Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999
Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 8
work page 1999
-
[15]
Convmae: Masked convolution meets masked au- toencoders.arXiv preprint arXiv:2205.03892, 2022
Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, and Yu Qiao. Convmae: Masked convolution meets masked au- toencoders.arXiv preprint arXiv:2205.03892, 2022. 1, 2, 5, 6, 7
-
[16]
Mfnet: Towards real-time se- mantic segmentation for autonomous vehicles with multi- spectral scenes
Qishen Ha, Kohei Watanabe, Takumi Karasawa, Yoshitaka Ushiku, and Tatsuya Harada. Mfnet: Towards real-time se- mantic segmentation for autonomous vehicles with multi- spectral scenes. InIROS, pages 5108–5115. IEEE, 2017. 6, 7
work page 2017
-
[17]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, pages 2961–2969, 2017. 5
work page 2017
-
[18]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 1, 5, 6, 7
work page 2022
-
[19]
Jefferson Hernandez, Ruben Villegas, and Vicente Ordonez. Vic-mae: Self-supervised representation learning from im- ages and video with contrastive masked autoencoders. In European Conference on Computer Vision, pages 444–463. Springer, 2024. 2
work page 2024
-
[20]
Milan: Masked image pretraining on language assisted representation.arXiv preprint arXiv:2208.06049,
Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, and Sun- Yuan Kung. Milan: Masked image pretraining on language assisted representation.arXiv preprint arXiv:2208.06049,
-
[21]
Interlaced sparse self-attention for semantic segmentation.arXiv preprint arXiv:1907.12273, 2019
Lang Huang, Yuhui Yuan, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. Interlaced sparse self-attention for semantic segmentation.arXiv preprint arXiv:1907.12273, 2019. 5
-
[22]
Multispectral pedestrian detection: Benchmark dataset and baseline
Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. InCVPR, pages 1037– 1045, 2015. 2, 4
work page 2015
-
[23]
Llvip: A visible-infrared paired dataset for low-light vision
Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InCVPR, pages 3496–3504, 2021. 2, 4
work page 2021
-
[24]
Jie Jiang, Shaobo Min, Weijie Kong, Hongfa Wang, Zhifeng Li, and Wei Liu. Tencent text-video retrieval: hierarchi- cal cross-modal interactions with multi-level representations. IEEE Access, 2022. 2
work page 2022
-
[25]
Chenglong Li, Wei Xia, Yan Yan, Bin Luo, and Jin Tang. Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation.TNNLS, 32(7): 3069–3082, 2020. 6, 7
work page 2020
-
[26]
Lasher: A large-scale high- diversity benchmark for rgbt tracking.TIP, 31:392–404,
Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. Lasher: A large-scale high- diversity benchmark for rgbt tracking.TIP, 31:392–404,
-
[27]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, pages 2117–2125,
-
[28]
Infmae: A foundation model in the infrared modality
Fangcen Liu, Chenqiang Gao, Yaming Zhang, Junjie Guo, Jinghao Wang, and Deyu Meng. Infmae: A foundation model in the infrared modality. InECCV, pages 420–437. Springer, 2025. 1, 2, 5, 6
work page 2025
-
[29]
Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InCVPR, pages 5802–5811, 2022. 5
work page 2022
-
[30]
X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InProceedings of the 30th ACM international conference on multimedia, pages 638–647, 2022. 3
work page 2022
-
[31]
Fudong Nian, Ling Ding, Yuxia Hu, and Yanhong Gu. Multi- level cross-modal semantic alignment network for video–text retrieval.Mathematics, 10(18):3346, 2022. 2
work page 2022
-
[32]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PMLR, 2021. 2, 8
work page 2021
-
[34]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.PAMI, 39(6):1137–1149, 2016. 5
work page 2016
-
[35]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000. 6
work page 2000
-
[36]
Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.TCSVT, 32(10):6700–6713,
-
[37]
Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 2022. 1, 5, 6
work page 2022
-
[38]
Fine-grained action retrieval through multiple parts- of-speech embeddings
Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-grained action retrieval through multiple parts- of-speech embeddings. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 450–459,
-
[39]
Unified perceptual parsing for scene understand- ing
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InECCV, pages 418–434, 2018. 5
work page 2018
-
[40]
Hitea: Hierarchical temporal- aware video-language pre-training
Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal- aware video-language pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023. 2
work page 2023
-
[41]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2
work page 2023
-
[42]
Tao Zhang, Kun Ding, Jinyong Wen, Yu Xiong, Zeyu Zhang, Shiming Xiang, and Chunhong Pan. Pad: Self-supervised pre-training with patchwise-scale adapter for infrared im- ages.arXiv preprint arXiv:2312.08192, 2023. 1, 2, 6, 7
-
[43]
Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, and Chunhong Pan. Unip: Rethinking pre-trained attention patterns for infrared semantic segmentation.arXiv preprint arXiv:2502.02257, 2025. 1
-
[44]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InCVPR, pages 633–641, 2017. 6
work page 2017
-
[45]
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.