UNIV: Unified Foundation Model for Infrared and Visible Modalities

Chen Min; Fangyuan Mao; Fuyang Liu; Jilin Mei; Meiqi Wu; Shun Lu; Shuo Wang; Xiaokun Feng; Yu Hu

arxiv: 2509.15642 · v3 · pith:XT2JVNDRnew · submitted 2025-09-19 · 💻 cs.CV

UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao , Shuo Wang , Jilin Mei , Shun Lu , Chen Min , Fuyang Liu , Xiaokun Feng , Meiqi Wu

show 1 more author

Yu Hu

This is my paper

Pith reviewed 2026-05-18 16:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified foundation modelcross-modal contrastive learninginfrared visible perceptionpattern shortcutsemantic alignmentMVIP benchmarksemantic segmentationobject detection

0 comments

The pith

UNIV builds a shared feature space for infrared and visible images by contrasting semantically matched patches to bypass modal pattern biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that foundation models degrade across infrared and visible modalities because they exploit superficial sensor patterns instead of shared semantics. UNIV counters this with Patch Cross-modal Contrastive Learning that pulls representations of semantically similar patches together and pushes unrelated ones apart. The method rests on pseudo pairs generated by a frozen model and is supported by a new benchmark of nearly 99,000 precisely aligned image pairs. A sympathetic reader would care because joint RGB-infrared perception promises robustness in poor lighting or weather without sacrificing performance on ordinary RGB tasks.

Core claim

UNIV is a unified foundation model whose Patch Cross-modal Contrastive Learning constructs a single cross-modal feature space. A frozen pre-trained model supplies pseudo patch pairs according to semantic similarity; the contrastive objective then attracts corresponding infrared-visible representations while repelling unrelated ones, simultaneously improving alignment and inter-class separability so that the model attends to semantics rather than sensor-specific shortcuts.

What carries the argument

Patch Cross-modal Contrastive Learning (PCCL), which samples pseudo patch pairs from semantic similarity computed by a frozen model and performs attraction-repulsion alignment to enforce a unified semantic space across modalities.

If this is right

Infrared semantic segmentation improves by 1.7 mIoU over prior cross-modal baselines.
Infrared object detection improves by 0.7 mAP while RGB detection remains competitive.
The MVIP dataset of 98,992 aligned pairs enables systematic study of cross-modal correspondence.
The same alignment objective simultaneously raises inter-class semantic separability inside each modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pseudo-pair construction could be tested on thermal-depth or RGB-LiDAR pairs to check generality beyond the infrared-visible case.
A controlled study that swaps the frozen sampler for a different pre-trained backbone would reveal how much the gains depend on the particular semantic prior.
Deploying the model on unpaired real-world infrared-visible streams would test whether the learned alignment transfers without the MVIP pairing.

Load-bearing premise

Pseudo patch pairs sampled by a frozen pre-trained model based on semantic similarity provide reliable cross-modal correspondences that reflect true underlying semantics rather than inheriting biases from the pre-trained model.

What would settle it

Retraining the model after replacing PCCL with either random patch pairing or single-modality contrastive loss and checking whether the reported gains of +1.7 mIoU on infrared segmentation and +0.7 mAP on infrared detection vanish.

Figures

Figures reproduced from arXiv: 2509.15642 by Chen Min, Fangyuan Mao, Fuyang Liu, Jilin Mei, Meiqi Wu, Shun Lu, Shuo Wang, Xiaokun Feng, Yu Hu.

**Figure 1.** Figure 1: Visualization of attention maps from the final layer of different pre-trained models for a specific patch (marked with a star). (a) and (c) show a cyclist and a vehicle in the infrared modality, while (b) and (d) show the same objects in the visible RGB modality. Our UNIV establishes excellent cross-modal alignment and inter-class separability, enabling consistent attention localization for focused regi… view at source ↗

**Figure 2.** Figure 2: Proposed Unified Feature Space. UNIV learns a semantically rich and robust feature mapping by promoting intraclass cross-modal alignment and enhancing inter-class separation. 2. Related Works 2.1. The Foundation Model The success of pre-trained models like BERT [10], ViTs [11], and CLIP [32] demonstrate their ability to learn general representations from vast datasets. Following this paradigm, infrared-s… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed UNIV. Attention maps from a frozen visible RGB backbone are transformed to similarity pseudolabels MA. The cross-modality similarity map MIA and visible modality similarity map MV A are optimized to align with MA. positive samples, optimizing the KL divergence is equivalent to optimizing a binary cross-entropy (BCE) loss between MA and σ(MIA) as shown below: LIA ≡ DKL(MA∥σ(MIA))… view at source ↗

**Figure 4.** Figure 4: Cross-modal Aligment. Our model achieves excellent cross-modal alignment with efficient inter-class separability. 4.5.1. Cross-modal Alignment As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Heterogeneous vs Aligned Feature Space. Left: MCMAE projection feature space. Right: our proposed unified feature space. The t-SNE analysis in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UNIV's PCCL uses semantic pseudo-pairs from a frozen model to align IR and visible, plus a large new MVIP benchmark, but the link from those pairs to the reported gains needs tighter verification.

read the letter

The main takeaway is that this paper builds a unified foundation model for infrared and visible by introducing Patch Cross-modal Contrastive Learning. PCCL samples pseudo patch pairs based on semantic similarity from a frozen pre-trained model, then applies contrastive loss to pull matching pairs together and push others apart. They also release MVIP, a benchmark with nearly 99,000 aligned image pairs across varied scenes. That dataset size is a concrete step forward for anyone training on these modalities together. The reported lifts on infrared segmentation and detection while staying competitive on RGB are the kind of practical result that matters for all-weather perception systems. The approach is a direct extension of contrastive ideas to cross-modal patches, which is a reasonable move beyond simple fusion or single-modality pretraining. The soft spot is in how much the gains can be credited to the semantic alignment step. The abstract gives deltas but little on exact baselines, statistical significance, or ablations that isolate the pair-sampling mechanism. If the frozen model mostly groups patches by texture or co-occurrence rather than true object semantics, the contrastive objective could reinforce modal biases instead of removing them. That possibility is not ruled out by the information given, so the central story about avoiding pattern shortcuts rests on somewhat thin evidence. This paper is aimed at researchers working on multi-modal foundation models for real-world vision under changing illumination and weather. Someone building or evaluating cross-modal systems would get usable ideas from the dataset and the training recipe. It is solid enough to deserve a serious referee who can ask for the missing controls and checks on the pair quality. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces UNIV, a unified foundation model for infrared and visible modalities that addresses cross-modal degradation attributed to pattern shortcuts (modal bias favoring sensor patterns over semantics). The core contribution is Patch Cross-modal Contrastive Learning (PCCL), which uses a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity and performs contrastive alignment to build a unified feature space. The work also presents MVIP, a new benchmark with 98,992 aligned visible-infrared image pairs, and reports performance gains on infrared tasks (+1.7 mIoU for segmentation, +0.7 mAP for detection) while remaining competitive on RGB tasks.

Significance. If the central mechanism of PCCL produces genuine semantic alignment rather than reinforcing biases, the approach could meaningfully advance robust multi-modal perception for applications like autonomous driving under varying conditions. The introduction of the large-scale MVIP benchmark is a clear community contribution that enables standardized evaluation of cross-modal methods. The paper ships a new dataset and a self-supervised alignment strategy, which are strengths for reproducibility and further research.

major comments (2)

The central claim that PCCL avoids pattern shortcuts by focusing on semantics rests on the assumption that pseudo patch pairs sampled by the frozen pre-trained model reflect true underlying object semantics across modalities. The manuscript provides no direct validation, ablation, or analysis (e.g., qualitative inspection of pair quality, comparison to RGB-only sampling, or measurement of bias inheritance) to rule out that the pairs instead capture sensor-specific textures or co-occurrence statistics; without this, the reported infrared gains cannot be confidently attributed to the proposed semantic alignment rather than architecture, data, or training schedule.
The experimental section reports specific deltas (+1.7 mIoU, +0.7 mAP) but lacks details on baseline implementations, statistical significance testing, variance across runs, or full ablation studies isolating PCCL from other components; this makes it difficult to assess the robustness of the superiority claims on infrared tasks.

minor comments (2)

Clarify the exact training protocol, including how the frozen model is chosen, the temperature parameter in contrastive loss, and the proportion of pseudo pairs used during training.
The MVIP benchmark description would benefit from explicit discussion of alignment precision metrics and potential domain gaps relative to existing visible-infrared datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and outline revisions to strengthen the presentation of our claims and experimental details.

read point-by-point responses

Referee: The central claim that PCCL avoids pattern shortcuts by focusing on semantics rests on the assumption that pseudo patch pairs sampled by the frozen pre-trained model reflect true underlying object semantics across modalities. The manuscript provides no direct validation, ablation, or analysis (e.g., qualitative inspection of pair quality, comparison to RGB-only sampling, or measurement of bias inheritance) to rule out that the pairs instead capture sensor-specific textures or co-occurrence statistics; without this, the reported infrared gains cannot be confidently attributed to the proposed semantic alignment rather than architecture, data, or training schedule.

Authors: We appreciate this observation on the need for direct evidence. The PCCL design relies on a frozen pre-trained model to sample pairs via semantic similarity in feature space, with the intent of prioritizing object-level semantics over sensor-specific patterns. However, we acknowledge that the original manuscript does not include explicit validation such as pair visualizations or targeted ablations. In the revised version, we will add qualitative examples of sampled pseudo patch pairs, an ablation contrasting semantic sampling against random or RGB-only baselines, and discussion of potential bias metrics to better support attribution of the infrared gains to semantic alignment. revision: yes
Referee: The experimental section reports specific deltas (+1.7 mIoU, +0.7 mAP) but lacks details on baseline implementations, statistical significance testing, variance across runs, or full ablation studies isolating PCCL from other components; this makes it difficult to assess the robustness of the superiority claims on infrared tasks.

Authors: We agree that greater experimental transparency is warranted for assessing robustness. The reported deltas are based on our implementations, but we will expand the experimental section in revision to detail baseline setups (including any infrared adaptations), report means and standard deviations over multiple runs, include statistical significance tests for the key improvements, and provide expanded ablations that isolate PCCL from the MVIP dataset and training schedule. These changes will improve reproducibility and allow clearer evaluation of the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The paper defines PCCL as a contrastive alignment procedure that samples pseudo patch pairs from an independent frozen external pre-trained model and then applies standard contrastive loss; this construction does not reduce to the target performance metrics by definition. MVIP is presented as a new dataset with standard evaluation metrics, and reported gains (+1.7 mIoU, +0.7 mAP) are empirical outcomes on tasks rather than quantities forced by the sampling procedure or any self-citation chain. No equations, uniqueness theorems, or fitted parameters are shown to be equivalent to the inputs by construction, and the method relies on external components for its key step, keeping the chain independent.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that semantic similarity signals from a frozen pre-trained model transfer reliably across modalities; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Semantic similarity from a frozen pre-trained model accurately identifies cross-modal patch correspondences
Invoked to construct pseudo patch pairs for PCCL training

invented entities (2)

PCCL no independent evidence
purpose: Self-supervised contrastive strategy to align infrared and visible representations
Core training method introduced in the paper
MVIP no independent evidence
purpose: Large-scale precisely aligned visible-infrared image benchmark
New dataset released to support cross-modal training and evaluation

pith-pipeline@v0.9.0 · 5771 in / 1347 out tokens · 58947 ms · 2026-05-18T16:23:39.813412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

https://github.com/ultralytics/ultralytics

Yolov8. https://github.com/ultralytics/ultralytics. 5

work page
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

work page 2023
[3]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InECCV, pages 213–229. Springer, 2020. 5

work page 2020
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021. 1, 2, 5

work page 2021
[5]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 5

work page 2018
[6]

An empiri- cal study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empiri- cal study of training self-supervised vision transformers. In CVPR, pages 9640–9649, 2021. 1, 7

work page 2021
[7]

Tagging before alignment: Integrating multi- modal tags for video-text retrieval

Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, and Ying Shan. Tagging before alignment: Integrating multi- modal tags for video-text retrieval. InProceedings of the AAAI conference on artificial intelligence, pages 396–404,

work page
[8]

Learning a similarity metric discriminatively, with application to face verification

Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 539–546. IEEE, 2005. 2

work page 2005
[9]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 5

work page 2009
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

Flir thermal dataset version 1.3 [dataset],

Teledyne FLIR. Flir thermal dataset version 1.3 [dataset],

work page
[13]

https://www.flir.com/oem/adas/adas-dataset-form/. 2, 4

work page
[14]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 8

work page 1999
[15]

Convmae: Masked convolution meets masked au- toencoders.arXiv preprint arXiv:2205.03892, 2022

Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, and Yu Qiao. Convmae: Masked convolution meets masked au- toencoders.arXiv preprint arXiv:2205.03892, 2022. 1, 2, 5, 6, 7

work page arXiv 2022
[16]

Mfnet: Towards real-time se- mantic segmentation for autonomous vehicles with multi- spectral scenes

Qishen Ha, Kohei Watanabe, Takumi Karasawa, Yoshitaka Ushiku, and Tatsuya Harada. Mfnet: Towards real-time se- mantic segmentation for autonomous vehicles with multi- spectral scenes. InIROS, pages 5108–5115. IEEE, 2017. 6, 7

work page 2017
[17]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, pages 2961–2969, 2017. 5

work page 2017
[18]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 1, 5, 6, 7

work page 2022
[19]

Vic-mae: Self-supervised representation learning from im- ages and video with contrastive masked autoencoders

Jefferson Hernandez, Ruben Villegas, and Vicente Ordonez. Vic-mae: Self-supervised representation learning from im- ages and video with contrastive masked autoencoders. In European Conference on Computer Vision, pages 444–463. Springer, 2024. 2

work page 2024
[20]

Milan: Masked image pretraining on language assisted representation.arXiv preprint arXiv:2208.06049,

Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, and Sun- Yuan Kung. Milan: Masked image pretraining on language assisted representation.arXiv preprint arXiv:2208.06049,

work page arXiv
[21]

Interlaced sparse self-attention for semantic segmentation.arXiv preprint arXiv:1907.12273, 2019

Lang Huang, Yuhui Yuan, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. Interlaced sparse self-attention for semantic segmentation.arXiv preprint arXiv:1907.12273, 2019. 5

work page arXiv 1907
[22]

Multispectral pedestrian detection: Benchmark dataset and baseline

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. InCVPR, pages 1037– 1045, 2015. 2, 4

work page 2015
[23]

Llvip: A visible-infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InCVPR, pages 3496–3504, 2021. 2, 4

work page 2021
[24]

Tencent text-video retrieval: hierarchi- cal cross-modal interactions with multi-level representations

Jie Jiang, Shaobo Min, Weijie Kong, Hongfa Wang, Zhifeng Li, and Wei Liu. Tencent text-video retrieval: hierarchi- cal cross-modal interactions with multi-level representations. IEEE Access, 2022. 2

work page 2022
[25]

Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation.TNNLS, 32(7): 3069–3082, 2020

Chenglong Li, Wei Xia, Yan Yan, Bin Luo, and Jin Tang. Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation.TNNLS, 32(7): 3069–3082, 2020. 6, 7

work page 2020
[26]

Lasher: A large-scale high- diversity benchmark for rgbt tracking.TIP, 31:392–404,

Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. Lasher: A large-scale high- diversity benchmark for rgbt tracking.TIP, 31:392–404,

work page
[27]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, pages 2117–2125,

work page
[28]

Infmae: A foundation model in the infrared modality

Fangcen Liu, Chenqiang Gao, Yaming Zhang, Junjie Guo, Jinghao Wang, and Deyu Meng. Infmae: A foundation model in the infrared modality. InECCV, pages 420–437. Springer, 2025. 1, 2, 5, 6

work page 2025
[29]

Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InCVPR, pages 5802–5811, 2022. 5

work page 2022
[30]

X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InProceedings of the 30th ACM international conference on multimedia, pages 638–647, 2022. 3

work page 2022
[31]

Multi- level cross-modal semantic alignment network for video–text retrieval.Mathematics, 10(18):3346, 2022

Fudong Nian, Ling Ding, Yuxia Hu, and Yanhong Gu. Multi- level cross-modal semantic alignment network for video–text retrieval.Mathematics, 10(18):3346, 2022. 2

work page 2022
[32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PMLR, 2021. 2, 8

work page 2021
[34]

Faster r-cnn: Towards real-time object detection with region proposal networks.PAMI, 39(6):1137–1149, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.PAMI, 39(6):1137–1149, 2016. 5

work page 2016
[35]

The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000. 6

work page 2000
[36]

Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.TCSVT, 32(10):6700–6713,

Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.TCSVT, 32(10):6700–6713,

work page
[37]

Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 2022

Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 2022. 1, 5, 6

work page 2022
[38]

Fine-grained action retrieval through multiple parts- of-speech embeddings

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-grained action retrieval through multiple parts- of-speech embeddings. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 450–459,

work page
[39]

Unified perceptual parsing for scene understand- ing

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InECCV, pages 418–434, 2018. 5

work page 2018
[40]

Hitea: Hierarchical temporal- aware video-language pre-training

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal- aware video-language pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023. 2

work page 2023
[41]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2

work page 2023
[42]

Pad: Self-supervised pre-training with patchwise-scale adapter for infrared im- ages.arXiv preprint arXiv:2312.08192, 2023

Tao Zhang, Kun Ding, Jinyong Wen, Yu Xiong, Zeyu Zhang, Shiming Xiang, and Chunhong Pan. Pad: Self-supervised pre-training with patchwise-scale adapter for infrared im- ages.arXiv preprint arXiv:2312.08192, 2023. 1, 2, 6, 7

work page arXiv 2023
[43]

Unip: Rethinking pre-trained attention patterns for infrared semantic segmentation.arXiv preprint arXiv:2502.02257, 2025

Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, and Chunhong Pan. Unip: Rethinking pre-trained attention patterns for infrared semantic segmentation.arXiv preprint arXiv:2502.02257, 2025. 1

work page arXiv 2025
[44]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InCVPR, pages 633–641, 2017. 6

work page 2017
[45]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

https://github.com/ultralytics/ultralytics

Yolov8. https://github.com/ultralytics/ultralytics. 5

work page

[2] [2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

work page 2023

[3] [3]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InECCV, pages 213–229. Springer, 2020. 5

work page 2020

[4] [4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021. 1, 2, 5

work page 2021

[5] [5]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 5

work page 2018

[6] [6]

An empiri- cal study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empiri- cal study of training self-supervised vision transformers. In CVPR, pages 9640–9649, 2021. 1, 7

work page 2021

[7] [7]

Tagging before alignment: Integrating multi- modal tags for video-text retrieval

Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, and Ying Shan. Tagging before alignment: Integrating multi- modal tags for video-text retrieval. InProceedings of the AAAI conference on artificial intelligence, pages 396–404,

work page

[8] [8]

Learning a similarity metric discriminatively, with application to face verification

Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 539–546. IEEE, 2005. 2

work page 2005

[9] [9]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 5

work page 2009

[10] [10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010

[12] [12]

Flir thermal dataset version 1.3 [dataset],

Teledyne FLIR. Flir thermal dataset version 1.3 [dataset],

work page

[13] [13]

https://www.flir.com/oem/adas/adas-dataset-form/. 2, 4

work page

[14] [14]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 8

work page 1999

[15] [15]

Convmae: Masked convolution meets masked au- toencoders.arXiv preprint arXiv:2205.03892, 2022

Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, and Yu Qiao. Convmae: Masked convolution meets masked au- toencoders.arXiv preprint arXiv:2205.03892, 2022. 1, 2, 5, 6, 7

work page arXiv 2022

[16] [16]

Mfnet: Towards real-time se- mantic segmentation for autonomous vehicles with multi- spectral scenes

Qishen Ha, Kohei Watanabe, Takumi Karasawa, Yoshitaka Ushiku, and Tatsuya Harada. Mfnet: Towards real-time se- mantic segmentation for autonomous vehicles with multi- spectral scenes. InIROS, pages 5108–5115. IEEE, 2017. 6, 7

work page 2017

[17] [17]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, pages 2961–2969, 2017. 5

work page 2017

[18] [18]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 1, 5, 6, 7

work page 2022

[19] [19]

Vic-mae: Self-supervised representation learning from im- ages and video with contrastive masked autoencoders

Jefferson Hernandez, Ruben Villegas, and Vicente Ordonez. Vic-mae: Self-supervised representation learning from im- ages and video with contrastive masked autoencoders. In European Conference on Computer Vision, pages 444–463. Springer, 2024. 2

work page 2024

[20] [20]

Milan: Masked image pretraining on language assisted representation.arXiv preprint arXiv:2208.06049,

Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, and Sun- Yuan Kung. Milan: Masked image pretraining on language assisted representation.arXiv preprint arXiv:2208.06049,

work page arXiv

[21] [21]

Interlaced sparse self-attention for semantic segmentation.arXiv preprint arXiv:1907.12273, 2019

Lang Huang, Yuhui Yuan, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. Interlaced sparse self-attention for semantic segmentation.arXiv preprint arXiv:1907.12273, 2019. 5

work page arXiv 1907

[22] [22]

Multispectral pedestrian detection: Benchmark dataset and baseline

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. InCVPR, pages 1037– 1045, 2015. 2, 4

work page 2015

[23] [23]

Llvip: A visible-infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InCVPR, pages 3496–3504, 2021. 2, 4

work page 2021

[24] [24]

Tencent text-video retrieval: hierarchi- cal cross-modal interactions with multi-level representations

Jie Jiang, Shaobo Min, Weijie Kong, Hongfa Wang, Zhifeng Li, and Wei Liu. Tencent text-video retrieval: hierarchi- cal cross-modal interactions with multi-level representations. IEEE Access, 2022. 2

work page 2022

[25] [25]

Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation.TNNLS, 32(7): 3069–3082, 2020

Chenglong Li, Wei Xia, Yan Yan, Bin Luo, and Jin Tang. Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation.TNNLS, 32(7): 3069–3082, 2020. 6, 7

work page 2020

[26] [26]

Lasher: A large-scale high- diversity benchmark for rgbt tracking.TIP, 31:392–404,

Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. Lasher: A large-scale high- diversity benchmark for rgbt tracking.TIP, 31:392–404,

work page

[27] [27]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, pages 2117–2125,

work page

[28] [28]

Infmae: A foundation model in the infrared modality

Fangcen Liu, Chenqiang Gao, Yaming Zhang, Junjie Guo, Jinghao Wang, and Deyu Meng. Infmae: A foundation model in the infrared modality. InECCV, pages 420–437. Springer, 2025. 1, 2, 5, 6

work page 2025

[29] [29]

Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InCVPR, pages 5802–5811, 2022. 5

work page 2022

[30] [30]

X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InProceedings of the 30th ACM international conference on multimedia, pages 638–647, 2022. 3

work page 2022

[31] [31]

Multi- level cross-modal semantic alignment network for video–text retrieval.Mathematics, 10(18):3346, 2022

Fudong Nian, Ling Ding, Yuxia Hu, and Yanhong Gu. Multi- level cross-modal semantic alignment network for video–text retrieval.Mathematics, 10(18):3346, 2022. 2

work page 2022

[32] [32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PMLR, 2021. 2, 8

work page 2021

[34] [34]

Faster r-cnn: Towards real-time object detection with region proposal networks.PAMI, 39(6):1137–1149, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.PAMI, 39(6):1137–1149, 2016. 5

work page 2016

[35] [35]

The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000. 6

work page 2000

[36] [36]

Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.TCSVT, 32(10):6700–6713,

Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.TCSVT, 32(10):6700–6713,

work page

[37] [37]

Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 2022

Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 2022. 1, 5, 6

work page 2022

[38] [38]

Fine-grained action retrieval through multiple parts- of-speech embeddings

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-grained action retrieval through multiple parts- of-speech embeddings. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 450–459,

work page

[39] [39]

Unified perceptual parsing for scene understand- ing

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InECCV, pages 418–434, 2018. 5

work page 2018

[40] [40]

Hitea: Hierarchical temporal- aware video-language pre-training

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal- aware video-language pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023. 2

work page 2023

[41] [41]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2

work page 2023

[42] [42]

Pad: Self-supervised pre-training with patchwise-scale adapter for infrared im- ages.arXiv preprint arXiv:2312.08192, 2023

Tao Zhang, Kun Ding, Jinyong Wen, Yu Xiong, Zeyu Zhang, Shiming Xiang, and Chunhong Pan. Pad: Self-supervised pre-training with patchwise-scale adapter for infrared im- ages.arXiv preprint arXiv:2312.08192, 2023. 1, 2, 6, 7

work page arXiv 2023

[43] [43]

Unip: Rethinking pre-trained attention patterns for infrared semantic segmentation.arXiv preprint arXiv:2502.02257, 2025

Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, and Chunhong Pan. Unip: Rethinking pre-trained attention patterns for infrared semantic segmentation.arXiv preprint arXiv:2502.02257, 2025. 1

work page arXiv 2025

[44] [44]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InCVPR, pages 633–641, 2017. 6

work page 2017

[45] [45]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv