Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models

Junli Gong; Weifeng Su; Wenjun Dong; Yiuming Cheung; Yuchen Guo

arxiv: 2605.06010 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models

Yuchen Guo , Junli Gong , Wenjun Dong , Yiuming Cheung , Weifeng Su This is my paper

Pith reviewed 2026-05-08 14:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image fusionthermal imagingdiffusion modelsreal-time perceptionRGB-infrared fusionmodel distillationautonomous drivingplug-and-play module

0 comments

The pith

FusionProxy distills a diffusion ensemble into a lightweight module that fuses RGB and thermal images at real-time speeds while remaining fully independent of the downstream vision task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that variance statistics extracted once from a teacher diffusion model in both pixel and feature space can supervise a student network to deliver high-quality thermal-RGB fusion without any joint training with the final perception model. This matters for systems that must operate continuously because purely RGB cameras lose reliability at night or in fog while current fusion techniques are too slow for edge deployment. If the claim holds, thermal information becomes a plug-in upgrade that any existing visual pipeline can adopt without retraining or added latency.

Core claim

A student fusion network trained solely on per-pixel variance maps from a frozen diffusion teacher ensemble in raw image space and in backbone feature space can achieve diffusion-level fusion quality, enabling direct insertion into any RGB-based perception system to improve performance on static recognition tasks and robustness in dynamic closed-loop driving scenarios while sustaining real-time inference across GPU and commodity hardware platforms.

What carries the argument

FusionProxy, a distilled student model that uses per-pixel variance in raw image space to weight supervision and per-pixel variance inside frozen backbones to route feature alignment.

If this is right

Any pre-trained RGB vision model can receive thermal awareness by simply inserting the trained FusionProxy module with no further fine-tuning.
Static recognition accuracy improves under low-light and adverse weather conditions where RGB alone degrades.
Closed-loop autonomous driving systems gain measurable robustness from the added thermal channel without sacrificing latency.
The same fusion module runs at interactive rates on both high-end GPUs and lower-power commodity processors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance-based distillation recipe could be tested on other sensor pairs such as RGB plus event cameras or depth.
Because the module is task-agnostic, it might allow rapid prototyping of multi-modal perception stacks where thermal data is added only to selected subsystems.
If the variance statistics prove stable across scenes, the teacher ensemble might be replaced by a smaller set of diffusion samples without loss of student quality.

Load-bearing premise

Per-pixel variance statistics taken from a teacher diffusion ensemble are sufficient by themselves to train a student that matches diffusion fusion quality without any joint optimization against the downstream perception task.

What would settle it

A standard RGB object detector equipped with FusionProxy shows no accuracy gain over the RGB-only baseline on a nighttime or fog object-detection benchmark, or its measured frame rate falls below real-time thresholds on the target hardware.

Figures

Figures reproduced from arXiv: 2605.06010 by Junli Gong, Weifeng Su, Wenjun Dong, Yiuming Cheung, Yuchen Guo.

**Figure 1.** Figure 1: Top: End-to-end closed-loop autonomous driving pipeline under degraded visibility. Bottom-left: Thermal radiation exposes pedestrians and vehicles invisible to RGB. Bottom-right: The fused output is directly consumable by frozen downstream models without retraining. ∗Correspondence to: yuchenguo2027@u.northwestern.edu, wfsu@bnbu.edu.cn. Preprint. arXiv:2605.06010v1 [cs.CV] 7 May 2026 view at source ↗

**Figure 2.** Figure 2: The FusionProxy framework. Dual diffusion teachers generate a sample ensemble whose view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of image fusion results with state-of-the-art methods. view at source ↗

**Figure 4.** Figure 4: Closed-loop autonomous driving in CARLA: visual examples of degraded-visibility view at source ↗

read the original abstract

Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FusionProxy distills a fast RGB-IR fusion module from diffusion teacher variance stats in a plug-and-play way, but the no-joint-optimization assumption is the part that needs checking to back the robustness gains.

read the letter

The main point is that this paper trains a lightweight FusionProxy network to do RGB-IR fusion by matching per-pixel variance maps from a diffusion teacher ensemble, both at the raw image level for supervision weights and inside frozen backbones for feature routing. Once trained, it drops into any existing vision pipeline without further tuning and runs in real time even on modest hardware. That addresses a clear practical gap: high-quality fusion is usually too slow for edge use in driving or robotics, while simple RGB models drop off in night or fog. The dual-variance approach is a concrete engineering choice that builds on distillation ideas but applies them specifically to this fusion setting, and the claim of diffusion-level quality at much lower cost is worth testing if the numbers hold. The experiments reportedly show better static recognition and improved closed-loop driving robustness, which would matter if real. The soft spot sits exactly on the central assumption. Training only on those variance statistics without any downstream task signal means the student has to infer the right fusion behavior purely from how the teacher ensemble varies. If those maps do not fully encode the complementary thermal cues that actually help recognition or control, the reported gains on dynamic tasks could be weaker or come from other factors in the setup. The abstract is confident on the results, but the strength depends on whether the ablations isolate the variance supervision and whether the driving evaluations use realistic closed-loop metrics rather than just offline accuracy. This is aimed at people working on all-weather perception stacks who need something they can insert without retraining everything. A reader already doing fusion or distillation work would get the most out of the implementation details and speed measurements. It deserves peer review because the problem is real, the method is testable, and the claims are specific enough for referees to check the evidence directly.

Referee Report

3 major / 2 minor

Summary. The paper proposes FusionProxy, a plug-and-play real-time image fusion module that distills from a teacher diffusion ensemble using per-pixel variance statistics in raw image space (for weighted pixel supervision) and frozen backbone feature space (for spatial routing). Once trained in a task-agnostic manner, it can be inserted into any RGB visual perception pipeline without joint optimization, claiming superior performance on static recognition tasks, enhanced robustness in dynamic tasks including closed-loop autonomous driving, and real-time inference across hardware platforms.

Significance. If the central claims hold, the work would address a practical gap in all-day perception by enabling high-quality thermal fusion at low latency without retraining downstream models, potentially benefiting autonomous systems and edge deployment where existing fusion methods are too slow.

major comments (3)

[Method (training procedure description)] The core training procedure (per-pixel variance from teacher ensemble in image and feature spaces, no joint optimization) is load-bearing for the plug-and-play claim, yet the manuscript provides no direct ablation or comparison showing that these variance maps alone recover the full complementary thermal cues or fusion behavior that drives the reported gains on closed-loop driving robustness. This matches the weakest assumption identified in the stress test.
[Abstract and Experiments section] Abstract and experimental claims assert 'superior performance' and 'significantly enhances robustness' on static and dynamic tasks, but without visible quantitative tables, baselines, or metrics (e.g., mAP deltas, collision rates in driving sim), it is impossible to assess whether the data support the claims or whether the student matches teacher-level fusion quality.
[Experiments (runtime evaluation)] The real-time inference claim across 'diverse platforms from high-end GPUs to commodity hardware' is central to practicality, but lacks specific latency numbers, hardware specs, or comparison to non-distilled fusion baselines in the provided description.

minor comments (2)

[Method] Notation for the two variance statistics (raw image vs. feature space) should be formalized with equations to clarify the supervision and routing losses.
[Method] The abstract mentions 'frozen foundation backbones' but does not specify which models or layers are used; this detail is needed for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of our method, results, and claims.

read point-by-point responses

Referee: [Method (training procedure description)] The core training procedure (per-pixel variance from teacher ensemble in image and feature spaces, no joint optimization) is load-bearing for the plug-and-play claim, yet the manuscript provides no direct ablation or comparison showing that these variance maps alone recover the full complementary thermal cues or fusion behavior that drives the reported gains on closed-loop driving robustness. This matches the weakest assumption identified in the stress test.

Authors: We agree that a targeted ablation would more clearly isolate the role of the variance-based supervision. In the revised version we will add an ablation study (new Table in Section 4) that compares the full FusionProxy against (i) uniform pixel supervision without variance weighting and (ii) feature-space alignment without spatial routing. These variants will be evaluated on both static recognition mAP and closed-loop CARLA collision rates, directly showing how the per-pixel variance statistics recover complementary thermal cues while preserving the no-joint-optimization property that enables plug-and-play use. The training procedure itself is fully specified in Section 3.2. revision: yes
Referee: [Abstract and Experiments section] Abstract and experimental claims assert 'superior performance' and 'significantly enhances robustness' on static and dynamic tasks, but without visible quantitative tables, baselines, or metrics (e.g., mAP deltas, collision rates in driving sim), it is impossible to assess whether the data support the claims or whether the student matches teacher-level fusion quality.

Authors: Quantitative tables already appear in Section 4, reporting mAP deltas on recognition benchmarks and collision-rate reductions in CARLA closed-loop driving against RGB-only and competing fusion baselines. The student model is shown to reach within a small margin of the teacher ensemble while running in real time. To improve visibility we will (a) insert concrete metric references into the abstract (e.g., “+3.2 mAP, 18 % fewer collisions”) and (b) add a compact student-vs-teacher quality table. These changes will make the supporting data immediately accessible. revision: partial
Referee: [Experiments (runtime evaluation)] The real-time inference claim across 'diverse platforms from high-end GPUs to commodity hardware' is central to practicality, but lacks specific latency numbers, hardware specs, or comparison to non-distilled fusion baselines in the provided description.

Authors: Section 4.3 already contains runtime measurements on multiple platforms. In the revision we will replace the current summary with a dedicated table listing exact latency (ms) and FPS on NVIDIA RTX 3090, Jetson Nano, and CPU-only hardware, together with direct comparisons against non-distilled diffusion fusion and other real-time baselines. This will quantify the speedup obtained by distillation and substantiate the cross-platform real-time claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed distillation procedure

full rationale

The paper describes FusionProxy as a student model trained solely on per-pixel variance statistics extracted from an external teacher diffusion ensemble (in raw image space and frozen backbone feature space). No equations, derivations, or self-referential definitions are present that would reduce the claimed fusion quality or downstream performance gains to the training inputs by construction. The training is explicitly task-agnostic and plug-and-play with no joint optimization, and performance is validated empirically on separate static and dynamic benchmarks. This setup is self-contained against external teacher data and does not rely on load-bearing self-citations or fitted parameters renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; full text would be required to populate this ledger accurately.

pith-pipeline@v0.9.0 · 5499 in / 1135 out tokens · 55712 ms · 2026-05-08T14:10:38.350351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

R Archana and PS Eliahim Jeevaraj. Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

2024
[2]

Heat-assisted detection and ranging.Nature, 619(7971):743–748, 2023

Fanglin Bao, Xueji Wang, Shree Hari Sureshbabu, Gautam Sreekumar, Liping Yang, Vaneet Aggarwal, Vishnu N Boddeti, and Zubin Jacob. Heat-assisted detection and ranging.Nature, 619(7971):743–748, 2023

2023
[3]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017

2017
[4]

Fuse4seg: image-level fusion based multi-modality medical image segmentation.arXiv preprint arXiv:2409.10328, 2024

Yuchen Guo and Weifeng Su. Fuse4seg: Image-level fusion based multi-modality medical image segmentation.arXiv preprint arXiv:2409.10328, 2024

work page arXiv 2024
[5]

Dae-fuse: An adaptive dis- criminative autoencoder for multi-modality image fusion.arXiv preprint arXiv:2409.10080, 2024

Yuchen Guo, Ruoxiang Xu, Rongcheng Li, and Weifeng Su. Dae-fuse: An adaptive dis- criminative autoencoder for multi-modality image fusion.arXiv preprint arXiv:2409.10080, 2024

work page arXiv 2024
[6]

Ultralytics yolov8, 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. URL https:// github.com/ultralytics/ultralytics

2023
[7]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

2021
[8]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023
[9]

Rethinking vision transformers for mobilenet size and speed

Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16889– 16900, 2023

2023
[10]

Harnessing diffusion-yielded score priors for image restoration.ACM Transactions on Graphics (TOG), 44(6):1–21, 2025

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration.ACM Transactions on Graphics (TOG), 44(6):1–21, 2025

2025
[11]

Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5811, 2022

2022
[12]

Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation

Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. InInternational Conference on Computer Vision, 2023

2023
[13]

Infrared and visible image fusion methods and applications: A survey.Information fusion, 45:153–178, 2019

Jiayi Ma, Yong Ma, and Chang Li. Infrared and visible image fusion methods and applications: A survey.Information fusion, 45:153–178, 2019

2019
[14]

Ddcgan: A dual- discriminator conditional generative adversarial network for multi-resolution image fusion

Jiayi Ma, Han Xu, Junjun Jiang, Xiaoguang Mei, and Xiao-Ping Zhang. Ddcgan: A dual- discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29:4980–4995, 2020

2020
[15]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023

2023
[16]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 10

work page internal anchor Pith review arXiv 2023
[17]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[18]

Interference measurements between single-beam, mechanical scanning, time-of-flight lidars.Optical Engineering, 59(5): 053106–053106, 2020

Gerald B Popko, Thomas K Gaylord, and Christopher R Valenta. Interference measurements between single-beam, mechanical scanning, time-of-flight lidars.Optical Engineering, 59(5): 053106–053106, 2020

2020
[19]

Mobilenetv4: Universal models for the mobile ecosystem

Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, et al. Mobilenetv4: Universal models for the mobile ecosystem. InEuropean conference on computer vision, pages 78–96. Springer, 2024

2024
[20]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[21]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review arXiv 2014
[22]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review arXiv 2010
[23]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

2023
[24]

Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022

Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. ISSN 1566-2535

2022
[25]

Mask-difuser: A masked diffusion model for unified unsupervised image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Linfeng Tang, Chunyu Li, and Jiayi Ma. Mask-difuser: A masked diffusion model for unified unsupervised image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[26]

Controlfusion: A controllable image fusion framework with language-vision degradation prompts.arXiv preprint arXiv:2503.23356, 2025

Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, and Jiayi Ma. Controlfusion: A controllable image fusion framework with language-vision degradation prompts.arXiv preprint arXiv:2503.23356, 2025

work page arXiv 2025
[27]

Self-driving cars and lidar

Simon Verghese. Self-driving cars and lidar. InCLEO: Applications and Technology, pages AM3A–1. Optica Publishing Group, 2017

2017
[28]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

2023
[29]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[30]

Efficient rectified flow for image fusion.arXiv preprint arXiv:2509.16549, 2025

Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, and Jinyuan Liu. Efficient rectified flow for image fusion.arXiv preprint arXiv:2509.16549, 2025

work page arXiv 2025
[31]

Convnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

2023
[32]

Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

2021
[33]

U2fusion: A unified unsuper- vised image fusion network.IEEE transactions on pattern analysis and machine intelligence, 44(1):502–518, 2020

Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsuper- vised image fusion network.IEEE transactions on pattern analysis and machine intelligence, 44(1):502–518, 2020

2020
[34]

Diff-if: Multi-modality image fusion via diffusion model with fusion knowledge prior.Information Fusion, 110:102450, 2024

Xunpeng Yi, Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. Diff-if: Multi-modality image fusion via diffusion model with fusion knowledge prior.Information Fusion, 110:102450, 2024. 11

2024
[35]

Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion

Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27026–27035, 2024

2024
[36]

Teaching large language models to regress accurate image quality scores using score distribution

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025

2025
[37]

Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024

2024
[38]

Image fusion meets deep learning: A survey and perspective.Information Fusion, 76:323–336, 2021

Hao Zhang, Han Xu, Xin Tian, Junjun Jiang, and Jiayi Ma. Image fusion meets deep learning: A survey and perspective.Information Fusion, 76:323–336, 2021

2021
[39]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[40]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

work page arXiv 2024
[41]

Didfuse: Deep im- age decomposition for infrared and visible image fusion.arXiv preprint arXiv:2003.09210,

Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Pengfei Li, and Jiangshe Zhang. Didfuse: Deep image decomposition for infrared and visible image fusion.arXiv preprint arXiv:2003.09210, 2020

work page arXiv 2003
[42]

Cddfuse: Correlation-driven dual-branch feature decomposition for multi- modality image fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi- modality image fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5906–5916, June 2023

2023
[43]

Ddfm: denoising diffusion model for multi-modality image fusion

Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, and Luc Van Gool. Ddfm: denoising diffusion model for multi-modality image fusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 8082–8093, 2023

2023
[44]

Equivariant multi-modality image fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Kai Zhang, Shuang Xu, Dongdong Chen, Radu Timofte, and Luc Van Gool. Equivariant multi-modality image fusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

2024
[45]

Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,

Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, et al. Image fusion via vision-language model.arXiv preprint arXiv:2402.02235, 2024

work page arXiv 2024
[46]

A unified solution to video fusion: From multi-frame learning to benchmarking.arXiv preprint arXiv:2505.19858, 2025

Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, and Konrad Schindler. A unified solution to video fusion: From multi-frame learning to benchmarking.arXiv preprint arXiv:2505.19858, 2025. A Foundation Backbone Feature Extraction Details This section details the foundation backbone configuration referenced in Sec. 4.1. B...

work page arXiv 2025

[1] [1]

Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

R Archana and PS Eliahim Jeevaraj. Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

2024

[2] [2]

Heat-assisted detection and ranging.Nature, 619(7971):743–748, 2023

Fanglin Bao, Xueji Wang, Shree Hari Sureshbabu, Gautam Sreekumar, Liping Yang, Vaneet Aggarwal, Vishnu N Boddeti, and Zubin Jacob. Heat-assisted detection and ranging.Nature, 619(7971):743–748, 2023

2023

[3] [3]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017

2017

[4] [4]

Fuse4seg: image-level fusion based multi-modality medical image segmentation.arXiv preprint arXiv:2409.10328, 2024

Yuchen Guo and Weifeng Su. Fuse4seg: Image-level fusion based multi-modality medical image segmentation.arXiv preprint arXiv:2409.10328, 2024

work page arXiv 2024

[5] [5]

Dae-fuse: An adaptive dis- criminative autoencoder for multi-modality image fusion.arXiv preprint arXiv:2409.10080, 2024

Yuchen Guo, Ruoxiang Xu, Rongcheng Li, and Weifeng Su. Dae-fuse: An adaptive dis- criminative autoencoder for multi-modality image fusion.arXiv preprint arXiv:2409.10080, 2024

work page arXiv 2024

[6] [6]

Ultralytics yolov8, 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. URL https:// github.com/ultralytics/ultralytics

2023

[7] [7]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

2021

[8] [8]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023

[9] [9]

Rethinking vision transformers for mobilenet size and speed

Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16889– 16900, 2023

2023

[10] [10]

Harnessing diffusion-yielded score priors for image restoration.ACM Transactions on Graphics (TOG), 44(6):1–21, 2025

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration.ACM Transactions on Graphics (TOG), 44(6):1–21, 2025

2025

[11] [11]

Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5811, 2022

2022

[12] [12]

Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation

Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. InInternational Conference on Computer Vision, 2023

2023

[13] [13]

Infrared and visible image fusion methods and applications: A survey.Information fusion, 45:153–178, 2019

Jiayi Ma, Yong Ma, and Chang Li. Infrared and visible image fusion methods and applications: A survey.Information fusion, 45:153–178, 2019

2019

[14] [14]

Ddcgan: A dual- discriminator conditional generative adversarial network for multi-resolution image fusion

Jiayi Ma, Han Xu, Junjun Jiang, Xiaoguang Mei, and Xiao-Ping Zhang. Ddcgan: A dual- discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29:4980–4995, 2020

2020

[15] [15]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023

2023

[16] [16]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 10

work page internal anchor Pith review arXiv 2023

[17] [17]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[18] [18]

Interference measurements between single-beam, mechanical scanning, time-of-flight lidars.Optical Engineering, 59(5): 053106–053106, 2020

Gerald B Popko, Thomas K Gaylord, and Christopher R Valenta. Interference measurements between single-beam, mechanical scanning, time-of-flight lidars.Optical Engineering, 59(5): 053106–053106, 2020

2020

[19] [19]

Mobilenetv4: Universal models for the mobile ecosystem

Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, et al. Mobilenetv4: Universal models for the mobile ecosystem. InEuropean conference on computer vision, pages 78–96. Springer, 2024

2024

[20] [20]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[21] [21]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review arXiv 2014

[22] [22]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review arXiv 2010

[23] [23]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

2023

[24] [24]

Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022

Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. ISSN 1566-2535

2022

[25] [25]

Mask-difuser: A masked diffusion model for unified unsupervised image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Linfeng Tang, Chunyu Li, and Jiayi Ma. Mask-difuser: A masked diffusion model for unified unsupervised image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[26] [26]

Controlfusion: A controllable image fusion framework with language-vision degradation prompts.arXiv preprint arXiv:2503.23356, 2025

Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, and Jiayi Ma. Controlfusion: A controllable image fusion framework with language-vision degradation prompts.arXiv preprint arXiv:2503.23356, 2025

work page arXiv 2025

[27] [27]

Self-driving cars and lidar

Simon Verghese. Self-driving cars and lidar. InCLEO: Applications and Technology, pages AM3A–1. Optica Publishing Group, 2017

2017

[28] [28]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

2023

[29] [29]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004

[30] [30]

Efficient rectified flow for image fusion.arXiv preprint arXiv:2509.16549, 2025

Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, and Jinyuan Liu. Efficient rectified flow for image fusion.arXiv preprint arXiv:2509.16549, 2025

work page arXiv 2025

[31] [31]

Convnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

2023

[32] [32]

Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

2021

[33] [33]

U2fusion: A unified unsuper- vised image fusion network.IEEE transactions on pattern analysis and machine intelligence, 44(1):502–518, 2020

Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsuper- vised image fusion network.IEEE transactions on pattern analysis and machine intelligence, 44(1):502–518, 2020

2020

[34] [34]

Diff-if: Multi-modality image fusion via diffusion model with fusion knowledge prior.Information Fusion, 110:102450, 2024

Xunpeng Yi, Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. Diff-if: Multi-modality image fusion via diffusion model with fusion knowledge prior.Information Fusion, 110:102450, 2024. 11

2024

[35] [35]

Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion

Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27026–27035, 2024

2024

[36] [36]

Teaching large language models to regress accurate image quality scores using score distribution

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025

2025

[37] [37]

Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024

2024

[38] [38]

Image fusion meets deep learning: A survey and perspective.Information Fusion, 76:323–336, 2021

Hao Zhang, Han Xu, Xin Tian, Junjun Jiang, and Jiayi Ma. Image fusion meets deep learning: A survey and perspective.Information Fusion, 76:323–336, 2021

2021

[39] [39]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

[40] [40]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

work page arXiv 2024

[41] [41]

Didfuse: Deep im- age decomposition for infrared and visible image fusion.arXiv preprint arXiv:2003.09210,

Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Pengfei Li, and Jiangshe Zhang. Didfuse: Deep image decomposition for infrared and visible image fusion.arXiv preprint arXiv:2003.09210, 2020

work page arXiv 2003

[42] [42]

Cddfuse: Correlation-driven dual-branch feature decomposition for multi- modality image fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi- modality image fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5906–5916, June 2023

2023

[43] [43]

Ddfm: denoising diffusion model for multi-modality image fusion

Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, and Luc Van Gool. Ddfm: denoising diffusion model for multi-modality image fusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 8082–8093, 2023

2023

[44] [44]

Equivariant multi-modality image fusion

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Kai Zhang, Shuang Xu, Dongdong Chen, Radu Timofte, and Luc Van Gool. Equivariant multi-modality image fusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

2024

[45] [45]

Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,

Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, et al. Image fusion via vision-language model.arXiv preprint arXiv:2402.02235, 2024

work page arXiv 2024

[46] [46]

A unified solution to video fusion: From multi-frame learning to benchmarking.arXiv preprint arXiv:2505.19858, 2025

Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, and Konrad Schindler. A unified solution to video fusion: From multi-frame learning to benchmarking.arXiv preprint arXiv:2505.19858, 2025. A Foundation Backbone Feature Extraction Details This section details the foundation backbone configuration referenced in Sec. 4.1. B...

work page arXiv 2025