pith. sign in

arxiv: 2604.08924 · v1 · submitted 2026-04-10 · 💻 cs.CV

Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords infrared-visible image fusionmulti-task adaptationclosed-loop optimizationsemantic compensationadaptive fusion networkcomputer visiondynamic network
0
0 comments X

The pith

A closed-loop dynamic network customizes infrared-visible fusion for multiple downstream tasks by feeding back task performance to compensate semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the limitation that standard fusion methods for infrared and visible images cannot adjust themselves when used for different tasks such as detection or segmentation. It proposes a network that creates an explicit loop: task results influence a compensation module which then modifies the fusion process on the fly. This module draws from a bank of basis vectors and injects task-specific adjustments into the network architecture. The adjustments are guided by a reward or penalty based on whether task accuracy improves or declines. As a result, the same fusion model can serve multiple tasks without being retrained from scratch for each one.

Core claim

The central claim is that a closed-loop optimization mechanism, built around a Requirement-driven Semantic Compensation module, can transmit semantic needs from downstream tasks back to the fusion network. The module employs a Basis Vector Bank together with an Architecture-Adaptive Semantic Injection block to alter network behavior according to task requirements, so that the fused image actively supports whichever task is active without any retraining of the fusion weights.

What carries the argument

The Requirement-driven Semantic Compensation (RSC) module, which uses a Basis Vector Bank and Architecture-Adaptive Semantic Injection block to reshape the fusion network according to measured task performance.

If this is right

  • The fusion network maintains high visual quality on standard benchmarks while gaining the ability to serve multiple tasks.
  • Explicit feedback from task metrics drives semantic changes, removing the need to retrain the fusion model for each new task.
  • A reward-penalty rule based on task performance variations guides the compensation process.
  • The same trained model exhibits measurable adaptability across the M3FD, FMB, and VT5000 datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The closed-loop idea could be applied to other multi-modal fusion settings where the best output depends on which task is running at the moment.
  • By avoiding separate fusion models for each task, the approach may lower overall storage and compute costs in systems that switch between tasks.
  • If the compensation remains stable over long sequences of changing tasks, the method might support continuous online adaptation in deployed vision systems.

Load-bearing premise

Measured changes in downstream task performance can be translated into stable, useful adjustments to the fusion network without causing instability or overfitting to individual tasks.

What would settle it

Running the method on a new task or dataset where the adapted fusion produces lower task accuracy than a fixed, non-adaptive fusion baseline would show that the closed-loop compensation is not providing the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.08924 by Huafeng Li, Juan Cheng, Yafei Zhang, Yu Liu, Zengyi Yang, Zhiqin Zhu.

Figure 1
Figure 1. Figure 1: Comparison of processing paradigms between existing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the adaptive multi-task-aware infrared-visible image fusion network. The network forms a semantic transmission [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the A2SI block. The A2SI block com [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between the proposed method and the “task network retraining” methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison between the proposed method [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative (a) and quantitative (b) comparison between [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Network architecture of VFN. The VFN (a) consists of a Feature Extraction Blocks (FEB) (b) and a Fusion Feature Reconstruction [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison between the proposed method and existing state-of-the-art approaches. The first and second columns [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison between the full model and the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Training loss curves of the proposed method. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
read the original abstract

Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at https://github.com/YR0211/CLDyN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Closed-Loop Dynamic Network (CLDyN) for adaptive infrared-visible image fusion across multiple downstream tasks. It features a closed-loop optimization with a Requirement-driven Semantic Compensation (RSC) module that utilizes a Basis Vector Bank (BVB) and Architecture-Adaptive Semantic Injection (A2SI) block to customize the fusion network based on task semantics. A reward-penalty strategy guides the adaptation using variations in task performance, allowing the system to respond to diverse tasks without retraining. Validation is performed on the M3FD, FMB, and VT5000 datasets, asserting maintained fusion quality alongside multi-task adaptability.

Significance. Should the proposed closed-loop mechanism prove stable and effective in providing task-driven customization, this contribution would be significant for infrared-visible fusion research. It tackles the challenge of task-specific adaptation in fusion networks, which could streamline applications requiring robustness to varying semantic needs, such as in object detection or segmentation pipelines. The public code release aids in verifying and extending the work.

major comments (2)
  1. [RSC module and reward-penalty strategy] The reward-penalty strategy employs downstream task performance to directly influence the RSC module's adjustments to the fusion network. This creates a potential circular dependency, where the performance metric serves both as the driver for modification and the evaluator of the output. To support the central claim of reliable adaptation without retraining, the paper must demonstrate the stability of this process, perhaps through convergence proofs or extensive empirical validation beyond the reported datasets.
  2. [Experiments section] The experiments claim strong multi-task adaptability on M3FD, FMB, and VT5000, yet the provided description lacks specific quantitative metrics, ablation studies isolating the contributions of BVB and A2SI, and error analysis. This omission weakens the ability to assess whether the closed-loop truly enables the claimed customization or if results could be due to other factors.
minor comments (2)
  1. Consider adding a table summarizing quantitative fusion metrics (e.g., PSNR, SSIM) and task performance improvements across datasets for clarity.
  2. [Abstract] The abstract states 'strong multi-task adaptability' without supporting numbers; including one or two key results would strengthen the summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions to improve the paper's rigor and clarity.

read point-by-point responses
  1. Referee: [RSC module and reward-penalty strategy] The reward-penalty strategy employs downstream task performance to directly influence the RSC module's adjustments to the fusion network. This creates a potential circular dependency, where the performance metric serves both as the driver for modification and the evaluator of the output. To support the central claim of reliable adaptation without retraining, the paper must demonstrate the stability of this process, perhaps through convergence proofs or extensive empirical validation beyond the reported datasets.

    Authors: We acknowledge the valid concern about potential circular dependency in the closed-loop design. The reward-penalty mechanism uses performance variations as feedback to adjust the RSC module via the BVB and A2SI, but the downstream metrics (e.g., detection mAP or segmentation IoU) are computed independently on the fused output after each adaptation step, breaking direct circularity. While a formal convergence proof is not provided in the current manuscript due to the non-convex and dynamic nature of the architecture search, we will add extensive empirical validation in the revision, including convergence plots of task performance over adaptation iterations, stability analysis across random seeds, and results on additional task variations within the M3FD, FMB, and VT5000 datasets. These additions will support the claim of reliable adaptation without retraining. revision: partial

  2. Referee: [Experiments section] The experiments claim strong multi-task adaptability on M3FD, FMB, and VT5000, yet the provided description lacks specific quantitative metrics, ablation studies isolating the contributions of BVB and A2SI, and error analysis. This omission weakens the ability to assess whether the closed-loop truly enables the claimed customization or if results could be due to other factors.

    Authors: We appreciate this observation and agree that more granular details are needed. The original manuscript reports quantitative fusion metrics (e.g., PSNR, SSIM, VIF) and downstream task results (e.g., mAP on detection), but we will expand the experiments section to include: (1) specific numerical tables with all metrics and standard deviations, (2) dedicated ablation studies isolating BVB and A2SI contributions (with and without each component), and (3) error analysis including per-task performance breakdowns, failure case discussions, and statistical significance tests. These revisions will better demonstrate the closed-loop's role in customization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central mechanism (closed-loop optimization via RSC module with BVB, A2SI, and reward-penalty based on downstream task performance variations) is presented as an external feedback process from task metrics to network adaptation, not as a self-referential definition or a fitted parameter renamed as a prediction. No equations or steps in the abstract reduce the claimed semantic transmission chain to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to load-bear the architecture. The multi-dataset experiments are cited as empirical support for stability and adaptability, keeping the derivation self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on two new invented modules (RSC with BVB and A2SI) whose effectiveness is asserted without external validation or parameter-free derivation; the reward-penalty loop is an ad-hoc training signal whose stability is assumed rather than proven.

axioms (1)
  • domain assumption Downstream task performance provides a stable and informative signal for adjusting fusion parameters
    Invoked in the description of the reward-penalty strategy that drives the RSC module.
invented entities (3)
  • Requirement-driven Semantic Compensation (RSC) module no independent evidence
    purpose: To receive task feedback and customize fusion via semantic compensation
    New component introduced to close the loop between fusion and downstream tasks
  • Basis Vector Bank (BVB) no independent evidence
    purpose: To provide basis vectors for architecture adaptation
    New data structure introduced inside the RSC module
  • Architecture-Adaptive Semantic Injection (A2SI) block no independent evidence
    purpose: To inject task-specific semantics into the network
    New architectural block for dynamic customization

pith-pipeline@v0.9.0 · 5532 in / 1411 out tokens · 44949 ms · 2026-05-10T17:19:12.350503+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    Task- driven image fusion with learnable fusion loss

    Haowen Bai, Jiangshe Zhang, Zixiang Zhao, Yichen Wu, Lilun Deng, Yukun Cui, Tao Feng, and Shuang Xu. Task- driven image fusion with learnable fusion loss. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7457–7468, 2025. 1, 2, 6

  2. [2]

    Deep unfolding multi-modal image fusion network via attri- bution analysis.IEEE Transactions on Circuits and Systems for Video Technology, 35(4):3498–3511, 2025

    Haowen Bai, Zixiang Zhao, Jiangshe Zhang, Baisong Jiang, Lilun Deng, Yukun Cui, Shuang Xu, and Chunxia Zhang. Deep unfolding multi-modal image fusion network via attri- bution analysis.IEEE Transactions on Circuits and Systems for Video Technology, 35(4):3498–3511, 2025. 1, 2

  3. [3]

    Closed-loop visuomotor control with gen- erative expectation for robotic manipulation

    Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, and Hongyang Li. Closed-loop visuomotor control with gen- erative expectation for robotic manipulation. InAdvances in Neural Information Processing Systems (NeurIPS), pages 139002–139029, 2024. 2

  4. [4]

    Conditional controllable image fusion

    Bing Cao, Xingxin Xu, Pengfei Zhu, Qilong Wang, and Qinghua Hu. Conditional controllable image fusion. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 120311–120335, 2024. 1

  5. [5]

    End- to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. InProceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020. 7, 8

  6. [6]

    Varshney

    Hao Chen and Pramod K. Varshney. A human perception inspired quality metric for image fusion based on regional information.Information Fusion, 8(2):193–207, 2007. 6

  7. [7]

    Sdsfusion: A semantic-aware infrared and visible image fusion network for degraded scenes.IEEE Transactions on Image Processing, 34:3139–3153, 2025

    Jun Chen, Liling Yang, Wei Yu, Wenping Gong, Zhanchuan Cai, and Jiayi Ma. Sdsfusion: A semantic-aware infrared and visible image fusion network for degraded scenes.IEEE Transactions on Image Processing, 34:3139–3153, 2025. 2

  8. [8]

    Yin Chen and Rick S. Blum. A new automated quality as- sessment algorithm for image fusion.Image and Vision Com- puting, 27(10):1421–1432, 2009. 6

  9. [9]

    Dynamic convolution: Attention over convolution kernels

    Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11027–11036, 2020. 2

  10. [10]

    One model for all: Low-level task interaction is a key to task-agnostic image fu- sion

    Chunyang Cheng, Tianyang Xu, Zhenhua Feng, Xiaojun Wu, Zhangyong Tang, Hui Li, Zeyang Zhang, Sara Atito, Muhammad Awais, and Josef Kittler. One model for all: Low-level task interaction is a key to task-agnostic image fu- sion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 28102– 28112, 2025. 1

  11. [11]

    Clever, Greg Turk, C

    Zackory Erickson, Henry M. Clever, Greg Turk, C. Karen Liu, and Charles C. Kemp. Deep haptic model predictive control for robot-assisted dressing. In2018 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 4437–4444, 2018. 2

  12. [12]

    Sam-guided multi-level collaborative transformer for infrared and visible image fusion.Pattern Recognition, 162:111391, 2025

    Lin Guo, Xiaoqing Luo, Yue Liu, Zhancheng Zhang, and Xi- aojun Wu. Sam-guided multi-level collaborative transformer for infrared and visible image fusion.Pattern Recognition, 162:111391, 2025. 2

  13. [13]

    Dynamic neural networks: A sur- vey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2022

    Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2022. 2

  14. [14]

    Llvip: A visible-infrared paired dataset for low- light vision

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low- light vision. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 3496–3504, 2021. 5

  15. [15]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InProceedings of International Conference on Learning Representations (ICLR), 2015. 6

  16. [16]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 2

  17. [17]

    Hui Li, Congcong Bian, Zeyang Zhang, Xiaoning Song, Xi Li, and XiaoJun Wu. Occo: Lvm-guided infrared and visible image fusion framework based on object-aware and contex- tual contrastive learning.International Journal of Computer Vision, 133(9):6611–6635, 2025. 6

  18. [18]

    Hui Li, Congcong Bian, Zeyang Zhang, Xiaoning Song, Xi Li, and Xiao-Jun Wu. Occo: Lvm-guided infrared and visi- ble image fusion framework based on object-aware and con- textual contrastive learning.International Journal of Com- puter Vision, 133(9):6611–6635, 2025. 1

  19. [19]

    Huafeng Li, Zengyi Yang, Yafei Zhang, Wei Jia, Zheng- tao Yu, and Yu Liu. Mulfs-cap: Multimodal fusion- supervised cross-modality alignment perception for unreg- istered infrared-visible image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3673– 3690, 2025

  20. [20]

    From text to pix- els: A context-aware semantic synergy solution for infrared and visible image fusion.arXiv preprint arXiv:2401.00421,

    Xingyuan Li, Yang Zou, Jinyuan Liu, Zhiying Jiang, Long Ma, Xin Fan, and Risheng Liu. From text to pixels: A context-aware semantic synergy solution for infrared and visible image fusion.arXiv preprint arXiv:2401.00421, 2023

  21. [21]

    Contourlet residual for prompt learning enhanced infrared image super-resolution

    Xingyuan Li, Jinyuan Liu, Zhixin Chen, Yang Zou, Long Ma, Xin Fan, and Risheng Liu. Contourlet residual for prompt learning enhanced infrared image super-resolution. InEuropean Conference on Computer Vision, pages 270–

  22. [22]

    Difiisr: A diffu- sion model with gradient guidance for infrared image super- resolution

    Xingyuan Li, Zirui Wang, Yang Zou, Zhixin Chen, Jun Ma, Zhiying Jiang, Long Ma, and Jinyuan Liu. Difiisr: A diffu- sion model with gradient guidance for infrared image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 7534–7544, 2025. 1

  23. [23]

    Fusion from decomposition: A self-supervised approach for image fusion and beyond.arXiv preprint arXiv: 2410.12274, 2024

    Pengwei Liang, Junjun Jiang, Qing Ma, Xianming Liu, and Jiayi Ma. Fusion from decomposition: A self-supervised approach for image fusion and beyond.arXiv preprint arXiv: 2410.12274, 2024. 2

  24. [24]

    Conflict-averse gradient descent for multi-task learn- ing

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learn- ing. InAdvances in Neural Information Processing Systems (NeurIPS), pages 18878–18890, 2021. 4

  25. [25]

    Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

    Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5802–5811,

  26. [26]

    Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation

    Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8081–8090, 2023. 2, 5, 6, 7, 1, 3

  27. [27]

    Coconet: Coupled con- trastive learning network with multi-level feature ensemble for multi-modality image fusion.International Journal of Computer Vision, 132(5):1748–1775, 2024

    Jinyuan Liu, Runjia Lin, Guanyao Wu, Risheng Liu, Zhongxuan Luo, and Xin Fan. Coconet: Coupled con- trastive learning network with multi-level feature ensemble for multi-modality image fusion.International Journal of Computer Vision, 132(5):1748–1775, 2024. 6, 1

  28. [28]

    Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 47(4):2349–2369, 2025

    Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, and Risheng Liu. Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 47(4):2349–2369, 2025. 6

  29. [29]

    Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 47(4):2349–2369, 2025

    Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, and Risheng Liu. Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 47(4):2349–2369, 2025. 5

  30. [30]

    A task-guided, implicitly-searched and meta-initialized deep model for image fusion.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 46(10):6594–6609,

    Risheng Liu, Zhu Liu, Jinyuan Liu, Xin Fan, and Zhongxuan Luo. A task-guided, implicitly-searched and meta-initialized deep model for image fusion.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 46(10):6594–6609,

  31. [31]

    Yu Liu, Zhengzheng Qi, Juan Cheng, and Xun Chen. Re- thinking the effectiveness of objective evaluation metrics in multi-focus image fusion: A statistic-based approach.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5806–5819, 2024. 6

  32. [32]

    Bi-level dynamic learning for jointly multi- modality image fusion and beyond

    Zhu Liu, Jinyuan Liu, Guanyao Wu, Long Ma, Xin Fan, and Risheng Liu. Bi-level dynamic learning for jointly multi- modality image fusion and beyond. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), pages 1240–1248, 2023. 2

  33. [33]

    Paif: Perception-aware infrared-visible image fusion for attack-tolerant semantic segmentation

    Zhu Liu, Jinyuan Liu, Benzhuang Zhang, Long Ma, Xin Fan, and Risheng Liu. Paif: Perception-aware infrared-visible image fusion for attack-tolerant semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia (ACM MM), page 3706–3714, 2023. 2

  34. [34]

    Infrared and visible im- age fusion methods and applications: A survey.Information Fusion, 45:153–178, 2019

    Jiayi Ma, Yong Ma, and Chang Li. Infrared and visible im- age fusion methods and applications: A survey.Information Fusion, 45:153–178, 2019. 6

  35. [35]

    Jane Wang, and Xun Chen

    Yu Shi, Yu Liu, Juan Cheng, Z. Jane Wang, and Xun Chen. Vdmufusion: A versatile diffusion model-based unsuper- vised framework for image fusion.IEEE Transactions on Image Processing, 34:441–454, 2025. 1

  36. [36]

    Det- fusion: A detection-driven infrared and visible image fusion network

    Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Det- fusion: A detection-driven infrared and visible image fusion network. InProceedings of the 30th ACM International Con- ference on Multimedia (ACM MM), page 4003–4011, 2022. 1, 2

  37. [37]

    Task-gated multi- expert collaboration network for degraded multi-modal im- age fusion

    Yiming Sun, Xin Li, Pengfei Zhu, Qinghua Hu, Dongwei Ren, Huiying Xu, and Xinzhong Zhu. Task-gated multi- expert collaboration network for degraded multi-modal im- age fusion. InProceedings of 42nd International Conference on Machine Learning (ICML), 2025. 1

  38. [38]

    Image fusion in the loop of high-level vision tasks: A semantic-aware real- time infrared and visible image fusion network.Information Fusion, 82:28–42, 2022

    Linfeng Tang, Jiteng Yuan, and Jiayi Ma. Image fusion in the loop of high-level vision tasks: A semantic-aware real- time infrared and visible image fusion network.Information Fusion, 82:28–42, 2022. 1, 2

  39. [39]

    Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022

    Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible im- age fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. 5

  40. [40]

    Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. Rethink- ing the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity.Infor- mation Fusion, 99:101870, 2023. 2

  41. [41]

    C2rf: Bridging multi-modal image regis- tration and fusion via commonality mining and contrastive learning.International Journal of Computer Vision, 133(8): 5262–5280, 2025

    Linfeng Tang, Qinglong Yan, Xinyu Xiang, Leyuan Fang, and Jiayi Ma. C2rf: Bridging multi-modal image regis- tration and fusion via commonality mining and contrastive learning.International Journal of Computer Vision, 133(8): 5262–5280, 2025. 1

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv: 2302.13971, 2023. 7

  43. [43]

    Rgbt salient object detection: A large- scale dataset and benchmark.IEEE Transactions on Multi- media, 25:4163–4176, 2023

    Zhengzheng Tu, Yan Ma, Zhun Li, Chenglong Li, Jieming Xu, and Yongtao Liu. Rgbt salient object detection: A large- scale dataset and benchmark.IEEE Transactions on Multi- media, 25:4163–4176, 2023. 5, 3

  44. [44]

    An inter- actively reinforced paradigm for joint infrared-visible image fusion and saliency object detection.Information Fusion, 98: 101828, 2023

    Di Wang, Jinyuan Liu, Risheng Liu, and Xin Fan. An inter- actively reinforced paradigm for joint infrared-visible image fusion and saliency object detection.Information Fusion, 98: 101828, 2023. 2, 6, 7, 1

  45. [45]

    Di Wang, Xianghao Jiao, Jinyuan Liu, and Xin Fan. Robust one-stop multi-modality image registration-fusion- segmentation framework against misalignments and adver- sarial attacks.IEEE Transactions on Multimedia, 27:4531– 4543, 2025. 1 10

  46. [46]

    Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond

    Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, and Risheng Liu. Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 17882–17891, 2025. 1, 2, 6

  47. [47]

    Segformer: Simple and efficient design for semantic segmentation with transform- ers

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 5

  48. [48]

    Fusiondn: A unified densely connected network for image fusion.Proceedings of the AAAI Conference on Arti- ficial Intelligence, 34(07):12484–12491, 2020

    Han Xu, Jiayi Ma, Zhuliang Le, Junjun Jiang, and Xiaojie Guo. Fusiondn: A unified densely connected network for image fusion.Proceedings of the AAAI Conference on Arti- ficial Intelligence, 34(07):12484–12491, 2020. 5

  49. [49]

    U2fusion: A unified unsupervised image fusion net- work.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):502–518, 2022

    Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsupervised image fusion net- work.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):502–518, 2022. 5

  50. [50]

    Objective image fusion performance measure.Electronics letters, 36(4):308– 309, 2000

    Costas S Xydeas, Vladimir Petrovic, et al. Objective image fusion performance measure.Electronics letters, 36(4):308– 309, 2000. 6

  51. [51]

    Instruction-driven fusion of infrared–visible images: Tailor- ing for diverse downstream tasks.Information Fusion, 121: 103148, 2025

    Zengyi Yang, Yafei Zhang, Huafeng Li, and Yu Liu. Instruction-driven fusion of infrared–visible images: Tailor- ing for diverse downstream tasks.Information Fusion, 121: 103148, 2025. 2

  52. [52]

    Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion

    Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Ji- ayi Ma. Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27016–27025, 2024. 6

  53. [53]

    Mrfs: Mutually reinforcing image fusion and segmenta- tion

    Hao Zhang, Xuhui Zuo, Jie Jiang, Chunchao Guo, and Jiayi Ma. Mrfs: Mutually reinforcing image fusion and segmenta- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 26964– 26973, 2024. 1, 2, 6, 7

  54. [54]

    Omnifuse: Composite degradation-robust image fusion with language-driven semantics.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 47(9):7577–7595,

    Hao Zhang, Lei Cao, Xuhui Zuo, Zhenfeng Shao, and Jiayi Ma. Omnifuse: Composite degradation-robust image fusion with language-driven semantics.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 47(9):7577–7595,

  55. [55]

    Visible and infrared image fusion using deep learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 45(8):10535–10554,

    Xingchen Zhang and Yiannis Demiris. Visible and infrared image fusion using deep learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 45(8):10535–10554,

  56. [56]

    Vifb: A visi- ble and infrared image fusion benchmark

    Xingchen Zhang, Ping Ye, and Gang Xiao. Vifb: A visi- ble and infrared image fusion benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 468–478, 2020. 6

  57. [57]

    Metafusion: Infrared and visible image fusion via meta- feature embedding from object detection

    Wenda Zhao, Shigeng Xie, Fan Zhao, You He, and Huchuan Lu. Metafusion: Infrared and visible image fusion via meta- feature embedding from object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13955–13965, 2023. 1, 2

  58. [58]

    Freefusion: Infrared and visible image fusion via cross reconstruction learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 47(9):8040–8056,

    Wenda Zhao, Hengshuai Cui, Haipeng Wang, You He, and Huchuan Lu. Freefusion: Infrared and visible image fusion via cross reconstruction learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 47(9):8040–8056,

  59. [59]

    Com- plementary trilateral decoder for fast and accurate salient ob- ject detection

    Zhirui Zhao, Changqun Xia, Chenxi Xie, and Jia Li. Com- plementary trilateral decoder for fast and accurate salient ob- ject detection. InProceedings of the 29th ACM International Conference on Multimedia (ACM MM), page 4967–4975,

  60. [60]

    Equivariant multi-modality image fusion

    Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Kai Zhang, Shuang Xu, Dongdong Chen, Radu Timofte, and Luc Van Gool. Equivariant multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 25912–25921,

  61. [61]

    task network retraining

    1 11 Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion Supplementary Material A. More Details of VFN In the first stage, we train the VFN to focus on generating vi- sually guided fused images. In the second stage, the VFN is frozen, while the RSC module assists in adapting the VFN to various downs...