pith. machine review for the scientific record. sign in

arxiv: 2604.02896 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords image fusion evaluationinfrared visible fusionlightweight neural networkcontrastive learningconsistency assessmentinformation preservation
0
0 comments X

The pith

A lightweight network evaluates infrared-visible image fusion up to 1000 times faster while aligning better with human perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional metrics borrowed from other vision tasks are slow and often fail to capture true fusion quality. EvaNet uses a divide-and-conquer strategy that first decomposes the fused image into separate infrared and visible components before scoring information preservation in each. A lightweight network performs the scoring after training with contrastive learning and perceptual guidance from a large language model. The work also adds a consistency framework that measures how well any metric matches human visual perception or downstream task results. Experiments across standard benchmarks show the approach delivers much higher speed and consistency.

Core claim

EvaNet approximates widely used fusion metrics by decomposing the fused result into infrared and visible components, then using a lightweight network to measure information preservation in each component separately; the network is trained via contrastive learning informed by LLM-provided perceptual assessments, and the method includes a new consistency evaluation framework that references no-reference scores and task performance.

What carries the argument

Lightweight evaluation network that decomposes fused images into infrared and visible components before assessing modality-specific information preservation, trained with contrastive learning and LLM perceptual input.

If this is right

  • Evaluation becomes fast enough to apply to large-scale fusion datasets and repeated experiments.
  • The consistency framework supplies objective references for checking how any metric aligns with perception and tasks.
  • Separate scoring of infrared and visible preservation allows targeted diagnosis of modality-specific weaknesses in a fusion method.
  • Overall iteration cycles in fusion research shorten because reliable quality checks no longer require heavy computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decomposition-plus-lightweight-model pattern could transfer to quality assessment in other multi-modal fusion settings such as medical or remote-sensing images.
  • Real-time fusion pipelines in surveillance or autonomous systems could incorporate instant EvaNet-style checks for ongoing quality monitoring.
  • Replacing complex image transformations with learned approximations opens a route to parameter-free or low-cost evaluation pipelines in related vision tasks.

Load-bearing premise

The decomposition step separates information without introducing new biases and the trained lightweight network can faithfully approximate complex traditional metrics.

What would settle it

On a new set of fused images, compute both EvaNet scores and traditional metric scores; if the two sets of scores systematically disagree on images where human observers clearly prefer one result over another, the approximation claim fails.

Figures

Figures reproduced from arXiv: 2604.02896 by Chunyang Cheng, Hui Li, Josef Kittler, Tao Zhou, Tianyang Xu, Xiao-Jun Wu, Zhangyong Tang.

Figure 1
Figure 1. Figure 1: An illustration of the consistency and the efficiency issues (sub-figure (a) and (b)) of existing image fusion metrics. Image fusion evaluation relies heavily on traditional signal processing techniques (e.g., discrete cosine transform (DCT) and wavelet transform) or met￾rics adapted from other vision tasks. Without appropriate adjustment, these metrics lack consistency, i.e., better visualisation cannot a… view at source ↗
Figure 2
Figure 2. Figure 2: The speed (milliseconds per image) imbalance between inference and evaluation in image fusion. Traditional metrics rely on separate, computationally intensive procedures to perform different assessments, significantly slowing down the evaluation phase. In contrast, the proposed EvaNet generates multiple evaluation results simultaneously within a single forward pass, offering acceleration by a factor of up … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed EvaNet framework. Our method replaces traditional image fusion assessment processes by a lightweight learning-based network to significantly improve evaluation efficiency. In addition, a divide-and-conquer strategy is used to disentangle and independently measure the information preserved from each source modality. The environment branch, as part of the three-branch design, introdu… view at source ↗
Figure 4
Figure 4. Figure 4: The network architecture of the proposed EvaNet (see Sec. III-B). The model consists of two main components. The left part shows the modality-specific decomposition, implemented using two lightweight information probes [11]. The right part illustrates the surrogate metric prediction process, consisting of three branches: two modality branches correspond to the infrared and visible inputs, while the environ… view at source ↗
Figure 6
Figure 6. Figure 6: An overview of the training strategy based on contrastive [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Typical and zero-input decomposition results produced by the Information Probe. (a) and (b) show successful disentanglement of modality-specific content from two representative fused images with different degrees of thermal information preservation. (c) presents the zero-decomposition case, where an all-zero input yields two clean blank outputs without artefacts, verifying the robustness of the probe and i… view at source ↗
Figure 8
Figure 8. Figure 8: The fusion results of two representative approaches A (SDNet) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Metric-wise attribution visualisation of “Conv3” for two fusion results on the LLVIP dataset. Brighter regions indicate higher contribution to the predicted metric. Different metrics attend to distinct image regions in accordance with their underlying definitions. a human-labeled baseline. Specifically, we regenerate environment scores using Gemini 3 [55], ChatGPT-4o, ChatGPT-5.2 [45], and Qwen3-Max [56] … view at source ↗
Figure 12
Figure 12. Figure 12: The qualitative results produced by the EvaNet when [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The qualitative results produced by the EvaNet when [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The quantitative results of the ablation study of the number of [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Detection results visualisation of two representative image [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Segmentation results visualisation of two representative image [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The quantitative results reporting metric consistency using CLIP-IQA as the reference model, [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The results of comparison of the proposed EvaNet and the [PITH_FULL_IMAGE:figures/full_fig_p014_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A comparison of the EN values of the source images and the two state-of-the-art (Text-IF [51] and GIFNet [8]) image fusion results. [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative results obtained for the daytime scene and [PITH_FULL_IMAGE:figures/full_fig_p015_20.png] view at source ↗
read the original abstract

Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at https://github.com/AWCXV/EvaNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes EvaNet, a lightweight network for efficient evaluation of infrared-visible image fusion. It uses a divide-and-conquer pipeline that first decomposes the fused image into separate infrared and visible components, then applies the network (trained via contrastive learning and LLM-provided perceptual assessments) to measure information preservation in each. The work also introduces a consistency evaluation framework that aligns metrics with human perception using independent no-reference scores and downstream task performance. The central claims are up to 1000x speedup over traditional metrics and improved consistency on standard fusion benchmarks, with code to be released publicly.

Significance. If the decomposition fidelity and approximation accuracy hold, the framework could substantially accelerate iteration in image fusion research by replacing slow, unadapted traditional metrics with a fast, tailored alternative. The consistency framework addresses a recognized gap by grounding evaluation in human-aligned and task-based references, potentially improving reliability of fusion algorithm comparisons. Public code availability supports reproducibility.

major comments (3)
  1. [Decomposition approach] Decomposition module: The divide-and-conquer claim that decomposition 'disentangles' evaluation without new biases is load-bearing for both the approximation fidelity and the 1000x efficiency assertion. The manuscript provides no quantitative validation of decomposition fidelity (e.g., cross-modal leakage metrics, statistical preservation tests, or ablation on separation artifacts), leaving open the possibility that reported gains reflect decomposition artifacts rather than true metric approximation.
  2. [Experiments] Experimental results: The abstract states that 'extensive experiments' demonstrate superior efficiency and consistency, yet the provided text supplies no specific numbers, baselines, error analysis, or ablation tables for the contrastive+LLM training components. This absence makes it impossible to assess whether the lightweight network faithfully approximates complex traditional metrics or whether consistency improvements are statistically significant.
  3. [Consistency evaluation framework] Consistency framework: The new framework relies on independent no-reference scores and downstream tasks as objective references. The manuscript should explicitly demonstrate (via correlation analysis or controlled tests) that these references remain independent of the approximated metrics; otherwise the consistency claim risks moderate circularity.
minor comments (3)
  1. [Abstract] Abstract: The 'up to 1,000 times faster' claim should specify the exact traditional metrics used as baseline, the hardware platform, and whether the speedup includes decomposition overhead.
  2. [Methods] Notation and terminology: Ensure consistent definition of 'information preservation' when applied to the decomposed components versus the original fused image; minor inconsistencies appear in the high-level description.
  3. [Related work] References: Add citations to recent LLM-based perceptual evaluation works in vision to better situate the training strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's thorough review and valuable suggestions. We address each major comment below and plan to incorporate revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: Decomposition module: The divide-and-conquer claim that decomposition 'disentangles' evaluation without new biases is load-bearing for both the approximation fidelity and the 1000x efficiency assertion. The manuscript provides no quantitative validation of decomposition fidelity (e.g., cross-modal leakage metrics, statistical preservation tests, or ablation on separation artifacts), leaving open the possibility that reported gains reflect decomposition artifacts rather than true metric approximation.

    Authors: We agree that quantitative validation of the decomposition fidelity is crucial to substantiate our claims. The current manuscript includes some qualitative examples of decomposition, but lacks the specific metrics suggested. In the revised manuscript, we will add quantitative evaluations such as cross-modal leakage using metrics like mutual information between components, statistical preservation tests, and ablations on separation artifacts. This will confirm that the decomposition does not introduce biases and that the efficiency gains are genuine. revision: yes

  2. Referee: Experimental results: The abstract states that 'extensive experiments' demonstrate superior efficiency and consistency, yet the provided text supplies no specific numbers, baselines, error analysis, or ablation tables for the contrastive+LLM training components. This absence makes it impossible to assess whether the lightweight network faithfully approximates complex traditional metrics or whether consistency improvements are statistically significant.

    Authors: We acknowledge that the abstract and initial presentation could benefit from more concrete details. The full paper contains tables and figures with specific results, including speedup comparisons and consistency scores. To address this, we will revise the manuscript to include explicit numerical values, baseline comparisons, error bars, and ablation studies for the training components in the main text and abstract. We will also add statistical analysis to demonstrate significance. revision: yes

  3. Referee: Consistency framework: The new framework relies on independent no-reference scores and downstream tasks as objective references. The manuscript should explicitly demonstrate (via correlation analysis or controlled tests) that these references remain independent of the approximated metrics; otherwise the consistency claim risks moderate circularity.

    Authors: This is a valid concern. The references were selected to be independent, as no-reference scores (e.g., BRISQUE, NIQE) and task performances (e.g., detection accuracy) do not directly depend on the fusion metrics being approximated. However, to strengthen this, we will include in the revision correlation analyses and controlled experiments showing low correlation between the references and EvaNet predictions, thereby mitigating any risk of circularity. revision: yes

Circularity Check

1 steps flagged

Network trained to approximate traditional metrics creates moderate fitted-input circularity in consistency claims

specific steps
  1. fitted input called prediction [Abstract]
    "At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. ... During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model."

    The network is trained to reproduce the very traditional metrics it is later used to replace. Consequently, any measured consistency or efficiency gain that flows through the network's outputs is statistically constrained by the supervised approximation objective rather than constituting an independent first-principles result.

full rationale

The paper's core claim is that a lightweight network, trained via contrastive learning plus LLM perceptual labels, approximates complex traditional metrics under a decomposition strategy and thereby delivers superior efficiency and consistency. This approximation is explicitly supervised on the target metrics, so downstream consistency scores that rely on the network's outputs inherit dependence on the fitted quantities. The added consistency framework references independent no-reference scores and downstream task performance, which supplies some external anchoring and prevents full circularity. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text; the decomposition step is presented as a design choice rather than a derived result. Overall score reflects partial reduction of the evaluation outputs to the training targets without the entire derivation collapsing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so concrete free parameters, axioms, and invented entities cannot be extracted. The approach implicitly rests on standard deep-learning assumptions about network approximation power and the domain assumption that source decomposition aids evaluation.

pith-pipeline@v0.9.0 · 5548 in / 1122 out tokens · 44887 ms · 2026-05-13T20:26:41.646916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

  1. [1]

    Visible and infrared image fusion using deep learning,

    X. Zhang and Y . Demiris, “Visible and infrared image fusion using deep learning,”IEEE TPAMI, 2023

  2. [2]

    Current advances and future perspectives of image fusion: A comprehensive review,

    S. Karim, G. Tong, J. Li, A. Qadir, U. Farooq, and Y . Yu, “Current advances and future perspectives of image fusion: A comprehensive review,”Information Fusion, vol. 90, pp. 185– 217, 2023

  3. [3]

    Mulfs-cap: Multimodal fusion-supervised cross-modality alignment percep- tion for unregistered infrared-visible image fusion,

    H. Li, Z. Yang, Y . Zhang, W. Jia, Z. Yu, and Y . Liu, “Mulfs-cap: Multimodal fusion-supervised cross-modality alignment percep- tion for unregistered infrared-visible image fusion,”IEEE TPAMI, 2025

  4. [4]

    Mmdrfuse: Distilled mini-model with dynamic refresh for multi-modality image fusion,

    Y . Deng, T. Xu, C. Cheng, X.-J. Wu, and J. Kittler, “Mmdrfuse: Distilled mini-model with dynamic refresh for multi-modality image fusion,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7326–7335

  5. [5]

    S4fusion: Saliency-aware selective state space model for in- frared and visible image fusion,

    H. Ma, H. Li, C. Cheng, G. Wang, X. Song, and X.-J. Wu, “S4fusion: Saliency-aware selective state space model for in- frared and visible image fusion,”IEEE Transactions on Image Processing, 2025

  6. [6]

    Learning a graph neural network with cross modality interaction for image fusion,

    J. Li, J. Chen, J. Liu, and H. Ma, “Learning a graph neural network with cross modality interaction for image fusion,” in ACM MM, 2023, pp. 4471–4479

  7. [7]

    Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detection,

    J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detection,” inCVPR, 2022, pp. 5802–5811

  8. [8]

    One model for all: Low-level task interaction is a key to task-agnostic image fusion,

    C. Cheng, T. Xu, Z. Feng, X. Wu, Z. Tang, H. Li, Z. Zhang, S. Atito, M. Awais, and J. Kittler, “One model for all: Low-level task interaction is a key to task-agnostic image fusion,” inCVPR, 2025, pp. 28 102–28 112

  9. [9]

    Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba,

    X. Xie, Y . Cui, T. Tan, X. Zheng, and Z. Yu, “Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba,”Visual Intelligence, vol. 2, no. 1, p. 37, 2024

  10. [10]

    Self-learning hyperspectral and multispectral image fusion via adaptive resid- ual guided subspace diffusion model,

    J. Zhu, H. Wang, Y . Xu, Z. Wu, and Z. Wei, “Self-learning hyperspectral and multispectral image fusion via adaptive resid- ual guided subspace diffusion model,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 862–17 871

  11. [11]

    Fu- sionbooster: A unified image fusion boosting paradigm,

    C. Cheng, T. Xu, X.-J. Wu, H. Li, X. Li, and J. Kittler, “Fu- sionbooster: A unified image fusion boosting paradigm,”IJCV, 2025

  12. [12]

    Omnifuse: Composite degradation-robust image fusion with language-driven semantics,

    H. Zhang, L. Cao, X. Zuo, Z. Shao, and J. Ma, “Omnifuse: Composite degradation-robust image fusion with language-driven semantics,”IEEE TPAMI, 2025

  13. [13]

    Freefusion: In- frared and visible image fusion via cross reconstruction learning,

    W. Zhao, H. Cui, H. Wang, Y . He, and H. Lu, “Freefusion: In- frared and visible image fusion via cross reconstruction learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 9, pp. 8040–8056, 2025

  14. [14]

    Generative- based fusion mechanism for multi-modal tracking,

    Z. Tang, T. Xu, X. Wu, X.-F. Zhu, and J. Kittler, “Generative- based fusion mechanism for multi-modal tracking,” inAAAI, vol. 38, no. 6, 2024, pp. 5189–5197

  15. [15]

    Coconut: Modernizing coco segmentation,

    X. Deng, Q. Yu, P. Wang, X. Shen, and L.-C. Chen, “Coconut: Modernizing coco segmentation,” inCVPR, 2024, pp. 21 863– 21 873

  16. [16]

    Learning feature restoration transformer for robust dehazing visual object tracking,

    T. Xu, Y . Pan, Z. Feng, X. Zhu, C. Cheng, X.-J. Wu, and J. Kittler, “Learning feature restoration transformer for robust dehazing visual object tracking,”IJCV, vol. 132, no. 12, pp. 6021–6038, 2024

  17. [17]

    Llvip: A visible- infrared paired dataset for low-light vision,

    X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible- infrared paired dataset for low-light vision,” inICCV, 2021, pp. 3496–3504. PREPRINT FOR IEEE TPAMI, JULY 2025 17

  18. [18]

    Rethinking the effective- ness of objective evaluation metrics in multi-focus image fusion: A statistic-based approach,

    Y . Liu, Z. Qi, J. Cheng, and X. Chen, “Rethinking the effective- ness of objective evaluation metrics in multi-focus image fusion: A statistic-based approach,”IEEE TPAMI, 2024

  19. [19]

    Lrrnet: A novel representation learning guided fusion network for infrared and visible images,

    H. Li, T. Xu, X.-J. Wu, J. Lu, and J. Kittler, “Lrrnet: A novel representation learning guided fusion network for infrared and visible images,”IEEE TPAMI, 2023

  20. [20]

    Mdlatlrr: A novel decomposition method for infrared and visible image fusion,

    H. Li, X.-J. Wu, and J. Kittler, “Mdlatlrr: A novel decomposition method for infrared and visible image fusion,”IEEE TIP, vol. 29, pp. 4733–4746, 2020

  21. [21]

    Eggdd: An explicit dependency model for multi-modal medical image fusion in shift-invariant shearlet transform domain,

    L. Wang, B. Li, and L.-f. Tian, “Eggdd: An explicit dependency model for multi-modal medical image fusion in shift-invariant shearlet transform domain,”Information fusion, vol. 19, pp. 29– 37, 2014

  22. [22]

    Textfusion: Unveiling the power of textual semantics for con- trollable image fusion,

    C. Cheng, T. Xu, X.-J. Wu, H. Li, X. Li, Z. Tang, and J. Kittler, “Textfusion: Unveiling the power of textual semantics for con- trollable image fusion,”arXiv preprint arXiv:2312.14209, 2023

  23. [23]

    Mrfs: Mutually reinforcing image fusion and segmentation,

    H. Zhang, X. Zuo, J. Jiang, C. Guo, and J. Ma, “Mrfs: Mutually reinforcing image fusion and segmentation,” inCVPR, 2024, pp. 26 974–26 983

  24. [24]

    Densefuse: A fusion approach to infrared and visible images,

    H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,”IEEE TIP, vol. 28, no. 5, pp. 2614–2623, 2018

  25. [25]

    Task-driven image fusion with learnable fusion loss,

    H. Bai, J. Zhang, Z. Zhao, Y . Wu, L. Deng, Y . Cui, T. Feng, and S. Xu, “Task-driven image fusion with learnable fusion loss,” in CVPR, June 2025, pp. 7457–7468

  26. [26]

    Every sam drop counts: Embracing semantic priors for multi- modality image fusion and beyond,

    G. Wu, H. Liu, H. Fu, Y . Peng, J. Liu, X. Fan, and R. Liu, “Every sam drop counts: Embracing semantic priors for multi- modality image fusion and beyond,” inCVPR, June 2025, pp. 17 882–17 891

  27. [27]

    Dcevo: Discriminative cross-dimensional evolutionary learning for infrared and visible image fusion,

    J. Liu, B. Zhang, Q. Mei, X. Li, Y . Zou, Z. Jiang, L. Ma, R. Liu, and X. Fan, “Dcevo: Discriminative cross-dimensional evolutionary learning for infrared and visible image fusion,” in CVPR, 2025, pp. 2226–2235

  28. [28]

    Mambadfuse: A mamba-based dual-phase model for multi-modality image fusion,

    Z. Li, H. Pan, K. Zhang, Y . Wang, and F. Yu, “Mambadfuse: A mamba-based dual-phase model for multi-modality image fusion,”arXiv preprint arXiv:2404.08406, 2024

  29. [29]

    Ddfm: Denoising diffusion model for multi-modality image fusion,

    Z. Zhao, H. Bai, Y . Zhu, J. Zhang, S. Xu, Y . Zhang, K. Zhang, D. Meng, R. Timofte, and L. Van Gool, “Ddfm: Denoising diffusion model for multi-modality image fusion,” inICCV, October 2023, pp. 8082–8093

  30. [30]

    Diff-if: Multi- modality image fusion via diffusion model with fusion knowledge prior,

    X. Yi, L. Tang, H. Zhang, H. Xu, and J. Ma, “Diff-if: Multi- modality image fusion via diffusion model with fusion knowledge prior,”Information Fusion, vol. 110, p. 102450, 2024

  31. [31]

    U2fusion: A unified unsupervised image fusion network,

    H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,”IEEE TPAMI, 2020

  32. [32]

    Mufusion: A general unsuper- vised image fusion network based on memory unit,

    C. Cheng, T. Xu, and X.-J. Wu, “Mufusion: A general unsuper- vised image fusion network based on memory unit,”Information Fusion, vol. 92, pp. 80–92, 2023

  33. [33]

    Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection,

    W. Zhao, S. Xie, F. Zhao, Y . He, and H. Lu, “Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection,” inCVPR, June 2023, pp. 13 955–13 965

  34. [34]

    Tgfuse: An infrared and visible image fusion approach based on transformer and generative adversarial network,

    D. Rao, T. Xu, and X.-J. Wu, “Tgfuse: An infrared and visible image fusion approach based on transformer and generative adversarial network,”IEEE TIP, 2023

  35. [35]

    Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer,

    J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y . Ma, “Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer,”IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 7, pp. 1200–1217, 2022

  36. [36]

    Cddfuse: Correlation-driven dual-branch fea- ture decomposition for multi-modality image fusion,

    Z. Zhao, H. Bai, J. Zhang, Y . Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “Cddfuse: Correlation-driven dual-branch fea- ture decomposition for multi-modality image fusion,” inCVPR, 2023, pp. 5906–5916

  37. [37]

    Restormer: Efficient transformer for high-resolution image restoration,

    S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.- H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” inCVPR, 2022, pp. 5728–5739

  38. [38]

    Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,

    J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” inICCV, 2023, pp. 8115–8124

  39. [39]

    A novel state space model with local enhancement and state sharing for image fusion,

    Z. Cao, X. Wu, L.-J. Deng, and Y . Zhong, “A novel state space model with local enhancement and state sharing for image fusion,” inACM MM, 2024, pp. 1235–1244

  40. [40]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  41. [41]

    Perceptual quality assessment for multi-exposure image fusion,

    K. Ma, K. Zeng, and Z. Wang, “Perceptual quality assessment for multi-exposure image fusion,”IEEE TIP, vol. 24, no. 11, pp. 3345–3356, 2015

  42. [42]

    A non-reference image fusion metric based on mutual information of image features,

    M. B. A. Haghighat, A. Aghagolzadeh, and H. Seyedarabi, “A non-reference image fusion metric based on mutual information of image features,”Computers & Electrical Engineering, vol. 37, no. 5, pp. 744–756, 2011

  43. [43]

    Image information and visual quality,

    H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”IEEE TIP, vol. 15, no. 2, pp. 430–444, 2006

  44. [44]

    Very deep convolutional net- works for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” inInternational Con- ference on Learning Representations, May 2015

  45. [45]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  46. [46]

    Ifcnn: A general image fusion framework based on convolutional neural network,

    Y . Zhang, Y . Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “Ifcnn: A general image fusion framework based on convolutional neural network,”Information Fusion, vol. 54, pp. 99–118, 2020

  47. [47]

    Sdnet: A versatile squeeze-and- decomposition network for real-time image fusion,

    H. Zhang and J. Ma, “Sdnet: A versatile squeeze-and- decomposition network for real-time image fusion,”IJCV, pp. 1–25, 2021

  48. [48]

    Rfn-nest: An end-to-end residual fusion network for infrared and visible images,

    H. Li, X.-J. Wu, and J. Kittler, “Rfn-nest: An end-to-end residual fusion network for infrared and visible images,”Information Fusion, vol. 73, pp. 72–86, 2021

  49. [49]

    Ydtr: infrared and visible image fusion via y-shape dynamic transformer,

    W. Tang, F. He, and Y . Liu, “Ydtr: infrared and visible image fusion via y-shape dynamic transformer,”IEEE Transactions on Multimedia, 2022

  50. [50]

    Reconet: Recurrent correction network for fast and efficient multi-modality image fusion,

    Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “Reconet: Recurrent correction network for fast and efficient multi-modality image fusion,” inECCV. Springer, 2022, pp. 539–555

  51. [51]

    Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion,

    X. Yi, H. Xu, H. Zhang, L. Tang, and J. Ma, “Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion,” inCVPR, 2024, pp. 27 026–27 035

  52. [52]

    Equivariant multi-modality image fusion,

    Z. Zhao, H. Bai, J. Zhang, Y . Zhang, K. Zhang, S. Xu, D. Chen, R. Timofte, and L. Van Gool, “Equivariant multi-modality image fusion,” inCVPR, 2024, pp. 25 912–25 921

  53. [53]

    Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion,

    J. Liu, R. Lin, G. Wu, R. Liu, Z. Luo, and X. Fan, “Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion,”IJCV, vol. 132, no. 5, pp. 1748–1775, 2024

  54. [54]

    Deep neural networks for no-reference and full-reference image quality assessment,

    S. Bosse, D. Maniry, K.-R. M ¨uller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,”IEEE TIP, vol. 27, no. 1, pp. 206–219, 2017

  55. [55]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  56. [56]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  57. [57]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inCVPR, 2016, pp. 779–788

  58. [58]

    Pidnet: A real-time semantic segmentation network inspired by pid controllers,

    J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic segmentation network inspired by pid controllers,” in CVPR, 2023, pp. 19 529–19 539

  59. [59]

    Exploring clip for assessing the look and feel of images,

    J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” inAAAI, vol. 37, no. 2, 2023, pp. 2555–2563

  60. [60]

    U2fusion: A unified unsupervised image fusion network,

    H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,”IEEE TPAMI, vol. 44, no. 1, pp. 502–518, 2022. PREPRINT FOR IEEE TPAMI, JULY 2025 18 Chunyang Chengreceived the Ph.D. degree in Artificial Intelligence and Computer Sci- ence from Jiangnan University, Wuxi, China, in 2025. He is currently a Postdoctoral Re...