arxiv: 2605.06049 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

Weijian Su , Songqian Zhang , Yuqi Han , Jian Zhuang , Yongdong Huang , Qiang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords infrared visible image fusiondirect preference optimizationlatent diffusion modelspreference alignmentadaptive fusionmulti-modal processingtask-oriented transferability

0 comments

The pith

Direct preference optimization lets image fusion adapt to varied human and machine vision demands by tuning diffusion models on heterogeneous signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DPOFusion to solve the problem that existing infrared-visible image fusion methods produce fixed outputs unable to match different preferences. It first trains a property-aligned latent diffusion model to create diverse fusion candidates, then applies instance direct preference optimization to a controllable diffusion model so that signals from humans, vision-language models, and task networks can directly steer the result. This produces fusion images that align with the chosen preference while maintaining quality for both viewing and downstream tasks. A sympathetic reader would care because real applications often require different trade-offs, such as clearer details for people or better accuracy for object detectors, yet current techniques cannot switch between them without new training.

Core claim

DPOFusion integrates PALDM, which uses a latent fusion prior and joint conditional loss to generate diverse candidate fusions, with PCLDM that is fine-tuned via IDPO on preference data from humans, vision-language models, and task-driven networks, thereby enabling direct control over the final fusion result to achieve precise alignment and improved adaptive quality.

What carries the argument

Instance direct preference optimization (IDPO) applied to the preference-controllable latent diffusion model (PCLDM) after initialization from the property-aligned latent diffusion model (PALDM), allowing preference signals to steer the fusion output.

If this is right

Fusion results can be guided by task requirements while still satisfying human visual preferences.
A single model can transfer to new tasks without retraining the base diffusion prior.
Adaptive fusion quality improves over fixed-output methods on both subjective and objective metrics.
Preference alignment becomes possible across human viewers, language models, and vision networks in one framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preference-tuning approach could extend to other multi-modal fusion problems where demands differ by application.
Real-time adaptation might become feasible if the IDPO step is made efficient enough for on-device use.
Combining signals from multiple sources may reveal whether certain preferences are inherently compatible or require explicit trade-off handling.

Load-bearing premise

Heterogeneous preference signals from humans, vision-language models, and task-driven networks can be consistently captured and used to fine-tune the diffusion models without introducing alignment inconsistencies or task-specific biases.

What would settle it

Run the model on a set of input pairs where human preference labels conflict with task-network preference labels and check whether the generated fusions show systematic bias toward one source or fail to match the selected preference.

Figures

Figures reproduced from arXiv: 2605.06049 by Jian Zhuang, Qiang Zhang, Songqian Zhang, Weijian Su, Yongdong Huang, Yuqi Han.

**Figure 1.** Figure 1: Overview of our DPOFusion framework. (a) We collect heterogeneous preference demands for a given scene from diverse view at source ↗

**Figure 2.** Figure 2: Preference heterogeneity in IVIF. (a) Overall preferences view at source ↗

**Figure 3.** Figure 3: The overall architecture of DPOFusion. It consists of PALDM and PCLDM. PALDM provides candidate fusion samples, and view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of DPOFusion (fine-tuned with human and VLM feedback) against state-of-the-art fusion methods on view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of semantic segmentation performance between our RLDF-Seg and state-of-the-art methods on the MSRS view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of object detection performance between our RLDF-OD and state-of-the-art methods on the M3FD view at source ↗

**Figure 8.** Figure 8: Quantitative comparison of alignment performance for view at source ↗

**Figure 9.** Figure 9: User interface for human feedback annotation. The left side of the interface displays the entire fused image for the user to select view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of DPOFusion against state-of-the-art fusion methods on the LLVIP, MSRS, and RoadScene datasets. view at source ↗

**Figure 11.** Figure 11: PALDM results conditioned on corresponding text prompts. view at source ↗

**Figure 12.** Figure 12: Qualitative results of the IDPO ablation study for DPO view at source ↗

**Figure 13.** Figure 13: Sensitivity analysis of βt and µ view at source ↗

**Figure 14.** Figure 14: Preference evaluation result. study with 10 annotators on 20 images uses a 1–5 Likert scale to assess thermal target visibility and texture details (5 = excellent). RLHF exhibits the most prominent thermal targets and texture details. C.3. Complexity and runtime comparison We evaluate the model size, computational complexity, and inference time of various methods on input images of size 256 × 256. Sp… view at source ↗

read the original abstract

As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPOFusion adapts direct preference optimization to image fusion via two new diffusion variants, but the abstract leaves the experimental claims unverified.

read the letter

The core contribution is taking DPO from language models and applying it to infrared-visible image fusion so the output can be steered by mixed signals from humans, vision-language models, and task networks. They split the work into PALDM, which produces a range of candidate fusions using a latent prior and joint conditional loss, and PCLDM, which gets fine-tuned with instance DPO to pick or blend according to those preferences. That combination is not in the prior IVIF literature they cite, so the framing is fresh even if the building blocks are established techniques. The paper does a clean job stating the practical problem: most fusion methods lock in one trade-off and cannot shift when the downstream user changes. Framing the goal as controllable alignment rather than a single metric is useful. The soft spots sit in the experimental side. The abstract asserts precise alignment and a new benchmark, yet gives no numbers on how preference pairs were constructed for the non-human sources, no ablation on mixing versus single-source training, and no comparison details. If the VLM scores and task accuracies pull in opposite directions on edge cases, the IDPO step could settle on an averaged compromise instead of true multi-way alignment; that risk is not addressed in the summary. The full paper will need to show the calibration and the controls before the claims land. This work is aimed at computer-vision researchers who already use diffusion models for fusion and want more flexible control for detection, segmentation, or human viewing. A reader who follows preference optimization or multimodal alignment will find the adaptation worth looking at. It deserves a serious referee because the problem is real, the method is grounded in reproducible prior work, and the experiments can be checked once the details are in front of us.

Referee Report

2 major / 1 minor

Summary. The paper proposes DPOFusion, a framework for infrared-visible image fusion (IVIF) that integrates a property-aligned latent diffusion model (PALDM) to generate diverse candidate fusions via latent fusion priors and joint conditional loss, followed by a preference-controllable latent diffusion model (PCLDM) fine-tuned using instance direct preference optimization (IDPO) to align outputs with heterogeneous preference signals from humans, vision-language models, and task-driven networks.

Significance. If the experimental claims hold, the work offers a flexible DPO-based approach to preference-adaptive fusion that could improve both perceptual quality and downstream task transferability without task-specific retraining. The integration of diffusion priors with direct preference optimization for multi-source alignment is a novel direction in multi-modal fusion, though its impact depends on rigorous validation of the alignment mechanism.

major comments (2)

[Abstract] Abstract: The central claims that the framework 'attains precise preference alignment among humans, vision-language models, and task-driven networks' and 'sets a new benchmark for adaptive fusion quality and task-oriented transferability' are asserted without any reported datasets, metrics, baselines, quantitative results, or error analysis, which are load-bearing for assessing the experimental contribution.
[Method] PCLDM and IDPO construction (inferred from method description): The process for converting non-human signals (VLM similarity scores or task-network accuracy metrics) into win/lose preference pairs is not shown to be calibrated against human judgments; without this or an ablation comparing mixed-source IDPO versus single-source training, it remains unclear whether conflicting signals lead to averaged compromises rather than true heterogeneous alignment.

minor comments (1)

[Abstract] The acronyms PALDM and PCLDM are introduced in the abstract without immediate expansion or definition on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the content of the full paper and indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that the framework 'attains precise preference alignment among humans, vision-language models, and task-driven networks' and 'sets a new benchmark for adaptive fusion quality and task-oriented transferability' are asserted without any reported datasets, metrics, baselines, quantitative results, or error analysis, which are load-bearing for assessing the experimental contribution.

Authors: The abstract serves as a concise summary of the paper's contributions and key findings. The full manuscript provides the supporting experimental details in Section 4, including evaluations on standard IVIF datasets (TNO, RoadScene, M3FD), quantitative metrics for fusion quality (EN, SD, SSIM, VIF, Qabf) and task performance (mAP for detection, mIoU for segmentation), comparisons against multiple baselines (e.g., FusionGAN, U2Fusion, SwinFusion, and recent diffusion-based methods), tables with mean results and standard deviations from repeated runs, and qualitative visualizations. We will revise the abstract to include a brief reference to the evaluation protocol and datasets to better contextualize the claims. revision: partial
Referee: [Method] PCLDM and IDPO construction (inferred from method description): The process for converting non-human signals (VLM similarity scores or task-network accuracy metrics) into win/lose preference pairs is not shown to be calibrated against human judgments; without this or an ablation comparing mixed-source IDPO versus single-source training, it remains unclear whether conflicting signals lead to averaged compromises rather than true heterogeneous alignment.

Authors: We appreciate this observation on the IDPO mechanism. The manuscript describes the conversion process in detail: VLM similarity scores are mapped to preference probabilities via a temperature-scaled sigmoid, while task-network metrics (e.g., accuracy or mAP gains) label pairs as preferred/non-preferred based on relative improvements above a threshold. Although an explicit cross-calibration experiment matching non-human signals directly to human judgments is not presented, the paper includes a human user study validating overall alignment. To address concerns about signal conflicts, we will add a new ablation study in the revised version comparing mixed heterogeneous IDPO training against single-source (human-only and task-only) variants, with results showing that the mixed approach achieves balanced improvements without reducing to simple compromises. revision: yes

Circularity Check

0 steps flagged

Low circularity; relies on external DPO technique and diffusion priors

full rationale

The paper's core framework applies the established Direct Preference Optimization (DPO) algorithm to fine-tune a latent diffusion model (PCLDM) using preference signals derived from humans, VLMs, and task networks. No derivation step reduces a claimed prediction or alignment result to a fitted parameter or self-defined quantity by construction. The PALDM generates candidates via a latent fusion prior and joint conditional loss, after which IDPO performs standard preference optimization; these steps are independent of the final alignment claims and do not import uniqueness theorems or ansatzes from the authors' prior work. Self-citation load is minimal and non-load-bearing. The construction of preference pairs for non-human sources is described as an engineering choice rather than a mathematical reduction that forces the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond naming the new models PALDM and PCLDM.

invented entities (2)

PALDM (property-aligned latent diffusion model) no independent evidence
purpose: Generate diverse candidate fusion results with various properties using latent fusion prior and joint conditional loss
Introduced as core component for candidate generation
PCLDM (preference-controllable latent diffusion model) no independent evidence
purpose: Enable direct control of fusion results via instance direct preference optimization
Introduced as the fine-tuned model for preference alignment

pith-pipeline@v0.9.0 · 5507 in / 1096 out tokens · 24369 ms · 2026-05-09T15:22:32.529639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle = NIPS, editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =
[2]

2024 , pages =

Wallace, Bram and Dang, Meihua and Rafailov, Rafael and Zhou, Linqi and Lou, Aaron and Purushwalkam, Senthil and Ermon, Stefano and Xiong, Caiming and Joty, Shafiq and Naik, Nikhil , title =. 2024 , pages =

2024
[3]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell and Zion English and Kyle Lacey and Andreas Blattmann and Tim Dockhorn and Jonas Müller and Joe Penna and Robin Rombach , year=. 2307.01952 , archivePrefix=

work page internal anchor Pith review arXiv
[4]

2025 , pages =

Huang, Qihan and Chan, Long and Liu, Jinlong and He, Wanggui and Jiang, Hao and Song, Mingli and Song, Jie , title =. 2025 , pages =

2025
[5]

2509.01181 , archivePrefix=

Qiaoqiao Jin and Siming Fu and Dong She and Weinan Jia and Hualiang Wang and Mu Liu and Jidong Jiang , year=. 2509.01181 , archivePrefix=

work page arXiv
[6]

2024 , eprint=

Token-level Direct Preference Optimization , author=. 2024 , eprint=

2024
[7]

Wu, Junkang and Xie, Yuexiang and Yang, Zhengyi and Wu, Jiancan and Gao, Jinyang and Ding, Bolin and Wang, Xiang and He, Xiangnan , booktitle = NIPS, doi =
[8]

Xiao, Teng and Yuan, Yige and Zhu, Huaisheng and Li, Mingxiao and Honavar, Vasant G , booktitle = NIPS, doi =
[9]

2023 , pages =

Zhao, Zixiang and Bai, Haowen and Zhu, Yuanzhi and Zhang, Jiangshe and Xu, Shuang and Zhang, Yulun and Zhang, Kai and Meng, Deyu and Timofte, Radu and Van Gool, Luc , title =. 2023 , pages =

2023
[10]

2024 , issn =

Mining Li and Ronghao Pei and Tianyou Zheng and Yang Zhang and Weiwei Fu , keywords =. 2024 , issn =. doi:https://doi.org/10.1016/j.eswa.2023.121664 , url =

work page doi:10.1016/j.eswa.2023.121664 2024
[11]

2304.04774 , archivePrefix=

ZiHan Cao and ShiQi Cao and Xiao Wu and JunMing Hou and Ran Ran and Liang-Jian Deng , year=. 2304.04774 , archivePrefix=

work page arXiv
[12]

2023 , volume=

Croitoru, Florinel-Alin and Hondru, Vlad and Ionescu, Radu Tudor and Shah, Mubarak , journal=PAMI, title=. 2023 , volume=

2023
[13]

2025 , volume=

Huang, Yi and Huang, Jiancheng and Liu, Yifan and Yan, Mingfu and Lv, Jiaxi and Liu, Jianzhuang and Xiong, Wei and Zhang, He and Cao, Liangliang and Chen, Shifeng , journal=PAMI, title=. 2025 , volume=

2025
[14]

2022 , pages =

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn , title =. 2022 , pages =

2022
[15]

2022 , pages =

Zamir, Syed Waqas and Arora, Aditya and Khan, Salman and Hayat, Munawar and Khan, Fahad Shahbaz and Yang, Ming-Hsuan , title =. 2022 , pages =

2022
[16]

2023 , pages =

Zhang, Lvmin and Rao, Anyi and Agrawala, Maneesh , title =. 2023 , pages =

2023
[17]

2021 , editor =

Learning Transferable Visual Models From Natural Language Supervision , author =. 2021 , editor =

2021
[18]

2025 , eprint=

Qwen3-Omni Technical Report , author=. 2025 , eprint=

2025
[19]

and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =

Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =. 2023 , pages =

2023
[20]

and Luo, Ping , booktitle = NIPS, editor =

Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M. and Luo, Ping , booktitle = NIPS, editor =
[22]

2021 , pages =

Jia, Xinyu and Zhu, Chuang and Li, Minzhen and Tang, Wenqi and Zhou, Wenli , title =. 2021 , pages =

2021
[23]

2022 , pages =

Liu, Jinyuan and Fan, Xin and Huang, Zhanbo and Wu, Guanyao and Liu, Risheng and Zhong, Wei and Luo, Zhongxuan , title =. 2022 , pages =

2022
[24]

2022 , issn =

Linfeng Tang and Jiteng Yuan and Hao Zhang and Xingyu Jiang and Jiayi Ma , keywords =. 2022 , issn =. doi:https://doi.org/10.1016/j.inffus.2022.03.007 , url =

work page doi:10.1016/j.inffus.2022.03.007 2022
[25]

doi:10.1609/aaai.v34i07.6936 , abstractNote=

Xu, Han and Ma, Jiayi and Le, Zhuliang and Jiang, Junjun and Guo, Xiaojie , year=. doi:10.1609/aaai.v34i07.6936 , abstractNote=

work page doi:10.1609/aaai.v34i07.6936
[26]

Kang, Le and Ye, Peng and Li, Yi and Doermann, David , title =
[27]

2021 , pages =

Ke, Junjie and Wang, Qifei and Wang, Yilin and Milanfar, Peyman and Yang, Feng , title =. 2021 , pages =

2021
[28]

2022 , volume=

Xu, Han and Ma, Jiayi and Jiang, Junjun and Guo, Xiaojie and Ling, Haibin , journal=PAMI, title=. 2022 , volume=

2022
[29]

2024 , pages =

Zheng, Naishan and Zhou, Man and Huang, Jie and Hou, Junming and Li, Haoying and Xu, Yuan and Zhao, Feng , title =. 2024 , pages =

2024
[30]

2024 , pages =

Zhao, Zixiang and Bai, Haowen and Zhang, Jiangshe and Zhang, Yulun and Zhang, Kai and Xu, Shuang and Chen, Dongdong and Timofte, Radu and Van Gool, Luc , title =. 2024 , pages =

2024
[31]

2025 , pages =

Liu, Jinyuan and Zhang, Bowei and Mei, Qingyun and Li, Xingyuan and Zou, Yang and Jiang, Zhiying and Ma, Long and Liu, Risheng and Fan, Xin , title =. 2025 , pages =

2025
[32]

2024 , pages =

Yi, Xunpeng and Xu, Han and Zhang, Hao and Tang, Linfeng and Ma, Jiayi , title =. 2024 , pages =

2024
[33]

2025 , pages =

Cheng, Chunyang and Xu, Tianyang and Feng, Zhenhua and Wu, Xiaojun and Tang, Zhangyong and Li, Hui and Zhang, Zeyang and Atito, Sara and Awais, Muhammad and Kittler, Josef , title =. 2025 , pages =

2025
[34]

2025 , pages =

Yi, Xunpeng and Zhang, Yibing and Xiang, Xinyu and Yan, Qinglong and Xu, Han and Ma, Jiayi , title =. 2025 , pages =

2025
[35]

2025 , pages =

Wu, Guanyao and Liu, Haoyu and Fu, Hongming and Peng, Yichuan and Liu, Jinyuan and Fan, Xin and Liu, Risheng , title =. 2025 , pages =

2025
[36]

2024 , issn =

Xunpeng Yi and Linfeng Tang and Hao Zhang and Han Xu and Jiayi Ma , keywords =. 2024 , issn =. doi:https://doi.org/10.1016/j.inffus.2024.102450 , url =

work page doi:10.1016/j.inffus.2024.102450 2024
[37]

Zhang, Hao and Cao, Lei and Ma, Jaiyi , booktitle = NIPS, doi =
[38]

2023 , volume=

Yue, Jun and Fang, Leyuan and Xia, Shaobo and Deng, Yue and Ma, Jiayi , journal=TIP, title=. 2023 , volume=

2023
[39]

Jane and Chen, Xun , journal=TIP, title=

Shi, Yu and Liu, Yu and Cheng, Juan and Wang, Z. Jane and Chen, Xun , journal=TIP, title=. 2025 , volume=

2025
[40]

2023 , pages =

Zhou, Yufan and Liu, Bingchen and Zhu, Yizhe and Yang, Xiao and Chen, Changyou and Xu, Jinhui , title =. 2023 , pages =

2023
[41]

Xue, Zeyue and Song, Guanglu and Guo, Qiushan and Liu, Boxiao and Zong, Zhuofan and Liu, Yu and Luo, Ping , booktitle = NIPS, editor =
[42]

2023 , pages =

Kawar, Bahjat and Zada, Shiran and Lang, Oran and Tov, Omer and Chang, Huiwen and Dekel, Tali and Mosseri, Inbar and Irani, Michal , title =. 2023 , pages =

2023
[43]

and Ren, Jian , title =

Zhang, Zhixing and Han, Ligong and Ghosh, Arnab and Metaxas, Dimitris N. and Ren, Jian , title =. 2023 , pages =

2023
[44]

2022 , pages =

Lugmayr, Andreas and Danelljan, Martin and Romero, Andres and Yu, Fisher and Timofte, Radu and Van Gool, Luc , title =. 2022 , pages =

2022
[45]

2023 , pages =

Anciukevi. 2023 , pages =

2023
[46]

2025 , pages =

Yang, Zhihe and Luo, Xufang and Han, Dongqi and Xu, Yunjian and Li, Dongsheng , title =. 2025 , pages =

2025
[47]

2025 , volume=

Liu, Jinyuan and Wu, Guanyao and Liu, Zhu and Wang, Di and Jiang, Zhiying and Ma, Long and Zhong, Wei and Fan, Xin and Liu, Risheng , journal=PAMI, title=. 2025 , volume=

2025
[48]

2024 , pages =

Yang, Kai and Tao, Jian and Lyu, Jiafei and Ge, Chunjiang and Chen, Jiaxin and Shen, Weihan and Zhu, Xiaolong and Li, Xiu , title =. 2024 , pages =

2024
[49]

2504.15176 , archivePrefix=

Miaomiao Cai and Simiao Li and Wei Li and Xudong Huang and Hanting Chen and Jie Hu and Yunhe Wang , year=. 2504.15176 , archivePrefix=

work page arXiv
[50]

Shusheng Xu and Wei Fu and Jiaxuan Gao and Wenjie Ye and Weilin Liu and Zhiyu Mei and Guangju Wang and Chao Yu and Yi Wu , year=. Is. 2404.10719 , archivePrefix=

work page arXiv
[51]

2025 , pages =

Wu, Xun and Huang, Shaohan and Jiang, Lingjie and Wei, Furu , title =. 2025 , pages =

2025
[52]

2024 , pages =

Xu, Yanwu and Zhao, Yang and Xiao, Zhisheng and Hou, Tingbo , title =. 2024 , pages =

2024
[53]

2024 , pages =

Mou, Chong and Wang, Xintao and Song, Jiechong and Shan, Ying and Zhang, Jian , title =. 2024 , pages =

2024
[54]

2022 , volume=

Zhang, Xingchen , journal=PAMI, title=. 2022 , volume=

2022
[55]

2026 , volume=

Tang, Linfeng and Li, Chunyu and Ma, Jiayi , journal=PAMI, title=. 2026 , volume=

2026
[56]

2025 , volume=

Liang, Pengwei and Jiang, Junjun and Ma, Qing and Wang, Chenyang and Liu, Xianming and Ma, Jiayi , journal=TIP, title=. 2025 , volume=

2025
[57]

, Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. , Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn, pages 8748-8763, 2021

2021
[58]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, et al., Segment anything. In Int. Conf. Comput. Vis., pages 4015–4026, 2023

2023
[59]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, et al. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[60]

SegFormer: Simple and efficient design for semantic segmentation with transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inform. Process. Syst., 34:12077–12090, 2021

2021
[61]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024

work page internal anchor Pith review arXiv 2024