pith. machine review for the scientific record. sign in

arxiv: 2604.03117 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial patchesinfrared vision-language modelsuniversal attacksphysical-world vulnerabilitiessemantic alignmentrepresentation disruptioncross-modal robustness
0
0 comments X

The pith

A universal curved-grid patch disrupts semantic understanding in infrared vision-language models even after physical deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Universal Curved-Grid Patch to attack infrared vision-language models by targeting their internal visual representations rather than labels or text prompts. It combines a continuous mesh parameterization with an optimization objective that drives the patch to create subspace departure and topology disruption in the model's embedding space. A sympathetic reader would care because these models are built for reliable perception in low-visibility settings such as night operations or thermal surveillance, where a physically realizable patch could silently degrade cross-modal alignment. The method incorporates meta differential evolution and expectation-over-transformation modeling to support transfer across architectures and real-world conditions.

Core claim

UCGP integrates Curved-Grid Mesh parameterization for continuous, low-frequency patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. Combined with Meta Differential Evolution and EOT-augmented TPS deformation, the resulting patches weaken cross-modal semantic alignment and compromise semantic understanding across diverse IR-VLM architectures while preserving cross-model transferability, cross-dataset generalization, and physical effectiveness.

What carries the argument

Universal Curved-Grid Patch (UCGP) with Curved-Grid Mesh parameterization and a representation-driven objective that directly attacks the visual embedding space instead of labels or prompts.

If this is right

  • UCGP transfers across different IR-VLM architectures without architecture-specific retraining.
  • The patches generalize to unseen infrared datasets while keeping attack strength.
  • Physical deployment remains effective after printing and under natural scene variations.
  • The attack resists several existing defense strategies aimed at adversarial patches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Representation-level attacks may prove more general than label-level attacks when models must handle open-ended semantic tasks.
  • Security evaluations for multimodal systems should include physical patch testing in the infrared domain rather than digital-only checks.
  • Similar mesh-based parameterization could be tested on other sensor-language combinations such as radar or depth data.

Load-bearing premise

The Curved-Grid Mesh parameterization and unified objective can be reliably optimized to retain effectiveness after physical printing, placement, and real-world imaging variations.

What would settle it

A controlled real-world test in which the printed UCGP is placed on objects and imaged with an actual infrared camera, then fed to multiple IR-VLMs, shows no measurable drop in semantic understanding accuracy relative to clean inputs.

Figures

Figures reproduced from arXiv: 2604.03117 by Chengyin Hu, Jiahuan Long, Junqi Wu, Tingsong Jiang, Wen Yao, Xiang Chen, Yikun Guo, Yiwei Wei, Yuxian Dong.

Figure 1
Figure 1. Figure 1: Overview of the attack mechanism. Prior work has shown that carefully designed perturbations can deceive deep visual models in both digital and physical settings[7, 8]. As multimodal perception evolves from closed￾set recognition to open-ended semantic understanding, the security question also changes: failures no longer stop at wrong labels, but can propagate into generated descriptions and an￾swers. For … view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of UCGP. where 𝑔𝑢,𝑣 ∈ [0, 1] controls cell activation and 𝑑𝑢,𝑣,𝑒 parame￾terizes the corresponding edge deformation. The role of CGM is to inject a geometric inductive bias into the parameter space, steering the optimizer toward low￾frequency structures that are easier to deploy, attach, and stably capture. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CGM parameterization examples. (a) Mesh topology [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experimental setup. (a) Tripod, cold patches, and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: UCGP-generated adversarial samples and their classi [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Semantic failures of IR-VLMs in captioning, VQA, and physical deployment under the proposed patch. (A) Captioning [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the physical attack. (a) Deployment [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Transfer results over ASR, Δ𝑠target, and 𝑀adv. (a) Cross-dataset transfer. (b) Cross-model transfer. D1–D5 denote Infrared-COCO, LSOTB-TIR, LLVIP, M3FD, and FLIR, respectively [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation results on key hyperparameters. (a) Pop [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Radar comparison of BBox ratio and grid count. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the Universal Curved-Grid Patch (UCGP) framework for generating universal physical adversarial patches targeting infrared vision-language models (IR-VLMs). It introduces Curved-Grid Mesh (CGM) parameterization for continuous low-frequency patch generation combined with a unified representation-driven objective that targets subspace departure, topology disruption, and stealth. Optimization employs Meta Differential Evolution and EOT-augmented TPS deformation modeling to enhance robustness under domain shift. The central claim is that UCGP compromises semantic understanding across diverse IR-VLM architectures while achieving cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and defense robustness, without directly manipulating labels or prompts.

Significance. If the physical and transfer results hold under rigorous validation, the work is significant for exposing a previously unexamined vulnerability in IR-VLMs, which are increasingly relevant for low-visibility multimodal perception in safety-critical domains. The shift from closed-set classification attacks to open-ended semantic disruption via representation-space methods is a useful extension of prior adversarial patch literature. The parameterization and deformation modeling could support reproducible follow-up work if code and exact experimental protocols are released.

major comments (3)
  1. [§5.2] §5.2 (Physical Deployment): The physical-world effectiveness claim rests on EOT-augmented TPS deformation bridging simulation to real IR sensors, yet no quantitative ablation compares attack success rates with and without thermal-drift or emissivity modeling; this directly affects whether the reported robustness generalizes beyond controlled indoor conditions.
  2. [Table 3] Table 3 (Cross-Dataset Results): Reported success rates for cross-dataset generalization lack error bars, statistical significance tests, or per-run variance; without these, the generalization claim cannot be assessed as load-bearing for the overall vulnerability revelation.
  3. [§4.1] §4.1 (Objective Formulation): The unified representation-driven objective combines subspace departure and topology disruption, but the relative weighting between terms is not ablated; this leaves unclear whether the reported effectiveness is driven by the CGM parameterization or by the specific loss combination.
minor comments (2)
  1. [Abstract] Abstract: The abstract asserts 'extensive experiments' and 'robustness against defenses' without summarizing any quantitative metrics, success rates, or ablation highlights; adding a single sentence with key numbers would improve clarity.
  2. [§3.1] Notation: The definition of the Curved-Grid Mesh parameterization in §3.1 uses several free parameters whose ranges are not explicitly bounded in the text, which could hinder reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the paper.

read point-by-point responses
  1. Referee: [§5.2] §5.2 (Physical Deployment): The physical-world effectiveness claim rests on EOT-augmented TPS deformation bridging simulation to real IR sensors, yet no quantitative ablation compares attack success rates with and without thermal-drift or emissivity modeling; this directly affects whether the reported robustness generalizes beyond controlled indoor conditions.

    Authors: We acknowledge that while EOT-augmented TPS captures geometric deformations, an explicit quantitative ablation isolating thermal-drift and emissivity effects was not included. In the revised manuscript, we will add this ablation study to §5.2, reporting attack success rates under simulated thermal variations with and without these factors to better substantiate generalization claims beyond controlled conditions. revision: yes

  2. Referee: [Table 3] Table 3 (Cross-Dataset Results): Reported success rates for cross-dataset generalization lack error bars, statistical significance tests, or per-run variance; without these, the generalization claim cannot be assessed as load-bearing for the overall vulnerability revelation.

    Authors: We agree that statistical rigor is necessary to support the generalization claims. We will revise Table 3 to include error bars (standard deviations across runs), per-run variance, and statistical significance tests such as paired t-tests between datasets. revision: yes

  3. Referee: [§4.1] §4.1 (Objective Formulation): The unified representation-driven objective combines subspace departure and topology disruption, but the relative weighting between terms is not ablated; this leaves unclear whether the reported effectiveness is driven by the CGM parameterization or by the specific loss combination.

    Authors: We recognize that the relative weights were determined via preliminary tuning without a full ablation. In the revised manuscript, we will add an ablation study in §4.1 that varies the weights of the subspace departure and topology disruption terms, reporting their individual contributions and combined effects on performance to clarify the role of each component versus the CGM parameterization. revision: yes

Circularity Check

0 steps flagged

No circularity: UCGP framework integrates standard optimization with new parameterization without self-referential reductions

full rationale

The paper defines UCGP via Curved-Grid Mesh (CGM) parameterization plus a representation-driven objective (subspace departure + topology disruption), then optimizes via Meta Differential Evolution and EOT-augmented TPS. These steps are presented as constructive engineering choices applied to the IR-VLM attack problem, not as quantities fitted to the target metric and then renamed as predictions. No equations equate the claimed physical effectiveness to the inputs by construction, and no load-bearing self-citation chain or uniqueness theorem is invoked to force the result. Experiments are offered as external validation rather than tautological confirmation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of the newly introduced CGM parameterization and representation-driven objective; no explicit free parameters or background axioms are stated in the abstract.

invented entities (2)
  • Universal Curved-Grid Patch (UCGP) no independent evidence
    purpose: Generate universal physical adversarial patches for IR-VLMs
    Core proposed artifact whose effectiveness is asserted but not independently evidenced in the abstract.
  • Curved-Grid Mesh (CGM) no independent evidence
    purpose: Enable continuous low-frequency patch generation
    New parameterization introduced to support the patch framework.

pith-pipeline@v0.9.0 · 5548 in / 1145 out tokens · 35607 ms · 2026-05-13T19:39:46.714495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. URLhttps://arxiv.org/ abs/2103.00020

  2. [2]

    Infrared-llava: Enhancing understanding of infrared images in multi-modal large language models

    Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-llava: Enhancing understanding of infrared images in multi-modal large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8573– 8591, 2024. doi: 10.18653/v1/2024.findings-emnlp.501. URL https://aclanthology.org/2024.findings-...

  3. [3]

    IRGPT: Under- standing real-world infrared image with bi-cross-modal curriculum on large-scale benchmark

    Zhe Cao, Jin Zhang, and Ruiheng Zhang. IRGPT: Under- standing real-world infrared image with bi-cross-modal curriculum on large-scale benchmark. arXiv preprint arXiv:2507.14449, 2025. URLhttps://arxiv.org/ abs/2507.14449. Accepted by ICCV 2025

  4. [4]

    RGB-Th-Bench: A dense benchmark for visual-thermal understanding of vision language models

    MehdiMoshtaghi,SiavashH.Khajavi,andJoniPajarinen. RGB-Th-Bench: A dense benchmark for visual-thermal understanding of vision language models. arXiv preprint arXiv:2503.19654, 2025. URLhttps://arxiv.org/ abs/2503.19654. arXiv:2503.19654

  5. [5]

    IF-Bench: Benchmarking and enhancing mllms for infrared images with generative visual prompt- ing

    Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, and Chunhong Pan. IF-Bench: Benchmarking and enhancing mllms for infrared images with generative visual prompt- ing. https://arxiv.org/abs/2512.09663 , 2025. arXiv:2512.09663

  6. [6]

    ThermEval: A structured bench- markforevaluationofvision-languagemodelsonthermal imagery

    Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, and Nipun Batra. ThermEval: A structured bench- markforevaluationofvision-languagemodelsonthermal imagery. https : / / arxiv . org / abs / 2602 . 14989,

  7. [7]

    Synthesizing robust adversarial examples

    AnishAthalye,LoganEngstrom,AndrewIlyas,andKevin Kwok. Synthesizing robust adversarial examples. InPro- ceedingsofthe35thInternationalConferenceonMachine Learning, volume 80 ofProceedings of Machine Learn- ing Research, pages 284–293, Stockholm, Sweden, 2018. PMLR. URL https://proceedings.mlr.press/ v80/athalye18b.html

  8. [8]

    Robustphysical-worldattackson deep learning visual classification

    Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno,andDawnSong. Robustphysical-worldattackson deep learning visual classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1625–1634, Salt Lake City, UT, USA,

  9. [9]

    doi: 10.1109/CVPR.2018.00175

    IEEE. doi: 10.1109/CVPR.2018.00175

  10. [10]

    Visual adversarial examples jailbreak aligned large language models.ProceedingsoftheAAAIConferenceonArtificial Intelligence, 38(19):21527–21536, 2024

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models.ProceedingsoftheAAAIConferenceonArtificial Intelligence, 38(19):21527–21536, 2024. doi: 10.1609/ aaai.v38i19.30150

  11. [11]

    Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. Chain of attack: On the robustness of vision-language models against transfer- basedadversarialattacks.InProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14752–14761, Nashville, TN, USA, 2025. IEEE. doi: 10.1109/CVPR52734.2025.01368. 9

  12. [12]

    arXiv preprint arXiv:2312.14217, 2024

    ChengyinHuandWeiwenShi.Adversarialinfraredcurves: An attack on infrared pedestrian detectors in the physical world. arXiv preprint arXiv:2312.14217, 2024. URL https://arxiv.org/abs/2312.14217

  13. [13]

    Multi-view black-box physical attacks on infrared pedes- trian detectors using adversarial infrared grid

    Kalibinuer Tiliwalidi, Chengyin Hu, and Weiwen Shi. Multi-view black-box physical attacks on infrared pedes- trian detectors using adversarial infrared grid. arXiv preprintarXiv:2407.01168,2024. URL https://arxiv. org/abs/2407.01168

  14. [14]

    Review on infrared imaging technology.Sustainability, 14(18):11161, 2022

    Fujin Hou, Yan Zhang, Yong Zhou, Mei Zhang, Bin Lv, and Jianqing Wu. Review on infrared imaging technology.Sustainability, 14(18):11161, 2022. doi: 10.3390/su141811161

  15. [15]

    In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 3489–3497, Montreal, QC, Canada, 2021. IEEE. doi: 10.1109/ICCVW54120.2021.00389

  16. [16]

    Revisiting the adversarial robustness of vision language models: a multimodal perspective.arXiv preprint arXiv:2404.19287, 2024

    Wanqi Zhou, Shuanghao Bai, Danilo P. Mandic, Qibin Zhao, and Badong Chen. Revisiting the adversarial ro- bustness of vision language models: a multimodal per- spective. arXiv preprint arXiv:2404.19287, 2024. URL https://arxiv.org/abs/2404.19287

  17. [17]

    Explaining and Harnessing Adversarial Examples

    IanJ.Goodfellow,JonathonShlens,andChristianSzegedy. Explaining and harnessing adversarial examples. InInter- national Conference on Learning Representations, 2015. URLhttps://arxiv.org/abs/1412.6572

  18. [18]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Xiaopei Zhu, Zhanhao Hu, Siyuan Huang, Jianmin Li, and Xiaolin Hu. Infrared invisible clothing: Hiding frominfrareddetectorsatmultipleanglesinrealworld. In ProceedingsoftheIEEE/CVFConferenceonComputerVi- sion and Pattern Recognition, 2022. doi: 10.48550/arXiv. 2205.05909. URL https : / / openaccess . thecvf . com / content / CVPR2022 / html / Zhu _ Infrar...

  19. [19]

    Physically ad- versarial infrared patches with learnable shapes and lo- cations

    Xingxing Wei, Jie Yu, and Yao Huang. Physically ad- versarial infrared patches with learnable shapes and lo- cations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. doi: 10.48550/arXiv.2303.13868. URL https://arxiv. org/abs/2303.13868

  20. [20]

    arXivpreprintarXiv:2304.10712,2023

    Chengyin Hu, Weiwen Shi, Tingsong Jiang, Wen Yao, LingTian,andXiaoqianChen.Adversarialinfraredblocks: Amulti-viewblack-boxattacktothermalinfrareddetectors inphysicalworld. arXivpreprintarXiv:2304.10712,2023. URLhttps://arxiv.org/abs/2304.10712

  21. [21]

    Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 11(4):341–359, 1997

    Rainer Storn and Kenneth Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 11(4):341–359, 1997. doi: 10.1023/A:1008202821328

  22. [22]

    Completely derandomized self-adaptation in evolution strategies

    Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001. doi: 10.1162/106365601750190398

  23. [23]

    Black-box adversarial attacks with limited queries and information

    Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InProceedings of the 35th Interna- tional Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2142– 2151, Stockholm, Sweden, 2018. PMLR. URLhttp: //proceedings.mlr.press/v80/ilyas18a.html

  24. [24]

    Square attack: A query-efficient black-box adversarial attack via random search

    Maksym Andriushchenko and Francesco Croce. Square attack: A query-efficient black-box adversarial attack via random search. InComputer Vision – ECCV 2020, volume 12368 ofLecture Notes in Computer Science, pages 484–501, Cham, Switzerland, 2020. Springer. doi: 10.1007/978-3-030-58592-1_29

  25. [25]

    FredL.Bookstein.Principalwarps: Thin-platesplinesand thedecompositionofdeformations.IEEETransactionson PatternAnalysisandMachineIntelligence,11(6):567–585,

  26. [26]

    doi: 10.1109/34.24792

  27. [27]

    Uni- fied adversarial patch for visible-infrared cross-modal attacks in the physical world.IEEE Transactions on Pat- ternAnalysisandMachineIntelligence,46(4):2348–2363, 2023

    Xingxing Wei, Yao Huang, Yitong Sun, and Jie Yu. Uni- fied adversarial patch for visible-infrared cross-modal attacks in the physical world.IEEE Transactions on Pat- ternAnalysisandMachineIntelligence,46(4):2348–2363, 2023

  28. [28]

    Hinton and Sam T

    Geoffrey E. Hinton and Sam T. Roweis. Stochastic neigh- bor embedding. InAdvances in Neural Information Pro- cessing Systems 15, 2002

  29. [29]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9:2579–2605, 2008

    LaurensvanderMaatenandGeoffreyHinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9:2579–2605, 2008

  30. [30]

    Metade: Evolving differential evolution by differential evolution

    Minyang Chen, Chenchen Feng, and Ran Cheng. Metade: Evolving differential evolution by differential evolution. IEEE Transactions on Evolutionary Computation, 30(1): 141–155, 2026. doi: 10.1109/TEVC.2025.3541587

  31. [31]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Mi- crosoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755, 2014

  32. [32]

    Lsotb-tir: A large-scale high-diversity thermal infraredobjecttrackingbenchmark

    QiaoLiu,XinLi,ZhenyuHe,ChenglongLi,JunLi,Zikun Zhou, Di Yuan, Jing Li, Kai Yang, Nana Fan, and Feng Zheng. Lsotb-tir: A large-scale high-diversity thermal infraredobjecttrackingbenchmark. InProceedingsofthe 28thACMInternationalConferenceonMultimedia,pages 3847–3856, New York, NY, USA, 2020. Association for Computing Machinery. doi: 10.1145/3394171.3413922. 10

  33. [33]

    Multispectral pedestrian detection: Benchmark dataset and baseline

    SoonminHwang, JaesikPark, NamilKim, YukyungChoi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 1037–1045, Boston, MA, USA, 2015. IEEE. doi: 10.1109/CVPR.2015.7298706

  34. [34]

    FLIR ADAS Dataset v1.3

    FLIR Systems. FLIR ADAS Dataset v1.3. Online, 2018. URL https : / / www . flir . com / oem / adas / adas - dataset- form/. Referenced in RGB-T detection lit- erature as the FLIR thermal dataset; accessed March 2026

  35. [35]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition, pages 26296–26306, 2024

  36. [36]

    Llava-next: Improved reasoning, ocr, and world knowledge

    Haotian Liu, Chunyuan Li, Yuheng Li, Brandon Li, Yuan- han Zhang, Shijie Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge. Project page, 2024. URL https://llava- vl.github.io/ blog/2024-01-30-llava-next/. Model release page

  37. [38]

    URLhttps://arxiv.org/abs/2308.01390

  38. [39]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. URLhttps: //arxiv.org/abs/2301.12597

  39. [40]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    WenliangDai,JunnanLi,DongxuLi,AnthonyMengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung,andStevenC.H.Hoi. Instructblip: Towardsgeneral- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. URLhttps: //arxiv.org/abs/2305.06500

  40. [41]

    Assran, Q

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 2818–2829, Vancouver, BC, Canada, 2023. IEEE...

  41. [42]

    Demysti- fying clip data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, RussellHowes,VasuSharma,Shang-WenLi,GargiGhosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demysti- fying clip data. arXiv preprint arXiv:2309.16671, 2023. URLhttps://arxiv.org/abs/2309.16671

  42. [43]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. URL https://arxiv.org/abs/2303.15389

  43. [44]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Repre- sentations, 2022

  44. [45]

    When lighting deceives: Exposing vision- language models’ illumination vulnerability through il- lumination transformation attack

    Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. When lighting deceives: Exposing vision- language models’ illumination vulnerability through il- lumination transformation attack. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 11