arxiv: 2604.03117 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

Chengyin Hu , Yuxian Dong , Yikun Guo , Xiang Chen , Junqi Wu , Jiahuan Long , Yiwei Wei , Tingsong Jiang

show 1 more author

Wen Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial patchesinfrared vision-language modelsuniversal attacksphysical-world vulnerabilitiessemantic alignmentrepresentation disruptioncross-modal robustness

0 comments

The pith

A universal curved-grid patch disrupts semantic understanding in infrared vision-language models even after physical deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Universal Curved-Grid Patch to attack infrared vision-language models by targeting their internal visual representations rather than labels or text prompts. It combines a continuous mesh parameterization with an optimization objective that drives the patch to create subspace departure and topology disruption in the model's embedding space. A sympathetic reader would care because these models are built for reliable perception in low-visibility settings such as night operations or thermal surveillance, where a physically realizable patch could silently degrade cross-modal alignment. The method incorporates meta differential evolution and expectation-over-transformation modeling to support transfer across architectures and real-world conditions.

Core claim

UCGP integrates Curved-Grid Mesh parameterization for continuous, low-frequency patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. Combined with Meta Differential Evolution and EOT-augmented TPS deformation, the resulting patches weaken cross-modal semantic alignment and compromise semantic understanding across diverse IR-VLM architectures while preserving cross-model transferability, cross-dataset generalization, and physical effectiveness.

What carries the argument

Universal Curved-Grid Patch (UCGP) with Curved-Grid Mesh parameterization and a representation-driven objective that directly attacks the visual embedding space instead of labels or prompts.

If this is right

UCGP transfers across different IR-VLM architectures without architecture-specific retraining.
The patches generalize to unseen infrared datasets while keeping attack strength.
Physical deployment remains effective after printing and under natural scene variations.
The attack resists several existing defense strategies aimed at adversarial patches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Representation-level attacks may prove more general than label-level attacks when models must handle open-ended semantic tasks.
Security evaluations for multimodal systems should include physical patch testing in the infrared domain rather than digital-only checks.
Similar mesh-based parameterization could be tested on other sensor-language combinations such as radar or depth data.

Load-bearing premise

The Curved-Grid Mesh parameterization and unified objective can be reliably optimized to retain effectiveness after physical printing, placement, and real-world imaging variations.

What would settle it

A controlled real-world test in which the printed UCGP is placed on objects and imaged with an actual infrared camera, then fed to multiple IR-VLMs, shows no measurable drop in semantic understanding accuracy relative to clean inputs.

Figures

Figures reproduced from arXiv: 2604.03117 by Chengyin Hu, Jiahuan Long, Junqi Wu, Tingsong Jiang, Wen Yao, Xiang Chen, Yikun Guo, Yiwei Wei, Yuxian Dong.

**Figure 1.** Figure 1: Overview of the attack mechanism. Prior work has shown that carefully designed perturbations can deceive deep visual models in both digital and physical settings[7, 8]. As multimodal perception evolves from closedset recognition to open-ended semantic understanding, the security question also changes: failures no longer stop at wrong labels, but can propagate into generated descriptions and answers. For … view at source ↗

**Figure 2.** Figure 2: Overall framework of UCGP. where 𝑔𝑢,𝑣 ∈ [0, 1] controls cell activation and 𝑑𝑢,𝑣,𝑒 parameterizes the corresponding edge deformation. The role of CGM is to inject a geometric inductive bias into the parameter space, steering the optimizer toward lowfrequency structures that are easier to deploy, attach, and stably capture. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: CGM parameterization examples. (a) Mesh topology [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental setup. (a) Tripod, cold patches, and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: UCGP-generated adversarial samples and their classi [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Semantic failures of IR-VLMs in captioning, VQA, and physical deployment under the proposed patch. (A) Captioning [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the physical attack. (a) Deployment [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Transfer results over ASR, Δ𝑠target, and 𝑀adv. (a) Cross-dataset transfer. (b) Cross-model transfer. D1–D5 denote Infrared-COCO, LSOTB-TIR, LLVIP, M3FD, and FLIR, respectively [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation results on key hyperparameters. (a) Pop [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Radar comparison of BBox ratio and grid count. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts adversarial patches to IR-VLMs with a curved-grid approach, but the physical deployment claims rest on assumptions that need tighter checks.

read the letter

The main point is that UCGP brings a curved-grid mesh parameterization and a representation-focused objective to attack open-ended infrared vision-language models. It moves beyond standard RGB patch methods by targeting subspace departure and topology disruption directly in the visual features rather than labels or prompts. The meta differential evolution optimizer and EOT-augmented TPS deformation are added to handle some real-world shifts. If the experiments hold up, this gives a concrete way to test robustness in a domain that matters for low-visibility perception tasks. The paper does a reasonable job laying out why existing attacks fall short for IR-VLMs and in trying to make the patches universal across models and datasets. That extension is the clearest new piece. The soft spot sits in the physical effectiveness section. The stress-test note is on target: bridging the simulated deformations to uncontrolled IR conditions like thermal drift or emissivity changes is not automatic. If the real-world tests stay narrow or lack detailed metrics on success rate drops under varied conditions, the claims of cross-dataset generalization and defense robustness could overstate what was shown. The abstract promises extensive experiments, so the full results and ablations will decide how much weight the vulnerability finding carries. This is for people working on multimodal robustness or security in non-RGB sensing. A reader who follows adversarial work on VLMs or infrared systems would pick up usable ideas here. The approach is coherent on its own terms and engages the prior literature without obvious internal contradictions, so it deserves a serious referee. I would send it out for review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the Universal Curved-Grid Patch (UCGP) framework for generating universal physical adversarial patches targeting infrared vision-language models (IR-VLMs). It introduces Curved-Grid Mesh (CGM) parameterization for continuous low-frequency patch generation combined with a unified representation-driven objective that targets subspace departure, topology disruption, and stealth. Optimization employs Meta Differential Evolution and EOT-augmented TPS deformation modeling to enhance robustness under domain shift. The central claim is that UCGP compromises semantic understanding across diverse IR-VLM architectures while achieving cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and defense robustness, without directly manipulating labels or prompts.

Significance. If the physical and transfer results hold under rigorous validation, the work is significant for exposing a previously unexamined vulnerability in IR-VLMs, which are increasingly relevant for low-visibility multimodal perception in safety-critical domains. The shift from closed-set classification attacks to open-ended semantic disruption via representation-space methods is a useful extension of prior adversarial patch literature. The parameterization and deformation modeling could support reproducible follow-up work if code and exact experimental protocols are released.

major comments (3)

[§5.2] §5.2 (Physical Deployment): The physical-world effectiveness claim rests on EOT-augmented TPS deformation bridging simulation to real IR sensors, yet no quantitative ablation compares attack success rates with and without thermal-drift or emissivity modeling; this directly affects whether the reported robustness generalizes beyond controlled indoor conditions.
[Table 3] Table 3 (Cross-Dataset Results): Reported success rates for cross-dataset generalization lack error bars, statistical significance tests, or per-run variance; without these, the generalization claim cannot be assessed as load-bearing for the overall vulnerability revelation.
[§4.1] §4.1 (Objective Formulation): The unified representation-driven objective combines subspace departure and topology disruption, but the relative weighting between terms is not ablated; this leaves unclear whether the reported effectiveness is driven by the CGM parameterization or by the specific loss combination.

minor comments (2)

[Abstract] Abstract: The abstract asserts 'extensive experiments' and 'robustness against defenses' without summarizing any quantitative metrics, success rates, or ablation highlights; adding a single sentence with key numbers would improve clarity.
[§3.1] Notation: The definition of the Curved-Grid Mesh parameterization in §3.1 uses several free parameters whose ranges are not explicitly bounded in the text, which could hinder reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [§5.2] §5.2 (Physical Deployment): The physical-world effectiveness claim rests on EOT-augmented TPS deformation bridging simulation to real IR sensors, yet no quantitative ablation compares attack success rates with and without thermal-drift or emissivity modeling; this directly affects whether the reported robustness generalizes beyond controlled indoor conditions.

Authors: We acknowledge that while EOT-augmented TPS captures geometric deformations, an explicit quantitative ablation isolating thermal-drift and emissivity effects was not included. In the revised manuscript, we will add this ablation study to §5.2, reporting attack success rates under simulated thermal variations with and without these factors to better substantiate generalization claims beyond controlled conditions. revision: yes
Referee: [Table 3] Table 3 (Cross-Dataset Results): Reported success rates for cross-dataset generalization lack error bars, statistical significance tests, or per-run variance; without these, the generalization claim cannot be assessed as load-bearing for the overall vulnerability revelation.

Authors: We agree that statistical rigor is necessary to support the generalization claims. We will revise Table 3 to include error bars (standard deviations across runs), per-run variance, and statistical significance tests such as paired t-tests between datasets. revision: yes
Referee: [§4.1] §4.1 (Objective Formulation): The unified representation-driven objective combines subspace departure and topology disruption, but the relative weighting between terms is not ablated; this leaves unclear whether the reported effectiveness is driven by the CGM parameterization or by the specific loss combination.

Authors: We recognize that the relative weights were determined via preliminary tuning without a full ablation. In the revised manuscript, we will add an ablation study in §4.1 that varies the weights of the subspace departure and topology disruption terms, reporting their individual contributions and combined effects on performance to clarify the role of each component versus the CGM parameterization. revision: yes

Circularity Check

0 steps flagged

No circularity: UCGP framework integrates standard optimization with new parameterization without self-referential reductions

full rationale

The paper defines UCGP via Curved-Grid Mesh (CGM) parameterization plus a representation-driven objective (subspace departure + topology disruption), then optimizes via Meta Differential Evolution and EOT-augmented TPS. These steps are presented as constructive engineering choices applied to the IR-VLM attack problem, not as quantities fitted to the target metric and then renamed as predictions. No equations equate the claimed physical effectiveness to the inputs by construction, and no load-bearing self-citation chain or uniqueness theorem is invoked to force the result. Experiments are offered as external validation rather than tautological confirmation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of the newly introduced CGM parameterization and representation-driven objective; no explicit free parameters or background axioms are stated in the abstract.

invented entities (2)

Universal Curved-Grid Patch (UCGP) no independent evidence
purpose: Generate universal physical adversarial patches for IR-VLMs
Core proposed artifact whose effectiveness is asserted but not independently evidenced in the abstract.
Curved-Grid Mesh (CGM) no independent evidence
purpose: Enable continuous low-frequency patch generation
New parameterization introduced to support the patch framework.

pith-pipeline@v0.9.0 · 5548 in / 1145 out tokens · 35607 ms · 2026-05-13T19:39:46.714495+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. URLhttps://arxiv.org/ abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Infrared-llava: Enhancing understanding of infrared images in multi-modal large language models

Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-llava: Enhancing understanding of infrared images in multi-modal large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8573– 8591, 2024. doi: 10.18653/v1/2024.findings-emnlp.501. URL https://aclanthology.org/2024.findings-...

work page doi:10.18653/v1/2024.findings-emnlp.501 2024
[3]

IRGPT: Under- standing real-world infrared image with bi-cross-modal curriculum on large-scale benchmark

Zhe Cao, Jin Zhang, and Ruiheng Zhang. IRGPT: Under- standing real-world infrared image with bi-cross-modal curriculum on large-scale benchmark. arXiv preprint arXiv:2507.14449, 2025. URLhttps://arxiv.org/ abs/2507.14449. Accepted by ICCV 2025

work page arXiv 2025
[4]

RGB-Th-Bench: A dense benchmark for visual-thermal understanding of vision language models

MehdiMoshtaghi,SiavashH.Khajavi,andJoniPajarinen. RGB-Th-Bench: A dense benchmark for visual-thermal understanding of vision language models. arXiv preprint arXiv:2503.19654, 2025. URLhttps://arxiv.org/ abs/2503.19654. arXiv:2503.19654

work page arXiv 2025
[5]

IF-Bench: Benchmarking and enhancing mllms for infrared images with generative visual prompt- ing

Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, and Chunhong Pan. IF-Bench: Benchmarking and enhancing mllms for infrared images with generative visual prompt- ing. https://arxiv.org/abs/2512.09663 , 2025. arXiv:2512.09663

work page arXiv 2025
[6]

ThermEval: A structured bench- markforevaluationofvision-languagemodelsonthermal imagery

Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, and Nipun Batra. ThermEval: A structured bench- markforevaluationofvision-languagemodelsonthermal imagery. https : / / arxiv . org / abs / 2602 . 14989,

work page
[7]

Synthesizing robust adversarial examples

AnishAthalye,LoganEngstrom,AndrewIlyas,andKevin Kwok. Synthesizing robust adversarial examples. InPro- ceedingsofthe35thInternationalConferenceonMachine Learning, volume 80 ofProceedings of Machine Learn- ing Research, pages 284–293, Stockholm, Sweden, 2018. PMLR. URL https://proceedings.mlr.press/ v80/athalye18b.html

work page 2018
[8]

Robustphysical-worldattackson deep learning visual classification

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno,andDawnSong. Robustphysical-worldattackson deep learning visual classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1625–1634, Salt Lake City, UT, USA,

work page
[9]

doi: 10.1109/CVPR.2018.00175

IEEE. doi: 10.1109/CVPR.2018.00175

work page doi:10.1109/cvpr.2018.00175 2018
[10]

Visual adversarial examples jailbreak aligned large language models.ProceedingsoftheAAAIConferenceonArtificial Intelligence, 38(19):21527–21536, 2024

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models.ProceedingsoftheAAAIConferenceonArtificial Intelligence, 38(19):21527–21536, 2024. doi: 10.1609/ aaai.v38i19.30150

work page 2024
[11]

Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. Chain of attack: On the robustness of vision-language models against transfer- basedadversarialattacks.InProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14752–14761, Nashville, TN, USA, 2025. IEEE. doi: 10.1109/CVPR52734.2025.01368. 9

work page doi:10.1109/cvpr52734.2025.01368 2025
[12]

arXiv preprint arXiv:2312.14217, 2024

ChengyinHuandWeiwenShi.Adversarialinfraredcurves: An attack on infrared pedestrian detectors in the physical world. arXiv preprint arXiv:2312.14217, 2024. URL https://arxiv.org/abs/2312.14217

work page arXiv 2024
[13]

Multi-view black-box physical attacks on infrared pedes- trian detectors using adversarial infrared grid

Kalibinuer Tiliwalidi, Chengyin Hu, and Weiwen Shi. Multi-view black-box physical attacks on infrared pedes- trian detectors using adversarial infrared grid. arXiv preprintarXiv:2407.01168,2024. URL https://arxiv. org/abs/2407.01168

work page arXiv 2024
[14]

Review on infrared imaging technology.Sustainability, 14(18):11161, 2022

Fujin Hou, Yan Zhang, Yong Zhou, Mei Zhang, Bin Lv, and Jianqing Wu. Review on infrared imaging technology.Sustainability, 14(18):11161, 2022. doi: 10.3390/su141811161

work page doi:10.3390/su141811161 2022
[15]

In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 3489–3497, Montreal, QC, Canada, 2021. IEEE. doi: 10.1109/ICCVW54120.2021.00389

work page doi:10.1109/iccvw54120.2021.00389 2021
[16]

Revisiting the adversarial robustness of vision language models: a multimodal perspective.arXiv preprint arXiv:2404.19287, 2024

Wanqi Zhou, Shuanghao Bai, Danilo P. Mandic, Qibin Zhao, and Badong Chen. Revisiting the adversarial ro- bustness of vision language models: a multimodal per- spective. arXiv preprint arXiv:2404.19287, 2024. URL https://arxiv.org/abs/2404.19287

work page arXiv 2024
[17]

Explaining and Harnessing Adversarial Examples

IanJ.Goodfellow,JonathonShlens,andChristianSzegedy. Explaining and harnessing adversarial examples. InInter- national Conference on Learning Representations, 2015. URLhttps://arxiv.org/abs/1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Xiaopei Zhu, Zhanhao Hu, Siyuan Huang, Jianmin Li, and Xiaolin Hu. Infrared invisible clothing: Hiding frominfrareddetectorsatmultipleanglesinrealworld. In ProceedingsoftheIEEE/CVFConferenceonComputerVi- sion and Pattern Recognition, 2022. doi: 10.48550/arXiv. 2205.05909. URL https : / / openaccess . thecvf . com / content / CVPR2022 / html / Zhu _ Infrar...

work page internal anchor Pith review doi:10.48550/arxiv 2022
[19]

Physically ad- versarial infrared patches with learnable shapes and lo- cations

Xingxing Wei, Jie Yu, and Yao Huang. Physically ad- versarial infrared patches with learnable shapes and lo- cations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. doi: 10.48550/arXiv.2303.13868. URL https://arxiv. org/abs/2303.13868

work page doi:10.48550/arxiv.2303.13868 2023
[20]

arXivpreprintarXiv:2304.10712,2023

Chengyin Hu, Weiwen Shi, Tingsong Jiang, Wen Yao, LingTian,andXiaoqianChen.Adversarialinfraredblocks: Amulti-viewblack-boxattacktothermalinfrareddetectors inphysicalworld. arXivpreprintarXiv:2304.10712,2023. URLhttps://arxiv.org/abs/2304.10712

work page arXiv 2023
[21]

Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 11(4):341–359, 1997

Rainer Storn and Kenneth Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 11(4):341–359, 1997. doi: 10.1023/A:1008202821328

work page doi:10.1023/a:1008202821328 1997
[22]

Completely derandomized self-adaptation in evolution strategies

Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001. doi: 10.1162/106365601750190398

work page doi:10.1162/106365601750190398 2001
[23]

Black-box adversarial attacks with limited queries and information

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InProceedings of the 35th Interna- tional Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2142– 2151, Stockholm, Sweden, 2018. PMLR. URLhttp: //proceedings.mlr.press/v80/ilyas18a.html

work page 2018
[24]

Square attack: A query-efficient black-box adversarial attack via random search

Maksym Andriushchenko and Francesco Croce. Square attack: A query-efficient black-box adversarial attack via random search. InComputer Vision – ECCV 2020, volume 12368 ofLecture Notes in Computer Science, pages 484–501, Cham, Switzerland, 2020. Springer. doi: 10.1007/978-3-030-58592-1_29

work page doi:10.1007/978-3-030-58592-1_29 2020
[25]

FredL.Bookstein.Principalwarps: Thin-platesplinesand thedecompositionofdeformations.IEEETransactionson PatternAnalysisandMachineIntelligence,11(6):567–585,

work page
[26]

doi: 10.1109/34.24792

work page doi:10.1109/34.24792
[27]

Uni- fied adversarial patch for visible-infrared cross-modal attacks in the physical world.IEEE Transactions on Pat- ternAnalysisandMachineIntelligence,46(4):2348–2363, 2023

Xingxing Wei, Yao Huang, Yitong Sun, and Jie Yu. Uni- fied adversarial patch for visible-infrared cross-modal attacks in the physical world.IEEE Transactions on Pat- ternAnalysisandMachineIntelligence,46(4):2348–2363, 2023

work page 2023
[28]

Hinton and Sam T

Geoffrey E. Hinton and Sam T. Roweis. Stochastic neigh- bor embedding. InAdvances in Neural Information Pro- cessing Systems 15, 2002

work page 2002
[29]

Visualizing data using t-sne.Journal of Machine Learning Research, 9:2579–2605, 2008

LaurensvanderMaatenandGeoffreyHinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9:2579–2605, 2008

work page 2008
[30]

Metade: Evolving differential evolution by differential evolution

Minyang Chen, Chenchen Feng, and Ran Cheng. Metade: Evolving differential evolution by differential evolution. IEEE Transactions on Evolutionary Computation, 30(1): 141–155, 2026. doi: 10.1109/TEVC.2025.3541587

work page doi:10.1109/tevc.2025.3541587 2026
[31]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Mi- crosoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755, 2014

work page 2014
[32]

Lsotb-tir: A large-scale high-diversity thermal infraredobjecttrackingbenchmark

QiaoLiu,XinLi,ZhenyuHe,ChenglongLi,JunLi,Zikun Zhou, Di Yuan, Jing Li, Kai Yang, Nana Fan, and Feng Zheng. Lsotb-tir: A large-scale high-diversity thermal infraredobjecttrackingbenchmark. InProceedingsofthe 28thACMInternationalConferenceonMultimedia,pages 3847–3856, New York, NY, USA, 2020. Association for Computing Machinery. doi: 10.1145/3394171.3413922. 10

work page doi:10.1145/3394171.3413922 2020
[33]

Multispectral pedestrian detection: Benchmark dataset and baseline

SoonminHwang, JaesikPark, NamilKim, YukyungChoi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 1037–1045, Boston, MA, USA, 2015. IEEE. doi: 10.1109/CVPR.2015.7298706

work page doi:10.1109/cvpr.2015.7298706 2015
[34]

FLIR ADAS Dataset v1.3

FLIR Systems. FLIR ADAS Dataset v1.3. Online, 2018. URL https : / / www . flir . com / oem / adas / adas - dataset- form/. Referenced in RGB-T detection lit- erature as the FLIR thermal dataset; accessed March 2026

work page 2018
[35]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[36]

Llava-next: Improved reasoning, ocr, and world knowledge

Haotian Liu, Chunyuan Li, Yuheng Li, Brandon Li, Yuan- han Zhang, Shijie Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge. Project page, 2024. URL https://llava- vl.github.io/ blog/2024-01-30-llava-next/. Model release page

work page 2024
[38]

URLhttps://arxiv.org/abs/2308.01390

work page internal anchor Pith review arXiv
[39]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. URLhttps: //arxiv.org/abs/2301.12597

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

WenliangDai,JunnanLi,DongxuLi,AnthonyMengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung,andStevenC.H.Hoi. Instructblip: Towardsgeneral- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. URLhttps: //arxiv.org/abs/2305.06500

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Assran, Q

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 2818–2829, Vancouver, BC, Canada, 2023. IEEE...

work page doi:10.1109/cvpr52729.2023.00276 2023
[42]

Demysti- fying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, RussellHowes,VasuSharma,Shang-WenLi,GargiGhosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demysti- fying clip data. arXiv preprint arXiv:2309.16671, 2023. URLhttps://arxiv.org/abs/2309.16671

work page arXiv 2023
[43]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. URL https://arxiv.org/abs/2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Repre- sentations, 2022

work page 2022
[45]

When lighting deceives: Exposing vision- language models’ illumination vulnerability through il- lumination transformation attack

Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. When lighting deceives: Exposing vision- language models’ illumination vulnerability through il- lumination transformation attack. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 11

work page 2025