pith. sign in

arxiv: 2605.22273 · v1 · pith:ZLZ4RY5Znew · submitted 2026-05-21 · 💻 cs.CV

Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

Pith reviewed 2026-05-22 07:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial patchesvision-language modelsvisible-infraredfractal geometrycross-task transferabilitymultimodal attacksBeizer curvesEOT robustness
0
0 comments X

The pith

Curved fractal geometry with spiral textures creates adversarial patches that fool visible-infrared vision-language models and transfer to other tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CFGPatch, an adversarial patch framework that combines curved fractal geometry with spiral-based texture distortions to attack vision-language models processing both visible and infrared images. This matters because VIS-IR sensing supports reliable perception in challenging real-world conditions, and stronger attacks highlight where these models remain vulnerable. The design replaces straight edges with Bezier-curved fractal elements for smoother contours and richer variation while adding modality-specific Fraser spirals to mislead texture interpretation, all optimized with expectation over transformation for robustness. Experiments show it outperforms standard patches and that examples trained on zero-shot classification succeed on image captioning and visual question answering.

Core claim

CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation; it pairs this global structure with a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images, coupling the two to disrupt both shape perception and texture interpretation, and adopts expectation over transformation to improve robustness against common image-level transformations.

What carries the argument

The coupling of global curved-fractal geometry with local spiral-based appearance interference in the CFGPatch framework.

If this is right

  • CFGPatch fools VIS-IR VLMs more effectively than standard patch baselines while remaining robust under image transformations.
  • Adversarial samples optimized for zero-shot classification transfer successfully to image captioning and visual question answering.
  • The method demonstrates strong cross-task transferability and generalizability across downstream tasks.
  • Expectation over transformation improves robustness against common image-level transformations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security evaluations of VIS-IR systems should test against geometric and textural perturbations of this form.
  • Robustness achieved on one task may not extend to related tasks if cross-task transfer holds.
  • Similar geometric constructions could be examined for exposing weaknesses in other multimodal or cross-modal models.
  • Practical VIS-IR deployments may benefit from defenses tuned to fractal self-similarity and spiral interference patterns.

Load-bearing premise

That coupling curved fractal shapes with spiral texture distortions will disrupt shape and texture perception in VIS-IR VLMs beyond what standard patch methods achieve.

What would settle it

An experiment in which removing either the Bezier curves or the Fraser-spiral component causes attack success rates to drop to the level of standard patch baselines would show that the specific coupling is not responsible for the gains.

Figures

Figures reproduced from arXiv: 2605.22273 by Chao Li, Chengyin Hu, Fengyu Zhang, Jiahuan Long, Jiaju Han, Jiujiang Guo, Xiang Chen, Yiwei Wei, Yuxian Dong.

Figure 1
Figure 1. Figure 1: Shared curved-fractal construction and VIS–IR Fraser rendering of CFGPatch. Moving from detector-oriented VIS–IR attacks to semantic attacks on VIS–IR VLMs introduces sev￾eral intertwined challenges. A unified patch should preserve a shared geometric identity across spectra, since using unrelated shapes for visible and infrared inputs would collapse the setting into independent single-modality attacks. In … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CFGPatch: a shared curved–edge fractal patch for VIS–IR inputs is optimized [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative classification examples in the VIS–IR setting. From top to bottom: Clean, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples. (a) Captioning under clean and adversarial VIS–IR inputs. (b) VQA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Patch variants in component abla￾tion [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fractal-depth ablation results [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for evaluating semantic consistency in image captioning. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for correctness evaluation in visual question answering. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CFGPatch, a curved-edge fractal geometric adversarial patch for visible-infrared VLMs. It replaces straight-edged triangular fractals with Bezier-curved elements to preserve self-similarity while adding smoother contours and directional variation, and introduces a modality-specific Fraser-spiral rendering mechanism for texture distortions in VIS and IR images. The method couples global geometry with local appearance interference, uses EOT for robustness to transformations, and reports that the resulting patches outperform standard baselines on zero-shot classification while transferring to image captioning and VQA tasks.

Significance. If the empirical gains are shown to stem specifically from the curved-fractal plus spiral coupling rather than parameterization complexity, the work would usefully expose vulnerabilities in cross-modal VIS-IR VLMs and provide a concrete geometric attack framework with demonstrated cross-task transfer. The emphasis on real-world sensing conditions and EOT robustness is a positive step toward practical relevance.

major comments (3)
  1. [§4] §4 (Experiments) and associated tables: the central claim that CFGPatch outperforms baselines due to the curved-fractal + Fraser-spiral coupling requires an ablation that holds the number of control points, boundary smoothness, and optimization budget fixed while varying only the geometric primitives. Without such a control, the reported improvements cannot be attributed to the specific framework rather than increased degrees of freedom.
  2. [§3.2] §3.2 (Fraser-spiral rendering): the description of modality-specific texture injection lacks a quantitative measure (e.g., Fourier spectrum or perceptual metric) showing that the spiral patterns produce distinct disruptions in VIS versus IR channels beyond what a generic high-frequency noise patch would achieve.
  3. [Table 2] Table 2 (cross-task transfer results): the transfer from classification-optimized patches to captioning and VQA is presented without reporting the attack success rate on the source task for the transferred samples, making it impossible to assess whether the observed transfer reflects genuine generalizability or simply weaker target-task performance.
minor comments (2)
  1. [Abstract] The abstract and §1 claim 'extensive experiments' but the provided text does not include error bars, dataset sizes, or exact baseline implementations; these details should be added for reproducibility.
  2. [§3] Notation for the Bezier curve control points and the spiral frequency parameters is introduced without a consolidated table of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments and suggestions. We address each major comment in detail below and have revised the manuscript to incorporate the recommended improvements where applicable.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the central claim that CFGPatch outperforms baselines due to the curved-fractal + Fraser-spiral coupling requires an ablation that holds the number of control points, boundary smoothness, and optimization budget fixed while varying only the geometric primitives. Without such a control, the reported improvements cannot be attributed to the specific framework rather than increased degrees of freedom.

    Authors: We agree that a controlled ablation is necessary to isolate the effect of the curved-fractal geometry from increased parameterization. In the revised manuscript, we have added an ablation study in §4 that maintains fixed control points, boundary smoothness, and optimization budget, varying only the geometric primitives (e.g., straight vs. curved). The results confirm that the performance gains are attributable to the Bezier-curved fractal design. revision: yes

  2. Referee: [§3.2] §3.2 (Fraser-spiral rendering): the description of modality-specific texture injection lacks a quantitative measure (e.g., Fourier spectrum or perceptual metric) showing that the spiral patterns produce distinct disruptions in VIS versus IR channels beyond what a generic high-frequency noise patch would achieve.

    Authors: We appreciate this suggestion for strengthening the analysis. We have incorporated quantitative measures, including Fourier spectrum comparisons and perceptual metrics such as SSIM and LPIPS, in the revised §3.2 to demonstrate the distinct disruptions caused by the modality-specific Fraser-spiral patterns in VIS and IR channels compared to generic high-frequency noise. revision: yes

  3. Referee: [Table 2] Table 2 (cross-task transfer results): the transfer from classification-optimized patches to captioning and VQA is presented without reporting the attack success rate on the source task for the transferred samples, making it impossible to assess whether the observed transfer reflects genuine generalizability or simply weaker target-task performance.

    Authors: This is a valid point for clarifying the transfer results. We have updated Table 2 to include the attack success rates on the source classification task for the patches transferred to captioning and VQA tasks. This additional information helps demonstrate that the transfer reflects genuine cross-task generalizability rather than just weaker performance on target tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack method validated externally

full rationale

The paper proposes CFGPatch by describing design choices (Bezier-curved fractal elements plus Fraser-spiral texture injection, plus EOT) and then reports experimental comparisons against standard patch baselines on VIS-IR VLMs for classification, captioning, and VQA. No equations, first-principles derivations, or fitted parameters are presented whose outputs are then relabeled as predictions. The central claims rest on measured attack success rates and transfer performance rather than any self-referential reduction. Any self-citations (if present in the full text) are not load-bearing because the effectiveness claims are falsifiable via the reported experiments against external baselines. This is a standard empirical contribution with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the design relies on geometric modeling choices whose specific parameters and assumptions are not detailed.

pith-pipeline@v0.9.0 · 5802 in / 1033 out tokens · 41755 ms · 2026-05-22T07:07:06.924623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

  2. [2]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine...

  3. [3]

    Align before fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems, v...

  4. [4]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...

  5. [5]

    BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...

  6. [6]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...

  7. [7]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023

  8. [8]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ...

  9. [9]

    Benchmarking robustness of adaptation methods on pre-trained vision-language models

    Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, and V olker Tresp. Benchmarking robustness of adaptation methods on pre-trained vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 51758–51777. Curran Associate...

  10. [10]

    Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks

    Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14679–14689, June 2025

  11. [11]

    Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024

    Alexander Shirnin, Nikita Andreev, Sofia Potapova, and Ekaterina Artemova. Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024. doi: 10.1109/TASLP.2024.3399061

  12. [12]

    On evaluating adversarial robustness of large vision-language models

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man (Man) Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 54111–54138. Curran As...

  13. [13]

    Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025

    Cong Zhang, Shuhui Wang, Xiaodan Li, Yao Zhu, Honggang Qi, and Qingming Huang. Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025. doi: 10.1109/TIFS.2025.3586430

  14. [14]

    Jailbreak vision language models via bi-modal adversarial prompt, 2024

    Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. Jailbreak vision language models via bi-modal adversarial prompt, 2024

  15. [15]

    Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, and Yuxin Ma. Vismodai: Visual analytics for evaluating and improving corruption robustness of vision-language models.IEEE Transactions on Visualization and Computer Graphics, 32(1):615–625, 2026. doi: 10.1109/TVCG.2025.3634257

  16. [16]

    Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022

    Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. ISSN 1566-2535. doi: 10.1016/j.inffus.2022.03.007

  17. [17]

    Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020

    Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020. ISSN 1566-2535. doi: 10.1016/j.inffus.2020.05.002

  18. [18]

    Unified adversarial patch for cross-modal attacks in the physical world

    Xingxing Wei, Yao Huang, Yitong Sun, and Jie Yu. Unified adversarial patch for cross-modal attacks in the physical world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4445–4454, October 2023

  19. [19]

    Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025

    Chengyin Hu, Weiwen Shi, Wen Yao, Tingsong Jiang, Ling Tian, and Wen Li. Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025. ISSN 1568-4946. doi: 10.1016/j.asoc.2025.112818

  20. [20]

    Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors

    Jiahuan Long, Wen Yao, Tingsong Jiang, Jiacheng Hou, Shuai Jia, Junqi Wu, Xiaoya Zhang, Xiaohu Zheng, and Chao Ma. Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1462–1470, New York, NY , USA, 2025. Association for Computing M...

  21. [21]

    Fine-grained semantically aligned vision-language pre-training

    Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, and Siliang Tang. Fine-grained semantically aligned vision-language pre-training. In Sanmi Koyejo, Shakir Mohamed, Anima Agarwal, Danielle Belgrave, Kyunghyun Cho, and Alice Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 7290–730...

  22. [22]

    Zou, and Tatsunori Hashimoto

    Ian Covert, Tony Sun, James Y . Zou, and Tatsunori Hashimoto. Locality alignment improves vision- language models. InInternational Conference on Learning Representations, 2025

  23. [23]

    Assessing and learning alignment of unimodal vision and language models

    Le Zhang, Qian Yang, and Aishwarya Agrawal. Assessing and learning alignment of unimodal vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14604–14614, June 2025

  24. [24]

    Eberhart

    James Kennedy and Russell C. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948, 1995. doi: 10.1109/ICNN. 1995.488968

  25. [25]

    Synthesizing robust adversarial examples

    Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 284–293. PMLR, 10–15 Jul 2018

  26. [26]

    Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022. doi: 10.1007/ s11263-022-01653-1

  27. [27]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, June 2022

  28. [28]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, June 2023. 11

  29. [29]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, October 2023

  30. [30]

    Benchmarking multimodal large language models against image corruptions

    Xinkuan Qiu, Meina Kan, Yongbin Zhou, and Shiguang Shan. Benchmarking multimodal large language models against image corruptions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9014–9023, October 2025

  31. [31]

    Analysing the robustness of vision-language-models to common corruptions, 2025

    Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common corruptions, 2025

  32. [32]

    MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

  33. [33]

    Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer

    Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch, 2017

  34. [34]

    Robust physical-world attacks on deep learning visual classification

    Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1625–1634, June 2018

  35. [35]

    Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector

    Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng (Polo) Chau. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors,Machine Learning and Knowledge Discovery in Databases, pages 52–68, Cham, 2019. Springer Internationa...

  36. [36]

    DPatch: An adversarial patch attack on object detectors

    Xin Liu, Huanrui Yang, Ziwei Liu, Linghao Song, Hai Li, and Yiran Chen. DPatch: An adversarial patch attack on object detectors. InProceedings of the AAAI Workshop on Artificial Intelligence Safety (SafeAI 2019), volume 2301 ofCEUR Workshop Proceedings. CEUR-WS, 2019

  37. [37]

    Lutz and Elvira Mayordomo

    Jack H. Lutz and Elvira Mayordomo. Dimensions of points in self-similar fractals.SIAM Journal on Computing, 38(3):1080–1112, 2008. doi: 10.1137/070684689

  38. [38]

    Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon

    Yiqi Zhong, Xianming Liu, Deming Zhai, Junjun Jiang, and Xiangyang Ji. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15345–15354, June 2022

  39. [39]

    Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks

    Teng-Fang Hsiao, Bo-Lun Huang, Zi-Xiang Ni, Yan-Ting Lin, Hong-Han Shuai, Yung-Hui Li, and Wen- Huang Cheng. Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3915–3924, January 2024

  40. [40]

    When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack

    Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10485– 10495, October 2025

  41. [41]

    Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, and Xingxing Wei. Advdreamer unveils: Are vision-language models truly ready for real-world 3d variations? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7894–7904, October 2025

  42. [42]

    Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models

    Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA, nov 2024...

  43. [43]

    Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark

    Zhe Cao, Jin Zhang, and Ruiheng Zhang. Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 166–176, October 2025. 12

  44. [44]

    Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026

    Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, and Wen Yao. Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026

  45. [45]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

  46. [46]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015

  47. [47]

    Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

    Zonghao Han, Shun Zhang, Yuru Su, Xiaoning Chen, and Shaohui Mei. Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. doi: 10.1109/TGRS.2024.3405989

  48. [48]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023

  49. [49]

    Demystifying clip data

    Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. InInternational Confer- ence on Learning Representations, 2024

  50. [50]

    Eva-clip: Improved training techniques for clip at scale, 2023

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023

  51. [51]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  52. [52]

    Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

  53. [53]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems,...

  54. [54]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022

  55. [55]

    Representation learning with contrastive predictive coding, 2018

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018

  56. [56]

    LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. LLMs instead of ...