Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

Chao Li; Chengyin Hu; Fengyu Zhang; Jiahuan Long; Jiaju Han; Jiujiang Guo; Xiang Chen; Yiwei Wei; Yuxian Dong

arxiv: 2605.22273 · v1 · pith:ZLZ4RY5Znew · submitted 2026-05-21 · 💻 cs.CV

Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

Xiang Chen , Yuxian Dong , Chao Li , Chengyin Hu , Jiaju Han , Fengyu Zhang , Yiwei Wei , Jiahuan Long

show 1 more author

Jiujiang Guo

This is my paper

Pith reviewed 2026-05-22 07:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial patchesvision-language modelsvisible-infraredfractal geometrycross-task transferabilitymultimodal attacksBeizer curvesEOT robustness

0 comments

The pith

Curved fractal geometry with spiral textures creates adversarial patches that fool visible-infrared vision-language models and transfer to other tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CFGPatch, an adversarial patch framework that combines curved fractal geometry with spiral-based texture distortions to attack vision-language models processing both visible and infrared images. This matters because VIS-IR sensing supports reliable perception in challenging real-world conditions, and stronger attacks highlight where these models remain vulnerable. The design replaces straight edges with Bezier-curved fractal elements for smoother contours and richer variation while adding modality-specific Fraser spirals to mislead texture interpretation, all optimized with expectation over transformation for robustness. Experiments show it outperforms standard patches and that examples trained on zero-shot classification succeed on image captioning and visual question answering.

Core claim

CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation; it pairs this global structure with a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images, coupling the two to disrupt both shape perception and texture interpretation, and adopts expectation over transformation to improve robustness against common image-level transformations.

What carries the argument

The coupling of global curved-fractal geometry with local spiral-based appearance interference in the CFGPatch framework.

If this is right

CFGPatch fools VIS-IR VLMs more effectively than standard patch baselines while remaining robust under image transformations.
Adversarial samples optimized for zero-shot classification transfer successfully to image captioning and visual question answering.
The method demonstrates strong cross-task transferability and generalizability across downstream tasks.
Expectation over transformation improves robustness against common image-level transformations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security evaluations of VIS-IR systems should test against geometric and textural perturbations of this form.
Robustness achieved on one task may not extend to related tasks if cross-task transfer holds.
Similar geometric constructions could be examined for exposing weaknesses in other multimodal or cross-modal models.
Practical VIS-IR deployments may benefit from defenses tuned to fractal self-similarity and spiral interference patterns.

Load-bearing premise

That coupling curved fractal shapes with spiral texture distortions will disrupt shape and texture perception in VIS-IR VLMs beyond what standard patch methods achieve.

What would settle it

An experiment in which removing either the Bezier curves or the Fraser-spiral component causes attack success rates to drop to the level of standard patch baselines would show that the specific coupling is not responsible for the gains.

Figures

Figures reproduced from arXiv: 2605.22273 by Chao Li, Chengyin Hu, Fengyu Zhang, Jiahuan Long, Jiaju Han, Jiujiang Guo, Xiang Chen, Yiwei Wei, Yuxian Dong.

**Figure 1.** Figure 1: Shared curved-fractal construction and VIS–IR Fraser rendering of CFGPatch. Moving from detector-oriented VIS–IR attacks to semantic attacks on VIS–IR VLMs introduces several intertwined challenges. A unified patch should preserve a shared geometric identity across spectra, since using unrelated shapes for visible and infrared inputs would collapse the setting into independent single-modality attacks. In … view at source ↗

**Figure 2.** Figure 2: Overview of CFGPatch: a shared curved–edge fractal patch for VIS–IR inputs is optimized [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative classification examples in the VIS–IR setting. From top to bottom: Clean, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples. (a) Captioning under clean and adversarial VIS–IR inputs. (b) VQA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Patch variants in component ablation [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Fractal-depth ablation results [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for evaluating semantic consistency in image captioning. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for correctness evaluation in visual question answering. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CFGPatch introduces curved Bezier fractals plus modality-specific Fraser spirals for VIS-IR VLM attacks and claims cross-task transfer, but the abstract gives no numbers to show the geometry itself drives the gains.

read the letter

The punchline on this paper is that it presents CFGPatch as an adversarial patch method for visible and infrared vision-language models. The approach replaces straight fractal edges with Bezier curves and adds a Fraser-spiral texture renderer tailored to each modality. The authors claim this fools the models more effectively than standard patches and that the attacks transfer from zero-shot classification to captioning and visual question answering. What is actually new seems to be the specific coupling of curved fractal geometry with the spiral-based local interference. Prior patch work has used fractals or spirals separately, but this combination for cross-modal VIS-IR is not something described in the referenced literature. The paper does well to motivate the problem with real-world sensing applications and to test transferability across tasks, which adds to its practical angle. The soft spots are mainly around the evidence. The abstract states that extensive experiments demonstrate outperformance and robustness, yet it provides no quantitative results, baseline comparisons, or dataset specifics. This makes it hard to evaluate the central claim. The stress-test concern holds some weight here: the improvements could come from the increased flexibility in patch shape and rendering parameters rather than the curved-fractal and spiral elements themselves. Without ablations that control for the number of degrees of freedom or optimization steps, the geometric motivation stays unproven. A reader focused on adversarial robustness in multimodal systems, especially those involving thermal or low-light conditions, would get some value from the method description. It could serve as a starting point for further experiments even if the current results need more support. The work shows clear thinking on the attack design and engages with the relevant literature on geometric perturbations and expectation over transformations. It is not incoherent on its own terms. I would bring this to a reading group to discuss the implementation details and potential ablations. I recommend sending it to peer review. The topic is timely and the framework has enough originality to warrant referee input, particularly on strengthening the experimental validation.

Referee Report

3 major / 2 minor

Summary. The paper proposes CFGPatch, a curved-edge fractal geometric adversarial patch for visible-infrared VLMs. It replaces straight-edged triangular fractals with Bezier-curved elements to preserve self-similarity while adding smoother contours and directional variation, and introduces a modality-specific Fraser-spiral rendering mechanism for texture distortions in VIS and IR images. The method couples global geometry with local appearance interference, uses EOT for robustness to transformations, and reports that the resulting patches outperform standard baselines on zero-shot classification while transferring to image captioning and VQA tasks.

Significance. If the empirical gains are shown to stem specifically from the curved-fractal plus spiral coupling rather than parameterization complexity, the work would usefully expose vulnerabilities in cross-modal VIS-IR VLMs and provide a concrete geometric attack framework with demonstrated cross-task transfer. The emphasis on real-world sensing conditions and EOT robustness is a positive step toward practical relevance.

major comments (3)

[§4] §4 (Experiments) and associated tables: the central claim that CFGPatch outperforms baselines due to the curved-fractal + Fraser-spiral coupling requires an ablation that holds the number of control points, boundary smoothness, and optimization budget fixed while varying only the geometric primitives. Without such a control, the reported improvements cannot be attributed to the specific framework rather than increased degrees of freedom.
[§3.2] §3.2 (Fraser-spiral rendering): the description of modality-specific texture injection lacks a quantitative measure (e.g., Fourier spectrum or perceptual metric) showing that the spiral patterns produce distinct disruptions in VIS versus IR channels beyond what a generic high-frequency noise patch would achieve.
[Table 2] Table 2 (cross-task transfer results): the transfer from classification-optimized patches to captioning and VQA is presented without reporting the attack success rate on the source task for the transferred samples, making it impossible to assess whether the observed transfer reflects genuine generalizability or simply weaker target-task performance.

minor comments (2)

[Abstract] The abstract and §1 claim 'extensive experiments' but the provided text does not include error bars, dataset sizes, or exact baseline implementations; these details should be added for reproducibility.
[§3] Notation for the Bezier curve control points and the spiral frequency parameters is introduced without a consolidated table of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments and suggestions. We address each major comment in detail below and have revised the manuscript to incorporate the recommended improvements where applicable.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: the central claim that CFGPatch outperforms baselines due to the curved-fractal + Fraser-spiral coupling requires an ablation that holds the number of control points, boundary smoothness, and optimization budget fixed while varying only the geometric primitives. Without such a control, the reported improvements cannot be attributed to the specific framework rather than increased degrees of freedom.

Authors: We agree that a controlled ablation is necessary to isolate the effect of the curved-fractal geometry from increased parameterization. In the revised manuscript, we have added an ablation study in §4 that maintains fixed control points, boundary smoothness, and optimization budget, varying only the geometric primitives (e.g., straight vs. curved). The results confirm that the performance gains are attributable to the Bezier-curved fractal design. revision: yes
Referee: [§3.2] §3.2 (Fraser-spiral rendering): the description of modality-specific texture injection lacks a quantitative measure (e.g., Fourier spectrum or perceptual metric) showing that the spiral patterns produce distinct disruptions in VIS versus IR channels beyond what a generic high-frequency noise patch would achieve.

Authors: We appreciate this suggestion for strengthening the analysis. We have incorporated quantitative measures, including Fourier spectrum comparisons and perceptual metrics such as SSIM and LPIPS, in the revised §3.2 to demonstrate the distinct disruptions caused by the modality-specific Fraser-spiral patterns in VIS and IR channels compared to generic high-frequency noise. revision: yes
Referee: [Table 2] Table 2 (cross-task transfer results): the transfer from classification-optimized patches to captioning and VQA is presented without reporting the attack success rate on the source task for the transferred samples, making it impossible to assess whether the observed transfer reflects genuine generalizability or simply weaker target-task performance.

Authors: This is a valid point for clarifying the transfer results. We have updated Table 2 to include the attack success rates on the source classification task for the patches transferred to captioning and VQA tasks. This additional information helps demonstrate that the transfer reflects genuine cross-task generalizability rather than just weaker performance on target tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack method validated externally

full rationale

The paper proposes CFGPatch by describing design choices (Bezier-curved fractal elements plus Fraser-spiral texture injection, plus EOT) and then reports experimental comparisons against standard patch baselines on VIS-IR VLMs for classification, captioning, and VQA. No equations, first-principles derivations, or fitted parameters are presented whose outputs are then relabeled as predictions. The central claims rest on measured attack success rates and transfer performance rather than any self-referential reduction. Any self-citations (if present in the full text) are not load-bearing because the effectiveness claims are falsifiable via the reported experiments against external baselines. This is a standard empirical contribution with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the design relies on geometric modeling choices whose specific parameters and assumptions are not detailed.

pith-pipeline@v0.9.0 · 5802 in / 1033 out tokens · 41755 ms · 2026-05-22T07:07:06.924623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean (SphereAdmitsCircleLinking, D=3 forcing) alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CFGPatch takes triangular fractal geometry as its base and transforms rigid straight-edged primitives into Bezier-curved elements, preserving fractal self-similarity... modality-specific Fraser-spiral rendering mechanism
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

coupling global curved-fractal geometry with local spiral-based appearance interference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

work page 2021
[2]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine...

work page 2021
[3]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems, v...

work page 2021
[4]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...

work page 2022
[5]

BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...

work page 2023
[6]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...

work page 2022
[7]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023

work page 2023
[8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ...

work page 2024
[9]

Benchmarking robustness of adaptation methods on pre-trained vision-language models

Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, and V olker Tresp. Benchmarking robustness of adaptation methods on pre-trained vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 51758–51777. Curran Associate...

work page 2023
[10]

Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks

Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14679–14689, June 2025

work page 2025
[11]

Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024

Alexander Shirnin, Nikita Andreev, Sofia Potapova, and Ekaterina Artemova. Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024. doi: 10.1109/TASLP.2024.3399061

work page doi:10.1109/taslp.2024.3399061 2024
[12]

On evaluating adversarial robustness of large vision-language models

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man (Man) Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 54111–54138. Curran As...

work page 2023
[13]

Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025

Cong Zhang, Shuhui Wang, Xiaodan Li, Yao Zhu, Honggang Qi, and Qingming Huang. Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025. doi: 10.1109/TIFS.2025.3586430

work page doi:10.1109/tifs.2025.3586430 2025
[14]

Jailbreak vision language models via bi-modal adversarial prompt, 2024

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. Jailbreak vision language models via bi-modal adversarial prompt, 2024

work page 2024
[15]

Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, and Yuxin Ma. Vismodai: Visual analytics for evaluating and improving corruption robustness of vision-language models.IEEE Transactions on Visualization and Computer Graphics, 32(1):615–625, 2026. doi: 10.1109/TVCG.2025.3634257

work page doi:10.1109/tvcg.2025.3634257 2026
[16]

Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022

Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. ISSN 1566-2535. doi: 10.1016/j.inffus.2022.03.007

work page doi:10.1016/j.inffus.2022.03.007 2022
[17]

Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020

Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020. ISSN 1566-2535. doi: 10.1016/j.inffus.2020.05.002

work page doi:10.1016/j.inffus.2020.05.002 2020
[18]

Unified adversarial patch for cross-modal attacks in the physical world

Xingxing Wei, Yao Huang, Yitong Sun, and Jie Yu. Unified adversarial patch for cross-modal attacks in the physical world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4445–4454, October 2023

work page 2023
[19]

Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025

Chengyin Hu, Weiwen Shi, Wen Yao, Tingsong Jiang, Ling Tian, and Wen Li. Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025. ISSN 1568-4946. doi: 10.1016/j.asoc.2025.112818

work page doi:10.1016/j.asoc.2025.112818 2025
[20]

Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors

Jiahuan Long, Wen Yao, Tingsong Jiang, Jiacheng Hou, Shuai Jia, Junqi Wu, Xiaoya Zhang, Xiaohu Zheng, and Chao Ma. Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1462–1470, New York, NY , USA, 2025. Association for Computing M...

work page doi:10.1145/3746027.3755188 2025
[21]

Fine-grained semantically aligned vision-language pre-training

Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, and Siliang Tang. Fine-grained semantically aligned vision-language pre-training. In Sanmi Koyejo, Shakir Mohamed, Anima Agarwal, Danielle Belgrave, Kyunghyun Cho, and Alice Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 7290–730...

work page 2022
[22]

Zou, and Tatsunori Hashimoto

Ian Covert, Tony Sun, James Y . Zou, and Tatsunori Hashimoto. Locality alignment improves vision- language models. InInternational Conference on Learning Representations, 2025

work page 2025
[23]

Assessing and learning alignment of unimodal vision and language models

Le Zhang, Qian Yang, and Aishwarya Agrawal. Assessing and learning alignment of unimodal vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14604–14614, June 2025

work page 2025
[24]

Eberhart

James Kennedy and Russell C. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948, 1995. doi: 10.1109/ICNN. 1995.488968

work page doi:10.1109/icnn 1942
[25]

Synthesizing robust adversarial examples

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 284–293. PMLR, 10–15 Jul 2018

work page 2018
[26]

Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022. doi: 10.1007/ s11263-022-01653-1

work page 2022
[27]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, June 2022

work page 2022
[28]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, June 2023. 11

work page 2023
[29]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, October 2023

work page 2023
[30]

Benchmarking multimodal large language models against image corruptions

Xinkuan Qiu, Meina Kan, Yongbin Zhou, and Shiguang Shan. Benchmarking multimodal large language models against image corruptions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9014–9023, October 2025

work page 2025
[31]

Analysing the robustness of vision-language-models to common corruptions, 2025

Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common corruptions, 2025

work page 2025
[32]

MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

work page 2024
[33]

Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer

Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch, 2017

work page 2017
[34]

Robust physical-world attacks on deep learning visual classification

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1625–1634, June 2018

work page 2018
[35]

Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector

Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng (Polo) Chau. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors,Machine Learning and Knowledge Discovery in Databases, pages 52–68, Cham, 2019. Springer Internationa...

work page 2019
[36]

DPatch: An adversarial patch attack on object detectors

Xin Liu, Huanrui Yang, Ziwei Liu, Linghao Song, Hai Li, and Yiran Chen. DPatch: An adversarial patch attack on object detectors. InProceedings of the AAAI Workshop on Artificial Intelligence Safety (SafeAI 2019), volume 2301 ofCEUR Workshop Proceedings. CEUR-WS, 2019

work page 2019
[37]

Lutz and Elvira Mayordomo

Jack H. Lutz and Elvira Mayordomo. Dimensions of points in self-similar fractals.SIAM Journal on Computing, 38(3):1080–1112, 2008. doi: 10.1137/070684689

work page doi:10.1137/070684689 2008
[38]

Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon

Yiqi Zhong, Xianming Liu, Deming Zhai, Junjun Jiang, and Xiangyang Ji. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15345–15354, June 2022

work page 2022
[39]

Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks

Teng-Fang Hsiao, Bo-Lun Huang, Zi-Xiang Ni, Yan-Ting Lin, Hong-Han Shuai, Yung-Hui Li, and Wen- Huang Cheng. Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3915–3924, January 2024

work page 2024
[40]

When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack

Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10485– 10495, October 2025

work page 2025
[41]

Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, and Xingxing Wei. Advdreamer unveils: Are vision-language models truly ready for real-world 3d variations? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7894–7904, October 2025

work page 2025
[42]

Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models

Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA, nov 2024...

work page doi:10.18653/v1/2024.findings-emnlp.501 2024
[43]

Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark

Zhe Cao, Jin Zhang, and Ruiheng Zhang. Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 166–176, October 2025. 12

work page 2025
[44]

Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026

Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, and Wen Yao. Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026

work page 2026
[45]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

work page 2014
[46]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015

work page 2015
[47]

Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

Zonghao Han, Shun Zhang, Yuru Su, Xiaoning Chen, and Shaohui Mei. Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. doi: 10.1109/TGRS.2024.3405989

work page doi:10.1109/tgrs.2024.3405989 2024
[48]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023

work page 2023
[49]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. InInternational Confer- ence on Learning Representations, 2024

work page 2024
[50]

Eva-clip: Improved training techniques for clip at scale, 2023

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023

work page 2023
[51]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[52]

Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

work page 2023
[53]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems,...

work page 2023
[54]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022

work page 2022
[55]

Representation learning with contrastive predictive coding, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018

work page 2018
[56]

LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. LLMs instead of ...

work page doi:10.18653/v1/2025.acl-short.20 2025

[1] [1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

work page 2021

[2] [2]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine...

work page 2021

[3] [3]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems, v...

work page 2021

[4] [4]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...

work page 2022

[5] [5]

BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...

work page 2023

[6] [6]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...

work page 2022

[7] [7]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023

work page 2023

[8] [8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ...

work page 2024

[9] [9]

Benchmarking robustness of adaptation methods on pre-trained vision-language models

Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, and V olker Tresp. Benchmarking robustness of adaptation methods on pre-trained vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 51758–51777. Curran Associate...

work page 2023

[10] [10]

Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks

Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14679–14689, June 2025

work page 2025

[11] [11]

Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024

Alexander Shirnin, Nikita Andreev, Sofia Potapova, and Ekaterina Artemova. Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024. doi: 10.1109/TASLP.2024.3399061

work page doi:10.1109/taslp.2024.3399061 2024

[12] [12]

On evaluating adversarial robustness of large vision-language models

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man (Man) Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 54111–54138. Curran As...

work page 2023

[13] [13]

Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025

Cong Zhang, Shuhui Wang, Xiaodan Li, Yao Zhu, Honggang Qi, and Qingming Huang. Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025. doi: 10.1109/TIFS.2025.3586430

work page doi:10.1109/tifs.2025.3586430 2025

[14] [14]

Jailbreak vision language models via bi-modal adversarial prompt, 2024

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. Jailbreak vision language models via bi-modal adversarial prompt, 2024

work page 2024

[15] [15]

Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, and Yuxin Ma. Vismodai: Visual analytics for evaluating and improving corruption robustness of vision-language models.IEEE Transactions on Visualization and Computer Graphics, 32(1):615–625, 2026. doi: 10.1109/TVCG.2025.3634257

work page doi:10.1109/tvcg.2025.3634257 2026

[16] [16]

Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022

Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. ISSN 1566-2535. doi: 10.1016/j.inffus.2022.03.007

work page doi:10.1016/j.inffus.2022.03.007 2022

[17] [17]

Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020

Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020. ISSN 1566-2535. doi: 10.1016/j.inffus.2020.05.002

work page doi:10.1016/j.inffus.2020.05.002 2020

[18] [18]

Unified adversarial patch for cross-modal attacks in the physical world

Xingxing Wei, Yao Huang, Yitong Sun, and Jie Yu. Unified adversarial patch for cross-modal attacks in the physical world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4445–4454, October 2023

work page 2023

[19] [19]

Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025

Chengyin Hu, Weiwen Shi, Wen Yao, Tingsong Jiang, Ling Tian, and Wen Li. Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025. ISSN 1568-4946. doi: 10.1016/j.asoc.2025.112818

work page doi:10.1016/j.asoc.2025.112818 2025

[20] [20]

Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors

Jiahuan Long, Wen Yao, Tingsong Jiang, Jiacheng Hou, Shuai Jia, Junqi Wu, Xiaoya Zhang, Xiaohu Zheng, and Chao Ma. Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1462–1470, New York, NY , USA, 2025. Association for Computing M...

work page doi:10.1145/3746027.3755188 2025

[21] [21]

Fine-grained semantically aligned vision-language pre-training

Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, and Siliang Tang. Fine-grained semantically aligned vision-language pre-training. In Sanmi Koyejo, Shakir Mohamed, Anima Agarwal, Danielle Belgrave, Kyunghyun Cho, and Alice Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 7290–730...

work page 2022

[22] [22]

Zou, and Tatsunori Hashimoto

Ian Covert, Tony Sun, James Y . Zou, and Tatsunori Hashimoto. Locality alignment improves vision- language models. InInternational Conference on Learning Representations, 2025

work page 2025

[23] [23]

Assessing and learning alignment of unimodal vision and language models

Le Zhang, Qian Yang, and Aishwarya Agrawal. Assessing and learning alignment of unimodal vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14604–14614, June 2025

work page 2025

[24] [24]

Eberhart

James Kennedy and Russell C. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948, 1995. doi: 10.1109/ICNN. 1995.488968

work page doi:10.1109/icnn 1942

[25] [25]

Synthesizing robust adversarial examples

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 284–293. PMLR, 10–15 Jul 2018

work page 2018

[26] [26]

Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022. doi: 10.1007/ s11263-022-01653-1

work page 2022

[27] [27]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, June 2022

work page 2022

[28] [28]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, June 2023. 11

work page 2023

[29] [29]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, October 2023

work page 2023

[30] [30]

Benchmarking multimodal large language models against image corruptions

Xinkuan Qiu, Meina Kan, Yongbin Zhou, and Shiguang Shan. Benchmarking multimodal large language models against image corruptions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9014–9023, October 2025

work page 2025

[31] [31]

Analysing the robustness of vision-language-models to common corruptions, 2025

Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common corruptions, 2025

work page 2025

[32] [32]

MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

work page 2024

[33] [33]

Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer

Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch, 2017

work page 2017

[34] [34]

Robust physical-world attacks on deep learning visual classification

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1625–1634, June 2018

work page 2018

[35] [35]

Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector

Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng (Polo) Chau. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors,Machine Learning and Knowledge Discovery in Databases, pages 52–68, Cham, 2019. Springer Internationa...

work page 2019

[36] [36]

DPatch: An adversarial patch attack on object detectors

Xin Liu, Huanrui Yang, Ziwei Liu, Linghao Song, Hai Li, and Yiran Chen. DPatch: An adversarial patch attack on object detectors. InProceedings of the AAAI Workshop on Artificial Intelligence Safety (SafeAI 2019), volume 2301 ofCEUR Workshop Proceedings. CEUR-WS, 2019

work page 2019

[37] [37]

Lutz and Elvira Mayordomo

Jack H. Lutz and Elvira Mayordomo. Dimensions of points in self-similar fractals.SIAM Journal on Computing, 38(3):1080–1112, 2008. doi: 10.1137/070684689

work page doi:10.1137/070684689 2008

[38] [38]

Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon

Yiqi Zhong, Xianming Liu, Deming Zhai, Junjun Jiang, and Xiangyang Ji. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15345–15354, June 2022

work page 2022

[39] [39]

Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks

Teng-Fang Hsiao, Bo-Lun Huang, Zi-Xiang Ni, Yan-Ting Lin, Hong-Han Shuai, Yung-Hui Li, and Wen- Huang Cheng. Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3915–3924, January 2024

work page 2024

[40] [40]

When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack

Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10485– 10495, October 2025

work page 2025

[41] [41]

Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, and Xingxing Wei. Advdreamer unveils: Are vision-language models truly ready for real-world 3d variations? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7894–7904, October 2025

work page 2025

[42] [42]

Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models

Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA, nov 2024...

work page doi:10.18653/v1/2024.findings-emnlp.501 2024

[43] [43]

Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark

Zhe Cao, Jin Zhang, and Ruiheng Zhang. Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 166–176, October 2025. 12

work page 2025

[44] [44]

Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026

Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, and Wen Yao. Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026

work page 2026

[45] [45]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

work page 2014

[46] [46]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015

work page 2015

[47] [47]

Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

Zonghao Han, Shun Zhang, Yuru Su, Xiaoning Chen, and Shaohui Mei. Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. doi: 10.1109/TGRS.2024.3405989

work page doi:10.1109/tgrs.2024.3405989 2024

[48] [48]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023

work page 2023

[49] [49]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. InInternational Confer- ence on Learning Representations, 2024

work page 2024

[50] [50]

Eva-clip: Improved training techniques for clip at scale, 2023

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023

work page 2023

[51] [51]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024

[52] [52]

Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

work page 2023

[53] [53]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems,...

work page 2023

[54] [54]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022

work page 2022

[55] [55]

Representation learning with contrastive predictive coding, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018

work page 2018

[56] [56]

LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. LLMs instead of ...

work page doi:10.18653/v1/2025.acl-short.20 2025