Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability
Pith reviewed 2026-05-22 07:07 UTC · model grok-4.3
The pith
Curved fractal geometry with spiral textures creates adversarial patches that fool visible-infrared vision-language models and transfer to other tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation; it pairs this global structure with a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images, coupling the two to disrupt both shape perception and texture interpretation, and adopts expectation over transformation to improve robustness against common image-level transformations.
What carries the argument
The coupling of global curved-fractal geometry with local spiral-based appearance interference in the CFGPatch framework.
If this is right
- CFGPatch fools VIS-IR VLMs more effectively than standard patch baselines while remaining robust under image transformations.
- Adversarial samples optimized for zero-shot classification transfer successfully to image captioning and visual question answering.
- The method demonstrates strong cross-task transferability and generalizability across downstream tasks.
- Expectation over transformation improves robustness against common image-level transformations.
Where Pith is reading between the lines
- Security evaluations of VIS-IR systems should test against geometric and textural perturbations of this form.
- Robustness achieved on one task may not extend to related tasks if cross-task transfer holds.
- Similar geometric constructions could be examined for exposing weaknesses in other multimodal or cross-modal models.
- Practical VIS-IR deployments may benefit from defenses tuned to fractal self-similarity and spiral interference patterns.
Load-bearing premise
That coupling curved fractal shapes with spiral texture distortions will disrupt shape and texture perception in VIS-IR VLMs beyond what standard patch methods achieve.
What would settle it
An experiment in which removing either the Bezier curves or the Fraser-spiral component causes attack success rates to drop to the level of standard patch baselines would show that the specific coupling is not responsible for the gains.
Figures
read the original abstract
Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CFGPatch, a curved-edge fractal geometric adversarial patch for visible-infrared VLMs. It replaces straight-edged triangular fractals with Bezier-curved elements to preserve self-similarity while adding smoother contours and directional variation, and introduces a modality-specific Fraser-spiral rendering mechanism for texture distortions in VIS and IR images. The method couples global geometry with local appearance interference, uses EOT for robustness to transformations, and reports that the resulting patches outperform standard baselines on zero-shot classification while transferring to image captioning and VQA tasks.
Significance. If the empirical gains are shown to stem specifically from the curved-fractal plus spiral coupling rather than parameterization complexity, the work would usefully expose vulnerabilities in cross-modal VIS-IR VLMs and provide a concrete geometric attack framework with demonstrated cross-task transfer. The emphasis on real-world sensing conditions and EOT robustness is a positive step toward practical relevance.
major comments (3)
- [§4] §4 (Experiments) and associated tables: the central claim that CFGPatch outperforms baselines due to the curved-fractal + Fraser-spiral coupling requires an ablation that holds the number of control points, boundary smoothness, and optimization budget fixed while varying only the geometric primitives. Without such a control, the reported improvements cannot be attributed to the specific framework rather than increased degrees of freedom.
- [§3.2] §3.2 (Fraser-spiral rendering): the description of modality-specific texture injection lacks a quantitative measure (e.g., Fourier spectrum or perceptual metric) showing that the spiral patterns produce distinct disruptions in VIS versus IR channels beyond what a generic high-frequency noise patch would achieve.
- [Table 2] Table 2 (cross-task transfer results): the transfer from classification-optimized patches to captioning and VQA is presented without reporting the attack success rate on the source task for the transferred samples, making it impossible to assess whether the observed transfer reflects genuine generalizability or simply weaker target-task performance.
minor comments (2)
- [Abstract] The abstract and §1 claim 'extensive experiments' but the provided text does not include error bars, dataset sizes, or exact baseline implementations; these details should be added for reproducibility.
- [§3] Notation for the Bezier curve control points and the spiral frequency parameters is introduced without a consolidated table of symbols.
Simulated Author's Rebuttal
We thank the referee for the insightful comments and suggestions. We address each major comment in detail below and have revised the manuscript to incorporate the recommended improvements where applicable.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the central claim that CFGPatch outperforms baselines due to the curved-fractal + Fraser-spiral coupling requires an ablation that holds the number of control points, boundary smoothness, and optimization budget fixed while varying only the geometric primitives. Without such a control, the reported improvements cannot be attributed to the specific framework rather than increased degrees of freedom.
Authors: We agree that a controlled ablation is necessary to isolate the effect of the curved-fractal geometry from increased parameterization. In the revised manuscript, we have added an ablation study in §4 that maintains fixed control points, boundary smoothness, and optimization budget, varying only the geometric primitives (e.g., straight vs. curved). The results confirm that the performance gains are attributable to the Bezier-curved fractal design. revision: yes
-
Referee: [§3.2] §3.2 (Fraser-spiral rendering): the description of modality-specific texture injection lacks a quantitative measure (e.g., Fourier spectrum or perceptual metric) showing that the spiral patterns produce distinct disruptions in VIS versus IR channels beyond what a generic high-frequency noise patch would achieve.
Authors: We appreciate this suggestion for strengthening the analysis. We have incorporated quantitative measures, including Fourier spectrum comparisons and perceptual metrics such as SSIM and LPIPS, in the revised §3.2 to demonstrate the distinct disruptions caused by the modality-specific Fraser-spiral patterns in VIS and IR channels compared to generic high-frequency noise. revision: yes
-
Referee: [Table 2] Table 2 (cross-task transfer results): the transfer from classification-optimized patches to captioning and VQA is presented without reporting the attack success rate on the source task for the transferred samples, making it impossible to assess whether the observed transfer reflects genuine generalizability or simply weaker target-task performance.
Authors: This is a valid point for clarifying the transfer results. We have updated Table 2 to include the attack success rates on the source classification task for the patches transferred to captioning and VQA tasks. This additional information helps demonstrate that the transfer reflects genuine cross-task generalizability rather than just weaker performance on target tasks. revision: yes
Circularity Check
No circularity: empirical attack method validated externally
full rationale
The paper proposes CFGPatch by describing design choices (Bezier-curved fractal elements plus Fraser-spiral texture injection, plus EOT) and then reports experimental comparisons against standard patch baselines on VIS-IR VLMs for classification, captioning, and VQA. No equations, first-principles derivations, or fitted parameters are presented whose outputs are then relabeled as predictions. The central claims rest on measured attack success rates and transfer performance rather than any self-referential reduction. Any self-citations (if present in the full text) are not load-bearing because the effectiveness claims are falsifiable via the reported experiments against external baselines. This is a standard empirical contribution with no detectable circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.lean (SphereAdmitsCircleLinking, D=3 forcing)alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CFGPatch takes triangular fractal geometry as its base and transforms rigid straight-edged primitives into Bezier-curved elements, preserving fractal self-similarity... modality-specific Fraser-spiral rendering mechanism
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
coupling global curved-fractal geometry with local spiral-based appearance interference
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...
work page 2021
-
[2]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine...
work page 2021
-
[3]
Align before fuse: Vision and language representation learning with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems, v...
work page 2021
-
[4]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...
work page 2022
-
[5]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...
work page 2023
-
[6]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...
work page 2022
-
[7]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023
work page 2023
-
[8]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ...
work page 2024
-
[9]
Benchmarking robustness of adaptation methods on pre-trained vision-language models
Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, and V olker Tresp. Benchmarking robustness of adaptation methods on pre-trained vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 51758–51777. Curran Associate...
work page 2023
-
[10]
Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14679–14689, June 2025
work page 2025
-
[11]
Alexander Shirnin, Nikita Andreev, Sofia Potapova, and Ekaterina Artemova. Analyzing the robustness of vision & language models.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 2751–2763, 2024. doi: 10.1109/TASLP.2024.3399061
-
[12]
On evaluating adversarial robustness of large vision-language models
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man (Man) Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 54111–54138. Curran As...
work page 2023
-
[13]
Cong Zhang, Shuhui Wang, Xiaodan Li, Yao Zhu, Honggang Qi, and Qingming Huang. Enhancing the robustness of vision-language foundation models by alignment perturbation.IEEE Transactions on Information Forensics and Security, 20:7091–7105, 2025. doi: 10.1109/TIFS.2025.3586430
-
[14]
Jailbreak vision language models via bi-modal adversarial prompt, 2024
Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. Jailbreak vision language models via bi-modal adversarial prompt, 2024
work page 2024
-
[15]
Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, and Yuxin Ma. Vismodai: Visual analytics for evaluating and improving corruption robustness of vision-language models.IEEE Transactions on Visualization and Computer Graphics, 32(1):615–625, 2026. doi: 10.1109/TVCG.2025.3634257
-
[16]
Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. Piafusion: A progressive infrared and visible image fusion network based on illumination aware.Information Fusion, 83-84:79–92, 2022. ISSN 1566-2535. doi: 10.1016/j.inffus.2022.03.007
-
[17]
Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. Object fusion tracking based on visible and infrared images: A comprehensive review.Information Fusion, 63:166–187, 2020. ISSN 1566-2535. doi: 10.1016/j.inffus.2020.05.002
-
[18]
Unified adversarial patch for cross-modal attacks in the physical world
Xingxing Wei, Yao Huang, Yitong Sun, and Jie Yu. Unified adversarial patch for cross-modal attacks in the physical world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4445–4454, October 2023
work page 2023
-
[19]
Chengyin Hu, Weiwen Shi, Wen Yao, Tingsong Jiang, Ling Tian, and Wen Li. Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world.Applied Soft Computing, 171:112818, 2025. ISSN 1568-4946. doi: 10.1016/j.asoc.2025.112818
-
[20]
Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors
Jiahuan Long, Wen Yao, Tingsong Jiang, Jiacheng Hou, Shuai Jia, Junqi Wu, Xiaoya Zhang, Xiaohu Zheng, and Chao Ma. Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible- infrared detectors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1462–1470, New York, NY , USA, 2025. Association for Computing M...
-
[21]
Fine-grained semantically aligned vision-language pre-training
Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, and Siliang Tang. Fine-grained semantically aligned vision-language pre-training. In Sanmi Koyejo, Shakir Mohamed, Anima Agarwal, Danielle Belgrave, Kyunghyun Cho, and Alice Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 7290–730...
work page 2022
-
[22]
Ian Covert, Tony Sun, James Y . Zou, and Tatsunori Hashimoto. Locality alignment improves vision- language models. InInternational Conference on Learning Representations, 2025
work page 2025
-
[23]
Assessing and learning alignment of unimodal vision and language models
Le Zhang, Qian Yang, and Aishwarya Agrawal. Assessing and learning alignment of unimodal vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14604–14614, June 2025
work page 2025
-
[24]
James Kennedy and Russell C. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948, 1995. doi: 10.1109/ICNN. 1995.488968
-
[25]
Synthesizing robust adversarial examples
Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 284–293. PMLR, 10–15 Jul 2018
work page 2018
-
[26]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, jul 2022. doi: 10.1007/ s11263-022-01653-1
work page 2022
-
[27]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, June 2022
work page 2022
-
[28]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, June 2023. 11
work page 2023
-
[29]
Self-regulating prompts: Foundational model adaptation without forgetting
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, October 2023
work page 2023
-
[30]
Benchmarking multimodal large language models against image corruptions
Xinkuan Qiu, Meina Kan, Yongbin Zhou, and Shiguang Shan. Benchmarking multimodal large language models against image corruptions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9014–9023, October 2025
work page 2025
-
[31]
Analysing the robustness of vision-language-models to common corruptions, 2025
Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common corruptions, 2025
work page 2025
-
[32]
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...
work page 2024
-
[33]
Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer
Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch, 2017
work page 2017
-
[34]
Robust physical-world attacks on deep learning visual classification
Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1625–1634, June 2018
work page 2018
-
[35]
Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector
Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng (Polo) Chau. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors,Machine Learning and Knowledge Discovery in Databases, pages 52–68, Cham, 2019. Springer Internationa...
work page 2019
-
[36]
DPatch: An adversarial patch attack on object detectors
Xin Liu, Huanrui Yang, Ziwei Liu, Linghao Song, Hai Li, and Yiran Chen. DPatch: An adversarial patch attack on object detectors. InProceedings of the AAAI Workshop on Artificial Intelligence Safety (SafeAI 2019), volume 2301 ofCEUR Workshop Proceedings. CEUR-WS, 2019
work page 2019
-
[37]
Jack H. Lutz and Elvira Mayordomo. Dimensions of points in self-similar fractals.SIAM Journal on Computing, 38(3):1080–1112, 2008. doi: 10.1137/070684689
-
[38]
Yiqi Zhong, Xianming Liu, Deming Zhai, Junjun Jiang, and Xiangyang Ji. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15345–15354, June 2022
work page 2022
-
[39]
Teng-Fang Hsiao, Bo-Lun Huang, Zi-Xiang Ni, Yan-Ting Lin, Hong-Han Shuai, Yung-Hui Li, and Wen- Huang Cheng. Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3915–3924, January 2024
work page 2024
-
[40]
Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. When lighting deceives: Exposing vision-language models’ illumination vulnerability through illumination transformation attack. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10485– 10495, October 2025
work page 2025
-
[41]
Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, and Xingxing Wei. Advdreamer unveils: Are vision-language models truly ready for real-world 3d variations? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7894–7904, October 2025
work page 2025
-
[42]
Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models
Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-LLaV A: Enhanc- ing understanding of infrared images in multi-modal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA, nov 2024...
-
[43]
Zhe Cao, Jin Zhang, and Ruiheng Zhang. Irgpt: Understanding real-world infrared image with bi-cross- modal curriculum on large-scale benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 166–176, October 2025. 12
work page 2025
-
[44]
Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, and Wen Yao. Revealing physical-world semantic vulnerabilities: Universal adversarial patches for infrared vision-language models, 2026
work page 2026
-
[45]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1
work page 2014
-
[46]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015
work page 2015
-
[47]
Zonghao Han, Shun Zhang, Yuru Su, Xiaoning Chen, and Shaohui Mei. Dr-avit: Toward diverse and realistic aerial visible-to-infrared image translation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. doi: 10.1109/TGRS.2024.3405989
-
[48]
Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023
work page 2023
-
[49]
Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. InInternational Confer- ence on Learning Representations, 2024
work page 2024
-
[50]
Eva-clip: Improved training techniques for clip at scale, 2023
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023
work page 2023
-
[51]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[52]
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023
work page 2023
-
[53]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems,...
work page 2023
-
[54]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022
work page 2022
-
[55]
Representation learning with contrastive predictive coding, 2018
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018
work page 2018
-
[56]
LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. LLMs instead of ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.