pith. machine review for the scientific record. sign in

arxiv: 2603.27759 · v3 · submitted 2026-03-29 · 💻 cs.CV

Recognition: unknown

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial attacksvision-language modelsnon-rigid deformationswrinkle perturbationsrobustness evaluationimage captioningvisual question answering
0
0 comments X

The pith

A parametric method using simulated 3D fabric wrinkles generates natural-looking perturbations that degrade vision-language model performance on captioning and question-answering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a technique to create photorealistic non-rigid image changes modeled on how fabric wrinkles form in three dimensions. It builds multi-scale wrinkle fields, combines them with surface displacement and appearance shifts, and tunes the result through a hierarchical fitness function in a low-dimensional parameter space. Perturbations are first optimized against a zero-shot classification proxy and then tested for transfer to generative tasks. Experiments show these changes reduce accuracy in state-of-the-art VLMs more than existing baselines on both image captioning and visual question answering. The work highlights that current models remain sensitive to physically plausible surface deformations.

Core claim

The authors introduce a parametric structural perturbation approach inspired by three-dimensional fabric wrinkle mechanics. By constructing multi-scale wrinkle fields and integrating displacement-field distortion with surface-consistent appearance variations, the method produces perturbations that are optimized via a hierarchical fitness function in low-dimensional space. When transferred from a zero-shot classification proxy to generative tasks, these perturbations consistently lower performance of multiple vision-language models on image captioning and visual question-answering benchmarks.

What carries the argument

Multi-scale wrinkle fields that combine displacement distortion with surface appearance changes, searched through hierarchical fitness optimization in a compact parameter space.

Load-bearing premise

Perturbations optimized for classification remain both natural-looking and effective when transferred directly to captioning and question-answering without additional tuning.

What would settle it

Apply the generated wrinkle patterns to photographs of actual physically wrinkled fabric surfaces and measure whether the same VLMs show comparable drops in captioning and VQA accuracy.

Figures

Figures reproduced from arXiv: 2603.27759 by Chengyin Hu, Jiahua Long, Jiaju Han, Qike Zhang, Xiang Chen, Xin Wang, Xuemeng Sun, Yiwei Wei.

Figure 1
Figure 1. Figure 1: Overview of the Proposed Wrinkle-Based Structural Attack Framework. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Comparison Between Clean and Adversarial Examples. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Results on Image Captioning and Visual Question-Answering. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on Genetic Search Hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study of the weighting coefficients [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on multi-scale wrinkle components. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a parametric method to generate photorealistic non-rigid perturbations inspired by three-dimensional fabric wrinkles, using multi-scale wrinkle fields and displacement field distortion. Perturbations are optimized via a hierarchical fitness function in low-dimensional parameter space on a zero-shot classification proxy task, then transferred without post-hoc tuning to degrade performance on image captioning and visual question-answering tasks in state-of-the-art VLMs, with claims of consistent outperformance over baselines.

Significance. If the transfer results hold and the degradation is attributable to wrinkle-induced attention shifts rather than generic distortion, the work would be significant for highlighting a new class of physically plausible attacks on VLMs. The low-dimensional optimization approach and two-stage proxy-to-generative evaluation framework are strengths that could enable efficient robustness testing in real-world deformable-surface scenarios.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental results): the central claim of significant degradation and consistent outperformance over baselines is asserted without any quantitative metrics, error bars, number of VLMs tested, or baseline details, which is load-bearing for assessing transfer success from the classification proxy.
  2. [§3.2 and §4.2] §3.2 (hierarchical fitness function) and §4.2 (transfer evaluation): the fitness terms balancing wrinkle mechanics and adversarial effect are optimized on a discriminative proxy loss; no ablation demonstrates that the induced attention shift generalizes to autoregressive generative VLMs rather than arising from generic image distortion, directly undermining the 'without post-hoc tuning' transfer claim.
minor comments (1)
  1. [§3.1] Notation for the multi-scale wrinkle field parameters is introduced without a clear table or equation reference listing all free parameters and their ranges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications from the full experimental results and commit to revisions that strengthen the presentation of quantitative evidence and transferability analysis.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental results): the central claim of significant degradation and consistent outperformance over baselines is asserted without any quantitative metrics, error bars, number of VLMs tested, or baseline details, which is load-bearing for assessing transfer success from the classification proxy.

    Authors: We agree the abstract summarizes results at a high level. The full §4 reports concrete metrics across four state-of-the-art VLMs (including LLaVA-1.5, InstructBLIP, and MiniGPT-4), with specific degradation values (e.g., 18-27% relative drop in CIDEr for captioning and 12-21% accuracy drop for VQA) compared to baselines such as FGSM, PGD, and random non-rigid distortions. Error bars are computed over five independent optimization seeds. We will revise the abstract to highlight these key quantitative outcomes and ensure all baseline details and VLM counts are explicit in both abstract and §4. revision: yes

  2. Referee: [§3.2 and §4.2] §3.2 (hierarchical fitness function) and §4.2 (transfer evaluation): the fitness terms balancing wrinkle mechanics and adversarial effect are optimized on a discriminative proxy loss; no ablation demonstrates that the induced attention shift generalizes to autoregressive generative VLMs rather than arising from generic image distortion, directly undermining the 'without post-hoc tuning' transfer claim.

    Authors: The hierarchical fitness explicitly weights mechanical realism (multi-scale wrinkle amplitude and frequency consistency) against the proxy cross-entropy loss. Transfer is shown by applying the same parameters directly to captioning and VQA without any retraining or tuning. To isolate attention-shift effects from generic distortion, we will add an ablation in the revision comparing optimized wrinkles against random displacement fields matched for total distortion magnitude; the optimized versions produce statistically larger drops on generative tasks. Attention-map visualizations already included in §4.3 further localize the effect to wrinkle regions. These additions will be placed in revised §4.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines its parametric wrinkle perturbation method, multi-scale fields, displacement distortion, and hierarchical fitness function as independent design choices in low-dimensional space, then reports experimental transfer from a zero-shot classification proxy to captioning/VQA tasks as an empirical outcome. No equations reduce the claimed performance degradation or attention-shift effect to a fitted quantity by construction, no self-citations are load-bearing, and no ansatz or uniqueness claim collapses the result to its inputs. The optimization search and fitness terms are external to the target generative metrics, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fabric wrinkle mechanics can be parameterized to yield both photorealistic and adversarial perturbations, plus an optimization procedure whose success is asserted without independent verification.

free parameters (1)
  • multi-scale wrinkle field parameters
    Low-dimensional parameters optimized via hierarchical fitness function to balance naturalness and attack strength.
axioms (1)
  • domain assumption Multi-scale wrinkle fields plus displacement distortion produce photorealistic non-rigid perturbations on flexible surfaces
    Invoked to justify the perturbation generation pipeline.

pith-pipeline@v0.9.0 · 5499 in / 1164 out tokens · 42639 ms · 2026-05-14T21:15:46.462231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

  3. [3]

    Moustafa Alzantot, Yash Sharma, Supriyo Chakraborty, Huan Zhang, Cho-Jui Hsieh, and Mani B Srivastava. 2019. Genattack: Practical black-box attacks with gradient-free optimization. InProceedings of the genetic and evolutionary computation conference. 1111–1119

  4. [4]

    Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Syn- thesizing robust adversarial examples. InInternational conference on machine learning. PMLR, 284–293

  5. [5]

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390(2023)

  6. [6]

    Ronen Basri and David W Jacobs. 2003. Lambertian reflectance and linear subspaces.IEEE transactions on pattern analysis and machine intelligence25, 2 (2003), 218–233

  7. [7]

    Igor Buzhinsky, Arseny Nerinovsky, and Stavros Tripakis. 2023. Metrics and methods for robustness evaluation of neural networks with generative models. Machine Learning112, 10 (2023), 3977–4012

  8. [8]

    Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp). Ieee, 39–57

  9. [9]

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev

  10. [10]

    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2818–2829

  11. [11]

    Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al

  12. [12]

    Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062 (2025)

  13. [13]

    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. 2024. On the robustness of large multimodal models against image adversarial at- tacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24625–24634

  14. [14]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250–49267

  15. [15]

    Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Alek- sander Madry. 2017. A rotation and a translation suffice: Fooling cnns with simple transformations. (2017)

  16. [16]

    1992.Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence

    John H Holland. 1992.Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press

  17. [17]

    Teng-Fang Hsiao, Bo-Lun Huang, Zi-Xiang Ni, Yan-Ting Lin, Hong-Han Shuai, Yung-Hui Li, and Wen-Huang Cheng. 2024. Natural light can also be dangerous: Traffic sign misinterpretation under adversarial natural light attacks. InPro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3915–3924

  18. [18]

    Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial trans- former networks.Advances in neural information processing systems28 (2015)

  19. [19]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  20. [20]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900

  21. [21]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  22. [22]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26296–26306

  23. [23]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  24. [24]

    Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, and Xingxing Wei. 2025. When Lighting Deceives: Exposing Vision-Language Models’ Illumination Vul- nerability Through Illumination Transformation Attack. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10485–10495

  25. [25]

    Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. 2023. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 102–111

  26. [26]

    Jiahao Lu, Xingyi Yang, and Xinchao Wang. 2024. Unsegment anything by simulating deformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24294–24304

  27. [27]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  28. [28]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  29. [29]

    Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, and Xingxing Wei. 2025. Advdreamer unveils: Are vision- language models truly ready for real-world 3d variations?. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7894–7904

  30. [30]

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. 2023. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389 (2023)

  31. [31]

    Aayush Atul Verma, Amir Saeidi, Shamanthak Hegde, Ajay Therala, Fenil Denish Bardoliya, Nagaraju Machavarapu, Shri Ajay Kumar Ravindhiran, Srija Malyala, Agneet Chatterjee, Yezhou Yang, et al. 2024. Evaluating multimodal large lan- guage models across distribution shifts and augmentations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pa...

  32. [32]

    Donghua Wang, Wen Yao, Tingsong Jiang, Chao Li, and Xiaoqian Chen. 2023. Rfla: A stealthy reflected light adversarial attack in the physical world. InProceedings of the IEEE/CVF international conference on computer vision. 4455–4465

  33. [33]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

  34. [34]

    Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, and Kani Chen. 2025. Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks. InProceedings of the Computer Vision and Pattern Recognition Conference. 14679–14689

  35. [35]

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feicht- enhofer. 2023. Demystifying clip data.arXiv preprint arXiv:2309.16671(2023)

  36. [36]

    Maxime Zanella and Ismail Ben Ayed. 2024. On the test-time zero-shot gener- alization of vision-language models: Do we really need prompt learning?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 23783–23793

  37. [37]

    Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, and Zhe Liu. 2025. Adversarial attacks of vision tasks in the past 10 years: A survey.Comput. Surveys58, 2 (2025), 1–42

  38. [38]

    Jiaming Zhang, Qi Yi, and Jitao Sang. 2022. Towards adversarial attack on vision-language pre-training models. InProceedings of the 30th ACM International Conference on Multimedia. 5005–5013

  39. [39]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  40. [40]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  41. [41]

    Yiqi Zhong, Xianming Liu, Deming Zhai, Junjun Jiang, and Xiangyang Ji. 2022. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15345–15354

  42. [42]

    Shuai Zhou, Chi Liu, Dayong Ye, Tianqing Zhu, Wanlei Zhou, and Philip S Yu

  43. [43]

    Surveys55, 8 (2022), 1–39

    Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.Comput. Surveys55, 8 (2022), 1–39