pith. machine review for the scientific record. sign in

arxiv: 2605.04261 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.LG

Recognition: unknown

Laundering AI Authority with Adversarial Examples

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:15 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords adversarial examplesvision-language modelsauthority launderingtransfer attackscontent moderationmisinformationvisual robustnessperceptual attacks
0
0 comments X

The pith

Adversarial perturbations let attackers make vision-language models issue confident responses about the wrong images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that basic adversarial attacks developed against open CLIP models transfer directly to closed production vision-language models. This breaks the assumption that VLMs see the same visual content as human users, allowing an attacker to induce authoritative but false outputs without touching the model's safety rules. The technique operates purely at the perceptual level and works across systems including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2. Because VLMs are now used for image fact-checking, content moderation, and product judgments, the attacks open practical routes to amplify misinformation, evade filters, and manipulate recommendations. No new attack methods are needed; techniques known for over a decade already succeed at rates between 22 and 100 percent on hundreds of tested cases.

Core claim

Vision-language models are deployed as trusted authorities on images, yet adversarial examples break the assumption that they perceive the same content as users. An attacker can subtly perturb an image so the VLM produces confident and authoritative responses about the wrong input. Unlike jailbreaks, the attack leaves model alignment intact and works entirely at the perceptual level. Standard attacks against publicly available CLIP models transfer reliably to production VLMs including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2. The result is authority laundering that can amplify misinformation, disparage individuals, evade content moderation, and manipulate product recommendations, all

What carries the argument

Transfer of adversarial perturbations from open CLIP models to closed VLMs that induce mismatched authoritative responses.

If this is right

  • AI fact-checking on social media can be made to endorse false claims about images.
  • Content moderation systems can be bypassed by perturbing images to avoid detection.
  • Product comparison or recommendation tools can be steered toward incorrect visual judgments.
  • Individual reputations can be damaged through adversarial changes to images fed to identity-related queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployers of VLMs may need to add human review layers or ensemble checks for any visual decision that affects public information.
  • The gap between alignment training and low-level perceptual robustness suggests that future multimodal models could inherit the same exposure.
  • Regulators focused on AI safety might treat visual input robustness as a separate requirement from prompt-level safeguards.
  • Simple image preprocessing or watermarking could be tested as a low-cost mitigation before more expensive retraining.

Load-bearing premise

The perturbations stay imperceptible or non-obvious to humans while still transferring reliably from open CLIP models to closed production VLMs.

What would settle it

A controlled test in which humans consistently notice the perturbations or a production VLM correctly describes the original image content despite the attack.

Figures

Figures reproduced from arXiv: 2605.04261 by Avital Shafran, Florian Tram\`er, Jie Zhang, Pura Peetathawatchai.

Figure 1
Figure 1. Figure 1: Examples of AI authority laundering attacks against production VLMs, spanning four attack families: narrative manipulation, identity manipulation, commercial fraud, and evasion of safety filters. In each case, an adversarial perturbation of the input image causes the model to respond honestly and within policy to a different semantic content than what the user or platform sees, laundering the attacker’s ch… view at source ↗
Figure 2
Figure 2. Figure 2: Fake news made possible through prompt control: view at source ↗
Figure 3
Figure 3. Figure 3: ChatGPT 5.4 Thinking declares the 9/11 attacks to view at source ↗
Figure 4
Figure 4. Figure 4: Grok 4.2 amplifies conspiracies about Tylenol (a view at source ↗
Figure 6
Figure 6. Figure 6: Grok misidentifies Musk as the subject of a news ar view at source ↗
Figure 7
Figure 7. Figure 7: Bypassing NSFW filters via adversarial perturbation. view at source ↗
Figure 8
Figure 8. Figure 8: Adversarial bypass of gender-asymmetric content moderation in Grok. Left: a clothing-removal request on a male view at source ↗
Figure 9
Figure 9. Figure 9: Some VLMs, such as Nano Banana Pro, refuse im view at source ↗
Figure 10
Figure 10. Figure 10: Presented with images of two watches, a high-end view at source ↗
Figure 12
Figure 12. Figure 12: An adversarial example, optimized from random view at source ↗
Figure 13
Figure 13. Figure 13: Targeted ASR across models as a function of perturbation budget view at source ↗
Figure 14
Figure 14. Figure 14: Targeted ASR averaged across all models, broken down by the gender (left) and race/ethnicity (right) of source and view at source ↗
Figure 15
Figure 15. Figure 15: Perturbing a screenshot of a New York Times arti view at source ↗
Figure 16
Figure 16. Figure 16: Google reverse image search misidentifies an adver view at source ↗
Figure 17
Figure 17. Figure 17: Presenting Grok with an image of the potentially view at source ↗
Figure 19
Figure 19. Figure 19: When asked to recommend between two pairs of view at source ↗
Figure 20
Figure 20. Figure 20: Sabotaging a competitor through adversarial per view at source ↗
Figure 21
Figure 21. Figure 21: Adversarial versions of photographs of six well-documented historical events, each perturbed to match the text view at source ↗
Figure 22
Figure 22. Figure 22: Examples in which adversarial manipulation bypasses public-figure protection (Section 5.3) but the generated output view at source ↗
read the original abstract

Vision-language models (VLMs) are increasingly deployed as trusted authorities -- fact-checking images on social media, comparing products, and moderating content. Users implicitly trust that these systems perceive the same visual content as they do. We show that adversarial examples break this assumption, enabling \emph{AI authority laundering}: an attacker subtly perturbs an image so that the VLM produces confident and authoritative responses about the \emph{wrong} input. Unlike jailbreaks or prompt injections, our attacks do not compromise model alignment; the attack operates entirely at the perceptual level. We demonstrate that standard attacks against publicly available CLIP models transfer reliably to production VLMs -- including GPT-5.4, Claude Opus~4.6, Gemini~3, and Grok~4.2. Across four attack surfaces, we show that authority laundering can amplify misinformation, disparage individuals, evade content moderation, and manipulate product recommendations. Our attacks have high success rates: In hundreds of attacks targeting identity manipulation and NSFW evasion, we measure success rates of $22 - 100\%$ across six models. No novel attack algorithm is required: basic techniques known for over a decade suffice, establishing a lower bound on attacker capability that should concern defenders. Our results demonstrate that visual adversarial robustness is now a practical -- and still largely unsolved -- safety problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that adversarial examples optimized on public CLIP models transfer reliably to closed production VLMs (including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2), enabling 'AI authority laundering' in which the VLM produces confident but incorrect authoritative responses about subtly perturbed images. This is shown across four attack surfaces (identity manipulation, misinformation amplification, content moderation evasion, product recommendation manipulation) using only standard attacks, with success rates of 22-100% measured over hundreds of attacks on six models.

Significance. If the transferability claim holds after adding proper controls, the result would be significant for AI safety: it establishes a practical lower bound on attacker capability against deployed VLMs used for fact-checking and moderation, showing that visual adversarial robustness remains unsolved. The work merits explicit credit for its breadth (multiple production models and attack surfaces) and for using only decade-old techniques rather than introducing new algorithms.

major comments (3)
  1. [Experimental results for identity manipulation and NSFW evasion] In the experimental results for identity manipulation and NSFW evasion (hundreds of attacks section): success rates of 22-100% are presented without baseline comparisons to unperturbed images or random perturbations. This is load-bearing because it leaves open whether outputs reflect perceptual transfer from CLIP or simply the VLMs' variable responses to ambiguous/low-quality inputs.
  2. [Transfer to production VLMs] In the transfer evaluation to production VLMs: success is measured solely by text matching to target 'wrong' descriptions under query-only access, but no controls for prompt phrasing, generation temperature, or multiple samples with error bars are described. This weakens the reliability claim for cross-model transfer.
  3. [Attack surfaces and imperceptibility claim] In the description of attack surfaces and imperceptibility: the claim that perturbations 'subtly' affect the image while remaining non-obvious to humans lacks any human evaluation, SSIM/LPIPS metrics, or perceptual thresholds, which is central to distinguishing authority laundering from obvious tampering.
minor comments (2)
  1. [Abstract] Abstract: model version strings (GPT-5.4, Claude Opus~4.6) use non-standard notation; clarify whether these refer to specific API snapshots or are illustrative.
  2. [Abstract] Abstract: the statement that 'basic techniques known for over a decade suffice' is repeated in spirit; consolidate to strengthen the lower-bound framing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback, which identifies key areas where additional controls and metrics will strengthen the experimental claims. We address each major comment below and commit to revisions that improve rigor without altering the core findings.

read point-by-point responses
  1. Referee: In the experimental results for identity manipulation and NSFW evasion (hundreds of attacks section): success rates of 22-100% are presented without baseline comparisons to unperturbed images or random perturbations. This is load-bearing because it leaves open whether outputs reflect perceptual transfer from CLIP or simply the VLMs' variable responses to ambiguous/low-quality inputs.

    Authors: We agree that baseline comparisons are essential to isolate the contribution of the transferred adversarial perturbations. In the revised manuscript we will add success rates for clean (unperturbed) images and for images subjected to random perturbations of comparable magnitude. These baselines, which we have already computed, demonstrate substantially lower rates of the target incorrect outputs, confirming that the observed results arise from perceptual transfer rather than model variability on ambiguous inputs. revision: yes

  2. Referee: In the transfer evaluation to production VLMs: success is measured solely by text matching to target 'wrong' descriptions under query-only access, but no controls for prompt phrasing, generation temperature, or multiple samples with error bars are described. This weakens the reliability claim for cross-model transfer.

    Authors: We acknowledge the value of additional controls for robustness. The revised manuscript will describe prompt-phrasing sensitivity checks, report results across multiple temperature settings where the production APIs allow, and include success rates with error bars computed from repeated independent queries. These additions will provide a clearer picture of transfer reliability under query-only access. revision: yes

  3. Referee: In the description of attack surfaces and imperceptibility: the claim that perturbations 'subtly' affect the image while remaining non-obvious to humans lacks any human evaluation, SSIM/LPIPS metrics, or perceptual thresholds, which is central to distinguishing authority laundering from obvious tampering.

    Authors: We concur that quantitative and human-centered evidence is required to support the imperceptibility claim. The revised version will report SSIM and LPIPS values for the perturbations, specify the perceptual thresholds used during attack generation, and include results from a human study in which participants attempt to distinguish original from perturbed images. These elements will more rigorously separate subtle authority laundering from obvious tampering. revision: yes

Circularity Check

0 steps flagged

Empirical demonstration with no derivations or self-referential reductions

full rationale

The paper is an empirical study demonstrating transfer of known adversarial attacks from public CLIP models to closed VLMs. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs or self-citations. Claims rest on experimental success rates across models and attack surfaces, with no load-bearing self-citation chains or ansatzes. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that adversarial perturbations developed for CLIP transfer to closed VLMs and that these perturbations can be made imperceptible to humans while altering model outputs.

axioms (1)
  • domain assumption Adversarial examples exist and transfer across vision models
    Invoked when stating that standard attacks against CLIP transfer to production VLMs

pith-pipeline@v0.9.0 · 5542 in / 1244 out tokens · 67057 ms · 2026-05-08T17:15:19.975123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    Naveed Akhtar and Ajmal Mian. 2018. Threat of adversarial attacks on deep learning in computer vision: A survey.IEEE Access6 (2018), 14410–14430. doi:10 .1109/ACCESS.2018.2807385

  2. [2]

    Naveed Akhtar, Ajmal Mian, Navid Kardan, and Mubarak Shah. 2021. Advances in adversarial attacks and defenses in computer vision: A survey.arXiv preprint arXiv:2108.00401(2021). https://arxiv.org/abs/2108.00401

  3. [3]

    Nouar AlDahoul, Talal Rahwan, and Yasir Zaki. 2025. AI-generated faces influ- ence gender stereotypes and racial homogenization.Scientific Reports15, 14449 (2025). doi:10.1038/s41598-025-99623-3

  4. [4]

    Anthropic. 2026. Introducing Claude Opus 4.6. https://www.anthropic.com/ne ws/claude-opus-4-6. Published: 2026-02-05

  5. [5]

    Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational Conference on Machine Learning. 274–283

  6. [6]

    Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. Abusing images and sounds for indirect instruction injection in multi-modal LLMs.arXiv preprint arXiv:2307.10490(2023)

  7. [7]

    Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. In2022 IEEE Sympo- sium on Security and Privacy (SP). IEEE, 769–786. doi:10.1109/sp46214.2022.983 3572

  8. [8]

    Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image hijacks: Adversarial images can control generative models at runtime.arXiv preprint arXiv:2309.00236(2023)

  9. [9]

    Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning attacks against support vector machines. InProceedings of the 29th International Conference on Machine Learning (ICML). 1807–1814. https://arxiv.org/abs/1206.6389

  10. [10]

    Nicholas Carlini. 2021. Adversarial Attacks That Matter. Presentation at ICCV Workshop on Adversarial Robustness in the Real World (AROW). https://nichol as.carlini.com/slides/2021_attacks_that_matter.pdf

  11. [11]

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. 2023. Are aligned neural networks adversarially aligned?Advances in Neural Information Processing Systems36 (2023), 61478–61500

  12. [12]

    Nicholas Carlini and David Wagner. 2017. Adversarial examples are not eas- ily detected: Bypassing ten detection methods. InACM Workshop on Artificial Intelligence and Security. 3–14

  13. [13]

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. 2019. Certified adversarial robustness via randomized smoothing. InInternational Conference on Machine Learning. 1310–1320

  14. [14]

    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. 2024. On the robustness of large multimodal models against image adversarial at- tacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24625–24634

  15. [15]

    Trisha Datta, Binyi Chen, and Dan Boneh. 2025. VerITAS: Verifying image transformations at scale. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 4606–4623

  16. [16]

    Debayan Deb, Jianbang Zhang, and Anil K Jain. 2020. Advfaces: Adversarial face synthesis. In2020 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 1–10

  17. [17]

    Google DeepMind. 2026. Gemini 3.1 Pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini- models/gemini-3-1-pro/. Published: 2026-02-19

  18. [18]

    Google DeepMind. 2026. Nano Banana 2: Combining Pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/technology/ai/nano- banana-2/. Published: 2026-02-26

  19. [19]

    Pierpaolo Della Monica, Ivan Visconti, Andrea Vitaletti, and Marco Zecchini

  20. [20]

    In2025 IEEE Symposium on Security and Privacy (SP)

    Trust nobody: Privacy-preserving proofs for edited photos with your laptop. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 4624–4642

  21. [21]

    Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. 2019. Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945(2019)

  22. [22]

    Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust physical- world attacks on deep learning visual classification. InProceedings of the IEEE conference on computer vision and pattern recognition. 1625–1634

  23. [23]

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572(2014)

  24. [24]

    Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, and Sanmi Koyejo

  25. [25]

    Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed.arXiv preprint arXiv:2510.01494(2025)

  26. [26]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://arxiv.org/abs/2401.13919

  27. [27]

    Hossein Hosseini, Yize Chen, Sreeram Kannan, Baosen Zhang, and Radha Pooven- dran. 2017. Blocking transferability of adversarial examples in black-box learning systems.arXiv preprint arXiv:1703.04318(2017)

  28. [28]

    Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, and Matt Fredrikson. 2025. Transferable adversarial attacks on black- box vision-language models.arXiv preprint arXiv:2505.01050(2025)

  29. [29]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al . 2024. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Transactions on Information Systems (2024). https://dl.acm.org/doi/10.1145/3703155 Jie Zhang Pura Pe...

  30. [30]

    Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. Adversarial examples are not bugs, they are features.Advances in neural information processing systems32 (2019)

  31. [31]

    Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. 2023. Re- thinking Randomized Smoothing from the Perspective of Scalability.arXiv preprint arXiv:2312.12608(2023). https://arxiv.org/abs/2312.12608

  32. [32]

    Walter Laurito et al. 2025. AI-AI Bias: Large Language Models Favor Communica- tions Generated by Large Language Models.Proceedings of the National Academy of Sciences122, 3 (2025). https://www.pnas.org/doi/10.1073/pnas.2415697122

  33. [33]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 292–305

  34. [34]

    Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, and Zhiqiang Shen

  35. [35]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

    A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https: //openreview.net/forum?id=9xXjWwAoUF

  36. [36]

    Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Trans- ferable Adversarial Examples and Black-box Attacks. arXiv:1611.02770 [cs.LG] https://arxiv.org/abs/1611.02770

  37. [37]

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations

  38. [38]

    Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2023. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies.Journal of MDPI6, 1 (2023). https://www.mdpi.com/2413-4155/6/1/3

  39. [39]

    Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence. Published: 2025-04-05

  40. [40]

    Ben Nassi, Yisroel Mirsky, Dudi Nassi, Raz Ben-Netanel, Oleg Drokin, and Yuval Elovici. 2020. Phantom of the adas: Securing advanced driver-assistance systems from split-second phantom attacks. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security. 293–308

  41. [41]

    Assa Naveh and Eran Tromer. 2016. Photoproof: Cryptographic image authenti- cation for any set of permissible transformations. In2016 IEEE Symposium on Security and Privacy (SP). IEEE, 255–271

  42. [42]

    Catherine Olsson. 2019. Unsolved Research Problems vs. Real-World Threat Models. Medium. https://medium.com/@catherio/unsolved-research-problems- vs-real-world-threat-models-e270e256bc9e

  43. [43]

    OpenAI. 2025. Introducing ChatGPT Atlas. OpenAI Blog. https://openai.com/i ndex/introducing-chatgpt-atlas/ Published: 2025-10-21

  44. [44]

    OpenAI. 2025. Introducing Operator. OpenAI Blog. https://openai.com/index/i ntroducing-operator/

  45. [45]

    OpenAI. 2026. Introducing ChatGPT Images 2.0. OpenAI Blog. https://openai.c om/index/introducing-chatgpt-images-2-0/ Published: 2026-04-21

  46. [46]

    OpenAI. 2026. Introducing GPT-5.4. OpenAI Blog. https://openai.com/index/i ntroducing-gpt-5-4/ Published: 2026-03-05

  47. [47]

    Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. 2016. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples.arXiv preprint arXiv:1605.07277(2016)

  48. [48]

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models.arXiv preprint arXiv:2202.03286(2022)

  49. [49]

    Jonathan Prokos, Neil Fendley, Matthew Green, Roei Schuster, Eran Tromer, Tushar Jois, and Yinzhi Cao. 2023. Squint hard enough: Attacking perceptual hashing with adversarial machine learning. In32nd USENIX Security Symposium (USENIX Security 23). 211–228

  50. [50]

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

  51. [51]

    Qwen. 2026. Qwen3.6-Plus: Towards Real World Agents. https://qwen.ai/blog?i d=qwen3.6. Published: 2026-04-01

  52. [52]

    Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, and Florian Tramèr. 2024. Gradient-based jailbreak images for multimodal fusion models. arXiv preprint arXiv:2410.03489(2024)

  53. [53]

    Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristobal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, et al

  54. [54]

    Failures to find transferable image jailbreaks between vision-language models.arXiv preprint arXiv:2407.15211(2024)

  55. [55]

    Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. 2016. Ac- cessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. InProceedings of the 2016 acm sigsac conference on computer and communications security. 1528–1540

  56. [56]

    Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2024. Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. InInter- national Conference on Learning Representations (ICLR). https://openreview.net /forum?id=plmBsXHxgR

  57. [57]

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. InInternational Conference on Learning Representations (ICLR). https: //arxiv.org/abs/1312.6199

  58. [58]

    New York Times. 2026. Musk’s Chatbot Flooded X With Millions of Sexualized Images in Days, New Estimates Show. https://www.nytimes.com/2026/01/22/tec hnology/grok-x-ai-elon-musk-deepfakes.html

  59. [59]

    Florian Tramèr. 2021. Does Adversarial Machine Learning Research Matter?. In KDD Workshop on Adversarial Machine Learning (AdvML). Virtual. Invited talk

  60. [60]

    Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses. InAdvances in Neural Information Processing Systems, Vol. 33. 1633–1645

  61. [61]

    Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. 2019. Robustness May Be at Odds with Accuracy. InInterna- tional Conference on Learning Representations (ICLR). https://openreview.net/f orum?id=SyxAb30cY7

  62. [62]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in neural information processing systems 36 (2023), 80079–80110

  63. [63]

    Simon Willison. 2022. Prompt Injection Attacks Against GPT-3. https://simonw illison.net/2022/Sep/12/prompt-injection/. Accessed: 2024-XX-XX

  64. [64]

    xAI. 2024. Grok: AI assistant with vision capabilities. https://x.ai/grok. Accessed: 2025-01-26

  65. [65]

    xAI. 2026. Grok 4.20. https://docs.x.ai/developers/models. Published: 2026-02-17

  66. [66]

    Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. 2019. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2730–2739

  67. [67]

    Greg Yang, Tony Duchi, Tony Morales, and Chelsea Finn. 2020. Randomized Smoothing of All Shapes and Sizes. InInternational Conference on Machine Learning (ICML). https://arxiv.org/abs/2002.08118 Shows randomized smoothing cannot achieve nontrivial certified accuracy at large radii using only label statistics

  68. [68]

    Yuzhe Yang, Yujia Liu, Xin Liu, Avanti Gulhane, Domenico Mastrodicasa, Wei Wu, Edward J Wang, Dushyant Sahani, and Shwetak Patel. 2025. Demographic bias of expert-level vision-language foundation models in medical imaging.Science Advances11, 13 (2025), eadq0305

  69. [69]

    Zonghao Ying, Aishan Dong, Hongru Huang, Yingwei Zhao, Zhengwei Zhang, and Alex C Liu. 2024. Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt.arXiv preprint arXiv:2406.04031(2024). https://arxiv.org/abs/2406.04031

  70. [70]

    Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael I. Jordan. 2019. Theoretically Principled Trade-off between Robustness and Accuracy. InInternational Conference on Machine Learning (ICML). 7472–

  71. [71]

    https://arxiv.org/abs/1901.08573

  72. [72]

    Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. 2025. AnyAttack: Towards Large-scale Self- supervised Adversarial Attacks on Vision-language Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 19900–19909

  73. [73]

    Zhengyu Zhao, Zhuoran Liu, and Martha Larson. 2020. Towards large yet imperceptible adversarial image perturbations with perceptual color distance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1039–1048

  74. [74]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). Laundering AI Authority with Adversarial Examples Figure 11: Claude Opus 4.6’s response when asked to compare an AI-generated image of a woman (left) ...