arxiv: 2605.04261 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.LG

Recognition: unknown

Laundering AI Authority with Adversarial Examples

Jie Zhang , Pura Peetathawatchai , Florian Tram\`er , Avital Shafran

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:15 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords adversarial examplesvision-language modelsauthority launderingtransfer attackscontent moderationmisinformationvisual robustnessperceptual attacks

0 comments

The pith

Adversarial perturbations let attackers make vision-language models issue confident responses about the wrong images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that basic adversarial attacks developed against open CLIP models transfer directly to closed production vision-language models. This breaks the assumption that VLMs see the same visual content as human users, allowing an attacker to induce authoritative but false outputs without touching the model's safety rules. The technique operates purely at the perceptual level and works across systems including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2. Because VLMs are now used for image fact-checking, content moderation, and product judgments, the attacks open practical routes to amplify misinformation, evade filters, and manipulate recommendations. No new attack methods are needed; techniques known for over a decade already succeed at rates between 22 and 100 percent on hundreds of tested cases.

Core claim

Vision-language models are deployed as trusted authorities on images, yet adversarial examples break the assumption that they perceive the same content as users. An attacker can subtly perturb an image so the VLM produces confident and authoritative responses about the wrong input. Unlike jailbreaks, the attack leaves model alignment intact and works entirely at the perceptual level. Standard attacks against publicly available CLIP models transfer reliably to production VLMs including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2. The result is authority laundering that can amplify misinformation, disparage individuals, evade content moderation, and manipulate product recommendations, all

What carries the argument

Transfer of adversarial perturbations from open CLIP models to closed VLMs that induce mismatched authoritative responses.

If this is right

AI fact-checking on social media can be made to endorse false claims about images.
Content moderation systems can be bypassed by perturbing images to avoid detection.
Product comparison or recommendation tools can be steered toward incorrect visual judgments.
Individual reputations can be damaged through adversarial changes to images fed to identity-related queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployers of VLMs may need to add human review layers or ensemble checks for any visual decision that affects public information.
The gap between alignment training and low-level perceptual robustness suggests that future multimodal models could inherit the same exposure.
Regulators focused on AI safety might treat visual input robustness as a separate requirement from prompt-level safeguards.
Simple image preprocessing or watermarking could be tested as a low-cost mitigation before more expensive retraining.

Load-bearing premise

The perturbations stay imperceptible or non-obvious to humans while still transferring reliably from open CLIP models to closed production VLMs.

What would settle it

A controlled test in which humans consistently notice the perturbations or a production VLM correctly describes the original image content despite the attack.

Figures

Figures reproduced from arXiv: 2605.04261 by Avital Shafran, Florian Tram\`er, Jie Zhang, Pura Peetathawatchai.

**Figure 1.** Figure 1: Examples of AI authority laundering attacks against production VLMs, spanning four attack families: narrative manipulation, identity manipulation, commercial fraud, and evasion of safety filters. In each case, an adversarial perturbation of the input image causes the model to respond honestly and within policy to a different semantic content than what the user or platform sees, laundering the attacker’s ch… view at source ↗

**Figure 2.** Figure 2: Fake news made possible through prompt control: view at source ↗

**Figure 3.** Figure 3: ChatGPT 5.4 Thinking declares the 9/11 attacks to view at source ↗

**Figure 4.** Figure 4: Grok 4.2 amplifies conspiracies about Tylenol (a view at source ↗

**Figure 6.** Figure 6: Grok misidentifies Musk as the subject of a news ar view at source ↗

**Figure 7.** Figure 7: Bypassing NSFW filters via adversarial perturbation. view at source ↗

**Figure 8.** Figure 8: Adversarial bypass of gender-asymmetric content moderation in Grok. Left: a clothing-removal request on a male view at source ↗

**Figure 9.** Figure 9: Some VLMs, such as Nano Banana Pro, refuse im view at source ↗

**Figure 10.** Figure 10: Presented with images of two watches, a high-end view at source ↗

**Figure 12.** Figure 12: An adversarial example, optimized from random view at source ↗

**Figure 13.** Figure 13: Targeted ASR across models as a function of perturbation budget view at source ↗

**Figure 14.** Figure 14: Targeted ASR averaged across all models, broken down by the gender (left) and race/ethnicity (right) of source and view at source ↗

**Figure 15.** Figure 15: Perturbing a screenshot of a New York Times arti view at source ↗

**Figure 16.** Figure 16: Google reverse image search misidentifies an adver view at source ↗

**Figure 17.** Figure 17: Presenting Grok with an image of the potentially view at source ↗

**Figure 19.** Figure 19: When asked to recommend between two pairs of view at source ↗

**Figure 20.** Figure 20: Sabotaging a competitor through adversarial per view at source ↗

**Figure 21.** Figure 21: Adversarial versions of photographs of six well-documented historical events, each perturbed to match the text view at source ↗

**Figure 22.** Figure 22: Examples in which adversarial manipulation bypasses public-figure protection (Section 5.3) but the generated output view at source ↗

read the original abstract

Vision-language models (VLMs) are increasingly deployed as trusted authorities -- fact-checking images on social media, comparing products, and moderating content. Users implicitly trust that these systems perceive the same visual content as they do. We show that adversarial examples break this assumption, enabling \emph{AI authority laundering}: an attacker subtly perturbs an image so that the VLM produces confident and authoritative responses about the \emph{wrong} input. Unlike jailbreaks or prompt injections, our attacks do not compromise model alignment; the attack operates entirely at the perceptual level. We demonstrate that standard attacks against publicly available CLIP models transfer reliably to production VLMs -- including GPT-5.4, Claude Opus~4.6, Gemini~3, and Grok~4.2. Across four attack surfaces, we show that authority laundering can amplify misinformation, disparage individuals, evade content moderation, and manipulate product recommendations. Our attacks have high success rates: In hundreds of attacks targeting identity manipulation and NSFW evasion, we measure success rates of $22 - 100\%$ across six models. No novel attack algorithm is required: basic techniques known for over a decade suffice, establishing a lower bound on attacker capability that should concern defenders. Our results demonstrate that visual adversarial robustness is now a practical -- and still largely unsolved -- safety problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard CLIP adversarial examples transfer to closed VLMs for authority laundering, but the results rest on moderate evidence without strong controls.

read the letter

The paper shows that basic adversarial perturbations optimized on public CLIP models can make production VLMs output confident but incorrect descriptions of images. This works on tasks like identity manipulation and NSFW evasion, and the attacks reach models including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2 with success rates from 22 to 100 percent across hundreds of trials. No new attack method is needed, which sets a lower bound on what attackers can already do to trusted visual systems.

Referee Report

3 major / 2 minor

Summary. The paper claims that adversarial examples optimized on public CLIP models transfer reliably to closed production VLMs (including GPT-5.4, Claude Opus 4.6, Gemini 3, and Grok 4.2), enabling 'AI authority laundering' in which the VLM produces confident but incorrect authoritative responses about subtly perturbed images. This is shown across four attack surfaces (identity manipulation, misinformation amplification, content moderation evasion, product recommendation manipulation) using only standard attacks, with success rates of 22-100% measured over hundreds of attacks on six models.

Significance. If the transferability claim holds after adding proper controls, the result would be significant for AI safety: it establishes a practical lower bound on attacker capability against deployed VLMs used for fact-checking and moderation, showing that visual adversarial robustness remains unsolved. The work merits explicit credit for its breadth (multiple production models and attack surfaces) and for using only decade-old techniques rather than introducing new algorithms.

major comments (3)

[Experimental results for identity manipulation and NSFW evasion] In the experimental results for identity manipulation and NSFW evasion (hundreds of attacks section): success rates of 22-100% are presented without baseline comparisons to unperturbed images or random perturbations. This is load-bearing because it leaves open whether outputs reflect perceptual transfer from CLIP or simply the VLMs' variable responses to ambiguous/low-quality inputs.
[Transfer to production VLMs] In the transfer evaluation to production VLMs: success is measured solely by text matching to target 'wrong' descriptions under query-only access, but no controls for prompt phrasing, generation temperature, or multiple samples with error bars are described. This weakens the reliability claim for cross-model transfer.
[Attack surfaces and imperceptibility claim] In the description of attack surfaces and imperceptibility: the claim that perturbations 'subtly' affect the image while remaining non-obvious to humans lacks any human evaluation, SSIM/LPIPS metrics, or perceptual thresholds, which is central to distinguishing authority laundering from obvious tampering.

minor comments (2)

[Abstract] Abstract: model version strings (GPT-5.4, Claude Opus~4.6) use non-standard notation; clarify whether these refer to specific API snapshots or are illustrative.
[Abstract] Abstract: the statement that 'basic techniques known for over a decade suffice' is repeated in spirit; consolidate to strengthen the lower-bound framing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback, which identifies key areas where additional controls and metrics will strengthen the experimental claims. We address each major comment below and commit to revisions that improve rigor without altering the core findings.

read point-by-point responses

Referee: In the experimental results for identity manipulation and NSFW evasion (hundreds of attacks section): success rates of 22-100% are presented without baseline comparisons to unperturbed images or random perturbations. This is load-bearing because it leaves open whether outputs reflect perceptual transfer from CLIP or simply the VLMs' variable responses to ambiguous/low-quality inputs.

Authors: We agree that baseline comparisons are essential to isolate the contribution of the transferred adversarial perturbations. In the revised manuscript we will add success rates for clean (unperturbed) images and for images subjected to random perturbations of comparable magnitude. These baselines, which we have already computed, demonstrate substantially lower rates of the target incorrect outputs, confirming that the observed results arise from perceptual transfer rather than model variability on ambiguous inputs. revision: yes
Referee: In the transfer evaluation to production VLMs: success is measured solely by text matching to target 'wrong' descriptions under query-only access, but no controls for prompt phrasing, generation temperature, or multiple samples with error bars are described. This weakens the reliability claim for cross-model transfer.

Authors: We acknowledge the value of additional controls for robustness. The revised manuscript will describe prompt-phrasing sensitivity checks, report results across multiple temperature settings where the production APIs allow, and include success rates with error bars computed from repeated independent queries. These additions will provide a clearer picture of transfer reliability under query-only access. revision: yes
Referee: In the description of attack surfaces and imperceptibility: the claim that perturbations 'subtly' affect the image while remaining non-obvious to humans lacks any human evaluation, SSIM/LPIPS metrics, or perceptual thresholds, which is central to distinguishing authority laundering from obvious tampering.

Authors: We concur that quantitative and human-centered evidence is required to support the imperceptibility claim. The revised version will report SSIM and LPIPS values for the perturbations, specify the perceptual thresholds used during attack generation, and include results from a human study in which participants attempt to distinguish original from perturbed images. These elements will more rigorously separate subtle authority laundering from obvious tampering. revision: yes

Circularity Check

0 steps flagged

Empirical demonstration with no derivations or self-referential reductions

full rationale

The paper is an empirical study demonstrating transfer of known adversarial attacks from public CLIP models to closed VLMs. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs or self-citations. Claims rest on experimental success rates across models and attack surfaces, with no load-bearing self-citation chains or ansatzes. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that adversarial perturbations developed for CLIP transfer to closed VLMs and that these perturbations can be made imperceptible to humans while altering model outputs.

axioms (1)

domain assumption Adversarial examples exist and transfer across vision models
Invoked when stating that standard attacks against CLIP transfer to production VLMs

pith-pipeline@v0.9.0 · 5542 in / 1244 out tokens · 67057 ms · 2026-05-08T17:15:19.975123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Naveed Akhtar and Ajmal Mian. 2018. Threat of adversarial attacks on deep learning in computer vision: A survey.IEEE Access6 (2018), 14410–14430. doi:10 .1109/ACCESS.2018.2807385

work page arXiv 2018
[2]

Naveed Akhtar, Ajmal Mian, Navid Kardan, and Mubarak Shah. 2021. Advances in adversarial attacks and defenses in computer vision: A survey.arXiv preprint arXiv:2108.00401(2021). https://arxiv.org/abs/2108.00401

work page arXiv 2021
[3]

Nouar AlDahoul, Talal Rahwan, and Yasir Zaki. 2025. AI-generated faces influ- ence gender stereotypes and racial homogenization.Scientific Reports15, 14449 (2025). doi:10.1038/s41598-025-99623-3

work page doi:10.1038/s41598-025-99623-3 2025
[4]

Anthropic. 2026. Introducing Claude Opus 4.6. https://www.anthropic.com/ne ws/claude-opus-4-6. Published: 2026-02-05

2026
[5]

Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational Conference on Machine Learning. 274–283

2018
[6]

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. Abusing images and sounds for indirect instruction injection in multi-modal LLMs.arXiv preprint arXiv:2307.10490(2023)

work page arXiv 2023
[7]

Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. In2022 IEEE Sympo- sium on Security and Privacy (SP). IEEE, 769–786. doi:10.1109/sp46214.2022.983 3572

work page doi:10.1109/sp46214.2022.983 2022
[8]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image hijacks: Adversarial images can control generative models at runtime.arXiv preprint arXiv:2309.00236(2023)

work page arXiv 2023
[9]

Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning attacks against support vector machines. InProceedings of the 29th International Conference on Machine Learning (ICML). 1807–1814. https://arxiv.org/abs/1206.6389

work page arXiv 2012
[10]

Nicholas Carlini. 2021. Adversarial Attacks That Matter. Presentation at ICCV Workshop on Adversarial Robustness in the Real World (AROW). https://nichol as.carlini.com/slides/2021_attacks_that_matter.pdf

2021
[11]

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. 2023. Are aligned neural networks adversarially aligned?Advances in Neural Information Processing Systems36 (2023), 61478–61500

2023
[12]

Nicholas Carlini and David Wagner. 2017. Adversarial examples are not eas- ily detected: Bypassing ten detection methods. InACM Workshop on Artificial Intelligence and Security. 3–14

2017
[13]

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. 2019. Certified adversarial robustness via randomized smoothing. InInternational Conference on Machine Learning. 1310–1320

2019
[14]

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. 2024. On the robustness of large multimodal models against image adversarial at- tacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24625–24634

2024
[15]

Trisha Datta, Binyi Chen, and Dan Boneh. 2025. VerITAS: Verifying image transformations at scale. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 4606–4623

2025
[16]

Debayan Deb, Jianbang Zhang, and Anil K Jain. 2020. Advfaces: Adversarial face synthesis. In2020 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 1–10

2020
[17]

Google DeepMind. 2026. Gemini 3.1 Pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini- models/gemini-3-1-pro/. Published: 2026-02-19

2026
[18]

Google DeepMind. 2026. Nano Banana 2: Combining Pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/technology/ai/nano- banana-2/. Published: 2026-02-26

2026
[19]

Pierpaolo Della Monica, Ivan Visconti, Andrea Vitaletti, and Marco Zecchini
[20]

In2025 IEEE Symposium on Security and Privacy (SP)

Trust nobody: Privacy-preserving proofs for edited photos with your laptop. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 4624–4642
[21]

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. 2019. Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945(2019)

work page arXiv 2019
[22]

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust physical- world attacks on deep learning visual classification. InProceedings of the IEEE conference on computer vision and pattern recognition. 1625–1634

2018
[23]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572(2014)

work page internal anchor Pith review arXiv 2014
[24]

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, and Sanmi Koyejo
[25]

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed.arXiv preprint arXiv:2510.01494(2025)

work page arXiv 2025
[26]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://arxiv.org/abs/2401.13919

work page arXiv 2024
[27]

Hossein Hosseini, Yize Chen, Sreeram Kannan, Baosen Zhang, and Radha Pooven- dran. 2017. Blocking transferability of adversarial examples in black-box learning systems.arXiv preprint arXiv:1703.04318(2017)

work page arXiv 2017
[28]

Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, and Matt Fredrikson. 2025. Transferable adversarial attacks on black- box vision-language models.arXiv preprint arXiv:2505.01050(2025)

work page arXiv 2025
[29]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al . 2024. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Transactions on Information Systems (2024). https://dl.acm.org/doi/10.1145/3703155 Jie Zhang Pura Pe...

work page doi:10.1145/3703155 2024
[30]

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. Adversarial examples are not bugs, they are features.Advances in neural information processing systems32 (2019)

2019
[31]

Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. 2023. Re- thinking Randomized Smoothing from the Perspective of Scalability.arXiv preprint arXiv:2312.12608(2023). https://arxiv.org/abs/2312.12608

work page arXiv 2023
[32]

Walter Laurito et al. 2025. AI-AI Bias: Large Language Models Favor Communica- tions Generated by Large Language Models.Proceedings of the National Academy of Sciences122, 3 (2025). https://www.pnas.org/doi/10.1073/pnas.2415697122

work page doi:10.1073/pnas.2415697122 2025
[33]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 292–305

2023
[34]

Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, and Zhiqiang Shen
[35]

InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https: //openreview.net/forum?id=9xXjWwAoUF
[36]

Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Trans- ferable Adversarial Examples and Black-box Attacks. arXiv:1611.02770 [cs.LG] https://arxiv.org/abs/1611.02770

work page Pith review arXiv 2017
[37]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations

2018
[38]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2023. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies.Journal of MDPI6, 1 (2023). https://www.mdpi.com/2413-4155/6/1/3

2023
[39]

Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence. Published: 2025-04-05

2025
[40]

Ben Nassi, Yisroel Mirsky, Dudi Nassi, Raz Ben-Netanel, Oleg Drokin, and Yuval Elovici. 2020. Phantom of the adas: Securing advanced driver-assistance systems from split-second phantom attacks. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security. 293–308

2020
[41]

Assa Naveh and Eran Tromer. 2016. Photoproof: Cryptographic image authenti- cation for any set of permissible transformations. In2016 IEEE Symposium on Security and Privacy (SP). IEEE, 255–271

2016
[42]

Catherine Olsson. 2019. Unsolved Research Problems vs. Real-World Threat Models. Medium. https://medium.com/@catherio/unsolved-research-problems- vs-real-world-threat-models-e270e256bc9e

2019
[43]

OpenAI. 2025. Introducing ChatGPT Atlas. OpenAI Blog. https://openai.com/i ndex/introducing-chatgpt-atlas/ Published: 2025-10-21

2025
[44]

OpenAI. 2025. Introducing Operator. OpenAI Blog. https://openai.com/index/i ntroducing-operator/

2025
[45]

OpenAI. 2026. Introducing ChatGPT Images 2.0. OpenAI Blog. https://openai.c om/index/introducing-chatgpt-images-2-0/ Published: 2026-04-21

2026
[46]

OpenAI. 2026. Introducing GPT-5.4. OpenAI Blog. https://openai.com/index/i ntroducing-gpt-5-4/ Published: 2026-03-05

2026
[47]

Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. 2016. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples.arXiv preprint arXiv:1605.07277(2016)

work page Pith review arXiv 2016
[48]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models.arXiv preprint arXiv:2202.03286(2022)

work page Pith review arXiv 2022
[49]

Jonathan Prokos, Neil Fendley, Matthew Green, Roei Schuster, Eran Tromer, Tushar Jois, and Yinzhi Cao. 2023. Squint hard enough: Attacking perceptual hashing with adversarial machine learning. In32nd USENIX Security Symposium (USENIX Security 23). 211–228

2023
[50]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

2024
[51]

Qwen. 2026. Qwen3.6-Plus: Towards Real World Agents. https://qwen.ai/blog?i d=qwen3.6. Published: 2026-04-01

2026
[52]

Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, and Florian Tramèr. 2024. Gradient-based jailbreak images for multimodal fusion models. arXiv preprint arXiv:2410.03489(2024)

work page arXiv 2024
[53]

Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristobal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, et al
[54]

Failures to find transferable image jailbreaks between vision-language models.arXiv preprint arXiv:2407.15211(2024)

work page arXiv 2024
[55]

Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. 2016. Ac- cessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. InProceedings of the 2016 acm sigsac conference on computer and communications security. 1528–1540

2016
[56]

Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2024. Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. InInter- national Conference on Learning Representations (ICLR). https://openreview.net /forum?id=plmBsXHxgR

2024
[57]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. InInternational Conference on Learning Representations (ICLR). https: //arxiv.org/abs/1312.6199

work page internal anchor Pith review arXiv 2014
[58]

New York Times. 2026. Musk’s Chatbot Flooded X With Millions of Sexualized Images in Days, New Estimates Show. https://www.nytimes.com/2026/01/22/tec hnology/grok-x-ai-elon-musk-deepfakes.html

2026
[59]

Florian Tramèr. 2021. Does Adversarial Machine Learning Research Matter?. In KDD Workshop on Adversarial Machine Learning (AdvML). Virtual. Invited talk

2021
[60]

Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses. InAdvances in Neural Information Processing Systems, Vol. 33. 1633–1645

2020
[61]

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. 2019. Robustness May Be at Odds with Accuracy. InInterna- tional Conference on Learning Representations (ICLR). https://openreview.net/f orum?id=SyxAb30cY7

2019
[62]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in neural information processing systems 36 (2023), 80079–80110

2023
[63]

Simon Willison. 2022. Prompt Injection Attacks Against GPT-3. https://simonw illison.net/2022/Sep/12/prompt-injection/. Accessed: 2024-XX-XX

2022
[64]

xAI. 2024. Grok: AI assistant with vision capabilities. https://x.ai/grok. Accessed: 2025-01-26

2024
[65]

xAI. 2026. Grok 4.20. https://docs.x.ai/developers/models. Published: 2026-02-17

2026
[66]

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. 2019. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2730–2739

2019
[67]

Greg Yang, Tony Duchi, Tony Morales, and Chelsea Finn. 2020. Randomized Smoothing of All Shapes and Sizes. InInternational Conference on Machine Learning (ICML). https://arxiv.org/abs/2002.08118 Shows randomized smoothing cannot achieve nontrivial certified accuracy at large radii using only label statistics

work page arXiv 2020
[68]

Yuzhe Yang, Yujia Liu, Xin Liu, Avanti Gulhane, Domenico Mastrodicasa, Wei Wu, Edward J Wang, Dushyant Sahani, and Shwetak Patel. 2025. Demographic bias of expert-level vision-language foundation models in medical imaging.Science Advances11, 13 (2025), eadq0305

2025
[69]

Zonghao Ying, Aishan Dong, Hongru Huang, Yingwei Zhao, Zhengwei Zhang, and Alex C Liu. 2024. Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt.arXiv preprint arXiv:2406.04031(2024). https://arxiv.org/abs/2406.04031

work page arXiv 2024
[70]

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael I. Jordan. 2019. Theoretically Principled Trade-off between Robustness and Accuracy. InInternational Conference on Machine Learning (ICML). 7472–

2019
[71]

https://arxiv.org/abs/1901.08573

work page arXiv 1901
[72]

Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. 2025. AnyAttack: Towards Large-scale Self- supervised Adversarial Attacks on Vision-language Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 19900–19909

2025
[73]

Zhengyu Zhao, Zhuoran Liu, and Martha Larson. 2020. Towards large yet imperceptible adversarial image perturbations with perceptual color distance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1039–1048

2020
[74]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). Laundering AI Authority with Adversarial Examples Figure 11: Claude Opus 4.6’s response when asked to compare an AI-generated image of a woman (left) ...

work page internal anchor Pith review arXiv 2023