MIRAGE: Protecting against Malicious Image Editing via False Moderation

Anshul Nasery; Cho-Jui Hsieh; Ramnath Kumar; Sewoong Oh

arxiv: 2606.26199 · v2 · pith:OQEXVTZJnew · submitted 2026-06-24 · 💻 cs.CR

MIRAGE: Protecting against Malicious Image Editing via False Moderation

Anshul Nasery , Ramnath Kumar , Cho-Jui Hsieh , Sewoong Oh This is my paper

Pith reviewed 2026-06-29 04:44 UTC · model grok-4.3

classification 💻 cs.CR

keywords adversarial perturbationsimage immunizationcontent moderationAI image editingsafety filterstransfer attacksprompt-agnostic protection

0 comments

The pith

Perturbing images to trigger false positives in pre-generation safety classifiers protects them from unauthorized edits in commercial AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that commercial image editing APIs all rely on a shared pre-generation moderation step that can be exploited without access to the editor model or editing prompt. By optimizing small adversarial perturbations against an ensemble of open-source embedding and moderation models, images are aligned with policy-violating concepts in representation space. This causes the proprietary moderators to flag the image and refuse the edit request. Evaluations on closed-source systems such as GPT-Image, Gemini Flash Image, and Grok Imagine report success rates above 88 percent. The approach is prompt-agnostic and operates at the system level rather than targeting the generative model itself.

Core claim

MIRAGE immunizes images by adding adversarial perturbations that align them to policy-violating concepts in the representation space of an ensemble of open-source embedding and moderation models, thereby causing the pre-generation safety classifiers in closed-source commercial image editing APIs to produce false positives and refuse any editing prompt.

What carries the argument

Adversarial perturbations optimized on an ensemble of open-source models to induce false positives in proprietary pre-generation moderation classifiers.

If this is right

Edit requests on immunized images are refused by the system regardless of the editing prompt.
Protection extends across multiple closed commercial APIs including GPT-Image, Gemini Flash Image, and Grok Imagine.
The method requires neither the editing prompt nor access to the generative model weights.
Success exceeds 88 percent in direct evaluations against the closed-source systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same moderation surface could be targeted in other generative services that share pre-generation safety checks.
Widespread adoption would create an arms race where moderators are hardened against such transfer attacks.
Individuals could apply the immunization to personal images before sharing them online.
If open-source moderation models improve in robustness, the transfer success to closed systems might decline.

Load-bearing premise

Perturbations optimized against open-source moderation models will transfer to trigger false positives in the proprietary moderation classifiers used by commercial image editors.

What would settle it

Submitting MIRAGE-perturbed images to the target commercial APIs and observing that edit requests are accepted rather than refused would show the claimed transfer does not hold.

Figures

Figures reproduced from arXiv: 2606.26199 by Anshul Nasery, Cho-Jui Hsieh, Ramnath Kumar, Sewoong Oh.

**Figure 1.** Figure 1: Overview of the image immunization framework. (Left - Without Immunization): Given a source image (a cat sitting outdoors), an adversary can pair it with a (potentially malicious) instruction (“Put in a cage”) and feed it to a blackbox image-editing model (e.g., Gemini, OpenAI, or xAI). The model complies with the instruction, producing an image (in which the cat appears behind cage bars). (Right - With I… view at source ↗

**Figure 2.** Figure 2: MIRAGE: Immunization objective via embedding similarity. The immunization pipeline computes adversarial perturbations to the source image by maximizing its alignment with a set of unsafe target embeddings. The source image and each image/text in the unsafe target set T (e.g., gore, violence, sexually explicit content, etc.) are independently encoded by many frozen image/text encoders. We also extract patch… view at source ↗

**Figure 3.** Figure 3: Immunization rates of baselines and our method across closed-source image editing APIs (OpenAI GPTImage, Google Gemini, XAI Grok). We find that no existing baseline can effectively immunize against proprietary AIpowered image editors. Images immunized using MIRAGE reliably trigger the moderation filters of these systems, leading to the image editing request being refused most of the time. For baseline me… view at source ↗

**Figure 4.** Figure 4: Immunization vs. perturbation budget ∥δ∥∞. Immunization rates for MIRAGE monotonically increase with increasing ∥δ∥∞ at the cost of larger visual distortions to the image. Qualitative example images of varying perturbation level are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Immunization rate of our method under weak adversarial perturbations. MIRAGE survives most attacks under the weak threat model which restricts the adversary to using classical image pre-processing transforms. resources and use some white-box open-source image embedding, segmentation, generation or editing models. 5.3.1. Weak adversary. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Immunization rate of our method under stronger adversarial perturbations. If the adversary can run embedding, generation or segmentation models, they can construct powerful perturbations to bypass the immunization afforded by MIRAGE. VLM-as-a-judge shows the proportion of images that passes the moderation filter but resulted in significantly distorted output. Local model shows immunization rate against a … view at source ↗

**Figure 7.** Figure 7: Effect of ensemble size on immunization rate. More models n ∈ {4, 6, 8} in the ensemble objective Eq. (1) makes the immunization generalize better, resulting in improved success rates, but the gain diminishes with larger ensemble and higher perturbation bound. better immunization rates. This effect is starker at smaller bounds, where larger ensembles lead to more generalizable perturbations which can misl… view at source ↗

**Figure 8.** Figure 8: Effect of global-local embeddings on immunization. Using both global and local views in the objective in Eq. (2) is crucial for immunizing high resolution images. the method practical. Violence Sexual 0 25 50 75 100 Immunization rate (%) (a) ||δ||∞ = 8/255 Violence Sexual 0 25 50 75 100 Immunization rate (%) (b) ||δ||∞ = 16/255 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of target category on immunization rate. Using sexually explicit content as the target concept leads to higher refusals as compared to violent imagery. 5.4.3. Targets chosen. For our main experiments, we choose the targets to be a set T of 5 images and 3 text prompts corresponding to sexually explicit content. In Figure 9 we also experiment with using the same number of images and texts correspondi… view at source ↗

**Figure 11.** Figure 11: Qualitative examples of original source images and their successfully immunized counterparts at different [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative examples of source images [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 14.** Figure 14: Qualitative TDAE [72] examples Each row shows the perturbed image, the local white-box edit, the API edit of the clean image, and the API edit of the perturbed image. The top three rows are successfully immunized, where the VLM judge determines that the perturbed API edit does not follow the instruction, while the bottom three rows are immunization failures. We see that the VLM-as-ajudge is consistent wi… view at source ↗

**Figure 15.** Figure 15: Qualitative comparison between PhotoGuard (PG) and M [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

read the original abstract

The proliferation of AI-powered image editing systems raises serious concerns because it allows personal images to be arbitrarily manipulated at scale, with minimal effort, and a lower barrier to entry. Prior work on image immunization adds imperceptible perturbations to an image to protect against unauthorized manipulations. However, these methods usually require access to the model weights and the image manipulating prompt. This significantly limits their use, especially against powerful commercial image-editors such as GPT-Image, Gemini Flash Image (Nano Banana), and Grok Imagine. To address this, we take a system-level view of the problem and identify a previously unexplored attack surface common to all major commercial image editing systems: pre-generation safety moderation. Rather than disrupting the generative model itself, we propose to immunize images by causing these moderation classifiers to flag images as policy-violating, triggering an automatic refusal regardless of the editing prompt. We operationalize this by adding adversarial perturbations to align our image to policy-violating concepts in the representation space of an ensemble of open-source embedding and moderation models. We call our method MIRAGE, which stands for Moderation Induced Resistance Against Generative Editing. We evaluate MIRAGE against multiple closed-source image editing APIs and demonstrate success rates of more than 88%. Our approach is simple, prompt-agnostic, and effective, offering a practical path towards protecting personal images from unauthorized AI-powered editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIRAGE's shift to triggering false positives in pre-generation moderation is a practical new angle, but the >88% transfer claim to closed APIs rests on unshown experimental details.

read the letter

The main takeaway is that this work targets the shared pre-generation moderation layer in commercial editors instead of the generators themselves. That lets them create a prompt-agnostic defense by pushing images toward policy-violating regions in open embedding spaces so the closed systems refuse the request outright.

The paper does a clean job laying out why this surface exists across GPT-Image, Gemini Flash Image, and Grok Imagine, and why prior immunization methods fall short because they need model weights or the edit prompt. Using an ensemble of open-source models to generate the perturbations is a reasonable operational choice and keeps the method usable without API access.

The soft spot is the evaluation. The abstract states success rates above 88% on the closed APIs, yet supplies no setup details, baselines, transfer metrics, or error analysis. The central assumption—that perturbations optimized on open models will reliably cross over to proprietary moderation classifiers—remains the load-bearing piece, and adversarial transfer across mismatched architectures is often fragile. Without seeing the actual numbers and controls, it is hard to judge whether the results reflect genuine robustness or narrower conditions.

This paper is aimed at people working on practical image privacy and adversarial defenses against deployed systems. It deserves a serious referee because the problem is real and the system-level framing is worth checking, even if the current evidence level looks preliminary. I would send it to review so the experiments can be examined directly.

Referee Report

2 major / 0 minor

Summary. The paper proposes MIRAGE to protect personal images from unauthorized editing by commercial AI image editing systems. It adds adversarial perturbations optimized on an ensemble of open-source embedding and moderation models to cause pre-generation safety moderation classifiers in closed-source APIs (GPT-Image, Gemini Flash Image, Grok Imagine) to flag the images as policy-violating, triggering refusals independent of the editing prompt. The authors report success rates of more than 88% and describe the method as simple, prompt-agnostic, and effective.

Significance. If the empirical results on transferability are robust, this work is significant for offering a practical defense mechanism that does not require access to proprietary model weights or knowledge of editing prompts. By targeting the shared pre-generation moderation layer, it provides a system-level solution to a growing privacy concern in generative AI, potentially influencing how safety filters are designed in future commercial systems.

major comments (2)

Abstract: The claim of >88% success rates on closed-source APIs lacks accompanying experimental details, baselines, transfer metrics, or error analysis, which are necessary to substantiate the central empirical claim.
Method section: The approach assumes that perturbations optimized against open-source models will transfer to proprietary moderation classifiers; however, without specific details on the ensemble composition, optimization objective, or evidence of shared representation geometry, the transferability remains an unverified core assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive feedback. We address each major comment below, proposing targeted revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim of >88% success rates on closed-source APIs lacks accompanying experimental details, baselines, transfer metrics, or error analysis, which are necessary to substantiate the central empirical claim.

Authors: We agree that the abstract, being concise by nature, does not include the full experimental details. The manuscript's Experiments section (Section 4) provides these, including the specific closed-source APIs evaluated (GPT-Image, Gemini Flash Image, Grok Imagine), per-API success rates, baselines such as unperturbed images and random noise, transfer metrics from the open-source ensemble to closed APIs, and error analysis on the ~12% failure cases. To directly address the concern, we will revise the abstract to briefly note the evaluation scope (e.g., 'across 500 images on three commercial APIs with >88% average success') while keeping it within length limits. revision: yes
Referee: Method section: The approach assumes that perturbations optimized against open-source models will transfer to proprietary moderation classifiers; however, without specific details on the ensemble composition, optimization objective, or evidence of shared representation geometry, the transferability remains an unverified core assumption.

Authors: The Method section (Section 3) specifies the ensemble composition (CLIP ViT-L/14, OpenCLIP, and two open-source moderation models like those from LAION and Stability AI), the optimization objective (maximizing cosine similarity to policy-violating concept embeddings while minimizing perceptual distortion via PGD), and the prompt-agnostic nature. We provide empirical evidence of transfer via the reported success rates. However, we acknowledge that an explicit discussion of shared representation geometry (e.g., due to overlapping training data on safety policies) is limited. We will add a short subsection or paragraph in Methods providing this rationale and citing related work on moderation model similarities. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical transfer evaluated on external closed APIs

full rationale

The paper's core procedure optimizes perturbations against open-source embedding/moderation models then measures refusal rates on separate closed-source commercial APIs (GPT-Image, Gemini, Grok). No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or description; the >88% success claim rests on direct external measurements rather than any reduction to the optimization inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5784 in / 1150 out tokens · 62968 ms · 2026-06-29T04:44:13.989254+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 4 canonical work pages

[1]

N. Ahn, K. Yoo, W. Ahn, D. Kim, and S.-H. Nam. Nearly zero-cost protection against mimicry by per- sonalized diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 28801–28810, June
[2]

URL https://openaccess.thecvf.com/content/CV PR2025/html/Ahn Nearly Zero-Cost Protection Aga inst Mimicry by Personalized Diffusion Models C VPR 2025 paper.html

2025
[3]

Andriushchenko, F

M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein. Square attack: A query-efficient black-box adversarial attack via random search. InComputer Vision – ECCV 2020, 2020. URL https://arxiv.org/ abs/1912.00049

arXiv 2020
[4]

Athalye, L

A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. InInterna- tional conference on machine learning, pages 284–293. PMLR, 2018

2018
[5]

FLUX.2 [klein]: Towards Interac- tive Visual Intelligence

Black Forest Labs. FLUX.2 [klein]: Towards Interac- tive Visual Intelligence. https://bfl.ai/blog/flux2-kle in-towards-interactive-visual-intelligence, Jan. 2026. Blog post. Accessed: 2026-06-08

2026
[6]

S. Boztas. Dutch far-right party pays damages to court artist after changing image with AI. https://www.theg uardian.com/world/2026/jun/13/geert-wilders-pvv-dut ch-far-right-party-damages-court-artist-change-image -ai, June 2026

2026
[7]

Brendel, J

W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. InInternational Conference on Learning Representations (ICLR), 2018. URL https: //arxiv.org/abs/1712.04248

Pith/arXiv arXiv 2018
[8]

C. G. Broyden. A class of methods for solving nonlinear simultaneous equations.Mathematics of Computation, 19(92):577–593, 1965

1965
[9]

M. Burgess. Grok Is Still Hosting Sexualized Deep- fakes of Famous Women . https://www.wired.com/st ory/grok-is-still-hosting-sexualized-deepfakes-of-fam ous-women/, June 2026

2026
[10]

Grok floods X with sexualized images of women and children

Center for Countering Digital Hate. Grok floods X with sexualized images of women and children. https: //counterhate.com/research/grok-floods-x-with-sexuali zed-images/, January 2026

2026
[11]

H. Chen, Y . Zhang, Y . Dong, X. Yang, H. Su, and J. Zhu. Rethinking model ensemble in transfer-based adversarial attacks, 2024. URL https://arxiv.org/abs/ 2303.09105

arXiv 2024
[12]

P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black- box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017

2017
[13]

R. Chen, H. Jin, Y . Liu, J. Chen, H. Wang, and L. Sun. EditShield: Protecting unauthorized image editing by instruction-guided diffusion models. InComputer Vi- sion – ECCV 2024, pages 126–142. Springer, 2025. doi: 10.1007/978-3-031-73036-8 8

work page doi:10.1007/978-3-031-73036-8 2024
[14]

Cheng, T

M. Cheng, T. Le, P.-Y . Chen, J. Yi, H. Zhang, and C.-J. Hsieh. Query-efficient hard-label black-box attack:an optimization-based approach, 2018. URL https://arxi v.org/abs/1807.04457

Pith/arXiv arXiv 2018
[15]

Black, and Otmar Hilliges

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for con- trastive language-image learning. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), page 2818–2829. IEEE, June 2023. doi: 10.1109/cvpr52729.2023.00276. URL http: //dx.doi.org/10....

work page doi:10.1109/cvpr52729.2023.00276 2023
[16]

J. Chi, U. Karn, H. Zhan, E. Smith, J. Rando, Y . Zhang, K. Plawiak, Z. D. Coudert, K. Upasani, and M. Pa- supuleti. Llama guard 3 vision: Safeguarding human- ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

arXiv 2024
[17]

J. S. Choi, K. Lee, J. Jeong, S. Xie, J. Shin, and K. Lee. DiffusionGuard: A robust defense against malicious diffusion-based image editing. InInternational Confer- ence on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=9OfKxKoYNw

2025
[18]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non- strongly convex composite objectives. InAdvances in Neural Information Processing Systems, 2014

2014
[19]

DiResta and J

R. DiResta and J. A. Goldstein. How spammers and scammers leverage ai-generated images on facebook for audience growth.arXiv preprint arXiv:2403.12838, 2024

arXiv 2024
[20]

S. Dong, J. Zhang, G. Zhao, S. Shan, and X. Chen. Semantic mismatch and perceptual degradation: A new perspective on image editing immunity.arXiv preprint arXiv:2512.14320, 2025. URL https://arxiv.org/abs/25 12.14320

arXiv 2025
[21]

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex opti- mization: The power of two function evaluations.IEEE Transactions on Information Theory, 61(5):2788–2806, 2015

2015
[22]

A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. Toshev, and V . Shankar. Data filtering networks, 2023. URL https://arxiv.org/abs/2309.17425

arXiv 2023
[23]

J. Fu, S. Li, Y . Jiang, K.-Y . Lin, C. Qian, C. C. Loy, W. Wu, and Z. Liu. Stylegan-human: A data-centric odyssey of human generation, 2022. URL https://arxi v.org/abs/2204.11823

arXiv 2022
[24]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyr- nis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beau- mont, S. Oh, A. Dimakis, J. Jitsev, Y...

arXiv 2023
[25]

Gentleman

A. Gentleman. New claimants seek to sue Elon Musk’s xAI after Labour MP’s test case. https://www.th eguardian.com/technology/2026/jun/05/grok-ai-e lon-musk-jess-asato-labour-mp-lawsuit, June 2026. The Guardian. Additional reporting by Jessica Elgot. Accessed June 7, 2026

2026
[26]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. URL https://arxiv.org/abs/1412.6572

Pith/arXiv arXiv 2015
[27]

Gemini API Additional Terms of Service

Google. Gemini API Additional Terms of Service. https://ai.google.dev/gemini-api/terms, Mar. 2026. Effective March 23, 2026. Accessed June 7, 2026

2026
[28]

Gemini 3.1 Flash Image (Nano Banana 2), 2026

Google DeepMind. Gemini 3.1 Flash Image (Nano Banana 2), 2026. https://deepmind.google/models/g emini-image/flash/

2026
[29]

Z. Guo, L. Fang, J. Lin, Y . Qian, S. Zhao, Z. Wang, J. Dong, C. Chen, O. Arandjelovi ´c, and C. P. Lau. A grey-box attack against latent diffusion model-based image editing by posterior collapse.arXiv preprint arXiv:2408.10901, 2024. URL https://arxiv.org/abs/24 08.10901

arXiv 2024
[30]

Ilyas, L

A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black- box adversarial attacks with limited queries and in- formation. InInternational conference on machine learning, pages 2137–2146. PMLR, 2018

2018
[31]

X. Jia, S. Gao, S. Qin, T. Pang, C. Du, Y . Huang, X. Li, Y . Li, B. Li, and Y . Liu. Adversarial attacks against closed-source mllms via feature optimal alignment,
[32]

URL https://arxiv.org/abs/2505.21494

arXiv
[33]

J. Kim, Y . Nam, M. Kim, S. Kim, and J. Jeong. BlurGuard: A simple approach for robustifying image protection against AI-powered editing. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=vritEZz28d

2025
[34]

D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.Mathematical Programming, 45:503–528, 1989

1989
[35]

Y . Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. InInternational Conference on Learning Rep- resentations (ICLR), 2017. URL https://arxiv.org/abs/ 1611.02770

Pith/arXiv arXiv 2017
[36]

L. Lo, C. Y . Yeo, H.-H. Shuai, and W.-H. Cheng. Distraction is all you need: Memory-efficient image immunization against diffusion-based image editing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24462–24471, June 2024. URL https://openaccess.the cvf.com/content/CVPR2024/html/Lo Distraction is All Y...

2024
[37]

N. A. Lord, R. Mueller, and L. Bertinetto. Attacking deep networks with surrogate-based adversarial black- box methods is easy.arXiv preprint arXiv:2203.08725, 2022

arXiv 2022
[38]

Madry, A

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=rJzIBfZAb

2018
[39]

Nasery, E

A. Nasery, E. Contente, A. Kaz, P. Viswanath, and S. Oh. Are robust llm fingerprints adversarially robust? arXiv e-prints, pages arXiv–2509, 2025

2025
[40]

Nasery, J

A. Nasery, J. Hayase, C. Brooks, P. Sheng, H. Tyagi, P. Viswanath, and S. Oh. Scalable fingerprinting of large language models.Advances in Neural Informa- tion Processing Systems, 38:125116–125152, 2026

2026
[41]

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandkumar. Diffusion models for adversarial purification. InInternational Conference on Machine Learning (ICML), 2022

2022
[42]

J. Nocedal. Updating quasi-newton matrices with limited storage.Mathematics of Computation, 35(151): 773–782, 1980

1980
[43]

N. C. of State Legislatures. Deepfakes in Elections and Campaigns. https://www.ncsl.org/elections-and-c ampaigns/artificial-intelligence-ai-in-elections-and-c ampaigns, June 2026

2026
[44]

GPT Image 2, 2026

Open AI. GPT Image 2, 2026. https://developers.ope nai.com/api/docs/guides/image-generation

2026
[45]

Usage Policies

OpenAI. Usage Policies. https://openai.com/policie s/usage-policies/, Oct. 2025. Effective October 29,

2025
[47]

omni-moderation Model

OpenAI. omni-moderation Model. https://developers .openai.com/api/docs/models/omni-moderation-latest,
[48]

Accessed: 2026- 06-08

OpenAI API documentation. Accessed: 2026- 06-08

2026
[49]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without sup...

Pith/arXiv arXiv 2024
[50]

T. C. Ozden, O. Kara, O. Akcin, K. Zaman, S. Srivastava, S. P. Chinchali, and J. M. Rehg. DiffVax: Optimization-free image immunization against diffusion-based editing. arXiv preprint arXiv:2411.17957, 2024. URL https://arxiv.org/abs/2411.17957

arXiv 2024
[51]

B. Perrigo. How to Spot an AI-Generated Image Like the ’Balenciaga Pope’. https://time.com/6266606/how -to-spot-deepfake-pope/, March 2023

arXiv 2023
[52]

Pleimling, S

X. Pleimling, S. M. Abdullah, G. Balde, P. Gao, M. Mondal, M. Jadliwala, and B. Viswanath. Off- the-shelf image-to-image models are all you need to defeat image protection schemes, 2026. URL https: //arxiv.org/abs/2602.22197

arXiv 2026
[53]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning trans- ferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[54]

Fact Check: Online posts reporting explosion near Pentagon on May 22, 2023 are false

Reuters. Fact Check: Online posts reporting explosion near Pentagon on May 22, 2023 are false. https://ww w.reuters.com/article/fact-check/online-posts-reporting -explosion-near-pentagon-on-may-22-2023-are-false -idUSL1N37J2QJ/, May 2023

2023
[55]

Grok’s AI image generation tool violated Canadian privacy law, watchdog says

Reuters. Grok’s AI image generation tool violated Canadian privacy law, watchdog says. https://www. reuters.com/business/media-telecom/groks-ai-image-g eneration-tool-violated-canadian-privacy-law-says-w atchdog-2026-06-11/, June 2026

2026
[56]

Ricker, D

J. Ricker, D. Assenmacher, T. Holz, A. Fischer, and E. Quiring. Ai-generated faces in the real world: A large-scale case study of twitter profile images. In Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses, pages 513–530, 2024

2024
[57]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 10684–10695, June 2022. URL https://openaccess.thecvf.com/content/CVPR2022/htm l/Rombach High-Resolution Image Synthesis With Latent D...

2022
[58]

Salman, A

H. Salman, A. Khaddaj, G. Leclerc, A. Ilyas, and A. Madry. Raising the cost of malicious AI-powered image editing. InProceedings of the 40th Interna- tional Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 29894–29918. PMLR, 2023. URL https://proceeding s.mlr.press/v202/salman23a.html

2023
[59]

Satter and S

R. Satter and S. Tabahriti. Exclusive: Despite new curbs, Elon Musk’s Grok at times produces sexualized images - even when told subjects didn’t consent. https: //www.reuters.com/business/despite-new-curbs-elo n-musks-grok-times-produces-sexualized-images-eve n-when-2026-02-03/, February 2026

2026
[60]

Schaeffer, D

R. Schaeffer, D. Valentine, L. Bailey, J. Chua, C. Eyza- guirre, Z. Durante, J. Benton, B. Miranda, H. Sleight, T. Wang, et al. Failures to find transferable image jailbreaks between vision-language models. InIn- ternational Conference on Learning Representations, volume 2025, pages 44669–44704, 2025

2025
[61]

Schmidt, N

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient.Math- ematical Programming, 162(1–2):83–112, 2017. doi: 10.1007/s10107-016-1030-6

work page doi:10.1007/s10107-016-1030-6 2017
[62]

S. Shan, J. Cryan, E. Wenger, H. Zheng, R. Hanocka, and B. Y . Zhao. Glaze: Protecting artists from style mimicry by text-to-image models. In32nd USENIX Se- curity Symposium (USENIX Security 23), pages 2187–
[63]

URL https://www

USENIX Association, 2023. URL https://www. usenix.org/conference/usenixsecurity23/presentation/ shan

2023
[64]

S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y . Zhao. Nightshade: Prompt-specific poisoning attacks on text-to-image generative models. In2024 IEEE Symposium on Security and Privacy (SP), pages 807–825, 2024. doi: 10.1109/SP54263.2024.00207. URL https://arxiv.org/abs/2310.13828

work page doi:10.1109/sp54263.2024.00207 2024
[65]

Z. Shao, H. Liu, Y . Hu, and N. Z. Gong. Leave my images alone: Preventing multi-modal large language models from analyzing images via visual prompt in- jection.arXiv preprint arXiv:2604.09024, 2026

Pith/arXiv arXiv 2026
[66]

Sohl-Dickstein, B

J. Sohl-Dickstein, B. Poole, and S. Ganguli. Fast large- scale optimization by unifying stochastic gradient and quasi-newton methods. InInternational Conference on Machine Learning, 2014

2014
[67]

M. Sparks. Disney, NBC Universal, and DreamWorks File Major IP Lawsuit Against AI Image Generator Midjourney. https://www.law.georgetown.edu/tech-i nstitute/research-insights/insights/disney-nbc-univers al-and-dreamworks-file-major-ip-lawsuit-against-ai-i mage-generator-midjourney/, June 2025. Institute for Technology Law & Policy, Georgetown Law. Accesse...

2025
[68]

C.-C. Tu, P. Ting, P.-Y . Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. InProceed- ings of the AAAI conference on artificial intelligence, volume 33, pages 742–749, 2019

2019
[69]

xAI Acceptable Use Policy

xAI. xAI Acceptable Use Policy. https://x.ai/legal/a cceptable-use-policy, Jan. 2025. Effective January 2,

2025
[70]

Accessed June 7, 2026

2026
[71]

Grok Imagine API, 2026

xAI. Grok Imagine API, 2026. https://x.ai/news/grok -imagine-api

2026
[72]

J. Xu, F. Wang, M. Ma, P. W. Koh, C. Xiao, and M. Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3277–3306, 2024

2024
[73]

Y . Ye, X. He, Z. Li, S. Yuan, Z. Yan, B. Hou, L. Yuan, et al. Imgedit: A unified image editing dataset and benchmark.Advances in Neural Information Process- ing Systems, 38, 2026

2026
[74]

W. Zeng, D. Kurniawan, R. Mullins, Y . Liu, T. Saha, D. Ike-Njoku, J. Gu, Y . Song, C. Xu, J. Zhou, A. Joshi, S. Dheep, M. Malek, H. Palangi, J. Baek, R. Pereira, and K. Narasimhan. Shieldgemma 2: Robust and tractable image content moderation, 2025. URL https: //arxiv.org/abs/2504.01081

arXiv 2025
[75]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October
[76]

URL https://openaccess.thecvf.com/content/IC CV2023/html/Zhai Sigmoid Loss for Language Ima ge Pre-Training ICCV 2023 paper.html

2023
[77]

Zhang, Z

J. Zhang, Z. Gu, J. Jang, H. Wu, M. P. Stoecklin, H. Huang, and I. Molloy. Protecting intellectual prop- erty of deep neural networks with watermarking. In Proceedings of the 2018 on Asia conference on com- puter and communications security, pages 159–172, 2018

2018
[78]

Zhang, S

J. Zhang, S. Dong, S. Shan, and X. Chen. Dual attention guided defense against malicious edits.arXiv preprint arXiv:2512.14333, 2025. URL https://arxiv. org/abs/2512.14333

arXiv 2025
[79]

Zhang, S

J. Zhang, S. Dong, S. Shan, and X. Chen. Towards transferable defense against malicious image edits. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. URL https://www.computer.org/c sdl/journal/tp/5555/01/11421009/2eApjoNgEve. Early Access / PrePrints

2026
[80]

Zhang, P

J. Zhang, P. Peetathawatchai, F. Tram `er, and A. Shafran. Laundering ai authority with adversarial examples, 2026. URL https://arxiv.org/abs/2605.042 61

2026
[81]

Zhang, G

P.-F. Zhang, G. Bai, and Z. Huang. Maa: Meticulous adversarial attack against vision-language pre-trained models, 2025. URL https://arxiv.org/abs/2502.08079

arXiv 2025

Showing first 80 references.

[1] [1]

N. Ahn, K. Yoo, W. Ahn, D. Kim, and S.-H. Nam. Nearly zero-cost protection against mimicry by per- sonalized diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 28801–28810, June

[2] [2]

URL https://openaccess.thecvf.com/content/CV PR2025/html/Ahn Nearly Zero-Cost Protection Aga inst Mimicry by Personalized Diffusion Models C VPR 2025 paper.html

2025

[3] [3]

Andriushchenko, F

M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein. Square attack: A query-efficient black-box adversarial attack via random search. InComputer Vision – ECCV 2020, 2020. URL https://arxiv.org/ abs/1912.00049

arXiv 2020

[4] [4]

Athalye, L

A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. InInterna- tional conference on machine learning, pages 284–293. PMLR, 2018

2018

[5] [5]

FLUX.2 [klein]: Towards Interac- tive Visual Intelligence

Black Forest Labs. FLUX.2 [klein]: Towards Interac- tive Visual Intelligence. https://bfl.ai/blog/flux2-kle in-towards-interactive-visual-intelligence, Jan. 2026. Blog post. Accessed: 2026-06-08

2026

[6] [6]

S. Boztas. Dutch far-right party pays damages to court artist after changing image with AI. https://www.theg uardian.com/world/2026/jun/13/geert-wilders-pvv-dut ch-far-right-party-damages-court-artist-change-image -ai, June 2026

2026

[7] [7]

Brendel, J

W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. InInternational Conference on Learning Representations (ICLR), 2018. URL https: //arxiv.org/abs/1712.04248

Pith/arXiv arXiv 2018

[8] [8]

C. G. Broyden. A class of methods for solving nonlinear simultaneous equations.Mathematics of Computation, 19(92):577–593, 1965

1965

[9] [9]

M. Burgess. Grok Is Still Hosting Sexualized Deep- fakes of Famous Women . https://www.wired.com/st ory/grok-is-still-hosting-sexualized-deepfakes-of-fam ous-women/, June 2026

2026

[10] [10]

Grok floods X with sexualized images of women and children

Center for Countering Digital Hate. Grok floods X with sexualized images of women and children. https: //counterhate.com/research/grok-floods-x-with-sexuali zed-images/, January 2026

2026

[11] [11]

H. Chen, Y . Zhang, Y . Dong, X. Yang, H. Su, and J. Zhu. Rethinking model ensemble in transfer-based adversarial attacks, 2024. URL https://arxiv.org/abs/ 2303.09105

arXiv 2024

[12] [12]

P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black- box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017

2017

[13] [13]

R. Chen, H. Jin, Y . Liu, J. Chen, H. Wang, and L. Sun. EditShield: Protecting unauthorized image editing by instruction-guided diffusion models. InComputer Vi- sion – ECCV 2024, pages 126–142. Springer, 2025. doi: 10.1007/978-3-031-73036-8 8

work page doi:10.1007/978-3-031-73036-8 2024

[14] [14]

Cheng, T

M. Cheng, T. Le, P.-Y . Chen, J. Yi, H. Zhang, and C.-J. Hsieh. Query-efficient hard-label black-box attack:an optimization-based approach, 2018. URL https://arxi v.org/abs/1807.04457

Pith/arXiv arXiv 2018

[15] [15]

Black, and Otmar Hilliges

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for con- trastive language-image learning. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), page 2818–2829. IEEE, June 2023. doi: 10.1109/cvpr52729.2023.00276. URL http: //dx.doi.org/10....

work page doi:10.1109/cvpr52729.2023.00276 2023

[16] [16]

J. Chi, U. Karn, H. Zhan, E. Smith, J. Rando, Y . Zhang, K. Plawiak, Z. D. Coudert, K. Upasani, and M. Pa- supuleti. Llama guard 3 vision: Safeguarding human- ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

arXiv 2024

[17] [17]

J. S. Choi, K. Lee, J. Jeong, S. Xie, J. Shin, and K. Lee. DiffusionGuard: A robust defense against malicious diffusion-based image editing. InInternational Confer- ence on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=9OfKxKoYNw

2025

[18] [18]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non- strongly convex composite objectives. InAdvances in Neural Information Processing Systems, 2014

2014

[19] [19]

DiResta and J

R. DiResta and J. A. Goldstein. How spammers and scammers leverage ai-generated images on facebook for audience growth.arXiv preprint arXiv:2403.12838, 2024

arXiv 2024

[20] [20]

S. Dong, J. Zhang, G. Zhao, S. Shan, and X. Chen. Semantic mismatch and perceptual degradation: A new perspective on image editing immunity.arXiv preprint arXiv:2512.14320, 2025. URL https://arxiv.org/abs/25 12.14320

arXiv 2025

[21] [21]

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex opti- mization: The power of two function evaluations.IEEE Transactions on Information Theory, 61(5):2788–2806, 2015

2015

[22] [22]

A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. Toshev, and V . Shankar. Data filtering networks, 2023. URL https://arxiv.org/abs/2309.17425

arXiv 2023

[23] [23]

J. Fu, S. Li, Y . Jiang, K.-Y . Lin, C. Qian, C. C. Loy, W. Wu, and Z. Liu. Stylegan-human: A data-centric odyssey of human generation, 2022. URL https://arxi v.org/abs/2204.11823

arXiv 2022

[24] [24]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyr- nis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beau- mont, S. Oh, A. Dimakis, J. Jitsev, Y...

arXiv 2023

[25] [25]

Gentleman

A. Gentleman. New claimants seek to sue Elon Musk’s xAI after Labour MP’s test case. https://www.th eguardian.com/technology/2026/jun/05/grok-ai-e lon-musk-jess-asato-labour-mp-lawsuit, June 2026. The Guardian. Additional reporting by Jessica Elgot. Accessed June 7, 2026

2026

[26] [26]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. URL https://arxiv.org/abs/1412.6572

Pith/arXiv arXiv 2015

[27] [27]

Gemini API Additional Terms of Service

Google. Gemini API Additional Terms of Service. https://ai.google.dev/gemini-api/terms, Mar. 2026. Effective March 23, 2026. Accessed June 7, 2026

2026

[28] [28]

Gemini 3.1 Flash Image (Nano Banana 2), 2026

Google DeepMind. Gemini 3.1 Flash Image (Nano Banana 2), 2026. https://deepmind.google/models/g emini-image/flash/

2026

[29] [29]

Z. Guo, L. Fang, J. Lin, Y . Qian, S. Zhao, Z. Wang, J. Dong, C. Chen, O. Arandjelovi ´c, and C. P. Lau. A grey-box attack against latent diffusion model-based image editing by posterior collapse.arXiv preprint arXiv:2408.10901, 2024. URL https://arxiv.org/abs/24 08.10901

arXiv 2024

[30] [30]

Ilyas, L

A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black- box adversarial attacks with limited queries and in- formation. InInternational conference on machine learning, pages 2137–2146. PMLR, 2018

2018

[31] [31]

X. Jia, S. Gao, S. Qin, T. Pang, C. Du, Y . Huang, X. Li, Y . Li, B. Li, and Y . Liu. Adversarial attacks against closed-source mllms via feature optimal alignment,

[32] [32]

URL https://arxiv.org/abs/2505.21494

arXiv

[33] [33]

J. Kim, Y . Nam, M. Kim, S. Kim, and J. Jeong. BlurGuard: A simple approach for robustifying image protection against AI-powered editing. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=vritEZz28d

2025

[34] [34]

D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.Mathematical Programming, 45:503–528, 1989

1989

[35] [35]

Y . Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. InInternational Conference on Learning Rep- resentations (ICLR), 2017. URL https://arxiv.org/abs/ 1611.02770

Pith/arXiv arXiv 2017

[36] [36]

L. Lo, C. Y . Yeo, H.-H. Shuai, and W.-H. Cheng. Distraction is all you need: Memory-efficient image immunization against diffusion-based image editing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24462–24471, June 2024. URL https://openaccess.the cvf.com/content/CVPR2024/html/Lo Distraction is All Y...

2024

[37] [37]

N. A. Lord, R. Mueller, and L. Bertinetto. Attacking deep networks with surrogate-based adversarial black- box methods is easy.arXiv preprint arXiv:2203.08725, 2022

arXiv 2022

[38] [38]

Madry, A

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=rJzIBfZAb

2018

[39] [39]

Nasery, E

A. Nasery, E. Contente, A. Kaz, P. Viswanath, and S. Oh. Are robust llm fingerprints adversarially robust? arXiv e-prints, pages arXiv–2509, 2025

2025

[40] [40]

Nasery, J

A. Nasery, J. Hayase, C. Brooks, P. Sheng, H. Tyagi, P. Viswanath, and S. Oh. Scalable fingerprinting of large language models.Advances in Neural Informa- tion Processing Systems, 38:125116–125152, 2026

2026

[41] [41]

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandkumar. Diffusion models for adversarial purification. InInternational Conference on Machine Learning (ICML), 2022

2022

[42] [42]

J. Nocedal. Updating quasi-newton matrices with limited storage.Mathematics of Computation, 35(151): 773–782, 1980

1980

[43] [43]

N. C. of State Legislatures. Deepfakes in Elections and Campaigns. https://www.ncsl.org/elections-and-c ampaigns/artificial-intelligence-ai-in-elections-and-c ampaigns, June 2026

2026

[44] [44]

GPT Image 2, 2026

Open AI. GPT Image 2, 2026. https://developers.ope nai.com/api/docs/guides/image-generation

2026

[45] [45]

Usage Policies

OpenAI. Usage Policies. https://openai.com/policie s/usage-policies/, Oct. 2025. Effective October 29,

2025

[46] [47]

omni-moderation Model

OpenAI. omni-moderation Model. https://developers .openai.com/api/docs/models/omni-moderation-latest,

[47] [48]

Accessed: 2026- 06-08

OpenAI API documentation. Accessed: 2026- 06-08

2026

[48] [49]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without sup...

Pith/arXiv arXiv 2024

[49] [50]

T. C. Ozden, O. Kara, O. Akcin, K. Zaman, S. Srivastava, S. P. Chinchali, and J. M. Rehg. DiffVax: Optimization-free image immunization against diffusion-based editing. arXiv preprint arXiv:2411.17957, 2024. URL https://arxiv.org/abs/2411.17957

arXiv 2024

[50] [51]

B. Perrigo. How to Spot an AI-Generated Image Like the ’Balenciaga Pope’. https://time.com/6266606/how -to-spot-deepfake-pope/, March 2023

arXiv 2023

[51] [52]

Pleimling, S

X. Pleimling, S. M. Abdullah, G. Balde, P. Gao, M. Mondal, M. Jadliwala, and B. Viswanath. Off- the-shelf image-to-image models are all you need to defeat image protection schemes, 2026. URL https: //arxiv.org/abs/2602.22197

arXiv 2026

[52] [53]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning trans- ferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[53] [54]

Fact Check: Online posts reporting explosion near Pentagon on May 22, 2023 are false

Reuters. Fact Check: Online posts reporting explosion near Pentagon on May 22, 2023 are false. https://ww w.reuters.com/article/fact-check/online-posts-reporting -explosion-near-pentagon-on-may-22-2023-are-false -idUSL1N37J2QJ/, May 2023

2023

[54] [55]

Grok’s AI image generation tool violated Canadian privacy law, watchdog says

Reuters. Grok’s AI image generation tool violated Canadian privacy law, watchdog says. https://www. reuters.com/business/media-telecom/groks-ai-image-g eneration-tool-violated-canadian-privacy-law-says-w atchdog-2026-06-11/, June 2026

2026

[55] [56]

Ricker, D

J. Ricker, D. Assenmacher, T. Holz, A. Fischer, and E. Quiring. Ai-generated faces in the real world: A large-scale case study of twitter profile images. In Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses, pages 513–530, 2024

2024

[56] [57]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 10684–10695, June 2022. URL https://openaccess.thecvf.com/content/CVPR2022/htm l/Rombach High-Resolution Image Synthesis With Latent D...

2022

[57] [58]

Salman, A

H. Salman, A. Khaddaj, G. Leclerc, A. Ilyas, and A. Madry. Raising the cost of malicious AI-powered image editing. InProceedings of the 40th Interna- tional Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 29894–29918. PMLR, 2023. URL https://proceeding s.mlr.press/v202/salman23a.html

2023

[58] [59]

Satter and S

R. Satter and S. Tabahriti. Exclusive: Despite new curbs, Elon Musk’s Grok at times produces sexualized images - even when told subjects didn’t consent. https: //www.reuters.com/business/despite-new-curbs-elo n-musks-grok-times-produces-sexualized-images-eve n-when-2026-02-03/, February 2026

2026

[59] [60]

Schaeffer, D

R. Schaeffer, D. Valentine, L. Bailey, J. Chua, C. Eyza- guirre, Z. Durante, J. Benton, B. Miranda, H. Sleight, T. Wang, et al. Failures to find transferable image jailbreaks between vision-language models. InIn- ternational Conference on Learning Representations, volume 2025, pages 44669–44704, 2025

2025

[60] [61]

Schmidt, N

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient.Math- ematical Programming, 162(1–2):83–112, 2017. doi: 10.1007/s10107-016-1030-6

work page doi:10.1007/s10107-016-1030-6 2017

[61] [62]

S. Shan, J. Cryan, E. Wenger, H. Zheng, R. Hanocka, and B. Y . Zhao. Glaze: Protecting artists from style mimicry by text-to-image models. In32nd USENIX Se- curity Symposium (USENIX Security 23), pages 2187–

[62] [63]

URL https://www

USENIX Association, 2023. URL https://www. usenix.org/conference/usenixsecurity23/presentation/ shan

2023

[63] [64]

S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y . Zhao. Nightshade: Prompt-specific poisoning attacks on text-to-image generative models. In2024 IEEE Symposium on Security and Privacy (SP), pages 807–825, 2024. doi: 10.1109/SP54263.2024.00207. URL https://arxiv.org/abs/2310.13828

work page doi:10.1109/sp54263.2024.00207 2024

[64] [65]

Z. Shao, H. Liu, Y . Hu, and N. Z. Gong. Leave my images alone: Preventing multi-modal large language models from analyzing images via visual prompt in- jection.arXiv preprint arXiv:2604.09024, 2026

Pith/arXiv arXiv 2026

[65] [66]

Sohl-Dickstein, B

J. Sohl-Dickstein, B. Poole, and S. Ganguli. Fast large- scale optimization by unifying stochastic gradient and quasi-newton methods. InInternational Conference on Machine Learning, 2014

2014

[66] [67]

M. Sparks. Disney, NBC Universal, and DreamWorks File Major IP Lawsuit Against AI Image Generator Midjourney. https://www.law.georgetown.edu/tech-i nstitute/research-insights/insights/disney-nbc-univers al-and-dreamworks-file-major-ip-lawsuit-against-ai-i mage-generator-midjourney/, June 2025. Institute for Technology Law & Policy, Georgetown Law. Accesse...

2025

[67] [68]

C.-C. Tu, P. Ting, P.-Y . Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. InProceed- ings of the AAAI conference on artificial intelligence, volume 33, pages 742–749, 2019

2019

[68] [69]

xAI Acceptable Use Policy

xAI. xAI Acceptable Use Policy. https://x.ai/legal/a cceptable-use-policy, Jan. 2025. Effective January 2,

2025

[69] [70]

Accessed June 7, 2026

2026

[70] [71]

Grok Imagine API, 2026

xAI. Grok Imagine API, 2026. https://x.ai/news/grok -imagine-api

2026

[71] [72]

J. Xu, F. Wang, M. Ma, P. W. Koh, C. Xiao, and M. Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3277–3306, 2024

2024

[72] [73]

Y . Ye, X. He, Z. Li, S. Yuan, Z. Yan, B. Hou, L. Yuan, et al. Imgedit: A unified image editing dataset and benchmark.Advances in Neural Information Process- ing Systems, 38, 2026

2026

[73] [74]

W. Zeng, D. Kurniawan, R. Mullins, Y . Liu, T. Saha, D. Ike-Njoku, J. Gu, Y . Song, C. Xu, J. Zhou, A. Joshi, S. Dheep, M. Malek, H. Palangi, J. Baek, R. Pereira, and K. Narasimhan. Shieldgemma 2: Robust and tractable image content moderation, 2025. URL https: //arxiv.org/abs/2504.01081

arXiv 2025

[74] [75]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October

[75] [76]

URL https://openaccess.thecvf.com/content/IC CV2023/html/Zhai Sigmoid Loss for Language Ima ge Pre-Training ICCV 2023 paper.html

2023

[76] [77]

Zhang, Z

J. Zhang, Z. Gu, J. Jang, H. Wu, M. P. Stoecklin, H. Huang, and I. Molloy. Protecting intellectual prop- erty of deep neural networks with watermarking. In Proceedings of the 2018 on Asia conference on com- puter and communications security, pages 159–172, 2018

2018

[77] [78]

Zhang, S

J. Zhang, S. Dong, S. Shan, and X. Chen. Dual attention guided defense against malicious edits.arXiv preprint arXiv:2512.14333, 2025. URL https://arxiv. org/abs/2512.14333

arXiv 2025

[78] [79]

Zhang, S

J. Zhang, S. Dong, S. Shan, and X. Chen. Towards transferable defense against malicious image edits. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. URL https://www.computer.org/c sdl/journal/tp/5555/01/11421009/2eApjoNgEve. Early Access / PrePrints

2026

[79] [80]

Zhang, P

J. Zhang, P. Peetathawatchai, F. Tram `er, and A. Shafran. Laundering ai authority with adversarial examples, 2026. URL https://arxiv.org/abs/2605.042 61

2026

[80] [81]

Zhang, G

P.-F. Zhang, G. Bai, and Z. Huang. Maa: Meticulous adversarial attack against vision-language pre-trained models, 2025. URL https://arxiv.org/abs/2502.08079

arXiv 2025