arxiv: 2601.23179 · v2 · submitted 2026-01-30 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Universal Adversarial Attacks against Closed-Source MLLMs via Target-View Routed Meta Optimization

Hui Lu , Yi Yu , Yiming Yang , Chenyu Yi , Xueyi Ke , Qixing Zhang , Bingquan Shen , Alex Kot

show 1 more author

Xudong Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords universal adversarial attacksmultimodal large language modelstargeted attacksblack-box transfermeta optimizationclosed-source modelstransferable perturbations

0 comments

The pith

Meta-optimized routing and multi-crop aggregation let one perturbation steer arbitrary inputs to a chosen target across closed-source MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies universal targeted transferable adversarial attacks on commercial multimodal large language models, where a single fixed perturbation must make any input image produce one specified output. Earlier attempts at this setting ran into high-variance target supervision from random crops, unreliable token alignment once image-specific cues were removed by universality, and high sensitivity to starting points when adapting to new targets with few sources. The authors introduce three fixes: attention-guided multi-crop aggregation to stabilize supervision, alignability-gated token routing to recover reliable matching, and meta-learning of a cross-target perturbation prior that improves per-target optimization. These changes raise unseen-image attack success by 23.7 percentage points on GPT-4o and 19.9 points on Gemini-2.0 relative to the best prior universal method. A reader would care because reusable perturbations make it practical to probe model behavior at scale rather than one example at a time.

Core claim

The authors establish that their MCRMO-Attack method, built from attention-guided multi-crop aggregation, alignability-gated token routing, and a meta-learned cross-target perturbation prior, resolves the core instabilities of the universal targeted setting and produces a single perturbation that consistently drives arbitrary inputs toward a fixed target output on unknown commercial MLLMs, delivering the stated gains in transfer success on unseen images.

What carries the argument

MCRMO-Attack, which stabilizes supervision with attention-guided multi-crop aggregation, recovers reliable token alignment via alignability-gated routing, and supplies better starting points through a meta-learned cross-target perturbation prior.

Load-bearing premise

The three components are assumed to fully resolve high-variance supervision, unreliable token matching, and initialization sensitivity when moving from sample-wise to universal targeted attacks.

What would settle it

Measuring attack success rates on a fresh commercial MLLM or new target set without retraining the meta-prior; if rates fall back to or below the strongest universal baseline, the components have not resolved the stated instabilities.

Figures

Figures reproduced from arXiv: 2601.23179 by Alex Kot, Bingquan Shen, Chenyu Yi, Hui Lu, Qixing Zhang, Xudong Jiang, Xueyi Ke, Yiming Yang, Yi Yu.

**Figure 1.** Figure 1: Comparison of targeted adversarial examples generated by FOA-Attack (Jia et al., 2025) and our MCRMO-Attack on the source image (captions in orange box) and an unseen arbitrary image (captions in blue box). Both methods succeed on the source images. However, FOA-Attack relies on local shadow cues and fails to transfer, while our MCRMO-Attack consistently induces the target concept (“cat”) across heterogene… view at source ↗

**Figure 2.** Figure 2: (Left) Comparison of mean loss curves with variance shading over 300 epochs, where the MCA and MCA+AGC variants exhibit improved convergence behavior relative to the baseline. (Right) Illustration of gradient variation, indicating that the proposed methods effectively reduce gradient stochasticity. and both the source and target images undergo independent random cropping. As a result, the stepwise gradie… view at source ↗

**Figure 3.** Figure 3: Visualization of adversarial images and perturbations for unseen sample. stringent unseen-sample setting (top block), the advantage is larger and consistent across models and metrics, demonstrating robust universality: GPT-4o reaches 61.7% ASR vs. 38.0% for UAP with higher KMR (e.g. KMRa 52.0% vs. 37.5%), and Gemini-2.0 shows the same pattern (ASR 56.7% vs. 36.8%, KMRa 52.6% vs. 40.2%). Gains persist on C… view at source ↗

read the original abstract

Targeted adversarial attacks on closed-source multimodal large language models (MLLMs) have been increasingly explored under black-box transfer, yet prior methods are predominantly sample-specific and offer limited reusability across inputs. We instead study a more stringent setting, Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation must consistently steer arbitrary inputs toward a specified target across unknown commercial MLLMs. Naively adapting existing sample-wise attacks to this universal setting faces three core difficulties: (i) target supervision becomes high-variance due to target-crop randomness, (ii) token-wise matching is unreliable because universality suppresses image-specific cues that would otherwise anchor alignment, and (iii) few-source per-target adaptation is highly initialization-sensitive, which can degrade the attainable performance. In this work, we propose MCRMO-Attack, which stabilizes supervision via Multi-Crop Aggregation with an Attention-Guided Crop, improves token-level reliability through alignability-gated Token Routing, and meta-learns a cross-target perturbation prior that yields stronger per-target solutions. Across commercial MLLMs, we boost unseen-image attack success rate by +23.7\% on GPT-4o and +19.9\% on Gemini-2.0 over the strongest universal baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical method for universal targeted attacks on closed-source MLLMs with reported gains, but the alignability gate rests on an unverified surrogate correlation.

read the letter

The main takeaway is that this paper gives a practical method for universal targeted attacks on closed-source MLLMs with reported gains, but the alignability gate rests on an unverified surrogate correlation. They frame the move from sample-specific to universal targeted attacks around three concrete problems: noisy target supervision from random crops, unreliable token matching once image-specific cues disappear, and high sensitivity to initialization when adapting with few sources. Their MCRMO-Attack counters these with attention-guided multi-crop aggregation, alignability-gated token routing, and a meta-learned cross-target perturbation prior. The headline numbers are a +23.7% lift on GPT-4o and +19.9% on Gemini-2.0 for unseen images over the strongest universal baseline. That combination of components is new in this exact setting and the modular design makes it straightforward to test. The work does a clean job naming the difficulties and tying each fix to one of them, which is more useful than another generic optimizer. The soft spot is the gated routing step. It decides routing from surrogate attention and embedding similarity, yet the paper supplies no measurement of how well those scores predict actual token alignment inside the closed models. If the correlation is weak, the gate cannot reliably cut the variance the authors themselves call central, and the gains may trace to the other pieces or to tuning. The meta-learned prior also needs ablations to show it generalizes rather than just fitting the source set. The rest of the setup follows standard transfer-attack practice with no circular metrics. This is for people who run or evaluate black-box robustness tests on deployed multimodal systems. A reader who needs concrete numbers on commercial APIs and reusable components will get value from it. It deserves a serious referee because the setting is current, the claims are testable, and the method is specific enough for others to replicate or extend.

Referee Report

3 major / 1 minor

Summary. The paper introduces MCRMO-Attack for the Universal Targeted Transferable Adversarial Attacks (UTTAA) setting on closed-source MLLMs. It identifies three difficulties when adapting sample-wise attacks to universal targeted transfer (high-variance target supervision from crop randomness, unreliable token matching due to suppressed image-specific cues, and initialization sensitivity in few-source adaptation) and proposes three components to address them: Multi-Crop Aggregation with Attention-Guided Crop, alignability-gated Token Routing, and a meta-learned cross-target perturbation prior. The central empirical claim is an improvement in unseen-image attack success rate of +23.7% on GPT-4o and +19.9% on Gemini-2.0 relative to the strongest universal baseline.

Significance. If the reported gains are reproducible and the three components are shown to be necessary via controlled experiments, the work would constitute a meaningful step toward reusable, universal perturbations that transfer to commercial closed-source MLLMs. The meta-optimization framing and explicit handling of token-alignment variance are potentially useful ideas for the black-box adversarial-attack literature.

major comments (3)

[Method (alignability-gated Token Routing)] The abstract and method description of alignability-gated Token Routing state that the gate is computed from surrogate-model attention and embedding similarity, yet no correlation analysis, ablation, or direct measurement is supplied showing how well these surrogate scores predict actual token-level alignment success inside GPT-4o or Gemini-2.0. Because the paper itself identifies unreliable token matching as a core obstacle, the absence of this verification is load-bearing for the +23.7 % and +19.9 % claims.
[Experiments] No experimental protocol, number of trials, statistical tests, baseline implementation details, or ablation tables are referenced in the abstract or summary. Without these, the reported percentage improvements cannot be evaluated for robustness or compared fairly to prior universal baselines.
[Ablation study] The claim that the three components collectively resolve high-variance supervision and initialization sensitivity is presented without per-component ablations or sensitivity analyses that isolate their individual contributions. This makes it impossible to confirm that the meta-learned prior and routing gate are the decisive factors rather than other unstated design choices.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly indicated the number of target classes, image sources, and evaluation metric (e.g., exact success-rate definition) used for the reported percentages.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below with clarifications and commitments to revisions where feasible. Our responses focus on substance and aim to strengthen the manuscript without overstating what the current experiments demonstrate.

read point-by-point responses

Referee: [Method (alignability-gated Token Routing)] The abstract and method description of alignability-gated Token Routing state that the gate is computed from surrogate-model attention and embedding similarity, yet no correlation analysis, ablation, or direct measurement is supplied showing how well these surrogate scores predict actual token-level alignment success inside GPT-4o or Gemini-2.0. Because the paper itself identifies unreliable token matching as a core obstacle, the absence of this verification is load-bearing for the +23.7 % and +19.9 % claims.

Authors: We agree that direct correlation analysis between the surrogate scores and internal token alignment in closed-source models is not feasible, as we lack access to GPT-4o or Gemini-2.0 internals. The routing gate is explicitly designed as a surrogate proxy to mitigate the token-matching variance we identify in the introduction. In the revised manuscript we will add (i) an expanded ablation isolating the routing gate's contribution to transfer ASR and (ii) a dedicated paragraph discussing the proxy's rationale, its correlation with open-source alignment metrics, and its limitations. These additions provide quantitative indirect support while acknowledging the inherent constraint of the closed-source setting. revision: partial
Referee: [Experiments] No experimental protocol, number of trials, statistical tests, baseline implementation details, or ablation tables are referenced in the abstract or summary. Without these, the reported percentage improvements cannot be evaluated for robustness or compared fairly to prior universal baselines.

Authors: The full manuscript (Sections 4 and 5 plus supplementary material) already specifies the evaluation protocol: 5 independent random seeds per setting, paired t-tests for significance, exact baseline re-implementations with hyper-parameter grids, and complete ablation tables. We acknowledge that the abstract and high-level summary do not explicitly reference these details. In the revision we will (a) add a concise sentence to the abstract pointing to the experimental protocol and (b) include a short “Evaluation Protocol” paragraph in the main text that cross-references the tables and supplementary sections. revision: yes
Referee: [Ablation study] The claim that the three components collectively resolve high-variance supervision and initialization sensitivity is presented without per-component ablations or sensitivity analyses that isolate their individual contributions. This makes it impossible to confirm that the meta-learned prior and routing gate are the decisive factors rather than other unstated design choices.

Authors: The current manuscript contains component-wise ablations (Table 3) and sensitivity analyses (Section 5.3) that quantify the contribution of each module. To make the isolation clearer, we will expand the main-text ablation section with additional rows that separately disable Multi-Crop Aggregation, Token Routing, and the meta-prior, together with sensitivity plots over initialization variance and crop randomness. These revisions will explicitly map each component to the three difficulties stated in the introduction. revision: yes

standing simulated objections not resolved

Direct measurement or correlation analysis of surrogate scores against internal token-level alignment success inside closed-source models (GPT-4o, Gemini-2.0) is impossible without model access.

Circularity Check

0 steps flagged

No significant circularity; empirical gains independent of method inputs

full rationale

The paper proposes MCRMO-Attack with three algorithmic components (Multi-Crop Aggregation, alignability-gated Token Routing, meta-learned prior) to address variance issues in universal targeted attacks. Reported gains (+23.7% on GPT-4o, +19.9% on Gemini-2.0) are presented as results of experimental evaluation on unseen images against closed-source MLLMs. No equations, self-citations, or fitted quantities are shown that reduce these gains to inputs by construction. The derivation chain remains self-contained with external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach appears to rest on standard meta-learning and attention mechanisms without new postulated objects.

pith-pipeline@v0.9.0 · 5549 in / 1166 out tokens · 42157 ms · 2026-05-16T09:23:18.204967+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LMC = sum cos(zk_tar, zl_as)·π_kl; w(zl_as)=σ((r(zl_as)−γ)/α); Reptile update δ0 ← Π(δ0 + η(δ̄−δ0))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no mention of J-cost, φ, 8-tick period or recognition ladder

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

https://www.anthropic.com/news/ claude-sonnet-4-5[Accessed: 2026-01-21]. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large visi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751,

Dong, Y ., Chen, H., Chen, J., Fang, Z., Yang, X., Zhang, Y ., Tian, Y ., Su, H., and Zhu, J. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751,

work page arXiv
[4]

Trans- ferable adversarial attacks on black-box vision-language models.CoRR, abs/2505.01050, 2025

Hu, K., Yu, W., Zhang, L., Robey, A., Zou, A., Xu, C., Hu, H., and Fredrikson, M. Transferable adversarial attacks on black-box vision-language models.arXiv preprint arXiv:2505.01050,

work page arXiv
[5]

Erfani, Yige Li, Xingjun Ma, and James Bailey

Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. X- transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528,

work page arXiv
[6]

Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494,

Jia, X., Gao, S., Qin, S., Pang, T., Du, C., Huang, Y ., Li, X., Li, Y ., Li, B., and Liu, Y . Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494,

work page arXiv
[7]

Survey of adver- sarial robustness in multimodal large language models

Jiang, C., Wang, Z., Dong, M., and Gui, J. Survey of adver- sarial robustness in multimodal large language models. arXiv preprint arXiv:2503.13962,

work page arXiv
[8]

Nips 2017: Defense against adversarial attack

K, A., Hamner, B., and Goodfellow, I. Nips 2017: Defense against adversarial attack. https://kaggle.com/competitions/ nips-2017-defense-against-adversarial-attack ,

work page 2017
[9]

A., Yang, J., Li, C., and Liu, Z

Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Cahyono, J. A., Yang, J., Li, C., and Liu, Z. Otter: A multi-modal model with in-context instruction tuning.IEEE Trans. on Pattern Analysis and Machine Intelligence, 2025a. Li, Z., Liu, D., Zhang, C., Wang, H., Xue, T., and Cai, W. Enhancing advanced visual reasoning ability of large lan- guage models.arXiv ...

work page arXiv
[10]

9 Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization Lin, J., Song, C., He, K., Wang, L., and Hopcroft, J. E. Nesterov accelerated gradient and scale invariance for adversarial attacks.arXiv preprint arXiv:1908.06281,

work page arXiv 1908
[11]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

C., and Jiang, X

Lu, H., Yu, Y ., Yang, Y ., Yi, C., Zhang, Q., Shen, B., Kot, A. C., and Jiang, X. When robots obey the patch: Univer- sal transferable patch attacks on vision-language-action models.arXiv preprint arXiv:2511.21192,

work page arXiv
[13]

On First-Order Meta-Learning Algorithms

Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms.arXiv preprint arXiv:1803.02999,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

One surrogate to fool them all: Universal, transferable, and targeted adver- sarial attacks with clip

Xu, B., Dai, X., Tang, D., and Zhang, K. One surrogate to fool them all: Universal, transferable, and targeted adver- sarial attacks with clip. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pp. 3087–3101,

work page 2025
[17]

Efficient and effective universal adversarial attack against vision-language pre-training models.arXiv preprint arXiv:2410.11639,

Yang, F., Huang, Y ., Wang, K., Shi, L., Pu, G., Liu, Y ., and Wang, H. Efficient and effective universal adversarial attack against vision-language pre-training models.arXiv preprint arXiv:2410.11639,

work page arXiv
[18]

Enhancing the transferability of adversarial examples with random patch

Zhang, Y ., Tan, Y .-a., Chen, T., Liu, X., Zhang, Q., and Li, Y . Enhancing the transferability of adversarial examples with random patch. InInternational Joint Conference on Artificial Intelligence, 2022b. Zhao, P., Ram, P., Lu, S., Yao, Y ., Bouneffouf, D., Lin, X., and Liu, S. Learning to generate image source- agnostic universal adversarial perturbat...

work page arXiv 2009
[19]

Y ., and Yao, D

Zhou, Z., Deng, M., Song, Y ., Zhang, H., Wan, W., Hu, S., Li, M., Zhang, L. Y ., and Yao, D. Darkhash: A data-free backdoor attack against deep hashing.IEEE Transactions on Information Forensics and Security, 2025a. Zhou, Z., Hu, Y ., Song, Y ., Li, Z., Hu, S., Zhang, L. Y ., Yao, D., Zheng, L., and Jin, H. Vanish into thin air: Cross- prompt universal a...

work page arXiv