pith. sign in

arxiv: 2601.21531 · v2 · pith:J2N4UVB7new · submitted 2026-01-29 · 💻 cs.CR · cs.AI· cs.CV

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

Pith reviewed 2026-05-21 15:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CV
keywords adversarial robustnessvision-language modelsvisual token compressionadversarial attackmodel efficiencyrobust accuracy
0
0 comments X

The pith

Visual token compression creates an optimization-inference mismatch that lets standard adversarial attacks underestimate vulnerabilities in large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing encoder-based attacks on large vision-language models fail to reveal full robustness weaknesses when visual tokens are compressed for efficiency, because perturbations are optimized on the complete token set while inference runs through the compression step. To close this gap, it introduces the Compression-AliGnEd attack that aligns the attack optimization with the compression process without needing to know the specific mechanism or token budget in advance. The method uses expected feature disruption to focus distortion on tokens likely to survive compression and rank distortion alignment to encourage retention of distorted tokens. Experiments across multiple plug-and-play compression techniques and datasets show that this attack produces lower robust accuracy than prior methods. The core point is that robustness numbers reported without accounting for compression can be too optimistic for real-world efficient deployments.

Core claim

Existing attacks cannot fully disclose the robustness vulnerabilities of compressed LVLMs due to an optimization-inference mismatch: perturbations are optimized on the full-token representation while inference is performed through a token-compression bottleneck. The proposed CAGE attack addresses this by combining expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and rank distortion alignment, which actively aligns token distortions with rank scores to promote retention of highly distorted evidence, achieving consistently lower robust accuracy than baselines without access to the deployed compression mechanism or token budget.

What carries the argument

The Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with unknown compression inference by using expected feature disruption and rank distortion alignment.

If this is right

  • Robustness assessments that ignore token compression will produce overly optimistic security estimates for efficient LVLMs.
  • Security evaluation of large vision-language models must incorporate compression-aware attack methods.
  • Defenses for efficient LVLMs need to address vulnerabilities introduced by the compression step itself.
  • Plug-and-play compression mechanisms should be tested for robustness under aligned adversarial optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mismatch effect may appear in other efficiency techniques such as quantization or pruning of non-visual components.
  • Future compression designs could incorporate explicit robustness objectives to reduce the gap that CAGE exploits.
  • Attack methods like CAGE could be adapted to black-box settings where the compression budget varies at inference time.

Load-bearing premise

The assumption that expected feature disruption and rank distortion alignment can align perturbation optimization with unknown compression inference without access to the deployed mechanism or token budget.

What would settle it

Running CAGE against multiple compressed LVLMs and compression methods on standard datasets and finding no consistent reduction in robust accuracy compared with baseline attacks.

Figures

Figures reproduced from arXiv: 2601.21531 by Haibo Hu, Hangcheng Liu, Hao Wang, Li Bai, Qingqing Ye, Tianwei Zhang, Xinwei Zhang.

Figure 1
Figure 1. Figure 1: Comparison between the existing attack and our attack. Darker red indicates tokens with stronger adversarial perturbation. While the existing attack (A) perturbs all visual tokens (all tokens are red), CAGE (B) concentrates the distortion on the surviving tokens (only survivors are red). progress comes with a substantial computational burden: current state-of-the-art models like LLaVA-NeXT (Liu et al., 202… view at source ↗
Figure 2
Figure 2. Figure 2: Average token-level feature gap under VEAttack over 100 samples. We rank vision tokens by adversarial(ADV) attention and plot the average feature gap (1−cosine) over the top-K tokens. The curve shows that the gap is large on a small number of high￾attention tokens and gradually decreases as lower-ranked tokens are included. Under our gray-box setting, backpropagating through the full surrogate LVLM is comp… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CAGE. cross-token interactions, potentially weakening the attack effect that global objectives rely on. Moreover, attack effectiveness depends on budget align￾ment: attacks are typically strongest when Kattack is compa￾rable to the deployment budget Kmodel. For example, on the 16-token model, the full-token attack yields 49.7% robust accuracy, whereas the aligned setting (Kattack=16) reduces it… view at source ↗
Figure 4
Figure 4. Figure 4: Conditional robust accuracy (CRA) vs. deployment token budget. We report CRA under the baseline attack and CAGE across three datasets. While CRA under baseline attack generally increases as the token budget shrinks, CAGE exhibits non-monotonic behavior, indicating that conditional robustness does not vary monotonically with compression [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention-based adversarial detection via the top-k CLS-to-token attention mass. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
read the original abstract

Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks cannot fully disclose the robustness vulnerabilities of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper examines adversarial robustness in large vision-language models (LVLMs) that apply visual token compression for efficiency. It identifies an optimization-inference mismatch in prior encoder-based attacks, which optimize perturbations on full token representations while inference occurs after compression. To address this, the authors introduce the Compression-AliGnEd attack (CAGE), a mechanism-agnostic method that combines expected feature disruption (to target tokens likely to survive plausible budgets) with rank distortion alignment (to encourage retention of highly distorted tokens). Empirical results across plug-and-play compressors and datasets are reported to show that CAGE yields lower robust accuracy than baselines, implying that compression-agnostic robustness evaluations may be overly optimistic.

Significance. If the central empirical claim holds, the work identifies a practically relevant gap in current robustness assessments for efficient LVLMs and offers a concrete, plug-and-play attack that does not require knowledge of the deployed compressor or token budget. The two heuristic proxies constitute a falsifiable approach to closing the mismatch, and the cross-mechanism, cross-dataset comparisons provide a starting point for compression-aware security evaluation. The absence of machine-checked proofs or parameter-free derivations is expected in an empirical security paper, but the reproducible-attack framing is a positive attribute.

major comments (2)
  1. [Abstract] Abstract: the headline claim that 'CAGE consistently achieves lower robust accuracy than the baseline' across 'diverse representative plug-and-play compression mechanisms and datasets' is presented without any quantitative values, error bars, statistical tests, or description of the experimental protocol; this directly undermines verification of the central empirical result.
  2. [Abstract] Abstract (description of CAGE): the premise that expected feature disruption plus rank distortion alignment will produce perturbations whose effect survives an unknown compression step is load-bearing for the mechanism-agnostic claim, yet the manuscript provides no direct correlation analysis or ablation showing that these proxies predict actual token retention rates under compressors not seen during attack optimization.
minor comments (1)
  1. Notation for 'expected feature disruption' and 'rank distortion alignment' should be introduced with explicit mathematical definitions (e.g., as expectations over token budgets or rank-score inner products) rather than descriptive prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. Where the comments identify opportunities to strengthen verifiability and supporting evidence, we agree to make targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'CAGE consistently achieves lower robust accuracy than the baseline' across 'diverse representative plug-and-play compression mechanisms and datasets' is presented without any quantitative values, error bars, statistical tests, or description of the experimental protocol; this directly undermines verification of the central empirical result.

    Authors: We agree that the abstract would benefit from concrete quantitative support and a concise description of the evaluation protocol to allow readers to assess the central claim more readily. In the revised manuscript we will incorporate representative numerical results drawn from the experimental sections (including average robust-accuracy reductions and ranges across mechanisms), note that error bars reflect multiple random seeds, and add a brief statement of the protocol (number of plug-and-play compressors, datasets, and evaluation metrics). These changes will be confined to the abstract and will not alter the technical content. revision: yes

  2. Referee: [Abstract] Abstract (description of CAGE): the premise that expected feature disruption plus rank distortion alignment will produce perturbations whose effect survives an unknown compression step is load-bearing for the mechanism-agnostic claim, yet the manuscript provides no direct correlation analysis or ablation showing that these proxies predict actual token retention rates under compressors not seen during attack optimization.

    Authors: We appreciate the referee’s emphasis on direct validation of the two proxy heuristics. While the cross-mechanism, cross-dataset results already demonstrate that CAGE outperforms baselines on compressors and budgets not used during attack generation, we acknowledge that an explicit correlation study would further substantiate the mechanism-agnostic premise. In the revision we will add a dedicated analysis subsection that (i) reports Pearson correlations between the expected-feature-disruption and rank-distortion scores and measured token-retention rates on held-out compressors, and (ii) presents ablation results isolating the contribution of each proxy. These additions will be placed in the experimental section and referenced from the abstract description of CAGE. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical attack proposal

full rationale

The paper introduces the CAGE attack as a heuristic combination of expected feature disruption and rank distortion alignment to bridge optimization-inference mismatch for unknown compression mechanisms. All central claims rest on empirical evaluations across plug-and-play compressors and datasets rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential definition. No equations reduce the reported robust-accuracy gaps to inputs by construction, and the method is presented as a practical proxy without invoking uniqueness theorems or ansatzes from prior self-citations in a load-bearing way. The work is therefore self-contained as an empirical robustness study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of CAGE components in bridging the optimization-inference gap; specific implementation details such as how plausible budgets are sampled or how rank scores are computed are not provided in the abstract.

free parameters (1)
  • plausible token budgets
    Expected feature disruption concentrates distortion on tokens likely to survive across plausible budgets, but the distribution or sampling method is unspecified.
axioms (1)
  • domain assumption Existing encoder-based attacks suffer from an optimization-inference mismatch when token compression is present.
    Invoked in the abstract as the reason standard attacks cannot fully disclose vulnerabilities.

pith-pipeline@v0.9.0 · 5739 in / 1286 out tokens · 124982 ms · 2026-05-21T15:02:04.827908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, L., Ye, Q., Zhang, X., Zhang, S., Liang, Z., Xu, J., and Hu, H. Toward efficient inference attacks: Shadow model sharing via mixture-of-experts. InAdvances in Neural Information Processing Systems (NeruIPS), 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wa...

  2. [2]

    MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., and Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. InEuropean Conference on Computer Vision (ECCV), 2024a. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu...

  3. [3]

    R., Pan, Y ., and Kashyap, S

    Guerrero, P. R., Pan, Y ., and Kashyap, S. Efficient deployment of vision-language models on mobile de- vices: A case study on oneplus 13r.arXiv preprint arXiv:2507.08505,

  4. [4]

    Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters

    Guo, Z., Kamigaito, H., and Watanabe, T. Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  5. [5]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Hu, T., Liu, X., Wang, S., Zhu, Y ., Liang, A., Kong, L., Zhao, G., Gong, Z., Cen, J., Huang, Z., Hao, X., Li, L., Song, H., Li, X., Ma, J., Shen, S., Zhu, J., Tao, D., Liu, Z., and Liang, J. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760,

  6. [6]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, T., Shi, L., Hong, R., Hu, Y ., Yin, Q., and Zhang, L. Multi-stage vision token dropping: Towards effi- cient multimodal large language model.ar...

  7. [7]

    Veattack: Downstream-agnostic vision encoder attack against large vision language models.arXiv preprint arXiv:2505.17440,

    Mei, H., Wang, Z., You, S., Dong, M., and Xu, C. Veattack: Downstream-agnostic vision encoder attack against large vision language models.arXiv preprint arXiv:2505.17440,

  8. [8]

    GPT-4 Technical Report

    OpenAI et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  9. [9]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  10. [10]

    Instructta: Instruction- tuned targeted attack for large vision-language models,

    Wang, X., Ji, Z., Ma, P., Li, Z., and Wang, S. InstructTA: Instruction-tuned targeted attack for large vision- language models.arXiv preprint arXiv:2312.01886, 2024a. Wang, Y ., Liu, C., Qu, Y ., Cao, H., Jiang, D., and Xu, L. Break the Visual Perception: Adversarial Attacks Target- ing Encoded Visual Tokens of Large Vision-Language Models. InProceedings ...

  11. [11]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., Liao, J., Zheng, Q., Huang, F., Zhou, J., and Yan, M. Mobile-Agent- v3: Foundamental agents for GUI automation.arXiv preprint arXiv:2508.15144,

  12. [12]

    DART: Dif- ferentiable dynamic adaptive region tokenizer for vision foundation models.arXiv preprint arXiv:2506.10390,

    Yin, S., Yin, K., Liu, Y ., Chen, W., and Lin, L. DART: Dif- ferentiable dynamic adaptive region tokenizer for vision foundation models.arXiv preprint arXiv:2506.10390,

  13. [13]

    AppAgent: Multimodal agents as smartphone users

    10 On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression Zhang, C., Yang, Z., Liu, J., Li, Y ., Han, Y ., Chen, X., Huang, Z., Fu, B., and Yu, G. AppAgent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sys- tems, 2025a. Zhang, Q., Cheng, A., Lu, M., Zhan...

  14. [14]

    ② Outer-LLMapproaches perform token selection or aggregationbeforethe main language model computation, treating the LLM as a black box

    ① Inner-LLMapproaches (Chen et al., 2024a; Zhang et al., 2025d; Yin et al., 2025; Liu et al., 2024b; Shao et al., 2025), such as FastV (Chen et al., 2024a) and SparseVLM (Zhang et al., 2025d), integrate token compression into the language model’s transformer layers, reducing the effective number of visual tokens processed during decoding. ② Outer-LLMappro...

  15. [15]

    A” and “S

    and autonomous systems (L¨ubberstedt et al., 2025). Existing adversarial attacks on LVLMs broadly fall into two optimization paradigms. ① End-to-end attacks backpropagate through the entire multimodal pipeline to craft adversarial images, but are often computationally expensive due to large models and long contexts (Schlarmann & Hein, 2023).②Encoder-based...

  16. [16]

    arXiv 2025 ✗ ✗ S✗ MustDrop (Liu et al., 2024b) arXiv 2024 A S A✗ HoliTom (Shao et al.,

  17. [17]

    arXiv 2025 ✗S ✗S VisionZip (Yang et al.,

  18. [18]

    CVPR 2025 A S ✗ ✗ VisPruner (Zhang et al., 2025b)ICCV 2025 A+S✗ ✗ ✗ DivPrune (Alvar et al.,

  19. [19]

    CVPR 2025 S✗ ✗ ✗ FlowCut (Tong et al.,

  20. [20]

    NeurIPS 2025A+S✗ ✗ ✗ PruMerge (Shang et al., 2025)ICCV 2025 A S ✗ ✗ G-Prune (Jiang et al.,

  21. [21]

    AAAI 2025 S✗ ✗ ✗ HiRED (Arif et al.,

  22. [22]

    For prompt-diverse tasks such as VQA, VEAttack (Mei et al.,

    AAAI 2025 A✗ ✗ ✗ 2024b) perturbs encoded visual tokens to break the vision encoder’s token representations. For prompt-diverse tasks such as VQA, VEAttack (Mei et al.,

  23. [23]

    Compared to open-ended VQA, GQA emphasizes systematic generalization and relational grounding

    focuses oncompositional visual reasoning, featuring structured questions that frequently require multi-hop reasoning over objects, relations, and attributes (e.g., spatial relations, comparisons, logical conjunctions). Compared to open-ended VQA, GQA emphasizes systematic generalization and relational grounding. We use GQA to test whether compression-alig...

  24. [24]

    It utilizes vision-encoder attention maps to estimate token importance, identifying and retaining a compact subset of highly informative tokens

    None 28.4 28.4 28.4 6.8 6.8 6.8 17.0 17.0 17.0 VisPruner (Zhang et al., 2025b)performssaliency-based pruning. It utilizes vision-encoder attention maps to estimate token importance, identifying and retaining a compact subset of highly informative tokens. The method explicitly filters out background redundancy to maximize the semantic density of the pruned...

  25. [25]

    Importantly, most non-zero λ settings are comparable to or better thanλ=0, indicating that incorporating RDA is generally beneficial even without delicate tuning

    Overall, moderate values consistently perform best: λ=0.005 yields the lowest accuracy for most budgets (Full/64/16), while λ=0.01 is slightly better in a few cases (128/32) and remains competitive across the board. Importantly, most non-zero λ settings are comparable to or better thanλ=0, indicating that incorporating RDA is generally beneficial even wit...

  26. [26]

    peakiness

    shows that the ℓ1 norm of each token’s value vector is a proxy for information strength, which can measure the token importance. This metric ranks tokens by embedding magnitude, based on the intuition that smaller norms correspond to weaker signals and thus convey less visual information through the Value vectors. • Attention score:We rank tokens by their...

  27. [27]

    or MobileVLM (Chu et al., 2024)) . Since these methods modify the compression mechanism and the resulting token dynamics in fundamentally different ways, their robustness characteristics may differ from the settings studied here, motivating a systematic extension of our analysis to these alternative compression families. Generalization to videos, multi-im...