pith. sign in

arxiv: 2605.28083 · v1 · pith:4KJSOEWCnew · submitted 2026-05-27 · 💻 cs.CV

VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

Pith reviewed 2026-06-29 12:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial patchvision-language-actiontransferable attackblack-box attackrobotic arm localizationproprioceptive injectionVLA models
0
0 comments X

The pith

VLA models can be attacked by patches that suppress their visual self-location of the real arm and inject a phantom one, enabling black-box transfer across different architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that VLA models must visually locate their own robotic arm before planning motion, and this shared process is the key vulnerability. VLA-Hijack targets it by optimizing two concurrent processes: one suppresses the real arm's visual features via attention guidance, while the other injects the patch as a surrogate phantom embodiment using multimodal anchoring and projection. This severs the connection between true embodiment and the control policy, unlike prior attacks that overfit to one model's action space. Experiments on OpenVLA, UniVLA, and CronusVLA show improved white-box efficiency and new state-of-the-art transfer in black-box cross-architecture and cross-domain settings.

Core claim

By targeting the shared visual self-localization process, VLA-Hijack concurrently optimizes Attention-Guided Proprioceptive Suppression and Multimodal Proprioceptive Injection to sever the semantic relationship between the agent's true embodiment and its control policy, achieving superior optimization efficiency in white-box settings and new SOTA cross-architecture and cross-domain black-box transferability across OpenVLA, UniVLA, and CronusVLA.

What carries the argument

Attention-Guided Proprioceptive Suppression and Multimodal Proprioceptive Injection, which alternate between semantic concept anchoring and visual prototype projection to inhibit real arm features and establish the patch as a phantom embodiment.

If this is right

  • Previous white-box patch attacks that overfit to specific action outputs become less necessary, as the new method focuses on the common localization step.
  • Black-box attacks no longer need model-specific tuning to achieve high transfer rates across different VLA architectures.
  • Cross-domain transfer becomes feasible without retraining the patch for each new environment or robot setup.
  • Safety-critical deployments of VLA models must now account for visual proprioception as a distinct attack surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the visual self-localization step is universal, then hardening only the policy head or language components would leave models exposed.
  • Designers could test whether reducing visual dependence for arm tracking, such as by adding dedicated proprioceptive sensors, reduces vulnerability.
  • The same suppression-injection pattern might apply to other embodied agents that rely on vision to track their own body parts.
  • Future work could measure how much the transfer rate drops when models are trained with explicit arm-localization defenses.

Load-bearing premise

All VLA models must first use visual information to locate their own robotic arm within the environment before planning any motion, and this process is similar enough across architectures to allow transfer via the hijacking method.

What would settle it

A demonstration that a VLA model can generate actions without first visually identifying its own arm's location in the scene, or that the self-localization step differs so much between architectures that the suppression and injection steps fail to transfer.

Figures

Figures reproduced from arXiv: 2605.28083 by Chenzhi Tan, Dingkang Yang, Jingkai Jia, Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Shuyong Gao, Wenqiang Zhang, Xueyao Chen, Zhaoyu Chen.

Figure 1
Figure 1. Figure 1: Unlike prior attacks that leave the real arm unsuppressed, VLA￾Hijack hijacks the proprioceptive loop by suppressing the original embodi￾ment and injecting a surrogate identity. To break this transferability bottle￾neck, we propose a paradigm shift: tar￾geting the universal proprioceptive logic of VLA models rather than their diver￾gent action outputs. Intriguingly, recent adversarial studies [17, 23, 28] … view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of Proprioceptive Conflict. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the VLA-Hijack Framework.We employ a two-phase opti￾mization loop to generate the adversarial patch: first suppressing the features of the real arm, and then injecting multimodal robotic arm identities into the patch. Proprioception Loop": the model must first localize its own physical state (the robotic arm) from the current observation, then locate the target object, and finally plan the robo… view at source ↗
Figure 4
Figure 4. Figure 4: Failure Rate (FR) vs. Optimization Steps. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on Algorithmic Components. We report the transfer Failure Rate (FR) of patches generated from OpenVLA-Spatial and evaluated across four UniVLA tasks (a-d), alongside their overall average (e). (a) Spatial, OpenVLA-Sp. (b) Bridge V2, OpenVLA-7B (c) Spatial, UniVLA-Spa. (d) Object, UniVLA-Obj. (e) Goal, UniVLA-Goal (f ) Long, UniVLA-Long [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of Adversarial Patches. Patches optimized on various surro￾gate models consistently manifest distinct structural features of physical robotic arms. This degradation occurs because simultaneously enforcing high-level semantic concepts and fine-grained visual prototypes forces the pixels into competing up￾date directions, inducing severe gradient conflicts that prevent convergence to a unified… view at source ↗
read the original abstract

While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes VLA-Hijack, a unified adversarial patch framework for Vision-Language-Action (VLA) models. It identifies a purported fundamental vulnerability—that VLAs must first use visual information to locate their own robotic arm before any motion planning—and concurrently optimizes Attention-Guided Proprioceptive Suppression (to inhibit real-arm features) and Multimodal Proprioceptive Injection (to establish the patch as a phantom embodiment). By alternating semantic concept anchoring and visual prototype projection, the method severs the link between true embodiment and control policy. Experiments across OpenVLA, UniVLA, and CronusVLA are claimed to show superior white-box optimization efficiency and new SOTA cross-architecture/cross-domain black-box transferability.

Significance. If the shared visual self-localization assumption holds and is mechanistically validated, the work would be significant for exposing transferable vulnerabilities in embodied generalist policies and for advancing black-box attack methods beyond overfitting to specific action spaces. The emphasis on cross-architecture transfer addresses a documented limitation of prior patch attacks on VLAs.

major comments (2)
  1. [Abstract] Abstract: The central claim rests on the assertion that 'before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment' as a shared process across architectures. No attention maps, layer-wise ablations, or controlled experiments isolating arm localization from language-conditioned or global scene features are referenced, leaving the mechanistic justification for suppression and injection unverified. If models instead route actions through non-proprioceptive pathways, the transferability results may reflect generic patch optimization rather than hijacking of a common bottleneck.
  2. [Abstract] Abstract: The claim of 'SOTA for cross-architecture and cross-domain black-box transferability' is load-bearing for the contribution, yet the abstract provides no quantitative baselines, error bars, or architecture-specific transfer rates. Without these, it is impossible to assess whether the gains exceed prior methods or arise from the proposed proprioceptive mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address the two major comments point-by-point below. Both concerns can be met by targeted revisions to the abstract and, where appropriate, by referencing existing results from the full manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim rests on the assertion that 'before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment' as a shared process across architectures. No attention maps, layer-wise ablations, or controlled experiments isolating arm localization from language-conditioned or global scene features are referenced, leaving the mechanistic justification for suppression and injection unverified. If models instead route actions through non-proprioceptive pathways, the transferability results may reflect generic patch optimization rather than hijacking of a common bottleneck.

    Authors: The full manuscript supports the shared visual self-localization premise through the consistent cross-architecture transfer gains obtained only when both suppression and injection are applied; prior patch methods that ignore this step show markedly lower transfer. We will revise the abstract to explicitly reference the ablation results on the two components and the three-architecture transfer tables that isolate the contribution of proprioceptive targeting. If the editor requests, we can also add attention-map figures to the supplement. revision: partial

  2. Referee: [Abstract] Abstract: The claim of 'SOTA for cross-architecture and cross-domain black-box transferability' is load-bearing for the contribution, yet the abstract provides no quantitative baselines, error bars, or architecture-specific transfer rates. Without these, it is impossible to assess whether the gains exceed prior methods or arise from the proposed proprioceptive mechanism.

    Authors: We agree that the abstract should contain the key quantitative evidence. We will revise it to report the principal black-box transfer success rates (with standard deviations over repeated trials) for each source-target pair and the corresponding margins over the strongest prior patch baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack framework rests on stated assumption without self-referential derivations or fitted predictions

full rationale

The provided abstract and description contain no equations, parameter-fitting steps, or derivation chains. The central premise—that VLA models must first perform visual self-localization of the robotic arm—is presented as an identified vulnerability rather than derived from prior results or self-citations. The attack components (Attention-Guided Proprioceptive Suppression and Multimodal Proprioceptive Injection) are described as optimization procedures without any indication that outputs reduce to inputs by construction or that uniqueness is imported via author self-citation. This is an empirical proposal whose validity depends on external validation experiments, not internal definitional closure. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; full details on parameters and assumptions unavailable. The central claim rests on the domain assumption that visual arm localization is a shared prerequisite step across VLA models.

axioms (1)
  • domain assumption Before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment.
    Explicitly identified in the abstract as the fundamental vulnerability targeted by the attack.

pith-pipeline@v0.9.1-grok · 5804 in / 1243 out tokens · 35636 ms · 2026-06-29T12:56:20.246130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 11 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A vision-language-action flow model for general robot control. CoRRabs/2...

  2. [2]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M.G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.E., Levine, S., Lu, Y., Michalewski...

  3. [3]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. CoRRabs/2505.06111 (2025)

  4. [4]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., Doll...

  5. [5]

    IEEE Trans

    Chen, Z., Li, B., Wu, S., Ding, S., Zhang, W.: Query-efficient decision-based black- box patch attack. IEEE Trans. Inf. Forensics Secur.18, 5522–5536 (2023)

  6. [6]

    CoRRabs/2407.13111(2024)

    Fu, J., Chen, Z., Jiang, K., Guo, H., Gao, S., Zhang, W.: Pg-attack: A precision- guided adversarial attack framework against vision foundation models for au- tonomous driving. CoRRabs/2407.13111(2024)

  7. [7]

    arXiv preprint arXiv:2403.10883 (2024)

    Fu, J., Chen, Z., Jiang, K., Guo, H., Wang, J., Gao, S., Zhang, W.: Improving adversarial transferability of visual-language pre-training models through collabo- rative multimodal interaction. CoRRabs/2403.10883(2024)

  8. [8]

    LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

    Fu, J., Jiang, K., Hong, L., Li, J., Guo, H., Yang, D., Chen, Z., Zhang, W.: Lin- goloop attack: Trapping mllms via linguistic context and state entrapment into endless loops. CoRRabs/2506.14493(2025)

  9. [9]

    State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space

    Guo, J., Jiang, W., Lin, Y., Liu, Y., Zhang, R., Lu, G., Chen, A., Han, X., Li, H., Niyato, D.: State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space. CoRRabs/2601.04266(2026)

  10. [10]

    In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Jiang, K., Chen, Z., Huang, H., Wang, J., Yang, D., Li, B., Wang, Y., Zhang, W.: Efficient decision-based black-box patch attacks on video recognition. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 4356–4366. IEEE (2023)

  11. [11]

    In: Agrawal, P., Kroemer, O., Burgard, W

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open-source vision-language-action model. In: Agrawal, P., Kroemer, O., Burgard, W. (eds.) Conference on Robot Learning, 6-9 Nove...

  12. [12]

    CoRRabs/2506.19816(2025)

    Li, H., Yang, S., Chen, Y., Tian, Y., Yang, X., Chen, X., Wang, H., Wang, T., Zhao, F., Lin, D., Pang, J.: Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation. CoRRabs/2506.19816(2025)

  13. [13]

    What Matters in Building Vision-Language-Action Models for Generalist Robots

    Li, X., Li, P., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Kong, T., Zhang, H., Liu, H.: Towards generalist robot policies: What matters in building vision- language-action models. CoRRabs/2412.14058(2024)

  14. [14]

    In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

    Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., Liu, H., Li, H., Kong, T.: Vision-language foundation models as effective robot imitators. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net (2024)

  15. [15]

    arXiv preprint arXiv:2602.03153 (2026)

    Li, X., Fu, P., Huang, W., Pan, N., Yang, S., Zhao, K., Wan, G., Li, M., Xuan, J., Li, M.: When attention betrays: Erasing backdoor attacks in robotic policies by reconstructing visual tokens. arXiv preprint arXiv:2602.03153 (2026)

  16. [16]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: LIBERO: bench- marking knowledge transfer for lifelong robot learning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Process- ing Systems 2023, NeurIPS 2023, Ne...

  17. [17]

    CoRRabs/2511.21192(2025)

    Lu, H., Yu, Y., Yang, Y., Yi, C., Zhang, Q., Shen, B., Kot, A.C., Jiang, X.: When robots obey the patch: Universal transferable patch attacks on vision-language- action models. CoRRabs/2511.21192(2025)

  18. [18]

    CoRRabs/2511.10008(2025)

    Lu, X., Chen, J., Xiao, S., Jin, Z., Chen, Z., Yu, H., Qian, B., Zhou, R., Ji, X., Xu, W.: Phantom menace: Exploring and enhancing the robustness of VLA models against physical sensor attacks. CoRRabs/2511.10008(2025)

  19. [19]

    In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

    Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robust- ness of semantic segmentation for autonomous driving against real-world adversar- ial patch attacks. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022. pp. 2826–2835. IEEE (2022)

  20. [20]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., Li, X.: Spatialvla: Exploring spatial representations for visual-language- action model. CoRRabs/2501.15830(2025)

  21. [21]

    IEEE Trans

    Ran, Y., Wang, W., Li, M., Li, L., Wang, Y., Li, J.: Cross-shaped adversarial patch attack. IEEE Trans. Circuits Syst. Video Technol.34(4), 2289–2303 (2024)

  22. [22]

    In: Tan, J., Toussaint, M., Darvish, K

    Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., Lee, A., Fang, K., Finn, C., Levine, S.: Bridgedata V2: A dataset for robot learning at scale. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Conference on Robot Learning, CoRL 2023, 6-9 November 2023, Atlanta, GA, USA. Proceedings of Mach...

  23. [23]

    CoRRabs/2411.13587(2024)

    Wang, T., Liu, D., Liang, J.C., Yang, W., Wang, Q., Han, C., Luo, J., Tang, R.: Exploring the adversarial vulnerabilities of vision-language-action models in robotics. CoRRabs/2411.13587(2024)

  24. [24]

    CoRRabs/2509.16645(2025)

    Wang, Y., Zhang, H., Pan, H., Zhou, Z., Wang, X., Guo, P., Xue, L., Hu, S., Li, M., Zhang, L.Y.: Advedm:fine-grained adversarial attack against vlm-based embodied agents. CoRRabs/2509.16645(2025)

  25. [25]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Wen,J.,Zhu,Y.,Li,J.,Tang,Z.,Shen,C.,Feng,F.:Dexvla:Vision-languagemodel with plug-in diffusion expert for general robot control. CoRRabs/2502.05855 (2025) 18 J. Fu, K. Jiang, et al

  26. [26]

    IEEE Robotics Autom

    Wen, J., Zhu, Y., Li, J., Zhu, M., Tang, Z., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., Peng, Y., Feng, F., Tang, J.: Tinyvla: Toward fast, data-efficient vision- language-action models for robotic manipulation. IEEE Robotics Autom. Lett. 10(4), 3988–3995 (2025)

  27. [27]

    SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

    Xu, B., Shang, Y., Wang, B., Ferrara, E.: Silentdrift: Exploiting action chunk- ing for stealthy backdoor attacks on vision-language-action models. CoRR abs/2601.14323(2026)

  28. [28]

    CoRR abs/2510.13237(2025)

    Xu, H., Koh, Y.S., Huang, S., Zhou, Z., Wang, D., Sakuma, J., Zhang, J.: Model- agnostic adversarial attack and defense for vision-language-action models. CoRR abs/2510.13237(2025)

  29. [29]

    CoRRabs/2510.10932(2025)

    Xu, Z., Zheng, X., Ma, X., Jiang, Y.: Tabvla: Targeted backdoor attacks on vision- language-action models. CoRRabs/2510.10932(2025)

  30. [30]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J

    Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black- box texture-based attack with reinforcement learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVI. Lecture Notes in Computer Science, vol. 12371, pp. 6...

  31. [31]

    CoRRabs/2507.17520(2025)

    Yang, S., Li, H., Chen, Y., Wang, B., Tian, Y., Wang, T., Wang, H., Zhao, F., Liao, Y., Pang, J.: Instructvla: Vision-language-action instruction tuning from un- derstanding to manipulation. CoRRabs/2507.17520(2025)

  32. [32]

    SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

    Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: Safevla: Towards safety alignment of vision-language-action model via constrained learning. arXiv preprint arXiv:2503.03480 (2025)

  33. [33]

    CoRRabs/2511.21663(2025)

    Zhang, N., Tao, W., Xiao, X., Sun, Q., Zheng, Y., Mo, W., Wang, P., Zhang, N.: Attention-guided patch-wise sparse adversarial attacks on vision-language-action models. CoRRabs/2511.21663(2025)

  34. [34]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N., Lin, M.: On evaluating adversarial robustness of large vision-language models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Process- ing Systems 2023, NeurIPS 2023, Ne...

  35. [35]

    arXiv preprint arXiv:2602.00500 (2026)

    Zhou, J., Wei, Y., Zhen, R., Zhao, B., Xia, X., Shao, R., Su, X., Yang, S.: Inject once survive later: Backdooring vision-language-action models to persist through downstream fine-tuning. arXiv preprint arXiv:2602.00500 (2026)

  36. [36]

    CoRRabs/2505.16640(2025)

    Zhou, X., Tie, G., Zhang, G., Wang, H., Zhou, P., Sun, L.: Badvla: Towards back- door attacks on vision-language-action models via objective-decoupled optimiza- tion. CoRRabs/2505.16640(2025)

  37. [37]

    CoRR abs/2510.09269(2025)

    Zhou, Z., Xiao, Z., Xu, H., Sun, J., Wang, D., Zhang, J.: Goal-oriented back- door attack against vision-language-action models via physical objects. CoRR abs/2510.09269(2025)

  38. [38]

    CoRRabs/2502.19250(2025)

    Zhu, M., Zhu, Y., Li, J., Zhou, Z., Wen, J., Liu, X., Shen, C., Peng, Y., Feng, F.: Objectvla: End-to-end open-world object manipulation without demonstration. CoRRabs/2502.19250(2025)