Partially Observable Adversarial Patch Attacks on Vision-Language-Action Models in Robotics

Keke Tang; Mingliang Han; Tianyu Hao; Xiaofei Wang; Yi Yang; Yun-Bo Zhao

arxiv: 2606.03556 · v1 · pith:EAL27CMYnew · submitted 2026-06-02 · 💻 cs.RO

Partially Observable Adversarial Patch Attacks on Vision-Language-Action Models in Robotics

Xiaofei Wang , Mingliang Han , Tianyu Hao , Yi Yang , Yun-Bo Zhao , Keke Tang This is my paper

Pith reviewed 2026-06-28 09:58 UTC · model grok-4.3

classification 💻 cs.RO

keywords adversarial patchesvision-language-action modelsroboticspartial observabilityadversarial attacksattention mapsaction trajectories

0 comments

The pith

Adversarial patches generated from a short trajectory prefix can still cause long-horizon failures in vision-language-action robot models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision-language-action models remain vulnerable to fixed adversarial patches even when the attacker sees only an initial segment of the robot's path rather than the full trajectory. It introduces a two-phase method that first identifies key image regions using the model's attention maps tied to the task instruction, then tunes the patch to break object recognition and make planned action paths curve more sharply. If this holds, it shows that realistic partial-observability constraints do not protect these models, allowing sustained drops in task completion over extended robot operations in both simulated and physical settings.

Core claim

The paper claims that under a threat model limited to a short prefix of the trajectory, a single fixed patch can be produced via attention-map localization of instruction-critical regions followed by optimization that simultaneously disrupts target-object semantic grounding and raises action-trajectory curvature, thereby producing compounding perception and control errors that persist across subsequent frames and measurably lower task success rates in simulation and real-world robot trials.

What carries the argument

The two-phase framework that localizes the patch with attention maps corresponding to the full instruction and then optimizes it to break semantic grounding while increasing action curvature.

If this is right

Task success rates fall substantially when the fixed patch is applied throughout execution.
Disruptions compound over long horizons even though the patch never changes after the prefix.
Failures arise from simultaneous breaks in object perception and in the smoothness of generated actions.
The attack succeeds without requiring access to the complete future trajectory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robot safety testing may need to include early-trajectory patch injection as a standard evaluation case.
Attention-map inspection during deployment could serve as an early-warning signal for ongoing attacks.
The same localization-plus-curvature approach might transfer to other multimodal control models that rely on visual-language grounding.

Load-bearing premise

The attacker is restricted to a short initial segment of the trajectory yet must produce one unchanging patch that affects every later frame.

What would settle it

A controlled trial in which patches generated from the short-prefix method produce no measurable drop in task success rates compared with clean runs, either in simulation or on physical robots, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.03556 by Keke Tang, Mingliang Han, Tianyu Hao, Xiaofei Wang, Yi Yang, Yun-Bo Zhao.

**Figure 2.** Figure 2: Overview of the proposed attack framework under partial observability. In Stage 1, we perform attention-based patch localization to identify a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Snapshots of OpenVLA executions on four LIBERO tasks under different attack settings. The top row shows benign execution, the middle row shows [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Attack success rate (ASR, %) of our attack framework with random [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of loss weights on ASR and nASR. Left: varying λtraj with λsem=1.0; right: varying λsem with λtraj=12.0. b) Effect of semantic and trajectory losses: To assess the contribution of the two loss terms, we ablate each component individually. Tab. IV reports ASR and nASR across LIBERO suites at K=10 and 30: removing either Lsem or Ltraj degrades performance, while the full model achieves the best resul… view at source ↗

**Figure 6.** Figure 6: Visual grounding maps of instruction nouns before and after our [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Histogram of angular deviations (degrees) at the first frame after [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Snapshots of real-world VLA task executions on a ROKAE xMate ER7 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models are gaining attention in robotics, yet their robustness to adversarial attacks remains largely unexplored. Existing work shows that adversarial patches can mislead VLA-based robots but assumes full access to the entire execution trajectory, an unrealistic requirement in practice. We address this limitation by formulating a partially observable threat model, where the adversary can exploit only a short prefix of the trajectory to generate a fixed patch applied to all subsequent frames. Under this setting, we propose a two-phase framework. First, we localize the patch using the model's attention maps to identify visually critical regions that correspond to the full instruction. Then, we optimize the patch to disrupt the semantic grounding of target objects and increase the curvature of action trajectories, thereby compounding failures in both perception and control. Extensive experiments in simulation and real-world robotic environments show that our method sustains adversarial effects under partial observability, inducing long-horizon disruptions and significantly reducing task success rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new partial-observability threat model is the main contribution, but the curvature optimization step looks inconsistent with the short-prefix constraint and no numbers are shown.

read the letter

The main thing to know is that the paper defines a threat model where the attacker sees only a short prefix of the robot's trajectory and must output one fixed patch that then runs for the rest of the task. They split the attack into attention-based patch placement followed by optimization that targets both object grounding and action-trajectory curvature.

That threat model is a reasonable step beyond the full-trajectory access used in earlier patch work on VLAs. The two-phase structure is clearly described in the abstract and gives a concrete way to generate the patch without seeing future frames.

The soft spot is the curvature term. Curvature is a property of the full action sequence, so any loss that penalizes it normally needs either repeated forward passes on later observations or an unrolled simulation. Both exceed the information available from a short prefix. The abstract gives no indication of how they compute or differentiate that loss without extra access, so the central claim that the attack works under the stated threat model is not yet supported.

No quantitative results, baselines, or success-rate numbers appear in the abstract, which leaves the practical impact unclear.

This is for people working on robot safety and adversarial robustness for deployed VLAs. A reader who wants to see how partial observability changes attack design will find the framing useful, but anyone looking for verified effect sizes will come away empty.

I would send it to peer review. The threat-model idea is worth referee scrutiny even if the optimization details need to be fixed.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing adversarial patch attacks on VLA models assume full trajectory access, which is unrealistic. It introduces a partially observable threat model limited to a short prefix of the trajectory for generating a fixed patch applied to subsequent frames. A two-phase framework is proposed: attention-map localization of the patch to critical regions, followed by optimization to disrupt semantic grounding of target objects and increase action-trajectory curvature to compound long-horizon failures. Extensive simulation and real-world experiments are said to show sustained adversarial effects and significantly reduced task success rates under this constraint.

Significance. If the central claim holds under the stated threat model, the work is significant for identifying practical vulnerabilities in emerging VLA-based robotic systems and for proposing a concrete attack construction that respects partial observability. The attention-based localization combined with curvature-driven optimization represents a technical contribution that could inform both attack and defense research in embodied AI.

major comments (2)

[Abstract] Abstract and threat-model description: the optimization objective of increasing the curvature of action trajectories is a long-horizon property of the closed-loop action sequence. It is unclear how this loss can be computed or differentiated from only a short prefix without additional forward passes, unrolled simulation, or future-state queries. This directly affects whether the reported success under partial observability is consistent with the stated threat model.
[Abstract] Abstract: the claim of 'significantly reducing task success rates' and 'extensive experiments' is asserted without any quantitative metrics, baselines, error bars, or statistical details. Because the central empirical claim cannot be evaluated from the provided text, the strength of evidence for the partial-observability result remains unevaluable.

minor comments (1)

[Abstract] Abstract would be strengthened by inclusion of at least one key quantitative result (e.g., success-rate drop and comparison to full-observability baseline) to allow readers to gauge effect size immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the opportunity to clarify the consistency of our threat model and to strengthen the presentation of our empirical results. We address each major comment below and will make the necessary revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and threat-model description: the optimization objective of increasing the curvature of action trajectories is a long-horizon property of the closed-loop action sequence. It is unclear how this loss can be computed or differentiated from only a short prefix without additional forward passes, unrolled simulation, or future-state queries. This directly affects whether the reported success under partial observability is consistent with the stated threat model.

Authors: We thank the referee for highlighting this important point. In our two-phase framework, the curvature loss is approximated during optimization by performing a limited number of unrolled forward passes through the VLA model itself, using the model's predicted actions to simulate subsequent states within the known prefix context. This does not require access to real future states from the environment or additional trajectory data beyond the initial prefix, thereby preserving the partial observability constraint. We will revise the manuscript to explicitly describe this model-based unrolling procedure in the methods section to ensure clarity. revision: yes
Referee: [Abstract] Abstract: the claim of 'significantly reducing task success rates' and 'extensive experiments' is asserted without any quantitative metrics, baselines, error bars, or statistical details. Because the central empirical claim cannot be evaluated from the provided text, the strength of evidence for the partial-observability result remains unevaluable.

Authors: We agree that the abstract would benefit from including key quantitative results to better support our claims. The full paper contains detailed experimental results with metrics, baselines, and statistical analysis in Sections 4 and 5. In the revised manuscript, we will update the abstract to include specific quantitative findings, such as the reduction in task success rates along with mention of baselines and error bars. revision: yes

Circularity Check

0 steps flagged

No circularity: new threat model and two-phase framework presented as independent construction

full rationale

The paper formulates a partially observable threat model (short prefix only) and proposes a two-phase attack (attention-based localization then optimization for semantic disruption and trajectory curvature) without any equations, fitted parameters, or self-citations that reduce the central claims to their own inputs by construction. No derivation chain is claimed that loops back; the method is explicitly positioned as addressing a limitation of prior full-trajectory work. This matches the default expectation of a self-contained new construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new entities are described in the abstract; the contribution is an empirical attack method.

pith-pipeline@v0.9.1-grok · 5702 in / 908 out tokens · 18878 ms · 2026-06-28T09:58:48.823411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Openvla: An open- source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,” inCoRL, 2024

2024
[2]

π0: A vision-language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π0: A vision-language-action flow model for general robot control,” inRSS, 2025

2025
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Are we ready for service robots? the openloris- scene datasets for lifelong slam,

X. Shi, D. Li, P. Zhao, Q. Tian, Y . Tian, Q. Long, C. Zhu, J. Song, F. Qiao, L. Song,et al., “Are we ready for service robots? the openloris- scene datasets for lifelong slam,” inICRA, 2020, pp. 3139–3145

2020
[5]

A framework for end-user instruction of a robot assistant for manufacturing,

K. R. Guerin, C. Lea, C. Paxton, and G. D. Hager, “A framework for end-user instruction of a robot assistant for manufacturing,” inICRA, 2015, pp. 6167–6174

2015
[6]

Intriguing properties of neural networks,

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inICLR, 2014

2014
[7]

Explaining and harnessing adversarial examples,

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inICLR, 2015

2015
[8]

Exploring the adversarial vulnerabilities of vision- language-action models in robotics,

T. Wang, C. Han, J. C. Liang, W. Yang, D. Liu, L. X. Zhang, Q. Wang, J. Luo, and R. Tang, “Exploring the adversarial vulnerabilities of vision- language-action models in robotics,” inICCV, 2025

2025
[9]

Robotics cyber security: Vulnerabilities, attacks, countermeasures, and recommen- dations,

J.-P. A. Yaacoub, H. N. Noura, O. Salman, and A. Chehab, “Robotics cyber security: Vulnerabilities, attacks, countermeasures, and recommen- dations,”International Journal of Information Security, vol. 21, no. 1, pp. 115–158, 2022

2022
[10]

De- tection of cyber-attacks to indoor real time localization systems for autonomous robots,

´A. M. Guerrero-Higueras, N. DeCastro-Garc ´ıa, and V . Matell´an, “De- tection of cyber-attacks to indoor real time localization systems for autonomous robots,”Robotics and Autonomous Systems, vol. 99, pp. 75–83, 2018

2018
[11]

Security for the robot operating system,

B. Dieber, B. Breiling, S. Taurer, S. Kacianka, S. Rass, and P. Schartner, “Security for the robot operating system,”Robotics and Autonomous Systems, vol. 98, pp. 192–203, 2017

2017
[12]

Tros: Protecting humanoids ros from privileged attackers,

G. Mazzeo and M. Staffa, “Tros: Protecting humanoids ros from privileged attackers,”International Journal of Social Robotics, vol. 12, no. 3, pp. 827–841, 2020

2020
[13]

Characterizing physical adversarial attacks on robot motion planners,

W. Wu, F. Pierazzi, Y . Du, and M. Brand ˜ao, “Characterizing physical adversarial attacks on robot motion planners,” inICRA, 2024, pp. 14 319–14 325

2024
[14]

Physical and digital adversarial attacks on grasp quality networks,

N. W. Alharthi and M. Brand ˜ao, “Physical and digital adversarial attacks on grasp quality networks,” inICRA, 2024, pp. 1907–1912

2024
[15]

Adversarial grasp objects,

D. Wang, D. Tseng, P. Li, Y . Jiang, M. Guo, M. Danielczuk, J. Mahler, J. Ichnowski, and K. Goldberg, “Adversarial grasp objects,” inCASE, 2019, pp. 241–248

2019
[16]

Advgrasp: Adversarial attacks on robotic grasping from a physical perspective,

X. Wang, M. Han, T. Hao, C. Li, Y . Zhao, and K. Tang, “Advgrasp: Adversarial attacks on robotic grasping from a physical perspective,” in IJCAI, 2025, pp. 547–555

2025
[17]

Adversary is on the road: Attacks on visual SLAM using unnoticeable adversarial patch,

B. Chen, W. Wang, P. Sikorski, and T. Zhu, “Adversary is on the road: Attacks on visual SLAM using unnoticeable adversarial patch,” inUSENIX Security, 2024, pp. 6345–6362

2024
[18]

Adversarial attacks on robotic vision language action models,

E. K. Jones, A. Robey, A. Zou, Z. Ravichandran, G. J. Pappas, H. Hassani, M. Fredrikson, and J. Z. Kolter, “Adversarial attacks on robotic vision language action models,” inRSS Workshop, 2025

2025
[19]

Badrobot: Jailbreaking embodied llm agents in the physical world,

H. Zhang, C. Zhu, X. Wang, Z. Zhou, C. Yin, M. Li, L. Xue, Y . Wang, S. Hu, A. Liu,et al., “Badrobot: Jailbreaking embodied llm agents in the physical world,” inICLR, 2025

2025
[20]

Exploring the robustness of decision-level through adversarial attacks on llm-based embodied models,

S. Liu, J. Chen, S. Ruan, H. Su, and Z. Yin, “Exploring the robustness of decision-level through adversarial attacks on llm-based embodied models,” inACM MM, 2024, pp. 8120–8128

2024
[21]

Badvla: To- wards backdoor attacks on vision-language-action models via objective- decoupled optimization,

X. Zhou, G. Tie, G. Zhang, H. Wang, P. Zhou, and L. Sun, “Badvla: To- wards backdoor attacks on vision-language-action models via objective- decoupled optimization,” inNeurIPS, 2025

2025
[22]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR, 2021

2021
[23]

The coordination of arm movements: An exper- imentally confirmed mathematical model,

T. Flash and N. Hogan, “The coordination of arm movements: An exper- imentally confirmed mathematical model,”The Journal of Neuroscience, vol. 5, no. 7, pp. 1688–1703, 1985

1985
[24]

A technique for time-jerk optimal planning of robot trajectories,

A. Gasparetto and V . Zanotto, “A technique for time-jerk optimal planning of robot trajectories,”Robotics and Computer-Integrated Man- ufacturing, vol. 24, no. 3, pp. 415–426, 2008

2008
[25]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”NeurIPS, vol. 36, pp. 44 776–44 791, 2023

2023
[26]

Hume: Introducing system-2 thinking in visual-language-action model,

H. Song, D. Qu, Y . Yao, Q. Chen, X. Ye, Q. Lv, X. Gao, G. Ren, M. Yao, B. Zhao, D. Wang, and X. Li, “Hume: Introducing system-2 thinking in visual-language-action model,” inCVPR, 2026

2026

[1] [1]

Openvla: An open- source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,” inCoRL, 2024

2024

[2] [2]

π0: A vision-language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π0: A vision-language-action flow model for general robot control,” inRSS, 2025

2025

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Are we ready for service robots? the openloris- scene datasets for lifelong slam,

X. Shi, D. Li, P. Zhao, Q. Tian, Y . Tian, Q. Long, C. Zhu, J. Song, F. Qiao, L. Song,et al., “Are we ready for service robots? the openloris- scene datasets for lifelong slam,” inICRA, 2020, pp. 3139–3145

2020

[5] [5]

A framework for end-user instruction of a robot assistant for manufacturing,

K. R. Guerin, C. Lea, C. Paxton, and G. D. Hager, “A framework for end-user instruction of a robot assistant for manufacturing,” inICRA, 2015, pp. 6167–6174

2015

[6] [6]

Intriguing properties of neural networks,

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inICLR, 2014

2014

[7] [7]

Explaining and harnessing adversarial examples,

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inICLR, 2015

2015

[8] [8]

Exploring the adversarial vulnerabilities of vision- language-action models in robotics,

T. Wang, C. Han, J. C. Liang, W. Yang, D. Liu, L. X. Zhang, Q. Wang, J. Luo, and R. Tang, “Exploring the adversarial vulnerabilities of vision- language-action models in robotics,” inICCV, 2025

2025

[9] [9]

Robotics cyber security: Vulnerabilities, attacks, countermeasures, and recommen- dations,

J.-P. A. Yaacoub, H. N. Noura, O. Salman, and A. Chehab, “Robotics cyber security: Vulnerabilities, attacks, countermeasures, and recommen- dations,”International Journal of Information Security, vol. 21, no. 1, pp. 115–158, 2022

2022

[10] [10]

De- tection of cyber-attacks to indoor real time localization systems for autonomous robots,

´A. M. Guerrero-Higueras, N. DeCastro-Garc ´ıa, and V . Matell´an, “De- tection of cyber-attacks to indoor real time localization systems for autonomous robots,”Robotics and Autonomous Systems, vol. 99, pp. 75–83, 2018

2018

[11] [11]

Security for the robot operating system,

B. Dieber, B. Breiling, S. Taurer, S. Kacianka, S. Rass, and P. Schartner, “Security for the robot operating system,”Robotics and Autonomous Systems, vol. 98, pp. 192–203, 2017

2017

[12] [12]

Tros: Protecting humanoids ros from privileged attackers,

G. Mazzeo and M. Staffa, “Tros: Protecting humanoids ros from privileged attackers,”International Journal of Social Robotics, vol. 12, no. 3, pp. 827–841, 2020

2020

[13] [13]

Characterizing physical adversarial attacks on robot motion planners,

W. Wu, F. Pierazzi, Y . Du, and M. Brand ˜ao, “Characterizing physical adversarial attacks on robot motion planners,” inICRA, 2024, pp. 14 319–14 325

2024

[14] [14]

Physical and digital adversarial attacks on grasp quality networks,

N. W. Alharthi and M. Brand ˜ao, “Physical and digital adversarial attacks on grasp quality networks,” inICRA, 2024, pp. 1907–1912

2024

[15] [15]

Adversarial grasp objects,

D. Wang, D. Tseng, P. Li, Y . Jiang, M. Guo, M. Danielczuk, J. Mahler, J. Ichnowski, and K. Goldberg, “Adversarial grasp objects,” inCASE, 2019, pp. 241–248

2019

[16] [16]

Advgrasp: Adversarial attacks on robotic grasping from a physical perspective,

X. Wang, M. Han, T. Hao, C. Li, Y . Zhao, and K. Tang, “Advgrasp: Adversarial attacks on robotic grasping from a physical perspective,” in IJCAI, 2025, pp. 547–555

2025

[17] [17]

Adversary is on the road: Attacks on visual SLAM using unnoticeable adversarial patch,

B. Chen, W. Wang, P. Sikorski, and T. Zhu, “Adversary is on the road: Attacks on visual SLAM using unnoticeable adversarial patch,” inUSENIX Security, 2024, pp. 6345–6362

2024

[18] [18]

Adversarial attacks on robotic vision language action models,

E. K. Jones, A. Robey, A. Zou, Z. Ravichandran, G. J. Pappas, H. Hassani, M. Fredrikson, and J. Z. Kolter, “Adversarial attacks on robotic vision language action models,” inRSS Workshop, 2025

2025

[19] [19]

Badrobot: Jailbreaking embodied llm agents in the physical world,

H. Zhang, C. Zhu, X. Wang, Z. Zhou, C. Yin, M. Li, L. Xue, Y . Wang, S. Hu, A. Liu,et al., “Badrobot: Jailbreaking embodied llm agents in the physical world,” inICLR, 2025

2025

[20] [20]

Exploring the robustness of decision-level through adversarial attacks on llm-based embodied models,

S. Liu, J. Chen, S. Ruan, H. Su, and Z. Yin, “Exploring the robustness of decision-level through adversarial attacks on llm-based embodied models,” inACM MM, 2024, pp. 8120–8128

2024

[21] [21]

Badvla: To- wards backdoor attacks on vision-language-action models via objective- decoupled optimization,

X. Zhou, G. Tie, G. Zhang, H. Wang, P. Zhou, and L. Sun, “Badvla: To- wards backdoor attacks on vision-language-action models via objective- decoupled optimization,” inNeurIPS, 2025

2025

[22] [22]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR, 2021

2021

[23] [23]

The coordination of arm movements: An exper- imentally confirmed mathematical model,

T. Flash and N. Hogan, “The coordination of arm movements: An exper- imentally confirmed mathematical model,”The Journal of Neuroscience, vol. 5, no. 7, pp. 1688–1703, 1985

1985

[24] [24]

A technique for time-jerk optimal planning of robot trajectories,

A. Gasparetto and V . Zanotto, “A technique for time-jerk optimal planning of robot trajectories,”Robotics and Computer-Integrated Man- ufacturing, vol. 24, no. 3, pp. 415–426, 2008

2008

[25] [25]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”NeurIPS, vol. 36, pp. 44 776–44 791, 2023

2023

[26] [26]

Hume: Introducing system-2 thinking in visual-language-action model,

H. Song, D. Qu, Y . Yao, Q. Chen, X. Ye, Q. Lv, X. Gao, G. Ren, M. Yao, B. Zhao, D. Wang, and X. Li, “Hume: Introducing system-2 thinking in visual-language-action model,” inCVPR, 2026

2026