pith. machine review for the scientific record. sign in

arxiv: 2603.24935 · v2 · submitted 2026-03-26 · 💻 cs.RO

Recognition: no theorem link

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:02 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsblack-box adversarial attacksinstruction editingReAct agentLIBERO benchmarkrobotics robustnessGRPO trainingred-teaming
0
0 comments X

The pith

SABER shows that small plausible edits to robot instructions can reduce task success by over 20 percent across multiple VLA models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SABER as a black-box method that trains a ReAct agent with GRPO to automatically create minimal adversarial edits to natural-language instructions for vision-language-action models. These edits use character-, token-, and prompt-level tools within a strict budget to cause robots to fail tasks, take longer paths, and violate more constraints. Experiments on the LIBERO benchmark across six state-of-the-art models confirm the edits work while using fewer tool calls and character changes than direct GPT baselines. A reader would care because the work demonstrates that current VLA systems can be degraded by subtle, human-like instruction changes that require no internal model access.

Core claim

SABER uses a GRPO-trained ReAct attacker to generate small, plausible adversarial instruction edits using character-, token-, and prompt-level tools under a bounded edit budget that induces targeted behavioral degradation, including task failure, unnecessarily long execution, and increased constraint violations. On the LIBERO benchmark across six state-of-the-art VLA models, SABER reduces task success by 20.6 percent, increases action-sequence length by 55 percent, and raises constraint violations by 33 percent, while requiring 21.1 percent fewer tool calls and 54.7 percent fewer character edits than strong GPT-based baselines.

What carries the argument

The GRPO-trained ReAct attacker, which autonomously selects and applies a sequence of editing tools at character, token, and prompt levels to stay within a bounded edit budget while producing targeted changes in downstream robot behavior.

If this is right

  • Bounded instruction edits are sufficient to produce large drops in task success for current VLA models.
  • Robot action sequences become substantially longer and violate more constraints when instructions receive these edits.
  • An agentic black-box pipeline can red-team multiple VLA models more efficiently than non-agentic LLM prompting.
  • The same editing approach works across different models without any per-model retraining or tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • VLA training pipelines may need explicit robustness objectives against small instruction perturbations to prevent easy degradation.
  • The agentic editing method could be repurposed to generate diverse training examples that improve instruction following under noise.
  • Similar bounded-edit attackers might expose comparable weaknesses in other language-conditioned embodied systems beyond robotics.

Load-bearing premise

The GRPO-trained ReAct attacker produces edits that remain plausible and minimal while still causing targeted behavioral degradation across diverse VLA models without model-specific tuning.

What would settle it

Running SABER-generated edits on held-out VLA models on LIBERO tasks and finding no measurable drop in success rate, no increase in action length, or edits that exceed the budget or lose plausibility.

Figures

Figures reproduced from arXiv: 2603.24935 by Amrit Singh Bedi, Dinesh Manocha, Guangyao Shi, Qingzi Wang, Xiyang Wu, Zongxia Li.

Figure 1
Figure 1. Figure 1: SABER: An agent-centric black-box pipeline for stealthy, automated instruction-based attacks on VLAs. VLA models for robot manipulation are expected to achieve high task success, efficient action planning and execution, and safe behavior under physical constraints. However, even small instruction perturbations can induce VLA malfunctions. SABER (Dashed Box) applies stealthy edits to manipulation instructio… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SABER. For each LIBERO task, we maintain two contrastive rollouts under a frozen target VLA. A clean baseline rollout (Green Box) is first executed and cached as reference. For the attack rollout, the instruction is passed to a red-team agent (Red Box), which uses an LLM backbone to reason over the instruction and available tools, then performs multi-turn FIND→APPLY edits in a ReAct-style loop.… view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage training procedure. We cold-start by caching clean baseline rollouts from target VLAs (Orange) and collecting initial attack trajectories with a frozen red-team agent (Red) via lightweight random exploration over tool-calling chains. These rollouts form the cold-start dataset for SFT before GRPO training. We then perform agentic RL in interactive scenarios, where the red-team agent attacks target… view at source ↗
read the original abstract

Vision-language-action (VLA) models enable robots to follow natural-language instructions grounded in visual observations, but the instruction channel also introduces a critical vulnerability: small textual perturbations can alter downstream robot behavior. Systematic robustness evaluation therefore requires a black-box attacker that can generate minimal yet effective instruction edits across diverse VLA models. To this end, we present SABER, an agent-centric approach for automatically generating instruction-based adversarial attacks on VLA models under bounded edit budgets. SABER uses a GRPO-trained ReAct attacker to generate small, plausible adversarial instruction edits using character-, token-, and prompt-level tools under a bounded edit budget that induces targeted behavioral degradation, including task failure, unnecessarily long execution, and increased constraint violations. On the LIBERO benchmark across six state-of-the-art VLA models, SABER reduces task success by 20.6%, increases action-sequence length by 55%, and raises constraint violations by 33%, while requiring 21.1% fewer tool calls and 54.7% fewer character edits than strong GPT-based baselines. These results show that small, plausible instruction edits are sufficient to substantially degrade robot execution, and that an agentic black-box pipeline offers a practical, scalable, and adaptive approach for red-teaming robotic foundation models. The codebase is publicly available at https://github.com/wuxiyang1996/SABER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SABER, an agentic black-box attack framework for vision-language-action (VLA) models that uses a GRPO-trained ReAct attacker to generate minimal, plausible adversarial edits to natural-language instructions via character-, token-, and prompt-level tools under bounded edit budgets. These edits are shown to induce targeted behavioral degradation (task failure, longer action sequences, and increased constraint violations) on the LIBERO benchmark across six state-of-the-art VLA models, outperforming GPT-based baselines in efficiency (21.1% fewer tool calls, 54.7% fewer character edits) while achieving 20.6% lower success rates, 55% longer sequences, and 33% more violations. The work positions this as a scalable red-teaming approach for robotic foundation models and releases the codebase publicly.

Significance. If the empirical results hold under proper statistical validation, the work is significant for robotics and AI safety as it provides a practical, adaptive black-box method to expose vulnerabilities in instruction-grounded VLA models without model-specific tuning. The public codebase release supports reproducibility and extension by the community.

major comments (3)
  1. [Abstract] Abstract: the headline performance deltas (20.6% success drop, 55% longer sequences, 33% more violations) are reported without error bars, statistical tests, or variance across tasks/runs, which is load-bearing for the central claim of consistent cross-model degradation.
  2. [Abstract] Abstract and experimental section: no ablation removing GRPO training, no human plausibility ratings or semantic-similarity metrics for the generated edits, and no per-model breakdown or variance, leaving the assumption that the ReAct policy produces minimal yet effective zero-shot transfers unverified.
  3. [Abstract] Abstract: details on enforcement of the bounded edit budget, baseline implementation (specific GPT prompts and tool usage), and how action-sequence length and constraint violations were measured are absent, undermining reproducibility and the fairness of the efficiency comparison.
minor comments (1)
  1. [Abstract] Abstract: specify the exact six VLA models evaluated to improve clarity for readers unfamiliar with the LIBERO benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of SABER as a scalable red-teaming approach for VLA models. We agree that several elements in the abstract and experimental reporting require strengthening for rigor and reproducibility. We will make the requested additions and clarifications in the revised manuscript. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance deltas (20.6% success drop, 55% longer sequences, 33% more violations) are reported without error bars, statistical tests, or variance across tasks/runs, which is load-bearing for the central claim of consistent cross-model degradation.

    Authors: We agree that error bars, statistical tests, and variance reporting are essential to support the central claims. The reported figures are means aggregated over 100+ episodes per model across the LIBERO suite, but we omitted explicit variance and significance in the abstract. In the revision we will add standard deviations to the headline numbers, state that differences are statistically significant (paired t-tests, p<0.05), and include a new table in the experimental section showing per-model and per-task means with variances. revision: yes

  2. Referee: [Abstract] Abstract and experimental section: no ablation removing GRPO training, no human plausibility ratings or semantic-similarity metrics for the generated edits, and no per-model breakdown or variance, leaving the assumption that the ReAct policy produces minimal yet effective zero-shot transfers unverified.

    Authors: We acknowledge these omissions weaken verification of the GRPO-trained ReAct component. We will add (1) an ablation comparing the trained ReAct agent against an untrained ReAct baseline and a random-edit baseline, (2) human plausibility ratings collected from 12 annotators on a 5-point Likert scale for naturalness and semantic fidelity, together with BERTScore and sentence-embedding cosine similarity metrics, and (3) a per-model breakdown table reporting success rate, sequence length, violations, and standard deviations for all six VLA models. revision: yes

  3. Referee: [Abstract] Abstract: details on enforcement of the bounded edit budget, baseline implementation (specific GPT prompts and tool usage), and how action-sequence length and constraint violations were measured are absent, undermining reproducibility and the fairness of the efficiency comparison.

    Authors: We agree that these implementation details are necessary for reproducibility. In the revised experimental section we will specify: the exact bounded-edit budget (maximum 5 character/token/prompt edits with early stopping when the ReAct agent decides the instruction is sufficiently adversarial), the full GPT-4 baseline prompts and tool-calling protocol, the precise definition of action-sequence length (number of executed actions until success, failure, or timeout), and how constraint violations are counted using the LIBERO benchmark’s built-in safety checkers. These additions will also clarify the efficiency comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack results on held-out benchmarks

full rationale

The paper describes an empirical attack framework (SABER) that trains a GRPO ReAct policy and measures its effects on task success, sequence length, and constraint violations across six VLA models on the LIBERO benchmark. No derivation chain, equations, or first-principles predictions are presented that reduce to their own inputs by construction. Performance numbers are direct experimental outcomes on held-out tasks rather than fitted quantities renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The central claims rest on observable metrics and baseline comparisons, not on definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; the GRPO training step likely involves standard RL hyperparameters that are not detailed here.

axioms (1)
  • domain assumption ReAct-style agent planning can be effectively trained with GRPO to produce instruction edits
    Central to the attacker design but not derived in the abstract

pith-pipeline@v0.9.0 · 5568 in / 1150 out tokens · 39675 ms · 2026-05-15T01:02:45.067264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 9 internal anchors

  1. [1]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  2. [2]

    π∗ 0.6: a vla that learns from experience,

    P. Intelligence, “π∗ 0.6: a vla that learns from experience,” 2025

  3. [3]

    arXiv preprint arXiv:2601.18692 (2026)

    W. Wu, F. Lu, Y. Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y. Wang, S. Ma,et al., “A pragmatic vla foundation model,”arXiv preprint arXiv:2601.18692, 2026

  4. [4]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

  5. [5]

    Jailbreaking llm-controlled robots,

    A. Robey, Z. Ravichandran, V. Kumar, H. Hassani, and G. J. Pappas, “Jailbreaking llm-controlled robots,” in2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 11948–11956, IEEE, 2025

  6. [6]

    On the vulnerability of llm/vlm- controlled robotics,

    X. Wu, S. Chakraborty, R. Xian, J. Liang, T. Guan, F. Liu, B. M. Sadler, D. Manocha, and A. S. Bedi, “On the vulnerability of llm/vlm- controlled robotics,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1914–1921, IEEE, 2025

  7. [7]

    Adversarial attacks on robotic vision language action models,

    E. K. Jones, A. Robey, A. Zou, Z. Ravichandran, G. J. Pappas, H. Has- sani, M. Fredrikson, and J. Z. Kolter, “Adversarial attacks on robotic vision language action models,”arXiv preprint arXiv:2506.03350, 2025

  8. [8]

    Poex: Towards policy executable jailbreak attacks against the llm-based robots,

    X. Lu, Z. Huang, X. Li, C. Zhang, W. Xu,et al., “Poex: Towards policy executable jailbreak attacks against the llm-based robots,”arXiv preprint arXiv:2412.16633, 2024

  9. [9]

    Freezevla: Action-freezing attacks against vision-language-action models,

    X. Wang, J. Li, Z. Weng, Y. Wang, Y. Gao, T. Pang, C. Du, Y. Teng, Y. Wang, Z. Wu,et al., “Freezevla: Action-freezing attacks against vision-language-action models,”arXiv preprint arXiv:2509.19870, 2025

  10. [10]

    Manipulation facing threats: Evaluating physical vulnerabilities in end-to-end vision language action models,

    H. Cheng, E. Xiao, Y. Wang, C. Yu, M. Sun, Q. Zhang, J. Cao, Y. Guo, N. Liu, K. Xu,et al., “Manipulation facing threats: Evaluating physical vulnerabilities in end-to-end vision language action models,”arXiv preprint arXiv:2409.13174, 2024

  11. [11]

    Exploring the adversarial vulnerabilities of vision- language-action models in robotics,

    T. Wang, C. Han, J. Liang, W. Yang, D. Liu, L. X. Zhang, Q. Wang, J. Luo, and R. Tang, “Exploring the adversarial vulnerabilities of vision- language-action models in robotics,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6948–6958, 2025

  12. [12]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  13. [13]

    Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs

    A. Paulus, A. Zharmagambetov, C. Guo, B. Amos, and Y. Tian, “Advprompter: Fast adaptive adversarial prompting for llms,”arXiv preprint arXiv:2404.16873, 2024

  14. [14]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu,et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  15. [15]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44776–44791, 2023

  16. [16]

    Jailbreaking black box large language models in twenty queries,

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 23–42, IEEE, 2025

  17. [17]

    Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,

    Z. Liao and H. Sun, “Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,”arXiv preprint arXiv:2404.07921, 2024

  18. [18]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,”arXiv preprint arXiv:2310.04451, 2023

  19. [19]

    Deciphering the chaos: Enhancing jailbreak attacks via adversarial prompt translation,

    Q. Li, X. Yang, W. Zuo, and Y. Guo, “Deciphering the chaos: Enhancing jailbreak attacks via adversarial prompt translation,”arXiv preprint arXiv:2410.11317, 2024

  20. [20]

    Autogen: Enabling next-gen llm applications via multi-agent conversations,

    Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu,et al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inFirst conference on language modeling, 2024

  21. [21]

    Overthink: Slowdown attacks on reasoning llms,

    A. Kumar, J. Roh, A. Naseh, M. Karpinska, M. Iyyer, A. Houmansadr, and E. Bagdasarian, “Overthink: Slowdown attacks on reasoning llms,” arXiv preprint arXiv:2502.02542, 2025

  22. [22]

    π0.5: a vision-language-action model with open-world generalization,

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π0.5: a vision-language-action model with open-world generalization,” 2025

  23. [23]

    Embodied Red Teaming for Auditing Robotic Foundation Models,

    S. Karnik, Z.-W. Hong, N. Abhangi, Y.-C. Lin, T.-H. Wang, C. Dupuy, R. Gupta, and P. Agrawal, “Embodied red teaming for auditing robotic foundation models,”arXiv preprint arXiv:2411.18676, 2024

  24. [24]

    Attackvla: Benchmarking adversarial and backdoor attacks on vision- language-action models,

    J. Li, Y. Zhao, X. Zheng, Z. Xu, Y. Li, X. Ma, and Y.-G. Jiang, “Attackvla: Benchmarking adversarial and backdoor attacks on vision- language-action models,”arXiv preprint arXiv:2511.12149, 2025

  25. [25]

    When alignment fails: Multimodal adversarial attacks on vision-language-action models,

    Y. Yan, Y. Xie, Y. Zhang, L. Lyu, H. Wang, and Y. Jin, “When alignment fails: Multimodal adversarial attacks on vision-language-action models,” arXiv preprint arXiv:2511.16203, 2025

  26. [26]

    Bad- vla: Towards backdoor attacks on vision-language-action models via objective-decoupled optimization,

    X. Zhou, G. Tie, G. Zhang, H. Wang, P. Zhou, and L. Sun, “Bad- vla: Towards backdoor attacks on vision-language-action models via objective-decoupled optimization,”arXiv preprint arXiv:2505.16640, 2025

  27. [27]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

  28. [28]

    Mm-zero: Self-evolving multi-model vision language models from zero data,

    Z. Li, H. Du, C. Huang, X. Wu, L. Yu, Y. He, J. Xie, X. Wu, Z. Liu, J. Zhang,et al., “Mm-zero: Self-evolving multi-model vision language models from zero data,”arXiv preprint arXiv:2603.09206, 2026

  29. [29]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”Advances in neural information processing systems, vol. 36, pp. 68539–68551, 2023

  30. [30]

    Agent lightning: Train any ai agents with reinforcement learning,

    X. Luo, Y. Zhang, Z. He, Z. Wang, S. Zhao, D. Li, L. K. Qiu, and Y. Yang, “Agent lightning: Train any ai agents with reinforcement learning,”arXiv preprint arXiv:2508.03680, 2025

  31. [31]

    Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework,

    H. Zhang, X. Liu, B. Lv, X. Sun, B. Jing, I. L. Iong, Z. Hou, Z. Qi, H. Lai, Y. Xu,et al., “Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework,”arXiv preprint arXiv:2510.04206, 2025

  32. [32]

    arXiv preprint arXiv:2511.14460 , year=

    M. Cheng, J. Ouyang, S. Yu, R. Yan, Y. Luo, Z. Liu, D. Wang, Q. Liu, and E. Chen, “Agent-r1: Training powerful llm agents with end-to-end reinforcement learning,”arXiv preprint arXiv:2511.14460, 2025

  33. [33]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Z. Li, W. Yu, C. Huang, R. Liu, Z. Liang, F. Liu, J. Che, D. Yu, J. Boyd-Graber, H. Mi,et al., “Self-rewarding vision-language model via reasoning decomposition,”arXiv preprint arXiv:2508.19652, 2025

  34. [34]

    Reinforcement learning- driven llm agent for automated attacks on llms,

    X. Wang, J. Peng, K. Xu, H. Yao, and T. Chen, “Reinforcement learning- driven llm agent for automated attacks on llms,” inProceedings of the fifth workshop on privacy in natural language processing, pp. 170–177, 2024

  35. [35]

    Art: Agent reinforcement trainer,

    B. Hilton, K. Corbitt, D. Corbitt, S. Gandhi, A. William, B. Kovalevskyi, and A. Jones, “Art: Agent reinforcement trainer,” 2025

  36. [36]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.,” Iclr, vol. 1, no. 2, p. 3, 2022

  37. [37]

    Mm-zero: Self-evolving multi-model vision language models from zero data,

    Z. Li, H. Du, C. Huang, X. Wu, L. Yu, Y. He, J. Xie, X. Wu, Z. Liu, J. Zhang, and F. Liu, “Mm-zero: Self-evolving multi-model vision language models from zero data,” 2026

  38. [38]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

  39. [39]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng,et al., “X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,”arXiv preprint arXiv:2510.10274, 2025

  40. [40]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    X. Chen, Y. Chen, Y. Fu, N. Gao, J. Jia, W. Jin, H. Li, Y. Mu, J. Pang, Y. Qiao,et al., “Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy,”arXiv preprint arXiv:2510.13778, 2025

  41. [41]

    DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

    C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin, “Deepthinkvla: Enhancing reasoning capability of vision-language- action models,”arXiv preprint arXiv:2511.15669, 2025

  42. [42]

    Qwen2.5 technical report,

    Qwen, “Qwen2.5 technical report,” 2025

  43. [43]

    GPT-5 mini Model (gpt-5-mini),

    OpenAI, “GPT-5 mini Model (gpt-5-mini),” 2026