pith. machine review for the scientific record. sign in

arxiv: 2507.01925 · v1 · submitted 2025-07-02 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-17 14:03 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsaction tokenizationroboticsembodied AImultimodal modelssurveyfoundation modelsaction generation
0
0 comments X

The pith

Vision-language-action models unify under one framework of action token chains from inputs to actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that VLA models, despite appearing diverse, share a common structure where vision and language inputs are handled by modules to produce a sequence of action tokens. These tokens gradually incorporate more specific and physical details until they result in executable robot actions. The authors identify that the main way models differ is in the form of these action tokens, which they group into eight categories including language descriptions, trajectories, and raw actions. This view helps organize the field and points out where each approach succeeds or falls short, which matters for building better robotic systems that understand and act on the world.

Core claim

Current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions. The primary design choice distinguishing VLA models lies in how action tokens are formulated, categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning.

What carries the argument

The chain of action tokens produced by VLA modules, which encodes increasingly grounded information to generate executable actions.

If this is right

  • This categorization allows systematic comparison of different VLA approaches.
  • It reveals strengths and limitations specific to each action token type.
  • It identifies underexplored directions for advancing VLA models.
  • It provides guidance toward developing general-purpose robotic intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New hybrid VLA systems could combine different action token types to leverage their individual strengths.
  • The action token perspective might extend to other areas of AI like planning or decision making in non-physical domains.
  • Targeted experiments could test whether certain token types scale better with model size or data.

Load-bearing premise

The assumption that the primary distinguishing feature of VLA models is their choice of action token formulation rather than other aspects of their architecture or training.

What would settle it

Discovery of a major VLA model whose design cannot be explained as generating a progressive chain of action tokens from vision and language inputs.

read the original abstract

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys Vision-Language-Action (VLA) models, claiming they can be unified under a single framework in which vision and language inputs are processed by a series of VLA modules to produce a chain of action tokens that progressively encode more grounded and actionable information, ultimately yielding executable actions. It identifies the formulation of action tokens as the key distinguishing design choice and categorizes them into eight types (language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning). The survey analyzes strengths and limitations of each type, reviews existing literature through this lens, and provides an outlook on future directions for general-purpose embodied intelligence.

Significance. If the unification framework holds without excessive post-hoc interpretation, the survey could provide a valuable organizational lens for a fast-growing field, helping researchers compare tokenization strategies and identify gaps. Its systematic categorization and distillation of trade-offs across token types represent a constructive contribution beyond simple enumeration of papers, particularly if it surfaces falsifiable predictions about which token types scale best to complex tasks.

major comments (2)
  1. [Abstract and §1] Abstract and §1 (Introduction): The claim that VLA models universally produce 'a chain of action tokens that progressively encode more grounded and actionable information' risks being an interpretive overlay. Models using direct end-to-end mapping (e.g., diffusion policies or single-pass regression to raw actions) often lack explicit intermediate token stages in their architecture. The survey should identify specific counter-examples from the literature and clarify whether the progressive chain is an observed architectural property or a taxonomy imposed by the authors.
  2. [Taxonomy section] Taxonomy section (likely §3 or §4): The eight-category breakdown is presented as exhaustive, yet boundary cases such as hybrid models combining 'reasoning' with 'trajectory' or 'latent representation' are not explicitly handled. The paper should provide a decision procedure or table showing how each cited model is assigned to a primary token type, and discuss whether any prominent VLA works (e.g., recent RT-series or OpenVLA variants) require additional categories.
minor comments (2)
  1. [Figure captions and §2] Figure captions and §2 (Background): Ensure that diagrams illustrating the 'chain of action tokens' explicitly label which components correspond to the proposed VLA modules versus standard vision/language encoders, to prevent readers from conflating the framework with existing transformer pipelines.
  2. [Throughout] Throughout: Standardize notation for 'action token' versus 'action output' so that readers can distinguish the intermediate representations from final executable commands.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our survey. The comments highlight important points regarding the framing of our unification framework and the robustness of the taxonomy. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that VLA models universally produce 'a chain of action tokens that progressively encode more grounded and actionable information' risks being an interpretive overlay. Models using direct end-to-end mapping (e.g., diffusion policies or single-pass regression to raw actions) often lack explicit intermediate token stages in their architecture. The survey should identify specific counter-examples from the literature and clarify whether the progressive chain is an observed architectural property or a taxonomy imposed by the authors.

    Authors: We appreciate this careful reading. Our framework is presented as a unifying conceptual lens based on observed information flow in VLA models rather than a claim of universal explicit architectural stages. To strengthen this, we will revise the Abstract and Section 1 to explicitly note that some models (e.g., certain diffusion policies and direct regression approaches) operate with more implicit progression. We will cite specific counter-examples from the literature and clarify the distinction between explicit token chains in modular designs and the progressive grounding that can be implicit in end-to-end models. This revision will reduce any risk of over-interpretation while preserving the organizational value of the perspective. revision: yes

  2. Referee: [Taxonomy section] Taxonomy section (likely §3 or §4): The eight-category breakdown is presented as exhaustive, yet boundary cases such as hybrid models combining 'reasoning' with 'trajectory' or 'latent representation' are not explicitly handled. The paper should provide a decision procedure or table showing how each cited model is assigned to a primary token type, and discuss whether any prominent VLA works (e.g., recent RT-series or OpenVLA variants) require additional categories.

    Authors: We agree that making the assignment process more transparent will improve the taxonomy's rigor. In the revised manuscript, we will add a decision table or flowchart in the taxonomy section that specifies criteria for assigning each model to its primary token type, with explicit handling of hybrids (e.g., by prioritizing the dominant actionable output). We will also review and discuss recent works such as the RT-series and OpenVLA variants to confirm their placement or note any boundary considerations. These additions will address potential gaps without requiring new categories at this stage. revision: yes

Circularity Check

0 steps flagged

Survey unification is a post-hoc taxonomic lens with no circular reduction to inputs or self-citations

full rationale

This is a survey paper that proposes an observational framework for unifying VLA models via action tokenization and categorizes existing literature into token types (language description, code, affordance, etc.). The central claim is presented as an observation rather than a derivation from equations, fitted parameters, or prior self-work. No self-definitional loops, predictions that reduce to fits, or load-bearing self-citations appear in the abstract or described structure. The framework functions as a classification scheme applied to prior models, not a result forced by construction or imported uniqueness theorems. The derivation chain is therefore self-contained as a review without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that action tokenization is the central distinguishing axis across VLA models; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption All current VLA models process vision and language inputs through modules that ultimately produce executable actions via action tokens.
    Stated in the abstract as the unifying observation that enables the categorization.

pith-pipeline@v0.9.0 · 5583 in / 1202 out tokens · 28506 ms · 2026-05-17T14:03:08.728329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

    cs.CV 2026-03 unverdicted novelty 8.0

    FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...

  2. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...

  3. Towards Multi-Object Nonprehensile Transportation via Shared Teleoperation: A Framework Based on Virtual Object Model Predictive Control

    cs.RO 2026-04 unverdicted novelty 7.0

    The virtual object MPC framework enables stable shared teleoperation for transporting up to nine objects, cutting sliding distance by 72.45% and eliminating tip-overs compared to baseline.

  4. QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

    cs.LG 2026-02 unverdicted novelty 7.0

    QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...

  5. RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

    cs.AI 2026-02 unverdicted novelty 7.0

    RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

  6. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  7. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

  8. Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.

  9. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  10. E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

    cs.CV 2026-04 conditional novelty 6.0

    E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.

  11. Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

    cs.RO 2026-03 unverdicted novelty 6.0

    Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.

  12. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    cs.RO 2025-09 conditional novelty 6.0

    SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...

  13. Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

    cs.CV 2026-05 unverdicted novelty 5.0

    Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...

  14. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...

  15. ReconVLA: An Uncertainty-Guided and Failure-Aware Vision-Language-Action Framework for Robotic Control

    cs.RO 2026-04 unverdicted novelty 5.0

    ReconVLA enhances pretrained vision-language-action robotic policies with conformal prediction for uncertainty estimation and failure detection without retraining.

  16. DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.

  17. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  18. GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

    cs.CV 2025-12 unverdicted novelty 5.0

    GeoPredict improves VLA manipulation accuracy by adding predictive kinematic trajectories and 3D Gaussian workspace geometry as training-time depth-rendering supervision.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 16 Pith papers · 58 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  2. [2]

    A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

    Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, pages 1–65, 2024

  3. [3]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  5. [5]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  6. [6]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  7. [7]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

  8. [8]

    URL https://openreview.net/forum?id=a68SUt6zFt

    ISSN 2835-8856. URL https://openreview.net/forum?id=a68SUt6zFt. Featured Certification

  9. [9]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInternational Conference on Computer Vision (ICCV), pages 4015–4026, 2023

  10. [10]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024

  11. [11]

    GPT-4o System Card

    OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

  12. [12]

    Gemini 2.5: Our most intelligent ai model, 2025

    Gemini team. Gemini 2.5: Our most intelligent ai model, 2025. URL https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025

  13. [13]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  14. [14]

    Do as i can, not as i say: Grounding language in robotic affordances

    brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar ...

  15. [16]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023

  16. [17]

    Voxposer: Composable 3d value maps for robotic manipulation with language models

    WenlongHuang,ChenWang,RuohanZhang,YunzhuLi,JiajunWu,andLiFei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 540–562. PMLR, 06–09 Nov 2023. URLhttp...

  17. [18]

    RT-trajectory: Robotic task generalization via hindsight trajectory sketches

    Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. RT-trajectory: Robotic task generalization via hindsight trajectory sketches. InThe Twelfth International Conference o...

  18. [19]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  19. [20]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, KatieMillican, GeorgevandenDriessche, BogdanDamoc, AureliaGuy, SimonOsindero, KarenSimonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre. Traini...

  20. [21]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  21. [22]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In 8th Annual Conference on Robot Learni...

  22. [23]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control, 2024.URL https://arxiv. org/abs/2410.24164, 2024

  23. [24]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  24. [25]

    Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. CoRR, abs/2502.19417, February 2025. URLhttps:...

  25. [26]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  26. [27]

    Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

    Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2024

  27. [28]

    Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572, 2025

    Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, and Dzmitry Tsetserukou. Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572, 2025

  28. [29]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943, February 2025

  29. [30]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024

  30. [31]

    DriveVLM: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models. In8thAnnualConferenceonRobotLearning , 2024. URLhttps://openreview.net/forum? id=928V4Umlys

  31. [32]

    RT-H: Action Hierarchies Using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yev- gen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. In https://arxiv.org/abs/2403.01823, 2024

  32. [33]

    Dexgraspvla: a vision-language-action framework to- wards general dexterous grasping,

    Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Yitao Liang, Yaodong Yang, and Yuanpei Chen. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025

  33. [34]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wetzstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR),...

  34. [35]

    Progprompt: Generating situated robot task plans using large language models.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2022

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2022. URLhttps://api.semanticscholar.org/CorpusID:252519594

  35. [36]

    Rekep: Spatio-temporalreasoning of relational keypoint constraints for robotic manipulation

    WenlongHuang,ChenWang,YunzhuLi,RuohanZhang,andLiFei-Fei. Rekep: Spatio-temporalreasoning of relational keypoint constraints for robotic manipulation. In2nd CoRL Workshop on Learning Effective Abstractions for Planning, 2024. URLhttps://openreview.net/forum?id=ZGbWq3VqrO

  36. [37]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

    Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. InThe Twelfth International Conference on Learning Representations, 2024

  37. [38]

    Any- point Trajectory Modeling for Policy Learning

    Chuan Wen, Xingyu Lin, John Ian Reyes So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any- point Trajectory Modeling for Policy Learning. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. doi: 10.15607/RSS.2024.XX.092

  38. [39]

    3D-VLA: A 3D vision-language-action generative world model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3D-VLA: A 3D vision-language-action generative world model. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedingsofthe41stInternationalConferenceonMachineLearning ...

  39. [40]

    Latent action pretraining from videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. InCoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoid...

  40. [41]

    Robotic control via embodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?id=S70MgnIA0v

  41. [42]

    Action-free reasoning for policy generalization

    Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729, 2025

  42. [43]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  43. [44]

    Bert: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  44. [45]

    Universal Sentence Encoder

    Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder.arXiv preprint arXiv:1803.11175, 2018

  45. [46]

    Multilingual Universal Sentence Encoder for Semantic Retrieval

    Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307, 2019

  46. [47]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  47. [48]

    Improving language understand- ing by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understand- ing by generative pre-training. 2018

  48. [49]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  49. [50]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  50. [51]

    The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

    Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

  51. [52]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  52. [53]

    AI Alignment: A Comprehensive Survey

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhong- hao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey.arXiv preprint arXiv:2310.19852, 2023

  53. [54]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  54. [55]

    Alphaevolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. Technical report, Technical report, Google DeepMind, 05

  55. [56]

    googleapis ..., 2025

    URL https://storage. googleapis ..., 2025. 49 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

  56. [57]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  57. [58]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  58. [59]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  59. [60]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  60. [61]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  61. [62]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  62. [63]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

  63. [64]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...

  64. [65]

    Gptunderstands, too

    XiaoLiu,YananZheng,ZhengxiaoDu,MingDing,YujieQian,ZhilinYang,andJieTang. Gptunderstands, too. AI Open, 5:208–215, 2024

  65. [66]

    P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 6...

  66. [67]

    Thepowerofscaleforparameter-efficientprompttuning

    BrianLester, RamiAl-Rfou, andNoahConstant. Thepowerofscaleforparameter-efficientprompttuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association fo...

  67. [68]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

  68. [69]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

  69. [70]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 50 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

  70. [71]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  71. [72]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  72. [73]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

  73. [74]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  74. [75]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  75. [76]

    Andersen, Jun Woo Park, Alexander J

    Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. InOSDI, pages 583–598. USENIX Association, 2014

  76. [77]

    Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

  77. [78]

    Pipedream: Generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019

  78. [79]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  79. [80]

    Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541–551, 1989

    YannLeCun,BernhardBoser,JohnSDenker,DonnieHenderson,RichardEHoward,WayneHubbard,and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541–551, 1989

  80. [81]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Showing first 80 references.