arxiv: 2507.01925 · v1 · submitted 2025-07-02 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong , Fengshuo Bai , Shaofei Cai , Xuchuan Huang , Zhang Chen , Xiaowei Zhang , Yuanfei Wang , Shaoyang Guo

show 6 more authors

Tianrui Guan Ka Nam Lui Zhiquan Qi Yitao Liang Yuanpei Chen Yaodong Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 14:03 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsaction tokenizationroboticsembodied AImultimodal modelssurveyfoundation modelsaction generation

0 comments

The pith

Vision-language-action models unify under one framework of action token chains from inputs to actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that VLA models, despite appearing diverse, share a common structure where vision and language inputs are handled by modules to produce a sequence of action tokens. These tokens gradually incorporate more specific and physical details until they result in executable robot actions. The authors identify that the main way models differ is in the form of these action tokens, which they group into eight categories including language descriptions, trajectories, and raw actions. This view helps organize the field and points out where each approach succeeds or falls short, which matters for building better robotic systems that understand and act on the world.

Core claim

Current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions. The primary design choice distinguishing VLA models lies in how action tokens are formulated, categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning.

What carries the argument

The chain of action tokens produced by VLA modules, which encodes increasingly grounded information to generate executable actions.

If this is right

This categorization allows systematic comparison of different VLA approaches.
It reveals strengths and limitations specific to each action token type.
It identifies underexplored directions for advancing VLA models.
It provides guidance toward developing general-purpose robotic intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New hybrid VLA systems could combine different action token types to leverage their individual strengths.
The action token perspective might extend to other areas of AI like planning or decision making in non-physical domains.
Targeted experiments could test whether certain token types scale better with model size or data.

Load-bearing premise

The assumption that the primary distinguishing feature of VLA models is their choice of action token formulation rather than other aspects of their architecture or training.

What would settle it

Discovery of a major VLA model whose design cannot be explained as generating a progressive chain of action tokens from vision and language inputs.

read the original abstract

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The action token lens organizes VLA work into eight practical categories and flags useful gaps, but the claimed unification into a progressive chain of tokens looks like an overlay that does not fit direct end-to-end models.

read the letter

The main thing to know is that this survey tries to pull VLA models together by how they turn vision and language into action tokens, and the eight-category breakdown is the part that actually adds structure to a scattered literature. It covers language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning, then notes strengths and weaknesses for each while sketching some forward directions. That synthesis is the useful piece when the area is expanding quickly and people need a map rather than another single model paper. The authors also avoid overclaiming new experiments or proofs, which keeps the scope honest. The soft spot is the unification itself. The abstract states that inputs go through VLA modules to produce a chain of tokens that become progressively more grounded until executable actions appear. That description fits some modular or autoregressive designs, but it does not describe diffusion policies or single-pass regression models that map straight to raw actions without explicit intermediate tokens. In those cases the progressive chain is imposed by the survey rather than present in the architecture, so the framework risks becoming a post-hoc taxonomy. Coverage looks reasonable from the abstract, but any survey stands or falls on whether counter-examples were left out. This paper is for robotics and embodied-AI researchers who want a quick way to compare token choices or spot missing pieces before starting a new project. It is not aimed at readers looking for new theorems or large-scale benchmarks. I would send it to peer review. The categorization is concrete enough to be discussed and improved, and the field benefits from having the gaps named even if the unifying story needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper surveys Vision-Language-Action (VLA) models, claiming they can be unified under a single framework in which vision and language inputs are processed by a series of VLA modules to produce a chain of action tokens that progressively encode more grounded and actionable information, ultimately yielding executable actions. It identifies the formulation of action tokens as the key distinguishing design choice and categorizes them into eight types (language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning). The survey analyzes strengths and limitations of each type, reviews existing literature through this lens, and provides an outlook on future directions for general-purpose embodied intelligence.

Significance. If the unification framework holds without excessive post-hoc interpretation, the survey could provide a valuable organizational lens for a fast-growing field, helping researchers compare tokenization strategies and identify gaps. Its systematic categorization and distillation of trade-offs across token types represent a constructive contribution beyond simple enumeration of papers, particularly if it surfaces falsifiable predictions about which token types scale best to complex tasks.

major comments (2)

[Abstract and §1] Abstract and §1 (Introduction): The claim that VLA models universally produce 'a chain of action tokens that progressively encode more grounded and actionable information' risks being an interpretive overlay. Models using direct end-to-end mapping (e.g., diffusion policies or single-pass regression to raw actions) often lack explicit intermediate token stages in their architecture. The survey should identify specific counter-examples from the literature and clarify whether the progressive chain is an observed architectural property or a taxonomy imposed by the authors.
[Taxonomy section] Taxonomy section (likely §3 or §4): The eight-category breakdown is presented as exhaustive, yet boundary cases such as hybrid models combining 'reasoning' with 'trajectory' or 'latent representation' are not explicitly handled. The paper should provide a decision procedure or table showing how each cited model is assigned to a primary token type, and discuss whether any prominent VLA works (e.g., recent RT-series or OpenVLA variants) require additional categories.

minor comments (2)

[Figure captions and §2] Figure captions and §2 (Background): Ensure that diagrams illustrating the 'chain of action tokens' explicitly label which components correspond to the proposed VLA modules versus standard vision/language encoders, to prevent readers from conflating the framework with existing transformer pipelines.
[Throughout] Throughout: Standardize notation for 'action token' versus 'action output' so that readers can distinguish the intermediate representations from final executable commands.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our survey. The comments highlight important points regarding the framing of our unification framework and the robustness of the taxonomy. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that VLA models universally produce 'a chain of action tokens that progressively encode more grounded and actionable information' risks being an interpretive overlay. Models using direct end-to-end mapping (e.g., diffusion policies or single-pass regression to raw actions) often lack explicit intermediate token stages in their architecture. The survey should identify specific counter-examples from the literature and clarify whether the progressive chain is an observed architectural property or a taxonomy imposed by the authors.

Authors: We appreciate this careful reading. Our framework is presented as a unifying conceptual lens based on observed information flow in VLA models rather than a claim of universal explicit architectural stages. To strengthen this, we will revise the Abstract and Section 1 to explicitly note that some models (e.g., certain diffusion policies and direct regression approaches) operate with more implicit progression. We will cite specific counter-examples from the literature and clarify the distinction between explicit token chains in modular designs and the progressive grounding that can be implicit in end-to-end models. This revision will reduce any risk of over-interpretation while preserving the organizational value of the perspective. revision: yes
Referee: [Taxonomy section] Taxonomy section (likely §3 or §4): The eight-category breakdown is presented as exhaustive, yet boundary cases such as hybrid models combining 'reasoning' with 'trajectory' or 'latent representation' are not explicitly handled. The paper should provide a decision procedure or table showing how each cited model is assigned to a primary token type, and discuss whether any prominent VLA works (e.g., recent RT-series or OpenVLA variants) require additional categories.

Authors: We agree that making the assignment process more transparent will improve the taxonomy's rigor. In the revised manuscript, we will add a decision table or flowchart in the taxonomy section that specifies criteria for assigning each model to its primary token type, with explicit handling of hybrids (e.g., by prioritizing the dominant actionable output). We will also review and discuss recent works such as the RT-series and OpenVLA variants to confirm their placement or note any boundary considerations. These additions will address potential gaps without requiring new categories at this stage. revision: yes

Circularity Check

0 steps flagged

Survey unification is a post-hoc taxonomic lens with no circular reduction to inputs or self-citations

full rationale

This is a survey paper that proposes an observational framework for unifying VLA models via action tokenization and categorizes existing literature into token types (language description, code, affordance, etc.). The central claim is presented as an observation rather than a derivation from equations, fitted parameters, or prior self-work. No self-definitional loops, predictions that reduce to fits, or load-bearing self-citations appear in the abstract or described structure. The framework functions as a classification scheme applied to prior models, not a result forced by construction or imported uniqueness theorems. The derivation chain is therefore self-contained as a review without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that action tokenization is the central distinguishing axis across VLA models; no free parameters or invented entities are introduced.

axioms (1)

domain assumption All current VLA models process vision and language inputs through modules that ultimately produce executable actions via action tokens.
Stated in the abstract as the unifying observation that enables the categorization.

pith-pipeline@v0.9.0 · 5583 in / 1202 out tokens · 28506 ms · 2026-05-17T14:03:08.728329+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
cs.CV 2026-03 unverdicted novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
Towards Multi-Object Nonprehensile Transportation via Shared Teleoperation: A Framework Based on Virtual Object Model Predictive Control
cs.RO 2026-04 unverdicted novelty 7.0

The virtual object MPC framework enables stable shared teleoperation for transporting up to nine objects, cutting sliding distance by 72.45% and eliminating tip-overs compared to baseline.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
cs.LG 2026-02 unverdicted novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
cs.AI 2026-02 unverdicted novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
cs.RO 2026-03 unverdicted novelty 6.0

Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
cs.CV 2026-05 unverdicted novelty 5.0

Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
ReconVLA: An Uncertainty-Guided and Failure-Aware Vision-Language-Action Framework for Robotic Control
cs.RO 2026-04 unverdicted novelty 5.0

ReconVLA enhances pretrained vision-language-action robotic policies with conformal prediction for uncertainty estimation and failure detection without retraining.
DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
cs.CV 2025-12 unverdicted novelty 5.0

GeoPredict improves VLA manipulation accuracy by adding predictive kinematic trajectories and 3D Gaussian workspace geometry as training-time depth-rendering supervision.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 16 Pith papers · 58 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, pages 1–65, 2024

work page 2024
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[6]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[7]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

work page
[8]

URL https://openreview.net/forum?id=a68SUt6zFt

ISSN 2835-8856. URL https://openreview.net/forum?id=a68SUt6zFt. Featured Certification

work page
[9]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInternational Conference on Computer Vision (ICCV), pages 4015–4026, 2023

work page 2023
[10]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Gemini 2.5: Our most intelligent ai model, 2025

Gemini team. Gemini 2.5: Our most intelligent ai model, 2025. URL https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025

work page 2025
[13]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Do as i can, not as i say: Grounding language in robotic affordances

brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar ...

work page 2022
[16]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023
[17]

Voxposer: Composable 3d value maps for robotic manipulation with language models

WenlongHuang,ChenWang,RuohanZhang,YunzhuLi,JiajunWu,andLiFei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 540–562. PMLR, 06–09 Nov 2023. URLhttp...

work page 2023
[18]

RT-trajectory: Robotic task generalization via hindsight trajectory sketches

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. RT-trajectory: Robotic task generalization via hindsight trajectory sketches. InThe Twelfth International Conference o...

work page 2024
[19]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, KatieMillican, GeorgevandenDriessche, BogdanDamoc, AureliaGuy, SimonOsindero, KarenSimonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre. Traini...

work page 2022
[21]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[22]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In 8th Annual Conference on Robot Learni...

work page 2024
[23]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control, 2024.URL https://arxiv. org/abs/2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. CoRR, abs/2502.19417, February 2025. URLhttps:...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025
[26]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2024

work page 2024
[28]

Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572, 2025

Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, and Dzmitry Tsetserukou. Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572, 2025

work page arXiv 2025
[29]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943, February 2025

work page 1933
[30]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

DriveVLM: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models. In8thAnnualConferenceonRobotLearning , 2024. URLhttps://openreview.net/forum? id=928V4Umlys

work page 2024
[32]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yev- gen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. In https://arxiv.org/abs/2403.01823, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Dexgraspvla: a vision-language-action framework to- wards general dexterous grasping,

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Yitao Liang, Yaodong Yang, and Yuanpei Chen. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025

work page arXiv 2025
[34]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wetzstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR),...

work page 2025
[35]

Progprompt: Generating situated robot task plans using large language models.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2022

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2022. URLhttps://api.semanticscholar.org/CorpusID:252519594

work page 2023
[36]

Rekep: Spatio-temporalreasoning of relational keypoint constraints for robotic manipulation

WenlongHuang,ChenWang,YunzhuLi,RuohanZhang,andLiFei-Fei. Rekep: Spatio-temporalreasoning of relational keypoint constraints for robotic manipulation. In2nd CoRL Workshop on Learning Effective Abstractions for Planning, 2024. URLhttps://openreview.net/forum?id=ZGbWq3VqrO

work page 2024
[37]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[38]

Any- point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John Ian Reyes So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any- point Trajectory Modeling for Policy Learning. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. doi: 10.15607/RSS.2024.XX.092

work page doi:10.15607/rss.2024.xx.092 2024
[39]

3D-VLA: A 3D vision-language-action generative world model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3D-VLA: A 3D vision-language-action generative world model. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedingsofthe41stInternationalConferenceonMachineLearning ...

work page 2024
[40]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. InCoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoid...

work page 2024
[41]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?id=S70MgnIA0v

work page 2024
[42]

Action-free reasoning for policy generalization

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729, 2025

work page arXiv 2025
[43]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[44]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[45]

Universal Sentence Encoder

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder.arXiv preprint arXiv:1803.11175, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Multilingual Universal Sentence Encoder for Semantic Retrieval

Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[47]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[48]

Improving language understand- ing by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understand- ing by generative pre-training. 2018

work page 2018
[49]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[50]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[51]

The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

work page 2019
[52]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[53]

AI Alignment: A Comprehensive Survey

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhong- hao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey.arXiv preprint arXiv:2310.19852, 2023

work page internal anchor Pith review arXiv 2023
[54]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[55]

Alphaevolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. Technical report, Technical report, Google DeepMind, 05

work page
[56]

googleapis ..., 2025

URL https://storage. googleapis ..., 2025. 49 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

work page 2025
[57]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...

work page doi:10.18653/v1/2021.acl-long.353 2021
[65]

Gptunderstands, too

XiaoLiu,YananZheng,ZhengxiaoDu,MingDing,YujieQian,ZhilinYang,andJieTang. Gptunderstands, too. AI Open, 5:208–215, 2024

work page 2024
[66]

P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 6...

work page doi:10.18653/v1/2022.acl-short.8 2022
[67]

Thepowerofscaleforparameter-efficientprompttuning

BrianLester, RamiAl-Rfou, andNoahConstant. Thepowerofscaleforparameter-efficientprompttuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association fo...

work page 2021
[68]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[69]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

work page 2023
[70]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 50 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

work page 2022
[71]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[72]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024
[74]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Andersen, Jun Woo Park, Alexander J

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. InOSDI, pages 583–598. USENIX Association, 2014

work page 2014
[77]

Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

work page 2012
[78]

Pipedream: Generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019

work page 2019
[79]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[80]

Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541–551, 1989

YannLeCun,BernhardBoser,JohnSDenker,DonnieHenderson,RichardEHoward,WayneHubbard,and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541–551, 1989

work page 1989
[81]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012

Showing first 80 references.