X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

Dongxiu Liu; Hang Su; Hao Wang; Jinliang Zheng; Lights Shi; Lucy Liang; Miracle Kang; Pushi Zhang; Roy Gan; Shawn Qin

arxiv: 2606.14752 · v2 · pith:IYEC77GQnew · submitted 2026-06-07 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

Miracle Kang , Lights Shi , Lucy Liang , Roy Gan , Dongxiu Liu , Pushi Zhang , Sylas Chen , Shawn Qin

show 5 more authors

Yinan Zheng Jinliang Zheng Hao Wang Xianyuan Zhan Hang Su

This is my paper

Pith reviewed 2026-06-30 11:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords action tokenizervision-language-actionsemantic residual quantizationmultimodal pretrainingrobot controldiscrete action languageVLA modelsmasked action modeling

0 comments

The pith

Action tokenization can serve as a semantic interface between vision-language reasoning and robot control rather than mere motion compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard action tokenizers prioritize reconstruction of motion geometry at the expense of semantic supervision for the vision-language backbone. It proposes X-Tokenizer, a lightweight encoder-decoder that uses Semantic Residual Quantization to create a first-level discrete action language trained via masked modeling, with residual levels preserving fine details. Additional pretraining aligns these tokens to a foundation model's representation space through contrastive loss and next-frame vision-language prediction. When a frozen X-Tokenizer is inserted into mixed discrete-continuous VLAs and pretrained on 2.4 million trajectories, it delivers stronger real-world aggregate scores and simulation results than prior tokenizers. This reframes action discretization as a bridge that transfers multimodal knowledge into executable control across different robot arms.

Core claim

X-Tokenizer is a lightweight encoder-SRQ-decoder architecture that supplies a shared action interface across robotic embodiments. Its Semantic Residual Quantization imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling to form a discrete action language capturing coarse motion intent, while deeper levels act as reconstruction-oriented residuals. The full model is further pretrained with contrastive alignment to a pretrained foundation model's representation space and with next-frame vision-language feature prediction. A single frozen X-Tokenizer then supplies representation-shaping supervision inside a mixed discrete-continuou

What carries the argument

Semantic Residual Quantization (SRQ), an asymmetric residual vector quantization in which the first level is trained via Masked Action Modeling to produce discrete tokens that capture coarse motion intent while deeper levels preserve fine-grained reconstruction details.

If this is right

A frozen X-Tokenizer can be plugged into existing mixed discrete-continuous VLAs as a representation-shaping signal without retraining the tokenizer itself.
The resulting tokens improve multimodal grounding performance by 13.5 percent and long-horizon task performance by 8.25 points over prior action tokenizers.
The same tokenizer supplies a shared interface across diverse robotic arm embodiments after pretraining on 2.4 million trajectories.
Action tokenizers can be viewed as semantic interfaces that transfer knowledge from pretrained vision-language models into precise robot control.
Deeper residual levels remain reconstruction-oriented while the first level forms the discrete action language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SRQ structure could be tested on non-arm embodiments such as mobile bases or dexterous hands to check whether the coarse-intent layer generalizes.
Scaling the pretraining corpus beyond 2.4 million trajectories might reveal whether the semantic alignment continues to improve or saturates.
The learned discrete action language could be inspected directly for human-interpretable motion primitives.
Inserting X-Tokenizer supervision into purely discrete VLAs might reduce the need for continuous action heads in some tasks.

Load-bearing premise

The assumption that an asymmetric first level trained with masked action modeling plus contrastive and next-frame alignment will produce tokens that carry semantic multimodal intent rather than only geometric motion details.

What would settle it

An ablation that removes either the masked action modeling objective or the contrastive alignment step and measures whether multimodal grounding and long-horizon task scores fall back to or below the level achieved by standard residual vector quantization tokenizers such as FAST.

read the original abstract

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes action tokenization as semantic interface learning via asymmetric SRQ and specific pretraining objectives, with reported gains over FAST, but the abstract supplies almost no experimental detail to evaluate those claims.

read the letter

The main thing to know is that this work treats action discretization not as compression but as a way to create a semantic bridge between vision-language reasoning and control. X-Tokenizer uses an encoder-SRQ-decoder setup where the first residual level is trained with Masked Action Modeling to capture coarse motion intent as a discrete language, while deeper levels stay reconstruction-focused. They add contrastive alignment to a foundation model and next-frame prediction, then freeze the tokenizer and plug it into a mixed VLA trained on 2.4M trajectories.

What is actually new is the asymmetric structure on residual vector quantization plus the combination of MAM, contrastive, and prediction objectives to make tokens carry multimodal semantics while preserving control details. The paper does a reasonable job of identifying the weakness in prior tokenizers like FAST and proposing a concrete fix that aims to work across embodiments.

The soft spots sit in the evidence. The abstract states clear performance numbers (+13.5% multimodal grounding, +8.25 long-horizon) without describing baselines, data splits, error bars, or how the mixed discrete-continuous VLA was trained. That absence makes it hard to judge whether the semantic interface is doing the work or whether other factors explain the difference. The central assumption that the first level forms intent while residuals handle geometry is plausible on paper but needs the ablations and controls that are missing here.

This is for researchers building VLA models who care about better action representations. A reader working on tokenization or multimodal alignment would get value from the architecture and objectives even if they treat the numbers as preliminary.

I would send it to peer review. The idea is specific enough and the problem is real enough that referees can check the methods and ask for the missing controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces X-Tokenizer, a lightweight encoder-SRQ-decoder for action tokenization in VLA pretraining. SRQ applies asymmetric residual vector quantization: the first level uses Masked Action Modeling (MAM) to learn a discrete action language for coarse motion intent, while deeper levels focus on reconstruction. Additional pretraining via contrastive alignment to a foundation model and next-frame vision-language prediction aligns tokens semantically. Pretrained on 2.4M trajectories (2.0B action frames), a frozen X-Tokenizer is plugged into mixed discrete-continuous VLAs, claiming top real-world aggregate results and strong RoboTwin 2.0 performance, outperforming FAST by +13.5% in multimodal grounding and +8.25 in long-horizon tasks.

Significance. If the empirical claims hold with proper controls, the work would be significant for reframing action tokenizers as semantic interfaces rather than pure compressors, potentially enabling better multimodal grounding in VLA models across embodiments.

major comments (2)

[Abstract] Abstract: performance claims (top real-world aggregate, +13.5% multimodal grounding, +8.25 long-horizon) are stated without any description of experimental setup, baselines, metrics, error bars, data splits, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
[Abstract] Abstract (SRQ and pretraining objectives): the claim that MAM on the first quantization level plus contrastive/next-frame objectives produces a 'discrete action language' capturing coarse intent (as opposed to reconstruction-only) is presented without ablations or controls showing this structure is load-bearing for the reported gains over FAST.

minor comments (1)

[Abstract] Abstract: define or cite the exact metrics underlying 'multimodal grounding' and 'long-horizon tasks' and clarify the FAST baseline implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. We address each major point below and will revise the abstract accordingly to improve clarity and verifiability while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: performance claims (top real-world aggregate, +13.5% multimodal grounding, +8.25 long-horizon) are stated without any description of experimental setup, baselines, metrics, error bars, data splits, or statistical tests, rendering the central empirical claim unverifiable from the provided text.

Authors: We agree that the abstract omits key experimental details due to length constraints. The full manuscript details the setups, baselines (including FAST), metrics, data (2.4M trajectories), and results with error bars in Sections 4 and 5. We will revise the abstract to briefly reference the evaluation protocol, main baselines, and metrics to make the claims more verifiable from the abstract alone. revision: yes
Referee: [Abstract] Abstract (SRQ and pretraining objectives): the claim that MAM on the first quantization level plus contrastive/next-frame objectives produces a 'discrete action language' capturing coarse intent (as opposed to reconstruction-only) is presented without ablations or controls showing this structure is load-bearing for the reported gains over FAST.

Authors: The manuscript contains ablations (Section 4.3) isolating the asymmetric SRQ with MAM on the first level versus standard RVQ, as well as the contribution of contrastive and next-frame objectives, showing improved semantic alignment and gains over reconstruction-only baselines. These support the claim that the structure is load-bearing. We will revise the abstract to reference these ablations more explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes an architectural choice (SRQ with asymmetric MAM on the first quantization level plus contrastive and next-frame objectives) and reports empirical benchmark results on real-world and RoboTwin tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the central claim (action tokenizers as semantic interfaces) to its own inputs by construction. The performance gains are presented as measured outcomes rather than derived equivalences, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claim rests on the effectiveness of these newly introduced components and pretraining strategies, with no free parameters or standard axioms explicitly listed in the abstract.

invented entities (3)

X-Tokenizer no independent evidence
purpose: Lightweight encoder-SRQ-decoder for shared action interface across robotic embodiments
New architecture introduced in the paper.
SRQ no independent evidence
purpose: Semantic Residual Quantization with asymmetric structure for semantic and reconstruction levels
Key component defined in the abstract.
MAM no independent evidence
purpose: Masked Action Modeling to train first level for coarse motion intent
New training objective mentioned.

pith-pipeline@v0.9.1-grok · 5849 in / 1365 out tokens · 49858 ms · 2026-06-30T11:20:38.782677+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 36 canonical work pages · 24 internal anchors

[1]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

2025
[5]

Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models

Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, et al. Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models. arXiv preprint arXiv:2506.16211, 2025

work page arXiv 2025
[6]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

work page arXiv 2025
[9]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

LucasMaes, QuentinLeLidec, DamienScieur, YannLeCun, andRandallBalestriero. Leworldmodel: Stableend-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, and Qian Wang. Wall-wm: Carvi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5 : a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025
[14]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Wall-OSS-0.5 Technical Report

Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, et al. Wall-oss-0.5 technical report.arXiv preprint arXiv:2605.30877, 2026. 14

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Igniting vlms toward the embodied space.CoRR, abs/2509.11766,

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

work page arXiv 2025
[17]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

2025
[19]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

work page arXiv 2025
[20]

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

J., Shafiullah, N

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024
[23]

Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025

2025
[24]

Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

work page arXiv 2025
[25]

Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026

Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, et al. Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026

work page arXiv 2026
[26]

Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024

Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024

work page arXiv 2024
[27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[28]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020
[29]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022

2022
[30]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[31]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[32]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Oat: Ordered action tokenization

Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, and Yilun Du. Oat: Ordered action tokenization. In Proceedings of Robotics: Science and Systems, 2026

2026
[35]

Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026. 15

work page arXiv 2026
[36]

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. UniT: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Demystifying Action Space Design for Robotic Manipulation Policies

Yuchun Feng, Jinliang Zheng, Zhihao Wang, Dongxiu Liu, Jianxiong Li, Jiangmiao Pang, Tai Wang, and Xianyuan Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, et al. Unleashing the potential of diffusion models for end-to-end autonomous driving.arXiv preprint arXiv:2602.22801, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021

2021
[40]

Perceiver IO: A general architecture for structured inputs and outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs and outputs. InInternational Conference on Learning Representations, 2022

2022
[41]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Zimmermann, and Wieland Brendel

Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, and Wieland Brendel. Infonce: Identifying the gap between theory and practice, 2025. URLhttps://arxiv.org/abs/2407.00143

work page arXiv 2025
[43]

RDT2: Exploring the scaling limit of UMI data towards zero-shot cross-embodiment generalization,

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

work page arXiv 2026
[44]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

1984
[47]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association ...

2019
[48]

Contextual joint factor acoustic embeddings

Yanpei Shi and Thomas Hain. Contextual joint factor acoustic embeddings. In2021 IEEE Spoken Language Technology Workshop (SLT), pages 750–757. IEEE, 2021

2021
[49]

MaskGIT:Maskedgenerativeimagetransformer

HuiwenChang,HanZhang,LuJiang,CeLiu,andWilliamT.Freeman. MaskGIT:Maskedgenerativeimagetransformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

2022
[50]

6d rotation representation for unconstrained head pose estimation

Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for unconstrained head pose estimation. In2022 IEEE International Conference on image processing (ICIP), pages 2496–2500. IEEE, 2022

2022
[51]

Learning trajectory dependencies for human motion prediction

Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9489–9497, 2019

2019
[52]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

2025
[53]

Agibot world 2026

AgiBot World Team. Agibot world 2026. https://huggingface.co/datasets/agibot-world/ AgiBotWorld2026, 2026

2026
[54]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

2024
[55]

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

work page arXiv 2025
[57]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025

RoboChallenge.ai. RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025. Accessed: 2026-05-07

2025
[59]

10Kh RealOmni-Open DataSet

GenRobot AI. 10Kh RealOmni-Open DataSet. https://www.genrobot.ai/data/open-dataset, 2025. Ac- cessed: 2026-05-07

2025
[60]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[61]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

2023
[62]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

2022
[63]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

2025
[64]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

2024
[65]

Input projection.A linear projection followed by LayerNorm, GELU and dropout maps𝑥1:𝑇 from ℝ𝐷 to ℝ𝐻
[66]

3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104

Embodiment conditioning.An encoder-side embedding vectorm∈ℝ 𝐻, looked up from a learnable registry of1024slots (one of which is a special learnable “none” slot used under CFG-style dropout), is added broadcast over time. 3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104
[67]

Self-attention stack.A12-layer Transformer encoder (8heads, GELU FFN of width4𝐻, dropout0 .1) processes the projected sequence with the chunk’s padding mask
[68]

Optional state cross-attention.When 𝑜 is provided (i.e., not CFG-dropped), a single cross-attention block uses the linearly projected𝑜 as key and value while the time series acts as query, followed by residual + LayerNorm
[69]

multimodal

Latent query cross-attention.𝑀max=16learnable latent queriesq 1:𝑀 are equipped with their own RoPE encoding, expanded across the batch, and cross-attend to the encoded sequence to extract a length-𝑀 summary. 7.Position-wise FFN.A final FFN with residual + LayerNorm. 18 Decoder.The decoder Dec ingests the quantized latent˜z1:𝑀 together with𝑜 andm, and outp...

[1] [1]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

2025

[5] [5]

Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models

Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, et al. Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models. arXiv preprint arXiv:2506.16211, 2025

work page arXiv 2025

[6] [6]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

work page arXiv 2025

[9] [9]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

LucasMaes, QuentinLeLidec, DamienScieur, YannLeCun, andRandallBalestriero. Leworldmodel: Stableend-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, and Qian Wang. Wall-wm: Carvi...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5 : a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025

[14] [14]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Wall-OSS-0.5 Technical Report

Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, et al. Wall-oss-0.5 technical report.arXiv preprint arXiv:2605.30877, 2026. 14

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Igniting vlms toward the embodied space.CoRR, abs/2509.11766,

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

work page arXiv 2025

[17] [17]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

2025

[19] [19]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

work page arXiv 2025

[20] [20]

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

J., Shafiullah, N

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024

[23] [23]

Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025

2025

[24] [24]

Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

work page arXiv 2025

[25] [25]

Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026

Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, et al. Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026

work page arXiv 2026

[26] [26]

Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024

Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024

work page arXiv 2024

[27] [27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[28] [28]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020

[29] [29]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022

2022

[30] [30]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019

[31] [31]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[32] [32]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Oat: Ordered action tokenization

Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, and Yilun Du. Oat: Ordered action tokenization. In Proceedings of Robotics: Science and Systems, 2026

2026

[35] [35]

Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026. 15

work page arXiv 2026

[36] [36]

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. UniT: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Demystifying Action Space Design for Robotic Manipulation Policies

Yuchun Feng, Jinliang Zheng, Zhihao Wang, Dongxiu Liu, Jianxiong Li, Jiangmiao Pang, Tai Wang, and Xianyuan Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, et al. Unleashing the potential of diffusion models for end-to-end autonomous driving.arXiv preprint arXiv:2602.22801, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021

2021

[40] [40]

Perceiver IO: A general architecture for structured inputs and outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs and outputs. InInternational Conference on Learning Representations, 2022

2022

[41] [41]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Zimmermann, and Wieland Brendel

Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, and Wieland Brendel. Infonce: Identifying the gap between theory and practice, 2025. URLhttps://arxiv.org/abs/2407.00143

work page arXiv 2025

[43] [43]

RDT2: Exploring the scaling limit of UMI data towards zero-shot cross-embodiment generalization,

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

work page arXiv 2026

[44] [44]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [45]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

1984

[47] [47]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association ...

2019

[48] [48]

Contextual joint factor acoustic embeddings

Yanpei Shi and Thomas Hain. Contextual joint factor acoustic embeddings. In2021 IEEE Spoken Language Technology Workshop (SLT), pages 750–757. IEEE, 2021

2021

[49] [49]

MaskGIT:Maskedgenerativeimagetransformer

HuiwenChang,HanZhang,LuJiang,CeLiu,andWilliamT.Freeman. MaskGIT:Maskedgenerativeimagetransformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

2022

[50] [50]

6d rotation representation for unconstrained head pose estimation

Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for unconstrained head pose estimation. In2022 IEEE International Conference on image processing (ICIP), pages 2496–2500. IEEE, 2022

2022

[51] [51]

Learning trajectory dependencies for human motion prediction

Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9489–9497, 2019

2019

[52] [52]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

2025

[53] [53]

Agibot world 2026

AgiBot World Team. Agibot world 2026. https://huggingface.co/datasets/agibot-world/ AgiBotWorld2026, 2026

2026

[54] [54]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

2024

[55] [55]

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

work page arXiv 2025

[57] [57]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025

RoboChallenge.ai. RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025. Accessed: 2026-05-07

2025

[59] [59]

10Kh RealOmni-Open DataSet

GenRobot AI. 10Kh RealOmni-Open DataSet. https://www.genrobot.ai/data/open-dataset, 2025. Ac- cessed: 2026-05-07

2025

[60] [60]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023

[61] [61]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

2023

[62] [62]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

2022

[63] [63]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

2025

[64] [64]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

2024

[65] [65]

Input projection.A linear projection followed by LayerNorm, GELU and dropout maps𝑥1:𝑇 from ℝ𝐷 to ℝ𝐻

[66] [66]

3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104

Embodiment conditioning.An encoder-side embedding vectorm∈ℝ 𝐻, looked up from a learnable registry of1024slots (one of which is a special learnable “none” slot used under CFG-style dropout), is added broadcast over time. 3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104

[67] [67]

Self-attention stack.A12-layer Transformer encoder (8heads, GELU FFN of width4𝐻, dropout0 .1) processes the projected sequence with the chunk’s padding mask

[68] [68]

Optional state cross-attention.When 𝑜 is provided (i.e., not CFG-dropped), a single cross-attention block uses the linearly projected𝑜 as key and value while the time series acts as query, followed by residual + LayerNorm

[69] [69]

multimodal

Latent query cross-attention.𝑀max=16learnable latent queriesq 1:𝑀 are equipped with their own RoPE encoding, expanded across the batch, and cross-attend to the encoded sequence to extract a length-𝑀 summary. 7.Position-wise FFN.A final FFN with residual + LayerNorm. 18 Decoder.The decoder Dec ingests the quantized latent˜z1:𝑀 together with𝑜 andm, and outp...