pith. sign in

arxiv: 2606.14752 · v2 · pith:IYEC77GQnew · submitted 2026-06-07 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

Pith reviewed 2026-06-30 11:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords action tokenizervision-language-actionsemantic residual quantizationmultimodal pretrainingrobot controldiscrete action languageVLA modelsmasked action modeling
0
0 comments X

The pith

Action tokenization can serve as a semantic interface between vision-language reasoning and robot control rather than mere motion compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard action tokenizers prioritize reconstruction of motion geometry at the expense of semantic supervision for the vision-language backbone. It proposes X-Tokenizer, a lightweight encoder-decoder that uses Semantic Residual Quantization to create a first-level discrete action language trained via masked modeling, with residual levels preserving fine details. Additional pretraining aligns these tokens to a foundation model's representation space through contrastive loss and next-frame vision-language prediction. When a frozen X-Tokenizer is inserted into mixed discrete-continuous VLAs and pretrained on 2.4 million trajectories, it delivers stronger real-world aggregate scores and simulation results than prior tokenizers. This reframes action discretization as a bridge that transfers multimodal knowledge into executable control across different robot arms.

Core claim

X-Tokenizer is a lightweight encoder-SRQ-decoder architecture that supplies a shared action interface across robotic embodiments. Its Semantic Residual Quantization imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling to form a discrete action language capturing coarse motion intent, while deeper levels act as reconstruction-oriented residuals. The full model is further pretrained with contrastive alignment to a pretrained foundation model's representation space and with next-frame vision-language feature prediction. A single frozen X-Tokenizer then supplies representation-shaping supervision inside a mixed discrete-continuou

What carries the argument

Semantic Residual Quantization (SRQ), an asymmetric residual vector quantization in which the first level is trained via Masked Action Modeling to produce discrete tokens that capture coarse motion intent while deeper levels preserve fine-grained reconstruction details.

If this is right

  • A frozen X-Tokenizer can be plugged into existing mixed discrete-continuous VLAs as a representation-shaping signal without retraining the tokenizer itself.
  • The resulting tokens improve multimodal grounding performance by 13.5 percent and long-horizon task performance by 8.25 points over prior action tokenizers.
  • The same tokenizer supplies a shared interface across diverse robotic arm embodiments after pretraining on 2.4 million trajectories.
  • Action tokenizers can be viewed as semantic interfaces that transfer knowledge from pretrained vision-language models into precise robot control.
  • Deeper residual levels remain reconstruction-oriented while the first level forms the discrete action language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same SRQ structure could be tested on non-arm embodiments such as mobile bases or dexterous hands to check whether the coarse-intent layer generalizes.
  • Scaling the pretraining corpus beyond 2.4 million trajectories might reveal whether the semantic alignment continues to improve or saturates.
  • The learned discrete action language could be inspected directly for human-interpretable motion primitives.
  • Inserting X-Tokenizer supervision into purely discrete VLAs might reduce the need for continuous action heads in some tasks.

Load-bearing premise

The assumption that an asymmetric first level trained with masked action modeling plus contrastive and next-frame alignment will produce tokens that carry semantic multimodal intent rather than only geometric motion details.

What would settle it

An ablation that removes either the masked action modeling objective or the contrastive alignment step and measures whether multimodal grounding and long-horizon task scores fall back to or below the level achieved by standard residual vector quantization tokenizers such as FAST.

read the original abstract

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces X-Tokenizer, a lightweight encoder-SRQ-decoder for action tokenization in VLA pretraining. SRQ applies asymmetric residual vector quantization: the first level uses Masked Action Modeling (MAM) to learn a discrete action language for coarse motion intent, while deeper levels focus on reconstruction. Additional pretraining via contrastive alignment to a foundation model and next-frame vision-language prediction aligns tokens semantically. Pretrained on 2.4M trajectories (2.0B action frames), a frozen X-Tokenizer is plugged into mixed discrete-continuous VLAs, claiming top real-world aggregate results and strong RoboTwin 2.0 performance, outperforming FAST by +13.5% in multimodal grounding and +8.25 in long-horizon tasks.

Significance. If the empirical claims hold with proper controls, the work would be significant for reframing action tokenizers as semantic interfaces rather than pure compressors, potentially enabling better multimodal grounding in VLA models across embodiments.

major comments (2)
  1. [Abstract] Abstract: performance claims (top real-world aggregate, +13.5% multimodal grounding, +8.25 long-horizon) are stated without any description of experimental setup, baselines, metrics, error bars, data splits, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
  2. [Abstract] Abstract (SRQ and pretraining objectives): the claim that MAM on the first quantization level plus contrastive/next-frame objectives produces a 'discrete action language' capturing coarse intent (as opposed to reconstruction-only) is presented without ablations or controls showing this structure is load-bearing for the reported gains over FAST.
minor comments (1)
  1. [Abstract] Abstract: define or cite the exact metrics underlying 'multimodal grounding' and 'long-horizon tasks' and clarify the FAST baseline implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. We address each major point below and will revise the abstract accordingly to improve clarity and verifiability while preserving its conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance claims (top real-world aggregate, +13.5% multimodal grounding, +8.25 long-horizon) are stated without any description of experimental setup, baselines, metrics, error bars, data splits, or statistical tests, rendering the central empirical claim unverifiable from the provided text.

    Authors: We agree that the abstract omits key experimental details due to length constraints. The full manuscript details the setups, baselines (including FAST), metrics, data (2.4M trajectories), and results with error bars in Sections 4 and 5. We will revise the abstract to briefly reference the evaluation protocol, main baselines, and metrics to make the claims more verifiable from the abstract alone. revision: yes

  2. Referee: [Abstract] Abstract (SRQ and pretraining objectives): the claim that MAM on the first quantization level plus contrastive/next-frame objectives produces a 'discrete action language' capturing coarse intent (as opposed to reconstruction-only) is presented without ablations or controls showing this structure is load-bearing for the reported gains over FAST.

    Authors: The manuscript contains ablations (Section 4.3) isolating the asymmetric SRQ with MAM on the first level versus standard RVQ, as well as the contribution of contrastive and next-frame objectives, showing improved semantic alignment and gains over reconstruction-only baselines. These support the claim that the structure is load-bearing. We will revise the abstract to reference these ablations more explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes an architectural choice (SRQ with asymmetric MAM on the first quantization level plus contrastive and next-frame objectives) and reports empirical benchmark results on real-world and RoboTwin tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the central claim (action tokenizers as semantic interfaces) to its own inputs by construction. The performance gains are presented as measured outcomes rather than derived equivalences, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claim rests on the effectiveness of these newly introduced components and pretraining strategies, with no free parameters or standard axioms explicitly listed in the abstract.

invented entities (3)
  • X-Tokenizer no independent evidence
    purpose: Lightweight encoder-SRQ-decoder for shared action interface across robotic embodiments
    New architecture introduced in the paper.
  • SRQ no independent evidence
    purpose: Semantic Residual Quantization with asymmetric structure for semantic and reconstruction levels
    Key component defined in the abstract.
  • MAM no independent evidence
    purpose: Masked Action Modeling to train first level for coarse motion intent
    New training objective mentioned.

pith-pipeline@v0.9.1-grok · 5849 in / 1365 out tokens · 49858 ms · 2026-06-30T11:20:38.782677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 36 canonical work pages · 24 internal anchors

  1. [1]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  2. [2]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

  3. [3]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

  5. [5]

    Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models

    Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, et al. Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models. arXiv preprint arXiv:2506.16211, 2025

  6. [6]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  7. [7]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  8. [8]

    Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

    Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

  9. [9]

    LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    LucasMaes, QuentinLeLidec, DamienScieur, YannLeCun, andRandallBalestriero. Leworldmodel: Stableend-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

  10. [10]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  11. [11]

    Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, and Qian Wang. Wall-wm: Carvi...

  12. [12]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  13. [13]

    In9th Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5 : a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

  14. [14]

    Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

  15. [15]

    Wall-OSS-0.5 Technical Report

    Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, et al. Wall-oss-0.5 technical report.arXiv preprint arXiv:2605.30877, 2026. 14

  16. [16]

    Igniting vlms toward the embodied space.CoRR, abs/2509.11766,

    Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

  17. [17]

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  18. [18]

    Universal actions for enhanced embodied foundation models

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

  19. [19]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  20. [20]

    A Pragmatic VLA Foundation Model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  21. [21]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  22. [22]

    J., Shafiullah, N

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

  23. [23]

    Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

    Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025

  24. [24]

    Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

    Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

  25. [25]

    Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026

    Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, et al. Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026

  26. [26]

    Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024

    Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024

  27. [27]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  28. [28]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  29. [29]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022

  30. [30]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  31. [31]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  32. [32]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  33. [33]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  34. [34]

    Oat: Ordered action tokenization

    Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, and Yilun Du. Oat: Ordered action tokenization. In Proceedings of Robotics: Science and Systems, 2026

  35. [35]

    Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

    Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026. 15

  36. [36]

    UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. UniT: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

  37. [37]

    Demystifying Action Space Design for Robotic Manipulation Policies

    Yuchun Feng, Jinliang Zheng, Zhihao Wang, Dongxiu Liu, Jianxiong Li, Jiangmiao Pang, Tai Wang, and Xianyuan Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026

  38. [38]

    Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

    Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, et al. Unleashing the potential of diffusion models for end-to-end autonomous driving.arXiv preprint arXiv:2602.22801, 2026

  39. [39]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021

  40. [40]

    Perceiver IO: A general architecture for structured inputs and outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs and outputs. InInternational Conference on Learning Representations, 2022

  41. [41]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  42. [42]

    Zimmermann, and Wieland Brendel

    Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, and Wieland Brendel. Infonce: Identifying the gap between theory and practice, 2025. URLhttps://arxiv.org/abs/2407.00143

  43. [43]

    RDT2: Exploring the scaling limit of UMI data towards zero-shot cross-embodiment generalization,

    Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

  44. [44]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  45. [45]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  46. [46]

    Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

    Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

  47. [47]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association ...

  48. [48]

    Contextual joint factor acoustic embeddings

    Yanpei Shi and Thomas Hain. Contextual joint factor acoustic embeddings. In2021 IEEE Spoken Language Technology Workshop (SLT), pages 750–757. IEEE, 2021

  49. [49]

    MaskGIT:Maskedgenerativeimagetransformer

    HuiwenChang,HanZhang,LuJiang,CeLiu,andWilliamT.Freeman. MaskGIT:Maskedgenerativeimagetransformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

  50. [50]

    6d rotation representation for unconstrained head pose estimation

    Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for unconstrained head pose estimation. In2022 IEEE International Conference on image processing (ICIP), pages 2496–2500. IEEE, 2022

  51. [51]

    Learning trajectory dependencies for human motion prediction

    Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9489–9497, 2019

  52. [52]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

  53. [53]

    Agibot world 2026

    AgiBot World Team. Agibot world 2026. https://huggingface.co/datasets/agibot-world/ AgiBotWorld2026, 2026

  54. [54]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  55. [55]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

  56. [56]

    Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

    Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

  57. [57]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

  58. [58]

    RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025

    RoboChallenge.ai. RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025. Accessed: 2026-05-07

  59. [59]

    10Kh RealOmni-Open DataSet

    GenRobot AI. 10Kh RealOmni-Open DataSet. https://www.genrobot.ai/data/open-dataset, 2025. Ac- cessed: 2026-05-07

  60. [60]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  61. [61]

    RT-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

  62. [62]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

  63. [63]

    Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

  64. [64]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

  65. [65]

    Input projection.A linear projection followed by LayerNorm, GELU and dropout maps𝑥1:𝑇 from ℝ𝐷 to ℝ𝐻

  66. [66]

    3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104

    Embodiment conditioning.An encoder-side embedding vectorm∈ℝ 𝐻, looked up from a learnable registry of1024slots (one of which is a special learnable “none” slot used under CFG-style dropout), is added broadcast over time. 3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104

  67. [67]

    Self-attention stack.A12-layer Transformer encoder (8heads, GELU FFN of width4𝐻, dropout0 .1) processes the projected sequence with the chunk’s padding mask

  68. [68]

    Optional state cross-attention.When 𝑜 is provided (i.e., not CFG-dropped), a single cross-attention block uses the linearly projected𝑜 as key and value while the time series acts as query, followed by residual + LayerNorm

  69. [69]

    multimodal

    Latent query cross-attention.𝑀max=16learnable latent queriesq 1:𝑀 are equipped with their own RoPE encoding, expanded across the batch, and cross-attend to the encoded sequence to extract a length-𝑀 summary. 7.Position-wise FFN.A final FFN with residual + LayerNorm. 18 Decoder.The decoder Dec ingests the quantized latent˜z1:𝑀 together with𝑜 andm, and outp...