SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance

Chunpu Xu; Jiayue Kang; Tengyue Jiang; Yao Mu

arxiv: 2606.30113 · v1 · pith:DBARYGEBnew · submitted 2026-06-29 · 💻 cs.RO · cs.AI

SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance

Tengyue Jiang , Chunpu Xu , Jiayue Kang , Yao Mu This is my paper

Pith reviewed 2026-06-30 05:37 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-actionaction tokenizationstate-aware tokenizerrobot manipulationdiscrete policiesproprioceptive conditioningsim-to-real transfer

0 comments

The pith

Conditioning action tokens on robot proprioceptive state allows each discrete code to represent multiple continuous actions and raises VLA success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard discrete action tokenizers assign one fixed continuous prototype to each code, yet the same token often requires different motor commands depending on joint configuration and contact state during manipulation. SA-VLA adds a state-injection step that modulates reconstruction using current proprioception, either through cross-attention or a lightweight adapter that predicts per-token scaling factors. The adapter formulation lets a small codebook cover a wider family of state-dependent actions while remaining compatible with autoregressive LLM policies. On 12 RoboTwin tasks this change lifts average success from 0.29 to 0.56, and on three real-world zero-shot transfers it lifts success from 0.15 to 0.33.

Core claim

The central claim is that state-conditioned action decoding in a VQ tokenizer expands the effective support of each discrete code by making reconstruction depend on the robot's current proprioceptive state, thereby reducing the compression gap between discrete tokens and continuous controls without altering the model interface.

What carries the argument

Lightweight state adapter that predicts action-wise modulation factors to condition continuous action reconstruction on proprioceptive state.

If this is right

The state adapter integrates into existing LLM-based VLA policies with only minimal interface changes and supports both autoregressive and parallel token decoding.
Average success across the 12 RoboTwin manipulation tasks rises from 0.29 to 0.56.
Zero-shot sim-to-real average success on three real-world tasks rises from 0.15 to 0.33.
Each discrete token can now represent a family of state-dependent actions while the codebook size and autoregressive structure remain unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same state-modulation idea could be tested in non-robotics sequence models where context changes the meaning of a token.
Smaller codebooks paired with state adapters might reach performance levels previously requiring larger codebooks.
Proprioceptive state may be an under-used conditioning signal in other discrete policy architectures beyond VLA.

Load-bearing premise

The measured gains are produced by the state-conditioning mechanism itself rather than by any differences in training procedure or implementation details between SA-VLA and the baseline tokenizers.

What would settle it

Train the strongest baseline tokenizer with the exact same data, optimizer schedule, and hyperparameters used for SA-VLA but without any state input, then compare success rates on the same 12 RoboTwin tasks.

Figures

Figures reproduced from arXiv: 2606.30113 by Chunpu Xu, Jiayue Kang, Tengyue Jiang, Yao Mu.

**Figure 2.** Figure 2: VLA architecture effectively reformulates the discrete VQ-VAE[11] into a regression problem, enabling the generation of a continuous action space while preserving the benefits of a fixed codebook. To delineate modality boundaries in the input tokens, we introduce special functional tokens: t bos/t eos for text, s bos/s eos for state, i bos/i eos for image, and a bos/a eos for action. During training, we s… view at source ↗

**Figure 3.** Figure 3: Some of the simulation tasks B.2 Visualization on real world scenes In the real-robot experiments, we selected three tasks, namely Click Bell, Place Container Plate, and Pick Diverse Bottles, and took a partial segment of each scenario, as shown in the [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Three task scenarios Table C.1: Performance on generalization tasks Task Name SR Shake Bottle Horizontally 0.95 Place Empty Cup 0.25 Handover Mic 0.10 Place Container Plate 0.16 C.2 Ablations on tokenization granularity We observed examples where different actions were mapped to identical tokens by the state-agnostic tokenizer, whereas the state-aware tokenizer mapped them to distinct tokens, as illustrate… view at source ↗

**Figure 5.** Figure 5: Visualization on tokenization granularity [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Discrete action tokenization provides a compact interface for autoregressive VLA policies, but accurately recovering continuous robot actions from discrete codes remains challenging. Existing tokenizers typically map each discrete code to a fixed continuous action prototype, ignoring the robot's current proprioceptive state. This limitation is particularly pronounced in manipulation, where the same action token may require different continuous controls under different joint configurations, object poses, and contact conditions. We therefore propose SA-VLA, a state-aware action tokenizer that conditions action decoding on robot state. We study two state-injection mechanisms for VQ-based action tokenization: cross-attention between state and action features, and a lightweight state adapter that predicts action-wise modulation factors for state-conditioned action modulation and reconstruction. The adapter formulation expands the effective support of a finite codebook by allowing each discrete token to represent a family of state-dependent continuous actions, while preserving the efficiency and compatibility of discrete action modeling. Integrated into an LLM-based VLA policy, SA-VLA supports both autoregressive and parallel action-token decoding with minimal changes to the model interface. On 12 RoboTwin manipulation tasks, SA-VLA improves the average success rate from 0.29 to 0.56 over the strongest tokenizer baseline. In zero-shot sim-to-real experiments on three real-world tasks, it further improves average success from 0.15 to 0.33 over the strongest tokenizer baseline. These results demonstrate that state-conditioned action decoding is a simple and effective mechanism for reducing the compression gap in discrete VLA policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

State conditioning in the tokenizer is a reasonable idea but the gains aren't isolated from other factors yet.

read the letter

The main point is that SA-VLA conditions the VQ action tokenizer on robot state through either cross-attention or a lightweight state adapter that predicts modulation factors. The adapter lets each discrete code stand for a range of state-dependent continuous actions without breaking the discrete interface for the downstream VLA policy.

The adapter design is the part that holds up on its own terms. Existing tokenizers map codes to fixed prototypes and ignore proprioception, which creates a real compression issue in manipulation where the same token can map to different controls depending on joint angles or contacts. Allowing modulation expands the effective coverage of a fixed codebook while keeping training and inference efficient. The paper keeps the changes minimal so the rest of an LLM-based VLA can stay the same for both autoregressive and parallel decoding.

The evaluation is the weaker part. The abstract reports average success rising from 0.29 to 0.56 on twelve RoboTwin tasks and from 0.15 to 0.33 on three real-world zero-shot transfers, but it gives no error bars, no ablation that holds codebook size, training schedule, and VLA integration fixed while toggling only the state path, and no confirmation that baselines received identical optimization. The stress-test concern is therefore still live: the deltas could trace to unmentioned implementation differences rather than the state injection. Without those controls the central claim stays provisional.

This is aimed at researchers already working on discrete tokenization inside robotics VLAs. Someone focused on action modeling would get value from the adapter formulation if the full paper supplies the missing ablations. It deserves a serious referee who can check the methods section and ask for those controls before any stronger claims are accepted.

Referee Report

2 major / 2 minor

Summary. The paper proposes SA-VLA, a state-aware action tokenizer for VLA models that conditions discrete action decoding on the robot's proprioceptive state. It introduces two mechanisms—cross-attention between state and action features, and a lightweight state adapter that predicts modulation factors—to allow each discrete token to represent a family of state-dependent continuous actions. The method is integrated into an LLM-based VLA policy supporting autoregressive and parallel decoding. On 12 RoboTwin manipulation tasks, it reports raising average success rate from 0.29 to 0.56 versus the strongest tokenizer baseline; in zero-shot sim-to-real transfer on three real-world tasks, it reports raising average success from 0.15 to 0.33.

Significance. If the gains are shown to stem specifically from state conditioning rather than uncontrolled variables, the work would be significant for discrete VLA policies. It directly targets the compression gap in action tokenization for contact-rich manipulation, where identical tokens must map to different controls under varying joint configurations and contacts. The adapter formulation preserves discrete modeling efficiency while expanding effective codebook support, and the minimal interface changes make it compatible with existing autoregressive VLAs.

major comments (2)

[Abstract / Experiments] Abstract (and Experiments section): the central empirical claim attributes the 0.29→0.56 (RoboTwin) and 0.15→0.33 (sim-to-real) gains to the state-injection mechanisms, yet no ablation is described that holds codebook size, training schedule, VLA integration, optimizer, and data augmentations fixed while toggling only the cross-attention path or state adapter. Because the adapter alters the reconstruction objective and cross-attention changes feature flow, any unmentioned hyper-parameter or implementation mismatch could fully explain the deltas; this is load-bearing for the claim that state conditioning itself is the effective mechanism.
[Abstract] Abstract: success rates are reported as point estimates with no error bars, standard deviations, or number of evaluation seeds. Without these, it is impossible to assess whether the reported improvements are statistically reliable or sensitive to random seeds, which is required to support the conclusion that state-aware tokenization reduces the compression gap.

minor comments (2)

[Abstract] Abstract: the phrase 'strongest tokenizer baseline' is undefined; the manuscript should name the specific baseline tokenizer(s), their codebook sizes, and training details for reproducibility.
The manuscript should clarify whether the state adapter is trained jointly with the VLA policy or only with the tokenizer, and whether the modulation factors are applied at inference time only or also during policy training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and commit to revisions that directly strengthen the empirical claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract (and Experiments section): the central empirical claim attributes the 0.29→0.56 (RoboTwin) and 0.15→0.33 (sim-to-real) gains to the state-injection mechanisms, yet no ablation is described that holds codebook size, training schedule, VLA integration, optimizer, and data augmentations fixed while toggling only the cross-attention path or state adapter. Because the adapter alters the reconstruction objective and cross-attention changes feature flow, any unmentioned hyper-parameter or implementation mismatch could fully explain the deltas; this is load-bearing for the claim that state conditioning itself is the effective mechanism.

Authors: We agree that isolating the contribution of the state-injection mechanisms (cross-attention and state adapter) while holding codebook size, training schedule, VLA integration, optimizer, and data augmentations fixed is necessary to support the central claim. The current manuscript reports comparisons to baseline tokenizers but does not contain such a controlled ablation. In the revised manuscript we will add this ablation, explicitly toggling only the state-conditioning components. revision: yes
Referee: [Abstract] Abstract: success rates are reported as point estimates with no error bars, standard deviations, or number of evaluation seeds. Without these, it is impossible to assess whether the reported improvements are statistically reliable or sensitive to random seeds, which is required to support the conclusion that state-aware tokenization reduces the compression gap.

Authors: We agree that point estimates alone are insufficient to demonstrate statistical reliability. Although the underlying experiments were performed with multiple random seeds, the manuscript reports only averages. In the revised version we will report standard deviations, include error bars in figures, and state the number of evaluation seeds used for both the RoboTwin and sim-to-real results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external task benchmarks

full rationale

The manuscript presents an empirical proposal for a state-aware action tokenizer (cross-attention and state adapter) and validates it via success-rate comparisons on 12 RoboTwin tasks and three sim-to-real tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the reported deltas (0.29→0.56 or 0.15→0.33) to quantities defined by the method itself. The central claims therefore remain independent of the input data and are evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes that state features can be fused without destabilizing the discrete codebook or the downstream LLM policy.

pith-pipeline@v0.9.1-grok · 5812 in / 1225 out tokens · 14084 ms · 2026-06-30T05:37:45.982100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 21 canonical work pages · 18 internal anchors

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Y . Wang, H. Zhu, M. Liu, J. Yang, H.-S. Fang, and T. He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025

2025
[7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

T. Shiba. The compression gap: Why discrete tokenization limits vision-language-action model scaling.arXiv preprint arXiv:2604.03191, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[12]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[13]

Y . Liu, P. Ding, T. Jiang, X. Wang, W. Song, M. Lin, H. Zhao, H. Zhang, Z. Zhuang, W. Zhao, et al. Mmada-vla: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

work page arXiv 2026
[14]

Liang, X

H. Liang, X. Chen, B. Wang, M. Chen, Y . Liu, Y . Zhang, Z. Chen, T. Yang, Y . Chen, J. Pang, et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025

work page arXiv 2025
[15]

B. Team. Being-h0. 7: A latent world-action model from egocentric videos, 2026

2026
[16]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 9

2025
[17]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026
[18]

J. Zhai, H. Shi, S. Guo, K. Yang, and K. Wang. E-vla: Event-augmented vision-language- action model for dark and blurred scenes.arXiv preprint arXiv:2604.04834, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

S. Yang, X. Hu, Q. Wu, and D. Yang. Vaevq: Enhancing discrete visual tokenization through variational modeling. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 40, pages 11703–11711, 2026

2026
[20]

H. Chen, Z. Wang, X. Li, X. Sun, F. Chen, J. Liu, J. Wang, B. Raj, Z. Liu, and E. Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358–28370, 2025

2025
[21]

Y . Zhu, B. Li, Y . Xin, Z. Xia, and L. Xu. Addressing representation collapse in vector quantized models with one linear layer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22968–22977, 2025

2025
[22]

Elfwing, E

S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

2018
[23]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023
[26]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[27]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Y . Deng and D. Zhou. Humannet: Scaling human-centric video learning to one million hours. arXiv preprint arXiv:2605.06747, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[31]

G. A. Team. Gen-0: Embodied foundation models that scale with physical interaction.Gener- alist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0

2025
[32]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 10

work page internal anchor Pith review Pith/arXiv arXiv 2001
[33]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[34]

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Z. Zhang, J. Pang, Z. Yang, K. Li, M. Liao, S. Zhang, G. Chi, J. Guo, H.-a. Gao, M. Shi, et al. Dexora: Open-source vla for high-dof bimanual dexterity.arXiv preprint arXiv:2605.18722, 2026. 11 Appendix A Detailed Training And Evaluation Recipes A.1 Settings for Data Collection in Simulator We conducted experiments in the RoboTwin simulator using two Pipe...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Y . Wang, H. Zhu, M. Liu, J. Yang, H.-S. Fang, and T. He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025

2025

[7] [7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

T. Shiba. The compression gap: Why discrete tokenization limits vision-language-action model scaling.arXiv preprint arXiv:2604.03191, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017

[12] [12]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[13] [13]

Y . Liu, P. Ding, T. Jiang, X. Wang, W. Song, M. Lin, H. Zhao, H. Zhang, Z. Zhuang, W. Zhao, et al. Mmada-vla: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

work page arXiv 2026

[14] [14]

Liang, X

H. Liang, X. Chen, B. Wang, M. Chen, Y . Liu, Y . Zhang, Z. Chen, T. Yang, Y . Chen, J. Pang, et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025

work page arXiv 2025

[15] [15]

B. Team. Being-h0. 7: A latent world-action model from egocentric videos, 2026

2026

[16] [16]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 9

2025

[17] [17]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026

[18] [18]

J. Zhai, H. Shi, S. Guo, K. Yang, and K. Wang. E-vla: Event-augmented vision-language- action model for dark and blurred scenes.arXiv preprint arXiv:2604.04834, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

S. Yang, X. Hu, Q. Wu, and D. Yang. Vaevq: Enhancing discrete visual tokenization through variational modeling. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 40, pages 11703–11711, 2026

2026

[20] [20]

H. Chen, Z. Wang, X. Li, X. Sun, F. Chen, J. Liu, J. Wang, B. Raj, Z. Liu, and E. Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358–28370, 2025

2025

[21] [21]

Y . Zhu, B. Li, Y . Xin, Z. Xia, and L. Xu. Addressing representation collapse in vector quantized models with one linear layer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22968–22977, 2025

2025

[22] [22]

Elfwing, E

S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

2018

[23] [23]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023

[26] [26]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[27] [27]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Y . Deng and D. Zhou. Humannet: Scaling human-centric video learning to one million hours. arXiv preprint arXiv:2605.06747, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[31] [31]

G. A. Team. Gen-0: Embodied foundation models that scale with physical interaction.Gener- alist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0

2025

[32] [32]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 10

work page internal anchor Pith review Pith/arXiv arXiv 2001

[33] [33]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[34] [34]

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Z. Zhang, J. Pang, Z. Yang, K. Li, M. Liao, S. Zhang, G. Chi, J. Guo, H.-a. Gao, M. Shi, et al. Dexora: Open-source vla for high-dof bimanual dexterity.arXiv preprint arXiv:2605.18722, 2026. 11 Appendix A Detailed Training And Evaluation Recipes A.1 Settings for Data Collection in Simulator We conducted experiments in the RoboTwin simulator using two Pipe...

work page internal anchor Pith review Pith/arXiv arXiv 2026