Learning Native Continuation for Action Chunking Flow Policies

Bocheng Li; Dequan Wang; Di Zhang; Hang Yu; Junliang Guo; Juntu Zhao; Junyuan Xie; Mingzhu Li; Wenxuan Wu; Yang Gao

arxiv: 2602.12978 · v2 · pith:EHOEEKIAnew · submitted 2026-02-13 · 💻 cs.RO · cs.AI

Learning Native Continuation for Action Chunking Flow Policies

Yufeng Liu , Hang Yu , Juntu Zhao , Bocheng Li , Di Zhang , Mingzhu Li , Wenxuan Wu , Yingdong Hu

show 4 more authors

Junyuan Xie Junliang Guo Dequan Wang Yang Gao

This is my paper

Pith reviewed 2026-05-21 12:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords action chunkingflow policiesVLAtrajectory smoothnessdenoising consistencycontinuation methodrobot manipulation

0 comments

The pith

By initializing denoising with mixtures of known actions and noise, Legato builds continuation into flow policies to eliminate chunk-boundary discontinuities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Action chunking allows real-time execution in vision-language-action models, yet naive chunking produces jumps at boundaries and external fixes like real-time chunking still trigger unwanted mode switches. Legato addresses this by training the model to start denoising from a schedule-shaped blend of actual actions and noise, exposing it to partial sequences, while also reshaping the flow dynamics so that training and inference remain aligned under step-by-step guidance. Randomized schedule conditions during training further allow the policy to adapt to different delays and control how smooth the output becomes. The resulting trajectories show fewer jumps, less hesitation, and faster task completion in physical robot experiments.

Core claim

Legato is a training-time continuation method for action-chunked flow-based VLA policies that initializes the denoising process from a schedule-shaped mixture of known actions and noise, reshapes the learned flow dynamics to keep training and inference consistent under per-step guidance, and applies randomized schedule conditioning to handle varying inference delays while producing controllable smoothness.

What carries the argument

Schedule-shaped mixture initialization of the denoising process together with reshaping of flow dynamics to enforce consistency between training and inference.

If this is right

Trajectories become smoother with fewer discontinuities at chunk boundaries during execution.
Spurious multimodal switching and resulting hesitation are reduced.
Task completion times shorten compared with external real-time chunking methods.
Approximately 10 percent gains appear in both smoothness and completion time across five real-world manipulation tasks.
Smoothness level becomes controllable by randomizing the schedule condition at training time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The randomized schedule approach may allow policies to maintain performance when inference delays fluctuate in unpredictable real-world settings.
Embedding consistency directly in training could reduce the need for separate post-processing modules when deploying flow policies.
The same mixture-and-reshape pattern might transfer to other sequential generation settings where temporal coherence matters.

Load-bearing premise

Initializing the denoising process from a schedule-shaped mixture of known actions and noise, combined with reshaping the learned flow dynamics, will produce intrinsic consistency between training and inference under per-step guidance without requiring additional constraints on model architecture or task distribution.

What would settle it

Measuring whether action trajectories retain discontinuities or increased multimodal switching at chunk boundaries when the mixture initialization step or the flow-reshaping step is removed during training.

Figures

Figures reproduced from arXiv: 2602.12978 by Bocheng Li, Dequan Wang, Di Zhang, Hang Yu, Junliang Guo, Juntu Zhao, Junyuan Xie, Mingzhu Li, Wenxuan Wu, Yang Gao, Yingdong Hu, Yufeng Liu.

**Figure 2.** Figure 2: Overview of Legato with schedule-shaped continuation dynamics. The schedule parameters are defined as follows: [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: One-shot prefix guidance cannot preserve prefix constraints during [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Real-world evaluation tasks on a dual-arm robot. We consider five [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Legato suppresses spurious multimodal switching across chunk boundaries. In a representative bowl-stacking rollout, RTC alternates (arrow) between [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Schedule ablation reveals a controllable trade-off between local [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Legato adds a training-time fix for smoother continuation in chunked flow policies and reports real-world gains over RTC, but the abstract leaves the mechanics and evidence thin.

read the letter

The core takeaway is that this paper offers a training procedure called Legato to make continuation native to flow-based action chunking policies. It mixes known actions with noise according to a schedule, reshapes the learned flow dynamics, and adds randomized schedule conditioning so the model handles partial actions more consistently at inference time without an external smoother like RTC. That setup is presented as new and is the main technical move. The work does a decent job spotting a real deployment friction in VLA models—chunk boundaries causing hesitation and multimodal flips—and then tests the idea on five manipulation tasks with reported 10% gains in trajectory smoothness and task time. Real-world experiments give it some grounding that pure simulation papers often lack. The approach stays scoped to flow policies and these tasks, which keeps the claim proportionate. The soft spots sit mostly in the missing pieces. The abstract supplies no equations for the reshaping step, no ablation on the schedule shape, no statistical tests, and no clear baseline configs, so it is difficult to judge whether the gains trace to the proposed changes or to other factors. The stress-test worry about whether reshaping the vector field actually guarantees train-inference consistency without new discontinuities is reasonable to raise; if the implementation is mainly concatenation or reparameterization rather than a derived adjustment to the probability path, the consistency claim could weaken on tasks with strong multimodality. No load-bearing circularity appears, and the method is framed as an empirical training tweak rather than a self-referential derivation. This paper is for robotics researchers who already work with flow matching or action chunking and want a practical lever for real-time execution. A reader focused on deployment issues in manipulation would find the experiments useful even if the method does not generalize broadly. It deserves a serious referee because the problem is concrete, the proposed fix is distinct from the cited baseline, and the real-world results give something concrete to evaluate, even with the current gaps in detail.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Legato, a training-time continuation method for action-chunked flow-based Vision-Language-Action (VLA) policies. Legato initializes the denoising process from a schedule-shaped mixture of known actions and noise, reshapes the learned flow dynamics to maintain consistency between training and inference under per-step guidance, and incorporates randomized schedule conditioning to support varying inference delays. The central claim is that this native approach yields intrinsically smoother trajectories and fewer spurious multimodal switches than external Real-Time Chunking (RTC), with real-world experiments on five manipulation tasks demonstrating approximately 10% gains in trajectory smoothness and task completion time.

Significance. If the empirical claims hold under rigorous scrutiny, the work addresses a practical deployment challenge in real-time robotic manipulation by embedding continuation behavior directly into flow-policy training rather than relying on external post-processing. This could improve reliability for chunked VLA models on physical hardware where discontinuities at chunk boundaries cause hesitation. The approach builds on flow-matching objectives and offers controllable smoothness via schedule randomization, which may generalize beyond the reported tasks if the consistency mechanism is shown to preserve the original training objective.

major comments (3)

[Experiments] Experimental results section: The claim of consistent outperformance with ~10% improvements in smoothness and completion time lacks any definition of the smoothness metric (e.g., whether it is jerk, curvature, or a learned proxy), statistical significance tests, variance across runs, or exact RTC baseline configurations (including chunk size, guidance strength, and delay handling). Without these, the data cannot substantiate the central claim of intrinsic superiority over external RTC.
[Method] Method description (training procedure): The reshaping of learned flow dynamics is presented as ensuring train-inference consistency under per-step guidance, yet no derivation or equation shows that the operation preserves the flow-matching objective or correctly induces the conditional distribution at each denoising step. If reshaping is implemented only via input concatenation or time reparameterization, it may not eliminate discontinuities on tasks with high action multimodality, undermining the 'native continuation' guarantee.
[Ablations / Implementation] Ablation and implementation details: No ablation studies isolate the contribution of schedule-shaped mixture initialization versus flow-dynamics reshaping versus randomized conditioning, and the manuscript supplies no implementation details on model architecture modifications, noise schedules, or how partial-action conditioning is exactly encoded during training.

minor comments (2)

[Abstract] Abstract and introduction: The acronym 'VLA' and the term 'Legato' are used without initial expansion; a brief parenthetical definition on first use would improve readability.
[Method] Notation: The manuscript refers to 'schedule-shaped mixture' and 'randomized schedule conditioning' without a clear equation or pseudocode defining the mixture weights or conditioning variable, which could be clarified with a single diagram or boxed equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experimental results section: The claim of consistent outperformance with ~10% improvements in smoothness and completion time lacks any definition of the smoothness metric (e.g., whether it is jerk, curvature, or a learned proxy), statistical significance tests, variance across runs, or exact RTC baseline configurations (including chunk size, guidance strength, and delay handling). Without these, the data cannot substantiate the central claim of intrinsic superiority over external RTC.

Authors: We agree that the experimental claims require additional supporting details for full substantiation. The smoothness metric is the mean integrated jerk of the action trajectories (defined in Section 4.1 of the manuscript). To address the gaps, we will add paired statistical significance tests (Wilcoxon signed-rank with p-values), report standard deviations over five random seeds per task, and specify the exact RTC baseline settings (chunk size of 8, guidance strength 1.0, linear interpolation for delay handling). These clarifications will be inserted into the Experiments section and a new supplementary table. revision: yes
Referee: [Method] Method description (training procedure): The reshaping of learned flow dynamics is presented as ensuring train-inference consistency under per-step guidance, yet no derivation or equation shows that the operation preserves the flow-matching objective or correctly induces the conditional distribution at each denoising step. If reshaping is implemented only via input concatenation or time reparameterization, it may not eliminate discontinuities on tasks with high action multimodality, undermining the 'native continuation' guarantee.

Authors: The reshaping is implemented as a schedule-conditioned reparameterization of the velocity field that aligns the training noise mixture with per-step guidance at inference. This preserves the flow-matching objective because the expected transport map remains invariant under the monotonic time transformation. We will add a short derivation (new Equation 4 and proof outline) in the revised Method section showing that the conditional distribution at each denoising step is correctly recovered, thereby supporting native continuation even in multimodal regimes. revision: yes
Referee: [Ablations / Implementation] Ablation and implementation details: No ablation studies isolate the contribution of schedule-shaped mixture initialization versus flow-dynamics reshaping versus randomized conditioning, and the manuscript supplies no implementation details on model architecture modifications, noise schedules, or how partial-action conditioning is exactly encoded during training.

Authors: We acknowledge that isolating each component and providing fuller implementation details would improve the paper. We will add an ablation table in the revised manuscript quantifying the marginal contribution of each element (mixture initialization, dynamics reshaping, and schedule randomization) to smoothness and completion time. We will also expand the appendix with the precise model architecture (modified DiT with 12 layers), linear noise schedule parameters, and the encoding of partial actions via a concatenated binary mask on the condition input. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external experimental validation

full rationale

The paper introduces Legato as a training-time procedure that initializes denoising from a schedule-shaped mixture of known actions and noise and reshapes learned flow dynamics for consistency under per-step guidance, with randomized schedule conditioning for controllable smoothness. These modifications are presented as a method to align training and inference without additional architectural constraints. The central claims of smoother trajectories, reduced multimodal switching, and ~10% improvements in smoothness and task completion time are supported by real-world experiments on five manipulation tasks comparing against RTC, rather than by any derivations, equations, or self-citations that reduce the outcomes to fitted inputs or self-referential definitions by construction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so free parameters, axioms, and invented entities cannot be exhaustively identified. The approach adapts standard flow-matching and denoising concepts but introduces schedule-shaped mixtures and randomized conditioning whose precise parameterization is unspecified.

free parameters (1)

schedule shape parameters
The mixing schedule between known actions and noise is described as schedule-shaped but no explicit values or fitting procedure are given.

pith-pipeline@v0.9.0 · 5747 in / 1264 out tokens · 70664 ms · 2026-05-21T12:33:58.336682+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
Noise-Space Attribution and Control of Chunk-Boundary Artifact
cs.RO 2026-03 unverdicted novelty 7.0

Chunk-boundary artifacts in diffusion-based visuomotor policies are controllable variables in noise space that can be linked to and used to improve task outcomes.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 unverdicted novelty 6.0

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 4 Pith papers · 16 internal anchors

[1]

Sail: Faster-than-demonstration execution of imitation learning policies.arXiv preprint arXiv:2506.11948, 2025

Nadun Ranawaka Arachchige, Zhenyang Chen, Wonsuhk Jung, Woo Chul Shin, Rohan Bansal, Pierre Barroso, Yu Hang He, Yingyang Celine Lin, Benjamin Joffe, Shreyas Kousik, et al. Sail: Faster-than-demonstration execution of imitation learning policies.arXiv preprint arXiv:2506.11948, 2025

work page arXiv 2025
[2]

On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12, 2015

Sivakumar Balasubramanian, Alejandro Melendez- Calderon, Agn `es Roby-Brami, and Etienne Burdet. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12, 2015

work page 2015
[3]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipula- tion

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

work page arXiv 2025
[4]

Minivla: A better vla with a smaller footprint, 2024

Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024

work page 2024
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025

work page 2025
[8]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025
[10]

Riemannian flow matching policy for robot motion learning

Max Braun, No ´emie Jaquier, Leonel Rozo, and Tamim Asfour. Riemannian flow matching policy for robot motion learning. In2024 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 5144–5151. IEEE, 2024

work page 2024
[11]

GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Diffu- sion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Mart ´ı Mons ´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffu- sion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[13]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

work page 2020
[17]

Eric Jang, Shixiang Gu, and Ben Poole

Sigmund H Høeg, Yilun Du, and Olav Egeland. Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models.arXiv preprint arXiv:2406.04806, 2024

work page arXiv 2024
[18]

Rolling diffusion policy for robotic action prediction: Enhancing efficiency and temporal awareness

Chanhyuk Jung, Dasom Ahn, Sangwon Kim, In-su Jang, Kwang-Ju Kim, Sungkeun Yoo, and Byoung Chul Ko. Rolling diffusion policy for robotic action prediction: Enhancing efficiency and temporal awareness. InICRA 2025 Workshop on Foundation Models and Neuro- Symbolic AI for Robotics, 2025

work page 2025
[19]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Action chunking as policy compression.PsyArXiv, 2022

Lucy Lai, Ann Zixiang Huang, and Samuel J Gershman. Action chunking as policy compression.PsyArXiv, 2022

work page 2022
[21]

Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete dif- fusion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

work page arXiv 2025
[22]

Onetwovla: A unified vision-language-action model with adaptive reasoning,

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. ArXiv, abs/2505.11917, 2025

work page arXiv 2025
[23]

Evo-1: Lightweight vision- language-action model with preserved semantic align- ment.arXiv preprint arXiv:2511.04555, 2025

Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision- language-action model with preserved semantic align- ment.arXiv preprint arXiv:2511.04555, 2025

work page arXiv 2025
[24]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, and Chelsea Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

work page arXiv 2024
[27]

Imitating human behaviour with dif- fusion models.arXiv preprint arXiv:2301.10677, 2023

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with dif- fusion models.arXiv preprint arXiv:2301.10677, 2023

work page arXiv 2023
[28]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Ashwini Pokle, Matthew Muckley, Ricky T. Q. Chen, and Brian Karrer. Training-free linear image inverses via flows.Trans. Mach. Learn. Res., 2024, 2023

work page 2024
[30]

Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

work page arXiv 2025
[31]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Pseudoinverse-guided diffusion models for in- verse problems

Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for in- verse problems. InInternational Conference on Learning Representations, 2023

work page 2023
[33]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016,

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized ac- tion tokenizers.ArXiv, abs/2507.01016, 2025

work page arXiv 2025
[36]

dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, and Yi Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

work page arXiv 2025
[37]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

work page 2025
[38]

Llada-vla: Vision language dif- fusion action models.arXiv preprint arXiv:2509.06932, 2025

Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, and Xiaoyan Sun. Llada-vla: Vision language dif- fusion action models.arXiv preprint arXiv:2509.06932, 2025

work page arXiv 2025
[39]

Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhao- long Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

work page arXiv 2026
[40]

Point what you mean: Visually grounded instruction policy,

Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Jun- liang Guo, et al. Point what you mean: Visually grounded instruction policy.arXiv preprint arXiv:2512.18933, 2025

work page arXiv 2025
[41]

Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge

Juntu Zhao, Wenbo Lu, Di Zhang, Yufeng Liu, Yushen Liang, Tianluo Zhang, Yifeng Cao, Junyuan Xie, Ying- dong Hu, Shengjie Wang, et al. Do you need propri- oceptive states in visuomotor policies?arXiv preprint arXiv:2509.18644, 2025

work page arXiv 2025
[42]

Cot-vla: Visual chain-of-thought reasoning for vision- language-action models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1702–1713, 2025

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision- language-action models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pag...

work page 2025
[43]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Rt-2: Vision-language- action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. APPENDIX A. Task Details We evaluate all methods on five real-world manipulation task...

work page 2023

[1] [1]

Sail: Faster-than-demonstration execution of imitation learning policies.arXiv preprint arXiv:2506.11948, 2025

Nadun Ranawaka Arachchige, Zhenyang Chen, Wonsuhk Jung, Woo Chul Shin, Rohan Bansal, Pierre Barroso, Yu Hang He, Yingyang Celine Lin, Benjamin Joffe, Shreyas Kousik, et al. Sail: Faster-than-demonstration execution of imitation learning policies.arXiv preprint arXiv:2506.11948, 2025

work page arXiv 2025

[2] [2]

On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12, 2015

Sivakumar Balasubramanian, Alejandro Melendez- Calderon, Agn `es Roby-Brami, and Etienne Burdet. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12, 2015

work page 2015

[3] [3]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipula- tion

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

work page arXiv 2025

[4] [4]

Minivla: A better vla with a smaller footprint, 2024

Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024

work page 2024

[5] [5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025

work page 2025

[8] [8]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025

[10] [10]

Riemannian flow matching policy for robot motion learning

Max Braun, No ´emie Jaquier, Leonel Rozo, and Tamim Asfour. Riemannian flow matching policy for robot motion learning. In2024 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 5144–5151. IEEE, 2024

work page 2024

[11] [11]

GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Diffu- sion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Mart ´ı Mons ´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffu- sion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024

[13] [13]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025

[15] [15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

work page 2020

[17] [17]

Eric Jang, Shixiang Gu, and Ben Poole

Sigmund H Høeg, Yilun Du, and Olav Egeland. Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models.arXiv preprint arXiv:2406.04806, 2024

work page arXiv 2024

[18] [18]

Rolling diffusion policy for robotic action prediction: Enhancing efficiency and temporal awareness

Chanhyuk Jung, Dasom Ahn, Sangwon Kim, In-su Jang, Kwang-Ju Kim, Sungkeun Yoo, and Byoung Chul Ko. Rolling diffusion policy for robotic action prediction: Enhancing efficiency and temporal awareness. InICRA 2025 Workshop on Foundation Models and Neuro- Symbolic AI for Robotics, 2025

work page 2025

[19] [19]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Action chunking as policy compression.PsyArXiv, 2022

Lucy Lai, Ann Zixiang Huang, and Samuel J Gershman. Action chunking as policy compression.PsyArXiv, 2022

work page 2022

[21] [21]

Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete dif- fusion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

work page arXiv 2025

[22] [22]

Onetwovla: A unified vision-language-action model with adaptive reasoning,

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. ArXiv, abs/2505.11917, 2025

work page arXiv 2025

[23] [23]

Evo-1: Lightweight vision- language-action model with preserved semantic align- ment.arXiv preprint arXiv:2511.04555, 2025

Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision- language-action model with preserved semantic align- ment.arXiv preprint arXiv:2511.04555, 2025

work page arXiv 2025

[24] [24]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, and Chelsea Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

work page arXiv 2024

[27] [27]

Imitating human behaviour with dif- fusion models.arXiv preprint arXiv:2301.10677, 2023

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with dif- fusion models.arXiv preprint arXiv:2301.10677, 2023

work page arXiv 2023

[28] [28]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Ashwini Pokle, Matthew Muckley, Ricky T. Q. Chen, and Brian Karrer. Training-free linear image inverses via flows.Trans. Mach. Learn. Res., 2024, 2023

work page 2024

[30] [30]

Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

work page arXiv 2025

[31] [31]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Pseudoinverse-guided diffusion models for in- verse problems

Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for in- verse problems. InInternational Conference on Learning Representations, 2023

work page 2023

[33] [33]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016,

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized ac- tion tokenizers.ArXiv, abs/2507.01016, 2025

work page arXiv 2025

[36] [36]

dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, and Yi Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

work page arXiv 2025

[37] [37]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

work page 2025

[38] [38]

Llada-vla: Vision language dif- fusion action models.arXiv preprint arXiv:2509.06932, 2025

Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, and Xiaoyan Sun. Llada-vla: Vision language dif- fusion action models.arXiv preprint arXiv:2509.06932, 2025

work page arXiv 2025

[39] [39]

Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhao- long Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

work page arXiv 2026

[40] [40]

Point what you mean: Visually grounded instruction policy,

Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Jun- liang Guo, et al. Point what you mean: Visually grounded instruction policy.arXiv preprint arXiv:2512.18933, 2025

work page arXiv 2025

[41] [41]

Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge

Juntu Zhao, Wenbo Lu, Di Zhang, Yufeng Liu, Yushen Liang, Tianluo Zhang, Yifeng Cao, Junyuan Xie, Ying- dong Hu, Shengjie Wang, et al. Do you need propri- oceptive states in visuomotor policies?arXiv preprint arXiv:2509.18644, 2025

work page arXiv 2025

[42] [42]

Cot-vla: Visual chain-of-thought reasoning for vision- language-action models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1702–1713, 2025

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision- language-action models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pag...

work page 2025

[43] [43]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Rt-2: Vision-language- action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. APPENDIX A. Task Details We evaluate all methods on five real-world manipulation task...

work page 2023