arxiv: 2605.13778 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.CV

Recognition: unknown

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu , Kefan Gu , Yucheng Zhao , Shengwen Liang , Tiancai Wang , Xing Hu , Ying Wang , Huawei Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:48 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords speculative inferencediffusion VLAreal-time roboticslatency reductionLIBERO benchmarkdraft model verificationembodied action generation

0 comments

The pith

A lightweight draft model with parallel verification lets diffusion VLAs replan actions at 19.1 ms average latency instead of 58 ms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Realtime-VLA FLASH, a framework that runs most replanning steps through a fast draft model rather than the full diffusion pipeline. The main model's Action Expert checks the draft outputs in parallel, and only falls back to full inference when the check fails. On the LIBERO benchmark this substitution cuts task-level latency by a factor of three while leaving success rates nearly unchanged, and the same pipeline works on a physical conveyor-belt sorting task. A reader would care because diffusion VLAs are accurate but too slow for real-time robot control; removing most of the expensive steps makes them practical.

Core claim

By generating candidate actions from a lightweight draft model and verifying them in parallel with the main model's Action Expert, most full diffusion inference rounds can be skipped. A phase-aware fallback restores the full pipeline when verification fails. The resulting system replaces many 58 ms full inferences with 7.8 ms speculative rounds, lowering average task latency to 19.1 ms on LIBERO while preserving task performance and demonstrating the same benefit on real-world conveyor sorting.

What carries the argument

Speculative inference pipeline: lightweight draft model plus parallel Action Expert verification and phase-aware fallback to full diffusion.

If this is right

Average inference latency on LIBERO drops from 58 ms to 19.1 ms per replanning step.
Task success rates remain essentially unchanged across the benchmark suites.
Speculative rounds run as fast as 7.8 ms, enabling higher-frequency replanning.
The same pipeline transfers to real-world conveyor-belt sorting without retraining.
The method applies to any diffusion-based VLA that exposes an Action Expert for verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same draft-plus-verification pattern could be applied to other slow generative policies in robotics, such as autoregressive transformers.
Combining FLASH with model quantization or caching might push latency even lower on edge hardware.
Frequent low-latency replanning could reduce the need for separate motion planners in dynamic scenes.

Load-bearing premise

The draft model produces outputs close enough to the main model that full inference is triggered infrequently enough to yield net speedup without hurting task success.

What would settle it

A test run in which the draft model disagrees with the Action Expert on more than half the steps, causing fallback frequency to rise and measured latency to stay near 58 ms or task success to drop.

Figures

Figures reproduced from arXiv: 2605.13778 by Huawei Li, Jiahui Niu, Kefan Gu, Shengwen Liang, Tiancai Wang, Xing Hu, Ying Wang, Yucheng Zhao.

**Figure 1.** Figure 1: Overview of Realtime-VLA FLASH. (a) Standard synchronous dVLA inference runs the full pipeline at each replanning round, causing stale action updates and failure in latency-critical manipulation. (b) Realtime-VLA FLASH accelerates replanning, allowing the robot to react in time and complete the grasp. (c) FLASH introduces speculative inference for dVLAs: a lightweight draft model proposes a continuous acti… view at source ↗

**Figure 2.** Figure 2: Roofline analysis of π0 [1] inference on NVIDIA RTX 4090D. 145.79 is the balanced arithmetic intensity point. We further profile these three stages on an NVIDIA RTX 4090D [11] and summarize the results with a roofline analysis in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Realtime-VLA FLASH Framework. (a) Realtime-VLA FLASH uses two inference paths: the original full path (Full Path) and a lightweight speculative path (Flash Path). The Full Path performs Image Encoder, VLM prefill, and Action Denoise, while the flash path still encodes the current images but skips VLM prefill to draft a candidate action chunk for verification. (b) Draft model architecture. The draft model u… view at source ↗

**Figure 4.** Figure 4: Parallel verification. (a) π0 [1] generates an action chunk with flow matching through 10- step sequential denoising. (b) Realtime-VLA FLASH reconstructs endpoints from selected denoising timesteps in parallel and checks an action-by-action distance threshold within the chunk, yielding the accepted prefix. (c) If reconstructed endpoints deviate beyond the threshold, no prefix is accepted and the flash path… view at source ↗

**Figure 5.** Figure 5: Phase-aware fallback on a LIBERO-Spatial task trajectory. 3D trajectories for a bowl-to-plate task. (a) Without phase-aware fallback, the flash path drifts during final placement and fails near the plate edge. (b) With fallback, final placement returns to the full path and succeeds. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Key-frame visualization of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Conveyor-belt sorting demo [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Real-world robot platform used for conveyor-belt sorting. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLASH adapts speculative decoding to diffusion VLAs via draft model plus Action Expert verification and phase-aware fallback, claiming a clean 3x latency drop on LIBERO and real hardware, but the acceptance-rate data needed to back the net speedup is missing from the abstract.

read the letter

The core contribution is a practical speculative inference pipeline for diffusion-based VLAs. It runs a lightweight draft model to propose actions, verifies them in parallel with the main model's Action Expert, and only triggers full inference on phase-aware fallback. On LIBERO this replaces most 58 ms rounds with 7.8 ms speculative ones, bringing average task latency to 19.1 ms for a reported 3.04x speedup, and the same pattern is shown on a real conveyor-belt sorting task. That engineering pattern is the new piece; prior speculative work exists for language models, but the specific draft-plus-parallel-verification-plus-phase-fallback combination for diffusion VLAs is not standard in the literature I know. The real-world demo is also a plus, since most VLA papers stay in simulation. The numbers are direct wall-clock measurements rather than fitted parameters, which keeps the claim grounded. The main gap is the missing breakdown on acceptance frequency and fallback rate. The speedup only materializes if the draft is accepted most of the time and undetected errors do not accumulate into task failures. The abstract gives no per-task acceptance statistics, no error bars, and no ablation on how often phase-aware fallback actually fires. Without those, it is hard to judge whether the preserved success rates are robust or just lucky on the tested tasks. Minor issues include the lack of detailed baseline comparisons beyond the full-inference case and no mention of statistical tests. This paper is aimed at robotics practitioners who need to run diffusion VLAs at control rates on physical hardware. A reader working on deployment latency would get immediate value from the framework description and the reported timings. It is worth sending to peer review because the problem is real, the method is straightforward to reproduce, and the experiments already include hardware results. A referee can ask for the acceptance-rate tables and still find the work useful even if revisions are needed.

Circularity Check

0 steps flagged

No circularity: empirical latency claims rest on direct wall-clock measurements

full rationale

The paper proposes a speculative inference framework and validates it through direct experimental timing on LIBERO tasks. Reported figures (58.0 ms full inference, 7.8 ms speculative rounds, 19.1 ms average, 3.04x speedup) are obtained from wall-clock measurements of the implemented pipeline versus baseline, not from any equation that reduces a prediction to a fitted input or self-referential definition. No derivation chain, uniqueness theorem, or ansatz is invoked that collapses to prior self-citation or renaming of known results. The central performance claim is therefore independently falsifiable by re-running the timing experiments on the same benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no explicit free parameters, axioms, or invented entities are stated; the approach implicitly assumes the draft model produces sufficiently accurate proposals most of the time.

pith-pipeline@v0.9.0 · 5473 in / 1100 out tokens · 33813 ms · 2026-05-14T17:48:44.641499+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 28 canonical work pages · 13 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.ArXiv, abs/2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Y ., and Levine, S

Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339, 2025

work page arXiv 2025
[3]

arXiv preprint arXiv:2512.05964 (2025)

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025
[4]

Mean-flow based one-step vision-language-action

Yang Chen, Xiaoguang Ma, and Bin Zhao. Mean-flow based one-step vision-language-action. arXiv preprint arXiv:2603.01469, 2026

work page arXiv 2026
[5]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

2023
[6]

Accelerated diffusion models via speculative sampling.arXiv preprint arXiv:2501.05370, 2025

Valentin De Bortoli, Alexandre Galashov, Arthur Gretton, and Arnaud Doucet. Accelerated diffusion models via speculative sampling.arXiv preprint arXiv:2501.05370, 2025

work page arXiv 2025
[7]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

work page internal anchor Pith review arXiv 2024
[8]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Diffusion models are secretly exchangeable: Parallelizing ddpms via autospeculation.arXiv preprint arXiv:2505.03983, 2025

Hengyuan Hu, Aniket Das, Dorsa Sadigh, and Nima Anari. Diffusion models are secretly exchangeable: Parallelizing ddpms via autospeculation.arXiv preprint arXiv:2505.03983, 2025

work page arXiv 2025
[10]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

How fast can i run my vla? demystifying vla inference performance with vla-perf.arXiv preprint arXiv:2602.18397, 2026

Wenqi Jiang, Jason Clemons, Karu Sankaralingam, and Christos Kozyrakis. How fast can i run my vla? demystifying vla inference performance with vla-perf.arXiv preprint arXiv:2602.18397, 2026

work page arXiv 2026
[12]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review arXiv 2024
[13]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

2024
[14]

arXiv preprint arXiv:2503.01840 (2025) 5 16 Z

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page arXiv 2025
[15]

arXiv preprint arXiv:2511.04555 (2025)

Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025. 10

work page arXiv 2025
[16]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[18]

arXiv preprint arXiv:2602.03310 (2026)

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

work page arXiv 2026
[19]

arXiv preprint arXiv:2510.26742 (2025)

Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, and Haoqiang Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

work page arXiv 2025
[20]

GR00T N1: An open foundation model for generalist humanoid robots

NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

2025
[21]

Gr00t n1.6: An im- proved open foundation model for generalist humanoid robots

NVIDIA GEAR Team, Allison Azzolini, Johan Bjorck, Valts Blukis, et al. Gr00t n1.6: An im- proved open foundation model for generalist humanoid robots. https://research.nvidia. com/labs/gear/gr00t-n1_6/, December 2025

2025
[22]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

arXiv preprint arXiv:2509.05614 (2025)

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, and Guohao Dai. Specprune-vla: Accelerating vision-language-action models via action-aware self-speculative pruning.arXiv preprint arXiv:2509.05614, 2025

work page arXiv 2025
[26]

Spec-vla: speculative decoding for vision-language-action models with relaxed accep- tance

Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang, and Derek F Wong. Spec-vla: speculative decoding for vision-language-action models with relaxed accep- tance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26916–26928, 2025

2025
[27]

One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257, 2024

Zhendong Wang, Zhaoshuo Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, et al. One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257, 2024

work page arXiv 2024
[28]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025
[29]

arXiv preprint arXiv:2511.05642 (2025)

Justin Williams, Kishor Datta Gupta, Roy George, and Mrinmoy Sarkar. Lite vla: Efficient vision-language-action control on cpu-bound edge robots.arXiv preprint arXiv:2511.05642, 2025. 11

work page arXiv 2025
[30]

Realtime- vla v2: Learning to run vlas fast, smooth, and accurate.arXiv preprint arXiv:2603.26360, 2026

Chen Yang, Yucheng Hu, Yunchao Ma, Yunhuan Yang, Jing Tan, and Haoqiang Fan. Realtime- vla v2: Learning to run vlas fast, smooth, and accurate.arXiv preprint arXiv:2603.26360, 2026

work page arXiv 2026
[31]

Effi- cientvla: Training-free acceleration and compression for vision-language-action models.arXiv preprint arXiv:2506.10100,

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efficientvla: Training-free acceleration and compression for vision- language-action models.arXiv preprint arXiv:2506.10100, 2025

work page arXiv 2025
[32]

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, and Mi Zhang. Quantvla: Scale-calibrated post-training quantization for vision-language-action models.arXiv preprint arXiv:2602.20309, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Zihao Zheng, Zhihao Mao, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Donggang Cao, Hong Mei, and Xiang Chen. Kerv: Kinematic-rectified speculative decoding for embodied vla models.arXiv preprint arXiv:2603.01581, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

Zihao Zheng, Zhihao Mao, Sicheng Tian, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, et al. Heisd: Hybrid speculative decoding for embodied vision-language-action models with kinematic awareness.arXiv preprint arXiv:2603.17573, 2026. A Draft Model Details A.1 Architecture This section expands the draft architectur...

work page internal anchor Pith review Pith/arXiv arXiv 2026