arxiv: 2604.25050 · v1 · submitted 2026-04-27 · 💻 cs.RO

Recognition: unknown

DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

Pengcheng Wang , Kaiwen Hong , Chensheng Peng , Katherine Driggs-Campbell , Masayoshi Tomizuka , Chenfeng Xu , Chen Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:27 UTC · model grok-4.3

classification 💻 cs.RO

keywords discrete diffusionasynchronous executionreal-time chunkingrobotic action policiesaction inpaintingdiffusion modelsdynamic manipulation

0 comments

The pith

Discrete diffusion policies act as natural asynchronous executors because iterative unmasking makes inpainting native to them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Physical robots must generate actions while the environment keeps changing, so any pause between action chunks risks failure on dynamic tasks. Asynchronous execution solves this by committing to an initial chunk and then inpainting the rest as the robot moves. Continuous flow-matching policies handle inpainting only through extra inference-time corrections that demand fine-tuning and added computation. Discrete diffusion policies instead generate by repeatedly unmasking tokens, so freezing committed actions and continuing to unmask the open ones is simply the model's normal behavior. This native support removes fine-tuning, allows early stopping for cheaper adaptive guidance, and yields higher success rates on simulated and real dynamic manipulation tasks.

Core claim

Discrete diffusion policies generate actions by iteratively unmasking, which makes real-time chunking their native operation rather than an added correction. Freezing already-committed action chunks and unmasking the remainder produces consistent continuations without any task-specific fine-tuning or external guidance modules. Early stopping during the unmasking process supplies adaptive guidance at lower inference cost than generating a full new sequence from scratch.

What carries the argument

Iterative unmasking process in discrete diffusion policies, which directly supports inpainting of open action chunks while executing committed ones.

If this is right

Implementation requires zero extra code for the inpainting step because it reuses the existing unmasking loop.
Inference cost falls to roughly 0.7 times the cost of generating a full action sequence from scratch.
Real-world dynamic pick success rate rises by about 50 percent relative to flow-matching-based real-time chunking.
Early stopping during unmasking supplies adaptive guidance without separate heuristic modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same native inpainting property could let discrete diffusion policies handle variable-length action horizons on the fly without retraining.
Because inpainting is built in, the approach may transfer more readily to new robot hardware or sensor suites than methods that rely on post-training corrections.
Combining early-stopping guidance with online data collection might further close the gap between offline training and live deployment.

Load-bearing premise

A discrete diffusion policy trained on standard offline data will keep producing high-quality, consistent continuations when inpainting around frozen committed action chunks in changing environments.

What would settle it

A controlled experiment showing that success rates or action smoothness drop sharply when DiscreteRTC switches from synchronous to asynchronous mode in the same dynamic pick or manipulation tasks.

Figures

Figures reproduced from arXiv: 2604.25050 by Chenfeng Xu, Chensheng Peng, Chen Tang, Kaiwen Hong, Katherine Driggs-Campbell, Masayoshi Tomizuka, Pengcheng Wang.

**Figure 1.** Figure 1: Async Execution with discrete diffusion policies solving dynamic manipulation. Gray rectangles and blocks represent the action chunks and the actions. Yellow and green cubes represent the masked and unmasked action tokens. During each inference cycle, discrete diffusion policies copy the tail of the last action chunk as the committed prefix, and inpaint upon it by simply forwardin itself. Compared with flo… view at source ↗

**Figure 2.** Figure 2: RTC with flow-matching head. Color represents the noise level, where green stands for the clear action and yellow stands for the pure noise. The flow-matching head is ill-suited for RTC because (a) during pre-training, the base policy is not trained on inpainting tasks; (b) to acquire this capability, a specially designed fine-tuning stage is required; (c) at inference time, RTC relies on heuristic guidanc… view at source ↗

**Figure 3.** Figure 3: RTC with discrete diffusion head. Color represents the masking status, where green stands for the unmasked token and yellow stands for the masked token. The discrete diffusion head is naturally suited for RTC because (a) during pre-training, the base policy is already trained on inpainting tasks; (b) consequently, no inpainting-specific fine-tuning is required; (c) at inference time, early stopping from th… view at source ↗

**Figure 4.** Figure 4: Experimental Results in Kinetix. The throughputs represent the task completed by the policy every 256 steps. Left: Average solve rate and throughputs across all environments with different inference delays; Right: Solve rates for every tasks with different inference delays. The executions horizon follows s = max(1, d) and each datapoint represents 2048 trials. To ensure valid execution before the next infe… view at source ↗

**Figure 5.** Figure 5: Extend Experimental Results in Kinetix. Left: Required interative steps for each inpainting inference of different policy architectures in Kinetix with s = max(1, d); Right: Average solve rates of extended variants in Kinetix. The evaluation setup keeps the same with view at source ↗

**Figure 6.** Figure 6: Unmasking Trajectory Sample with Natural Schedule Inference. Green blocks denote unmasked action tokens, yellow blocks denote masked tokens, and the red rectangle marks the earlystop boundary beyond which tokens do not need to be unmasked before the next inference. In practice, the natural schedule does work as expected compared to the simple hard mask approach. In this section, we show how the inappropri… view at source ↗

**Figure 7.** Figure 7: Detailed Main Results in Kinetix. The evaluation setup keeps the same with view at source ↗

**Figure 8.** Figure 8: Fine-tuning Ablation in Kinetix. The evaluation setup keeps the same with view at source ↗

**Figure 9.** Figure 9: dynamic Pick and Place Real-world Setup Hardware and Data. We use a single UR5e arm with a Robotiq gripper and a wrist-mounted RGB camera. Demonstrations are recorded at 500 Hz via the FastUMI pipeline. Each action is a 10D vector [∆x, ∆y, ∆z, rot6d(6), gripper], with translational dimensions normalized to [−1, 1] via min-max scaling, rotation dimensions left unnormalized, and the gripper binarized to {0, … view at source ↗

read the original abstract

Unlike chatbots, physical AI must act while the world keeps evolving. Therefore, the inter-chunk pause of synchronous executors are fatal for dynamic tasks regardless of how fast the inference is. Asynchronous execution -- thinking while acting -- is therefore a structural requirement, and real-time chunking (RTC) makes it viable by recasting chunk transitions as inpainting: freezing committed actions and consistently generating the remainder. However, RTC with flow-matching policy is structurally suboptimal: its inpainting comes from inference-time corrections rather than the base policy, yielding little pre-training benefit, specific fine-tuning, heuristic guidance, and extra computation that inflates the latency. In this work, we observe that discrete diffusion policies, which generate actions by iteratively unmasking, are natural asynchronous executors that resolve all limitations at once: they are fine-tuning free since inpainting is their native operation, while early stopping further provides adaptive guidance and reduces inference cost. We propose DiscreteRTC, which replaces external corrections with native unmasking, and show on dynamic simulated benchmarks and real-world dynamic manipulation tasks that it achieves higher success rates than continuous RTC and other baselines. In summary, DiscreteRTC is simpler to implement with 0 lines of code for async inpainting, faster at inference with only 0.7x computation compared with generating actions from scratch, and better at execution with 50% higher success rate in real-world dynamic pick task compared with flow-matching-based RTC. More visualizations are on https://outsider86.github.io/DiscreteRTCSite/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Discrete diffusion policies give a built-in way to do real-time chunking for async robot execution, with reported gains over flow-matching baselines.

read the letter

The core point is that discrete diffusion's unmasking process supplies native inpainting for committed action chunks, so DiscreteRTC skips the inference corrections and fine-tuning that flow-matching RTC requires. Early stopping then adds adaptive guidance at lower cost. This framing is the actual new piece relative to prior RTC work on continuous policies. The paper shows it on dynamic simulated benchmarks and a real pick task, with 50% higher success and 0.7x compute versus the flow-matching version, plus zero extra code for the async part. That combination of simplicity and measured improvement is the practical takeaway. The visualizations on the project site make the behavior easier to check. The experiments appear to support the claim that the base model handles inpainting without task-specific adaptations, at least in the tested settings. The stress-test worry about distribution shift from offline trajectories to live committed chunks does not seem to break the results here, since the real-world gains hold up. Still, the summary gives no error bars or statistical tests, which leaves the reliability of the 50% figure harder to judge without the full protocol details. The paper targets people already working on diffusion policies for robot control who need async execution. Readers focused on practical fixes for dynamic tasks will get direct value from the mechanism and numbers. The central argument follows from how discrete diffusion works and is backed by the comparisons, so the work is coherent on its own terms. It deserves a serious referee because the idea is grounded and the empirical edge is concrete, even if more statistical detail would strengthen it. Send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that discrete diffusion policies generate actions via iterative unmasking and are therefore natural asynchronous executors for real-time chunking (RTC). By treating chunk transitions as native inpainting, DiscreteRTC avoids the inference-time corrections, task-specific fine-tuning, and extra compute required by flow-matching RTC; early stopping further supplies adaptive guidance. Empirical results on dynamic simulated benchmarks and real-world manipulation tasks are reported to show higher success rates (50% improvement in a real pick task) and lower inference cost (0.7x compute) relative to continuous RTC and other baselines.

Significance. If the empirical claims hold under rigorous testing, the work would be significant for robotics and control: it supplies a structurally simpler mechanism for asynchronous execution that directly exploits the generative process of discrete diffusion, eliminating the need for external corrections or fine-tuning while simultaneously lowering latency. This could reduce the engineering overhead for deploying policies in dynamic physical environments where synchronous chunking is inadequate.

major comments (2)

[Abstract] Abstract: the reported gains (50% higher success rate in the real pick task, 0.7x compute) are presented without error bars, statistical tests, or a detailed experimental protocol. Because these numbers are the primary evidence for the superiority of DiscreteRTC over flow-matching RTC, the absence of statistical rigor is load-bearing for the central empirical claim.
[Abstract and Experiments] The skeptic concern about distribution shift is not addressed: a policy trained on complete offline trajectories may not produce consistent, high-quality inpainted continuations when a prefix of actions is committed and the world state evolves during execution. Without ablations that explicitly test partial-commitment conditioning or measure degradation under dynamic conditions, the assertion that inpainting is 'fine-tuning free' and native remains vulnerable.

minor comments (1)

[Abstract] Abstract: the informal phrasing '0 lines of code for async inpainting' should be replaced by a precise statement of what implementation changes (if any) are required to enable the inpainting behavior.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the concerns about statistical rigor and distribution shift by committing to revisions that add error bars, statistical details, and targeted ablations while preserving the core claims supported by our dynamic benchmarks and real-world results.

read point-by-point responses

Referee: [Abstract] Abstract: the reported gains (50% higher success rate in the real pick task, 0.7x compute) are presented without error bars, statistical tests, or a detailed experimental protocol. Because these numbers are the primary evidence for the superiority of DiscreteRTC over flow-matching RTC, the absence of statistical rigor is load-bearing for the central empirical claim.

Authors: We agree that the abstract requires greater statistical transparency to support the central claims. In the revised manuscript, we will add error bars to the reported success rates and compute metrics, specify the number of trials (e.g., 20 independent runs for the real pick task), and include references to statistical tests such as paired t-tests for significance between DiscreteRTC and flow-matching RTC. The full experimental protocol, including environment details, hyperparameters, and evaluation procedures, is already provided in Section 4 and the appendix; we will add an explicit cross-reference in the abstract. revision: yes
Referee: [Abstract and Experiments] The skeptic concern about distribution shift is not addressed: a policy trained on complete offline trajectories may not produce consistent, high-quality inpainted continuations when a prefix of actions is committed and the world state evolves during execution. Without ablations that explicitly test partial-commitment conditioning or measure degradation under dynamic conditions, the assertion that inpainting is 'fine-tuning free' and native remains vulnerable.

Authors: We appreciate this valid concern about potential distribution shift in asynchronous settings. Discrete diffusion policies are trained with a random masking objective that naturally includes partial action sequences, enabling native inpainting without fine-tuning or external corrections. Our dynamic simulated benchmarks and real-world manipulation tasks already evaluate performance under evolving world states with sequential action commitment, where DiscreteRTC outperforms baselines. To directly address partial-commitment conditioning, we will add an ablation in the revised paper that measures success rates, consistency of generated continuations, and any performance degradation across varying prefix commitment lengths in simulated dynamic environments. revision: yes

Circularity Check

0 steps flagged

No circularity: claim follows directly from discrete diffusion generative mechanism

full rationale

The paper's core assertion—that discrete diffusion policies serve as natural asynchronous executors because inpainting (freezing committed actions and unmasking the rest) is their native operation—derives immediately from the standard iterative unmasking process in discrete diffusion models, without any equations, fitted parameters, or self-citations reducing the result to a constructed input. The abstract and described mechanism present this as an inherent property of the architecture, with empirical benchmarks (success rates, inference cost) serving as external validation rather than definitional support. No load-bearing steps match the enumerated circularity patterns; the derivation remains self-contained against the model's generative definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim rests on the standard generative properties of discrete diffusion models (iterative unmasking) and the assumption that these properties transfer to real-time inpainting without additional training; no new free parameters, axioms, or invented entities are introduced for the core argument.

pith-pipeline@v0.9.0 · 5595 in / 1167 out tokens · 85170 ms · 2026-05-08T02:27:34.437853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 39 canonical work pages · 17 internal anchors

[1]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[2]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[3]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review arXiv 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2024
[5]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review arXiv 2025
[6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[7]

P. Wang, Q. Liu, H. Lin, Y . Li, G. Zhan, M. Tomizuka, and Y . Wang. Dadp: Domain adaptive diffusion policy.arXiv preprint arXiv:2602.04037, 2026

work page arXiv 2026
[8]

Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba

OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. J ´ozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation.CoRR, 2018. URL http://arxiv.org/abs/1808.00177

work page arXiv 2018
[9]

S. An, Z. Meng, C. Tang, Y . Zhou, T. Liu, F. Ding, S. Zhang, Y . Mu, R. Song, W. Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

work page arXiv 2025
[10]

Z. Su, B. Zhang, N. Rahmanian, Y . Gao, Q. Liao, C. Regan, K. Sreenath, and S. S. Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

work page arXiv 2025
[11]

R. Yu, Y . Wang, Q. Zhao, H. W. Tsui, J. Wang, P. Tan, and Q. Chen. Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Confer- ence Conference Papers, pages 1–11, 2025

2025
[12]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[13]

L. Lai, A. Z. Huang, and S. J. Gershman. Action chunking as conditional policy compression. Cognition, 264:106201, 2025

2025
[14]

H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. Dynamicvla: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

work page arXiv 2026
[15]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 9

work page internal anchor Pith review arXiv 2023
[16]

Real-Time Execution of Action Chunking Flow Policies

K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

work page internal anchor Pith review arXiv 2025
[17]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page Pith review arXiv 2025
[18]

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, et al.π 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review arXiv 2022
[20]

J. Song, A. Vahdat, M. Mardani, and J. Kautz. Pseudoinverse-guided diffusion models for inverse problems. InInternational conference on learning representations, 2023

2023
[21]

Training-Time Action Conditioning for Efficient Real-Time Chunking

K. Black, A. Z. Ren, M. Equi, and S. Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025
[22]

H. Wang, G. Zhang, Y . Yan, Y . Shang, R. R. Kompella, and G. Liu. Real-time robot execution with masked action chunking.arXiv preprint arXiv:2601.20130, 2026

work page arXiv 2026
[23]

Y . Liu, H. Yu, J. Zhao, B. Li, D. Zhang, M. Li, W. Wu, Y . Hu, J. Xie, J. Guo, et al. Learning native continuation for action chunking flow policies.arXiv preprint arXiv:2602.12978, 2026

work page arXiv 2026
[24]

Y . Lu, Z. Liu, X. Fan, Z. Yang, J. Hou, J. Li, K. Ding, and H. Zhao. Faster: Rethinking real-time flow vlas.arXiv preprint arXiv:2603.19199, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

F. Yang, P. Jing, K. Qu, N. Zhao, and Y . Su. Abpolicy: Asynchronous b-spline flow policy for real-time and smooth robotic manipulation.arXiv preprint arXiv:2602.23901, 2026

work page arXiv 2026
[26]

Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Z. Liang, Y . Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

work page arXiv 2025
[27]

Kinetix: Investigating the training of general agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024

M. Matthews, M. Beukman, C. Lu, and J. Foerster. Kinetix: Investigating the training of gen- eral agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024

work page arXiv 2024
[28]

G. Zhan, L. Tao, P. Wang, Y . Wang, Y . Li, Y . Chen, H. Li, M. Tomizuka, and S. E. Li. Mean flow policy with instantaneous velocity constraint for one-step action generation.arXiv preprint arXiv:2602.13810, 2026

work page arXiv 2026
[29]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review arXiv 2025
[30]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[31]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[33]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

2022
[34]

Y . Liu, J. I. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

work page arXiv 2024
[35]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

work page arXiv 2025
[36]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[38]

Y . Ma, Y . Zhou, Y . Yang, T. Wang, and H. Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

work page arXiv 2025
[39]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review arXiv 2025
[40]

C. Liu, X. Han, J. Gao, Y . Zhao, H. Chen, and Y . Du. Oat: Ordered action tokenization.arXiv preprint arXiv:2602.04215, 2026

work page arXiv 2026
[41]

A. Lou, C. Meng, and S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review arXiv 2023
[42]

Y . Wen, H. Li, K. Gu, Y . Zhao, T. Wang, and X. Sun. Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932, 2025

work page arXiv 2025
[43]

J. Wen, M. Zhu, J. Liu, Z. Liu, Y . Yang, L. Zhang, S. Zhang, Y . Zhu, and Y . Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

work page arXiv 2025
[44]

J. Chen, W. Song, P. Ding, Z. Zhou, H. Zhao, F. Tang, D. Wang, and H. Li. Unified diffu- sion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

work page arXiv 2025
[45]

J. Ye, S. Gong, J. Gao, J. Fan, S. Wu, W. Bi, H. Bai, L. Shang, and L. Kong. Dream-vl & dream-vla: Open vision-language and vision-language-action models with diffusion language model backbone.arXiv preprint arXiv:2512.22615, 2025

work page arXiv 2025
[46]

J. Chen, W. Song, S. Chen, J. Wang, Z. Li, and H. Li. Dfm-vla: Iterative action refinement for robot manipulation via discrete flow matching.arXiv preprint arXiv:2603.26320, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

W. Song, J. Chen, S. Chen, J. Wang, P. Ding, H. Zhao, Y . Qin, X. Zheng, D. Wang, Y . Wang, et al. Fast-dvla: Accelerating discrete diffusion vla to real-time performance.arXiv preprint arXiv:2603.25661, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

T. Xiao, E. Jang, D. Kalashnikov, S. Levine, J. Ibarz, K. Hausman, and A. Herzog. Think- ing while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089, 2020

work page arXiv 2004
[49]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax. 11 A Extended Related Works Efficient VLA via Discrete Diffusion.To train and run VLAs efficiently, many prior ...

2018