pith. machine review for the scientific record. sign in

arxiv: 2604.25050 · v1 · submitted 2026-04-27 · 💻 cs.RO

Recognition: unknown

DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords discrete diffusionasynchronous executionreal-time chunkingrobotic action policiesaction inpaintingdiffusion modelsdynamic manipulation
0
0 comments X

The pith

Discrete diffusion policies act as natural asynchronous executors because iterative unmasking makes inpainting native to them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Physical robots must generate actions while the environment keeps changing, so any pause between action chunks risks failure on dynamic tasks. Asynchronous execution solves this by committing to an initial chunk and then inpainting the rest as the robot moves. Continuous flow-matching policies handle inpainting only through extra inference-time corrections that demand fine-tuning and added computation. Discrete diffusion policies instead generate by repeatedly unmasking tokens, so freezing committed actions and continuing to unmask the open ones is simply the model's normal behavior. This native support removes fine-tuning, allows early stopping for cheaper adaptive guidance, and yields higher success rates on simulated and real dynamic manipulation tasks.

Core claim

Discrete diffusion policies generate actions by iteratively unmasking, which makes real-time chunking their native operation rather than an added correction. Freezing already-committed action chunks and unmasking the remainder produces consistent continuations without any task-specific fine-tuning or external guidance modules. Early stopping during the unmasking process supplies adaptive guidance at lower inference cost than generating a full new sequence from scratch.

What carries the argument

Iterative unmasking process in discrete diffusion policies, which directly supports inpainting of open action chunks while executing committed ones.

If this is right

  • Implementation requires zero extra code for the inpainting step because it reuses the existing unmasking loop.
  • Inference cost falls to roughly 0.7 times the cost of generating a full action sequence from scratch.
  • Real-world dynamic pick success rate rises by about 50 percent relative to flow-matching-based real-time chunking.
  • Early stopping during unmasking supplies adaptive guidance without separate heuristic modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same native inpainting property could let discrete diffusion policies handle variable-length action horizons on the fly without retraining.
  • Because inpainting is built in, the approach may transfer more readily to new robot hardware or sensor suites than methods that rely on post-training corrections.
  • Combining early-stopping guidance with online data collection might further close the gap between offline training and live deployment.

Load-bearing premise

A discrete diffusion policy trained on standard offline data will keep producing high-quality, consistent continuations when inpainting around frozen committed action chunks in changing environments.

What would settle it

A controlled experiment showing that success rates or action smoothness drop sharply when DiscreteRTC switches from synchronous to asynchronous mode in the same dynamic pick or manipulation tasks.

Figures

Figures reproduced from arXiv: 2604.25050 by Chenfeng Xu, Chensheng Peng, Chen Tang, Kaiwen Hong, Katherine Driggs-Campbell, Masayoshi Tomizuka, Pengcheng Wang.

Figure 1
Figure 1. Figure 1: Async Execution with discrete diffusion policies solving dynamic manipulation. Gray rectangles and blocks represent the action chunks and the actions. Yellow and green cubes represent the masked and unmasked action tokens. During each inference cycle, discrete diffusion policies copy the tail of the last action chunk as the committed prefix, and inpaint upon it by simply forwardin itself. Compared with flo… view at source ↗
Figure 2
Figure 2. Figure 2: RTC with flow-matching head. Color represents the noise level, where green stands for the clear action and yellow stands for the pure noise. The flow-matching head is ill-suited for RTC because (a) during pre-training, the base policy is not trained on inpainting tasks; (b) to acquire this capability, a specially designed fine-tuning stage is required; (c) at inference time, RTC relies on heuristic guidanc… view at source ↗
Figure 3
Figure 3. Figure 3: RTC with discrete diffusion head. Color represents the masking status, where green stands for the unmasked token and yellow stands for the masked token. The discrete diffusion head is naturally suited for RTC because (a) during pre-training, the base policy is already trained on inpainting tasks; (b) consequently, no inpainting-specific fine-tuning is required; (c) at inference time, early stopping from th… view at source ↗
Figure 4
Figure 4. Figure 4: Experimental Results in Kinetix. The throughputs represent the task completed by the policy every 256 steps. Left: Average solve rate and throughputs across all environments with different inference delays; Right: Solve rates for every tasks with different inference delays. The executions horizon follows s = max(1, d) and each datapoint represents 2048 trials. To ensure valid execution before the next infe… view at source ↗
Figure 5
Figure 5. Figure 5: Extend Experimental Results in Kinetix. Left: Required interative steps for each in￾painting inference of different policy architectures in Kinetix with s = max(1, d); Right: Average solve rates of extended variants in Kinetix. The evaluation setup keeps the same with view at source ↗
Figure 6
Figure 6. Figure 6: Unmasking Trajectory Sample with Natural Schedule Inference. Green blocks denote unmasked action tokens, yellow blocks denote masked tokens, and the red rectangle marks the early￾stop boundary beyond which tokens do not need to be unmasked before the next inference. In practice, the natural schedule does work as expected compared to the simple hard mask approach. In this section, we show how the inappropri… view at source ↗
Figure 7
Figure 7. Figure 7: Detailed Main Results in Kinetix. The evaluation setup keeps the same with view at source ↗
Figure 8
Figure 8. Figure 8: Fine-tuning Ablation in Kinetix. The evaluation setup keeps the same with view at source ↗
Figure 9
Figure 9. Figure 9: dynamic Pick and Place Real-world Setup Hardware and Data. We use a single UR5e arm with a Robotiq gripper and a wrist-mounted RGB camera. Demonstrations are recorded at 500 Hz via the FastUMI pipeline. Each action is a 10D vector [∆x, ∆y, ∆z, rot6d(6), gripper], with translational dimensions normalized to [−1, 1] via min-max scaling, rotation dimensions left unnormalized, and the gripper binarized to {0, … view at source ↗
read the original abstract

Unlike chatbots, physical AI must act while the world keeps evolving. Therefore, the inter-chunk pause of synchronous executors are fatal for dynamic tasks regardless of how fast the inference is. Asynchronous execution -- thinking while acting -- is therefore a structural requirement, and real-time chunking (RTC) makes it viable by recasting chunk transitions as inpainting: freezing committed actions and consistently generating the remainder. However, RTC with flow-matching policy is structurally suboptimal: its inpainting comes from inference-time corrections rather than the base policy, yielding little pre-training benefit, specific fine-tuning, heuristic guidance, and extra computation that inflates the latency. In this work, we observe that discrete diffusion policies, which generate actions by iteratively unmasking, are natural asynchronous executors that resolve all limitations at once: they are fine-tuning free since inpainting is their native operation, while early stopping further provides adaptive guidance and reduces inference cost. We propose DiscreteRTC, which replaces external corrections with native unmasking, and show on dynamic simulated benchmarks and real-world dynamic manipulation tasks that it achieves higher success rates than continuous RTC and other baselines. In summary, DiscreteRTC is simpler to implement with 0 lines of code for async inpainting, faster at inference with only 0.7x computation compared with generating actions from scratch, and better at execution with 50% higher success rate in real-world dynamic pick task compared with flow-matching-based RTC. More visualizations are on https://outsider86.github.io/DiscreteRTCSite/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that discrete diffusion policies generate actions via iterative unmasking and are therefore natural asynchronous executors for real-time chunking (RTC). By treating chunk transitions as native inpainting, DiscreteRTC avoids the inference-time corrections, task-specific fine-tuning, and extra compute required by flow-matching RTC; early stopping further supplies adaptive guidance. Empirical results on dynamic simulated benchmarks and real-world manipulation tasks are reported to show higher success rates (50% improvement in a real pick task) and lower inference cost (0.7x compute) relative to continuous RTC and other baselines.

Significance. If the empirical claims hold under rigorous testing, the work would be significant for robotics and control: it supplies a structurally simpler mechanism for asynchronous execution that directly exploits the generative process of discrete diffusion, eliminating the need for external corrections or fine-tuning while simultaneously lowering latency. This could reduce the engineering overhead for deploying policies in dynamic physical environments where synchronous chunking is inadequate.

major comments (2)
  1. [Abstract] Abstract: the reported gains (50% higher success rate in the real pick task, 0.7x compute) are presented without error bars, statistical tests, or a detailed experimental protocol. Because these numbers are the primary evidence for the superiority of DiscreteRTC over flow-matching RTC, the absence of statistical rigor is load-bearing for the central empirical claim.
  2. [Abstract and Experiments] The skeptic concern about distribution shift is not addressed: a policy trained on complete offline trajectories may not produce consistent, high-quality inpainted continuations when a prefix of actions is committed and the world state evolves during execution. Without ablations that explicitly test partial-commitment conditioning or measure degradation under dynamic conditions, the assertion that inpainting is 'fine-tuning free' and native remains vulnerable.
minor comments (1)
  1. [Abstract] Abstract: the informal phrasing '0 lines of code for async inpainting' should be replaced by a precise statement of what implementation changes (if any) are required to enable the inpainting behavior.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the concerns about statistical rigor and distribution shift by committing to revisions that add error bars, statistical details, and targeted ablations while preserving the core claims supported by our dynamic benchmarks and real-world results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported gains (50% higher success rate in the real pick task, 0.7x compute) are presented without error bars, statistical tests, or a detailed experimental protocol. Because these numbers are the primary evidence for the superiority of DiscreteRTC over flow-matching RTC, the absence of statistical rigor is load-bearing for the central empirical claim.

    Authors: We agree that the abstract requires greater statistical transparency to support the central claims. In the revised manuscript, we will add error bars to the reported success rates and compute metrics, specify the number of trials (e.g., 20 independent runs for the real pick task), and include references to statistical tests such as paired t-tests for significance between DiscreteRTC and flow-matching RTC. The full experimental protocol, including environment details, hyperparameters, and evaluation procedures, is already provided in Section 4 and the appendix; we will add an explicit cross-reference in the abstract. revision: yes

  2. Referee: [Abstract and Experiments] The skeptic concern about distribution shift is not addressed: a policy trained on complete offline trajectories may not produce consistent, high-quality inpainted continuations when a prefix of actions is committed and the world state evolves during execution. Without ablations that explicitly test partial-commitment conditioning or measure degradation under dynamic conditions, the assertion that inpainting is 'fine-tuning free' and native remains vulnerable.

    Authors: We appreciate this valid concern about potential distribution shift in asynchronous settings. Discrete diffusion policies are trained with a random masking objective that naturally includes partial action sequences, enabling native inpainting without fine-tuning or external corrections. Our dynamic simulated benchmarks and real-world manipulation tasks already evaluate performance under evolving world states with sequential action commitment, where DiscreteRTC outperforms baselines. To directly address partial-commitment conditioning, we will add an ablation in the revised paper that measures success rates, consistency of generated continuations, and any performance degradation across varying prefix commitment lengths in simulated dynamic environments. revision: yes

Circularity Check

0 steps flagged

No circularity: claim follows directly from discrete diffusion generative mechanism

full rationale

The paper's core assertion—that discrete diffusion policies serve as natural asynchronous executors because inpainting (freezing committed actions and unmasking the rest) is their native operation—derives immediately from the standard iterative unmasking process in discrete diffusion models, without any equations, fitted parameters, or self-citations reducing the result to a constructed input. The abstract and described mechanism present this as an inherent property of the architecture, with empirical benchmarks (success rates, inference cost) serving as external validation rather than definitional support. No load-bearing steps match the enumerated circularity patterns; the derivation remains self-contained against the model's generative definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim rests on the standard generative properties of discrete diffusion models (iterative unmasking) and the assumption that these properties transfer to real-time inpainting without additional training; no new free parameters, axioms, or invented entities are introduced for the core argument.

pith-pipeline@v0.9.0 · 5595 in / 1167 out tokens · 85170 ms · 2026-05-08T02:27:34.437853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 39 canonical work pages · 17 internal anchors

  1. [1]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  3. [3]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  6. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  7. [7]

    P. Wang, Q. Liu, H. Lin, Y . Li, G. Zhan, M. Tomizuka, and Y . Wang. Dadp: Domain adaptive diffusion policy.arXiv preprint arXiv:2602.04037, 2026

  8. [8]

    Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba

    OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. J ´ozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation.CoRR, 2018. URL http://arxiv.org/abs/1808.00177

  9. [9]

    S. An, Z. Meng, C. Tang, Y . Zhou, T. Liu, F. Ding, S. Zhang, Y . Mu, R. Song, W. Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

  10. [10]

    Z. Su, B. Zhang, N. Rahmanian, Y . Gao, Q. Liao, C. Regan, K. Sreenath, and S. S. Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

  11. [11]

    R. Yu, Y . Wang, Q. Zhao, H. W. Tsui, J. Wang, P. Tan, and Q. Chen. Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Confer- ence Conference Papers, pages 1–11, 2025

  12. [12]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  13. [13]

    L. Lai, A. Z. Huang, and S. J. Gershman. Action chunking as conditional policy compression. Cognition, 264:106201, 2025

  14. [14]

    H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. Dynamicvla: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

  15. [15]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 9

  16. [16]

    Real-Time Execution of Action Chunking Flow Policies

    K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

  17. [17]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  18. [18]

    ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, et al.π 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  19. [19]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  20. [20]

    J. Song, A. Vahdat, M. Mardani, and J. Kautz. Pseudoinverse-guided diffusion models for inverse problems. InInternational conference on learning representations, 2023

  21. [21]

    Training-Time Action Conditioning for Efficient Real-Time Chunking

    K. Black, A. Z. Ren, M. Equi, and S. Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

  22. [22]

    H. Wang, G. Zhang, Y . Yan, Y . Shang, R. R. Kompella, and G. Liu. Real-time robot execution with masked action chunking.arXiv preprint arXiv:2601.20130, 2026

  23. [23]

    Y . Liu, H. Yu, J. Zhao, B. Li, D. Zhang, M. Li, W. Wu, Y . Hu, J. Xie, J. Guo, et al. Learning native continuation for action chunking flow policies.arXiv preprint arXiv:2602.12978, 2026

  24. [24]

    Y . Lu, Z. Liu, X. Fan, Z. Yang, J. Hou, J. Li, K. Ding, and H. Zhao. Faster: Rethinking real-time flow vlas.arXiv preprint arXiv:2603.19199, 2026

  25. [25]

    F. Yang, P. Jing, K. Qu, N. Zhao, and Y . Su. Abpolicy: Asynchronous b-spline flow policy for real-time and smooth robotic manipulation.arXiv preprint arXiv:2602.23901, 2026

  26. [26]

    Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

    Z. Liang, Y . Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

  27. [27]

    Kinetix: Investigating the training of general agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024

    M. Matthews, M. Beukman, C. Lu, and J. Foerster. Kinetix: Investigating the training of gen- eral agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024

  28. [28]

    G. Zhan, L. Tao, P. Wang, Y . Wang, Y . Li, Y . Chen, H. Li, M. Tomizuka, and S. E. Li. Mean flow policy with instantaneous velocity constraint for one-step action generation.arXiv preprint arXiv:2602.13810, 2026

  29. [29]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  30. [30]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  31. [31]

    Van Den Oord, O

    A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  32. [33]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

  33. [34]

    Y . Liu, J. I. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

  34. [35]

    J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

  35. [36]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026

  36. [37]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  37. [38]

    Y . Ma, Y . Zhou, Y . Yang, T. Wang, and H. Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

  38. [39]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  39. [40]

    C. Liu, X. Han, J. Gao, Y . Zhao, H. Chen, and Y . Du. Oat: Ordered action tokenization.arXiv preprint arXiv:2602.04215, 2026

  40. [41]

    A. Lou, C. Meng, and S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

  41. [42]

    Y . Wen, H. Li, K. Gu, Y . Zhao, T. Wang, and X. Sun. Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932, 2025

  42. [43]

    J. Wen, M. Zhu, J. Liu, Z. Liu, Y . Yang, L. Zhang, S. Zhang, Y . Zhu, and Y . Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

  43. [44]

    J. Chen, W. Song, P. Ding, Z. Zhou, H. Zhao, F. Tang, D. Wang, and H. Li. Unified diffu- sion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

  44. [45]

    J. Ye, S. Gong, J. Gao, J. Fan, S. Wu, W. Bi, H. Bai, L. Shang, and L. Kong. Dream-vl & dream-vla: Open vision-language and vision-language-action models with diffusion language model backbone.arXiv preprint arXiv:2512.22615, 2025

  45. [46]

    J. Chen, W. Song, S. Chen, J. Wang, Z. Li, and H. Li. Dfm-vla: Iterative action refinement for robot manipulation via discrete flow matching.arXiv preprint arXiv:2603.26320, 2026

  46. [47]

    W. Song, J. Chen, S. Chen, J. Wang, P. Ding, H. Zhao, Y . Qin, X. Zheng, D. Wang, Y . Wang, et al. Fast-dvla: Accelerating discrete diffusion vla to real-time performance.arXiv preprint arXiv:2603.25661, 2026

  47. [48]

    T. Xiao, E. Jang, D. Kalashnikov, S. Levine, J. Ibarz, K. Hausman, and A. Herzog. Think- ing while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089, 2020

  48. [49]

    Bradbury, R

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax. 11 A Extended Related Works Efficient VLA via Discrete Diffusion.To train and run VLAs efficiently, many prior ...