Recognition: unknown
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
Pith reviewed 2026-05-08 02:27 UTC · model grok-4.3
The pith
Discrete diffusion policies act as natural asynchronous executors because iterative unmasking makes inpainting native to them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Discrete diffusion policies generate actions by iteratively unmasking, which makes real-time chunking their native operation rather than an added correction. Freezing already-committed action chunks and unmasking the remainder produces consistent continuations without any task-specific fine-tuning or external guidance modules. Early stopping during the unmasking process supplies adaptive guidance at lower inference cost than generating a full new sequence from scratch.
What carries the argument
Iterative unmasking process in discrete diffusion policies, which directly supports inpainting of open action chunks while executing committed ones.
If this is right
- Implementation requires zero extra code for the inpainting step because it reuses the existing unmasking loop.
- Inference cost falls to roughly 0.7 times the cost of generating a full action sequence from scratch.
- Real-world dynamic pick success rate rises by about 50 percent relative to flow-matching-based real-time chunking.
- Early stopping during unmasking supplies adaptive guidance without separate heuristic modules.
Where Pith is reading between the lines
- The same native inpainting property could let discrete diffusion policies handle variable-length action horizons on the fly without retraining.
- Because inpainting is built in, the approach may transfer more readily to new robot hardware or sensor suites than methods that rely on post-training corrections.
- Combining early-stopping guidance with online data collection might further close the gap between offline training and live deployment.
Load-bearing premise
A discrete diffusion policy trained on standard offline data will keep producing high-quality, consistent continuations when inpainting around frozen committed action chunks in changing environments.
What would settle it
A controlled experiment showing that success rates or action smoothness drop sharply when DiscreteRTC switches from synchronous to asynchronous mode in the same dynamic pick or manipulation tasks.
Figures
read the original abstract
Unlike chatbots, physical AI must act while the world keeps evolving. Therefore, the inter-chunk pause of synchronous executors are fatal for dynamic tasks regardless of how fast the inference is. Asynchronous execution -- thinking while acting -- is therefore a structural requirement, and real-time chunking (RTC) makes it viable by recasting chunk transitions as inpainting: freezing committed actions and consistently generating the remainder. However, RTC with flow-matching policy is structurally suboptimal: its inpainting comes from inference-time corrections rather than the base policy, yielding little pre-training benefit, specific fine-tuning, heuristic guidance, and extra computation that inflates the latency. In this work, we observe that discrete diffusion policies, which generate actions by iteratively unmasking, are natural asynchronous executors that resolve all limitations at once: they are fine-tuning free since inpainting is their native operation, while early stopping further provides adaptive guidance and reduces inference cost. We propose DiscreteRTC, which replaces external corrections with native unmasking, and show on dynamic simulated benchmarks and real-world dynamic manipulation tasks that it achieves higher success rates than continuous RTC and other baselines. In summary, DiscreteRTC is simpler to implement with 0 lines of code for async inpainting, faster at inference with only 0.7x computation compared with generating actions from scratch, and better at execution with 50% higher success rate in real-world dynamic pick task compared with flow-matching-based RTC. More visualizations are on https://outsider86.github.io/DiscreteRTCSite/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that discrete diffusion policies generate actions via iterative unmasking and are therefore natural asynchronous executors for real-time chunking (RTC). By treating chunk transitions as native inpainting, DiscreteRTC avoids the inference-time corrections, task-specific fine-tuning, and extra compute required by flow-matching RTC; early stopping further supplies adaptive guidance. Empirical results on dynamic simulated benchmarks and real-world manipulation tasks are reported to show higher success rates (50% improvement in a real pick task) and lower inference cost (0.7x compute) relative to continuous RTC and other baselines.
Significance. If the empirical claims hold under rigorous testing, the work would be significant for robotics and control: it supplies a structurally simpler mechanism for asynchronous execution that directly exploits the generative process of discrete diffusion, eliminating the need for external corrections or fine-tuning while simultaneously lowering latency. This could reduce the engineering overhead for deploying policies in dynamic physical environments where synchronous chunking is inadequate.
major comments (2)
- [Abstract] Abstract: the reported gains (50% higher success rate in the real pick task, 0.7x compute) are presented without error bars, statistical tests, or a detailed experimental protocol. Because these numbers are the primary evidence for the superiority of DiscreteRTC over flow-matching RTC, the absence of statistical rigor is load-bearing for the central empirical claim.
- [Abstract and Experiments] The skeptic concern about distribution shift is not addressed: a policy trained on complete offline trajectories may not produce consistent, high-quality inpainted continuations when a prefix of actions is committed and the world state evolves during execution. Without ablations that explicitly test partial-commitment conditioning or measure degradation under dynamic conditions, the assertion that inpainting is 'fine-tuning free' and native remains vulnerable.
minor comments (1)
- [Abstract] Abstract: the informal phrasing '0 lines of code for async inpainting' should be replaced by a precise statement of what implementation changes (if any) are required to enable the inpainting behavior.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the concerns about statistical rigor and distribution shift by committing to revisions that add error bars, statistical details, and targeted ablations while preserving the core claims supported by our dynamic benchmarks and real-world results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported gains (50% higher success rate in the real pick task, 0.7x compute) are presented without error bars, statistical tests, or a detailed experimental protocol. Because these numbers are the primary evidence for the superiority of DiscreteRTC over flow-matching RTC, the absence of statistical rigor is load-bearing for the central empirical claim.
Authors: We agree that the abstract requires greater statistical transparency to support the central claims. In the revised manuscript, we will add error bars to the reported success rates and compute metrics, specify the number of trials (e.g., 20 independent runs for the real pick task), and include references to statistical tests such as paired t-tests for significance between DiscreteRTC and flow-matching RTC. The full experimental protocol, including environment details, hyperparameters, and evaluation procedures, is already provided in Section 4 and the appendix; we will add an explicit cross-reference in the abstract. revision: yes
-
Referee: [Abstract and Experiments] The skeptic concern about distribution shift is not addressed: a policy trained on complete offline trajectories may not produce consistent, high-quality inpainted continuations when a prefix of actions is committed and the world state evolves during execution. Without ablations that explicitly test partial-commitment conditioning or measure degradation under dynamic conditions, the assertion that inpainting is 'fine-tuning free' and native remains vulnerable.
Authors: We appreciate this valid concern about potential distribution shift in asynchronous settings. Discrete diffusion policies are trained with a random masking objective that naturally includes partial action sequences, enabling native inpainting without fine-tuning or external corrections. Our dynamic simulated benchmarks and real-world manipulation tasks already evaluate performance under evolving world states with sequential action commitment, where DiscreteRTC outperforms baselines. To directly address partial-commitment conditioning, we will add an ablation in the revised paper that measures success rates, consistency of generated continuations, and any performance degradation across varying prefix commitment lengths in simulated dynamic environments. revision: yes
Circularity Check
No circularity: claim follows directly from discrete diffusion generative mechanism
full rationale
The paper's core assertion—that discrete diffusion policies serve as natural asynchronous executors because inpainting (freezing committed actions and unmasking the rest) is their native operation—derives immediately from the standard iterative unmasking process in discrete diffusion models, without any equations, fitted parameters, or self-citations reducing the result to a constructed input. The abstract and described mechanism present this as an inherent property of the architecture, with empirical benchmarks (success rates, inference cost) serving as external validation rather than definitional support. No load-bearing steps match the enumerated circularity patterns; the derivation remains self-contained against the model's generative definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Brown, B
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
1901
-
[2]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review arXiv 2025
-
[6]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review arXiv 2025
- [7]
-
[8]
OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. J ´ozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation.CoRR, 2018. URL http://arxiv.org/abs/1808.00177
- [9]
- [10]
-
[11]
R. Yu, Y . Wang, Q. Zhao, H. W. Tsui, J. Wang, P. Tan, and Q. Chen. Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Confer- ence Conference Papers, pages 1–11, 2025
2025
-
[12]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[13]
L. Lai, A. Z. Huang, and S. J. Gershman. Action chunking as conditional policy compression. Cognition, 264:106201, 2025
2025
- [14]
-
[15]
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 9
work page internal anchor Pith review arXiv 2023
-
[16]
Real-Time Execution of Action Chunking Flow Policies
K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025
work page Pith review arXiv 2025
-
[18]
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, et al.π 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review arXiv 2022
-
[20]
J. Song, A. Vahdat, M. Mardani, and J. Kautz. Pseudoinverse-guided diffusion models for inverse problems. InInternational conference on learning representations, 2023
2023
-
[21]
Training-Time Action Conditioning for Efficient Real-Time Chunking
K. Black, A. Z. Ren, M. Equi, and S. Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025
- [22]
- [23]
-
[24]
Y . Lu, Z. Liu, X. Fan, Z. Yang, J. Hou, J. Li, K. Ding, and H. Zhao. Faster: Rethinking real-time flow vlas.arXiv preprint arXiv:2603.19199, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [25]
-
[26]
Z. Liang, Y . Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025
-
[27]
M. Matthews, M. Beukman, C. Lu, and J. Foerster. Kinetix: Investigating the training of gen- eral agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024
- [28]
-
[29]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review arXiv 2025
-
[30]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[31]
Van Den Oord, O
A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
2017
-
[33]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10
2022
- [34]
- [35]
-
[36]
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Peebles and S
W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
- [38]
-
[39]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review arXiv 2025
- [40]
-
[41]
A. Lou, C. Meng, and S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review arXiv 2023
- [42]
- [43]
- [44]
- [45]
-
[46]
J. Chen, W. Song, S. Chen, J. Wang, Z. Li, and H. Li. Dfm-vla: Iterative action refinement for robot manipulation via discrete flow matching.arXiv preprint arXiv:2603.26320, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
W. Song, J. Chen, S. Chen, J. Wang, P. Ding, H. Zhao, Y . Qin, X. Zheng, D. Wang, Y . Wang, et al. Fast-dvla: Accelerating discrete diffusion vla to real-time performance.arXiv preprint arXiv:2603.25661, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [48]
-
[49]
Bradbury, R
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax. 11 A Extended Related Works Efficient VLA via Discrete Diffusion.To train and run VLAs efficiently, many prior ...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.