Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Chengyue Wu; Jiangmiao Pang; Liuao Pei; Ping Luo; Shunbo Zhou; Sitong Mao; Tian Nian; Tianshuo Yang; Xiaokang Yang; Yao Mu

arxiv: 2508.20072 · v4 · pith:DOUVTC4Enew · submitted 2025-08-27 · 💻 cs.CV · cs.LG· cs.RO

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang , Yizhuo Li , Tianshuo Yang , Chengyue Wu , Sitong Mao , Liuao Pei , Tian Nian , Shunbo Zhou

show 4 more authors

Xiaokang Yang Jiangmiao Pang Yao Mu Ping Luo

This is my paper

classification 💻 cs.CV cs.LGcs.RO

keywords diffusiondiscretedecodingactionmethodvision-languageachievesactions

0 comments

read the original abstract

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
Discrete Langevin-Inspired Posterior Sampling
cs.LG 2026-05 unverdicted novelty 7.0

ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
cs.CV 2026-02 unverdicted novelty 6.0

Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
Learning Native Continuation for Action Chunking Flow Policies
cs.RO 2026-02 unverdicted novelty 6.0

Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-worl...
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
cs.CV 2026-02 unverdicted novelty 6.0

ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
cs.RO 2025-11 unverdicted novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
Causal World Modeling for Robot Control
cs.CV 2026-01 unverdicted novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.