Transport Discrepancy as a Reliability Signal for Vision-Language-Action Models

Chaoyi Xu; Hao Luo; Haoqi Yuan; Qin Jin; Sipeng Zheng; Wanpeng Zhang; Ye Wang; Yicheng Feng; Zongqing Lu

arxiv: 2512.01715 · v2 · pith:KM3WXRCXnew · submitted 2025-12-01 · 💻 cs.RO

Transport Discrepancy as a Reliability Signal for Vision-Language-Action Models

Wanpeng Zhang , Ye Wang , Hao Luo , Haoqi Yuan , Yicheng Feng , Chaoyi Xu , Sipeng Zheng , Qin Jin

show 1 more author

Zongqing Lu

This is my paper

classification 💻 cs.RO

keywords actiongatebackbonechunkscostdiscrepancydistributiondrift

0 comments

read the original abstract

Vision-language-action (VLA) models that generate continuous action chunks via flow matching lack an internal signal for judging whether a given prediction is reliable. Distribution shift and long-horizon rollouts can push backbone representations away from the region the action head decodes reliably, yet the policy has no mechanism to detect or react to this drift. We observe that the cost of transporting observation features to the action representation in a shared feature space rises precisely when such drift occurs, providing a per-step reliability estimate without extra supervision. Building on this observation, we propose DiG (Discrepancy Gate), a lightweight plug-in module for flow-matching VLA policies. DiG computes a sliced Wasserstein transport cost between backbone features and the action expert's own input projection, maps it through an exponential gate, and uses the gate to modulate both a residual feature refinement and the training loss. At inference time, the gate enables DiG-Refinefine, an iterative refinement process that corrects action chunks before execution. Experiments on both simulation and real-world scenarios show that DiG consistently improves success rates, with the largest gains under distribution shift and on long-horizon tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
cs.RO 2026-06 unverdicted novelty 7.0

PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation
cs.RO 2026-07 unverdicted novelty 6.0

Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
LARA: Latent Action Representation Alignment for Vision-Language-Action Models
cs.CV 2026-06 unverdicted novelty 6.0

LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
LARA: Latent Action Representation Alignment for Vision-Language-Action Models
cs.CV 2026-06 unverdicted novelty 5.0

LARA jointly optimizes LAM and VLA models via representation alignment, reporting average gains of ~10%, ~5%, and ~15% on simulation and real robotic manipulation tasks.