QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Dongbin Zhao; Haoran Li; Mingcai Zhou; Yixuan Li; Yuhui Chen; Zhengtao Zhang

arxiv: 2510.14836 · v3 · pith:ACXLWRPLnew · submitted 2025-10-16 · 💻 cs.CV · cs.RO

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li , Yuhui Chen , Mingcai Zhou , Haoran Li , Zhengtao Zhang , Dongbin Zhao This is my paper

classification 💻 cs.CV cs.RO

keywords depthmodelsqdepth-vlatasksauxiliarymanipulationpredictionquantized

0 comments

read the original abstract

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
cs.RO 2026-05 conditional novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines...
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.