QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Dongbin Zhao; Haoran Li; Mingcai Zhou; Yixuan Li; Yuhui Chen; Zhengtao Zhang

arxiv: 2510.14836 · v3 · pith:ACXLWRPLnew · submitted 2025-10-16 · 💻 cs.CV · cs.RO

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li , Yuhui Chen , Mingcai Zhou , Haoran Li , Zhengtao Zhang , Dongbin Zhao This is my paper

classification 💻 cs.CV cs.RO

keywords depthmodelsqdepth-vlatasksauxiliarymanipulationpredictionquantized

0 comments

read the original abstract

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
cs.RO 2026-05 conditional novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines...
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training
cs.CV 2026-06 unverdicted novelty 6.0

A 3D-thinking-guided co-training method disentangles geometry perception and spatial reasoning to inject latent 3D priors into VLA models via adapters, achieving SOTA on manipulation benchmarks while running on 2D ima...
GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models
cs.RO 2026-06 unverdicted novelty 5.0

GeoAlign post-trains an RGB geometry branch on robot RGB-D data to produce GEP features that are queried by proprioceptive state to generate phase-dependent geometry tokens, yielding 99.0% on LIBERO, 85.3% on SimplerE...
OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.
Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model
cs.CV 2026-05 unverdicted novelty 4.0

Evo-Depth is a compact VLA model using a lightweight implicit depth encoder from RGB views plus progressive alignment to boost manipulation performance without added hardware.