pith. sign in

arxiv: 2510.14836 · v3 · pith:ACXLWRPLnew · submitted 2025-10-16 · 💻 cs.CV · cs.RO

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

classification 💻 cs.CV cs.RO
keywords depthmodelsqdepth-vlatasksauxiliarymanipulationpredictionquantized
0
0 comments X
read the original abstract

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

    cs.RO 2026-05 conditional novelty 7.0

    EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines...

  2. UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    cs.RO 2026-02 unverdicted novelty 7.0

    UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

  3. 3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

    cs.CV 2026-06 unverdicted novelty 6.0

    A 3D-thinking-guided co-training method disentangles geometry perception and spatial reasoning to inject latent 3D priors into VLA models via adapters, achieving SOTA on manipulation benchmarks while running on 2D ima...

  4. GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

    cs.RO 2026-06 unverdicted novelty 5.0

    GeoAlign post-trains an RGB geometry branch on robot RGB-D data to produce GEP features that are queried by proprioceptive state to generate phase-dependent geometry tokens, yielding 99.0% on LIBERO, 85.3% on SimplerE...

  5. OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.

  6. Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

    cs.CV 2026-05 unverdicted novelty 4.0

    Evo-Depth is a compact VLA model using a lightweight implicit depth encoder from RGB views plus progressive alignment to boost manipulation performance without added hardware.