Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Bing Cheng; Bo Zhao; Gen Li; Hongyi Cai; Jiting Liu; Junchi Yan; Kai Ye; Mingkang Dong; Nuobei Zhu; Tao Lin

arxiv: 2605.14950 · v1 · pith:HLMHWWYZnew · submitted 2026-05-14 · 💻 cs.CV · cs.RO

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Tao Lin , Yuxin Du , Jiting Liu , Nuobei Zhu , Yunhe Li , Yuqian Fu , Yinxinyu Chen , Hongyi Cai

show 9 more authors

Zewei Ye Bing Cheng Kai Ye Yiran Mao Yilei Zhong Mingkang Dong Junchi Yan Gen Li Bo Zhao

This is my paper

classification 💻 cs.CV cs.RO

keywords evo-depthdepthspatialdepth-enhancedlightweightmodelsoftenrepresentations

0 comments

read the original abstract

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LA4VLA: Learning to Act without Seeing via Language-Action Pretraining
cs.RO 2026-06 unverdicted novelty 6.0

LA4VLA pretrains on language-action pairs from decomposed demonstrations to create reusable action priors, yielding up to 45 percentage point gains in real-world VLA success rates when mixed with standard training.
GIVE: Grounding Human Gestures in Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 5.0

GIVE improves pre-trained VLA models for robotic tasks by incorporating gestures via visual skeleton overlays and semantic descriptions, yielding 40% higher object recognition accuracy and 80% higher task success in r...