StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

Heming Cui; He Wang; Jiayi Su; Mi Yan; Shengliang Deng; Wenhao Zhang; Xiaoguang Zhao; Yitao Zeng; Yixin Zheng; Zhizheng Zhang

arxiv: 2512.21970 · v2 · pith:YC224JSVnew · submitted 2025-12-26 · 💻 cs.RO

StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

Shengliang Deng , Mi Yan , Yixin Zheng , Jiayi Su , Wenhao Zhang , Yitao Zeng , Xiaoguang Zhao , Heming Cui

show 2 more authors

Zhizheng Zhang He Wang

This is my paper

classification 💻 cs.RO

keywords geometricstereovlacuesmodelssemanticspatialcameraestimation

0 comments

read the original abstract

While Vision-Language-Action (VLA) models excel in generalist manipulation, they often lack fine-grained spatial awareness and show limited viewpoint robustness. This limitation largely stems from the reliance on pretrained RGB encoders, which lack explicit geometric cues and prioritize semantic alignment over geometric representation. We argue that effective visual representations for VLA models must jointly encode both semantic and geometric information. In this paper, we introduce StereoVLA, the first VLA model to incorporate rich geometric cues from large-scale synthetic stereo data. StereoVLA employs a Geometric-and-Semantic (GeoSem) vision encoder that extracts geometric cues from subtle stereo-view disparities for precise spatial perception, while simultaneously capturing semantic features from pixel observations to support language-conditioned manipulation. Additionally, we introduce two synergistic co-training objectives: Interaction-Region Depth Estimation for precise spatial reasoning, and Camera Parameter Estimation to implicitly align perception and action coordinate systems. Compared with baselines that employ various input modalities, StereoVLA achieves a 33.4% absolute gain in success rate in real-world experiments and demonstrates robustness to near-hemispheric camera perspectives. Project page: https://shengliangd.github.io/StereoVLA-Webpage.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model
cs.CV 2026-06 unverdicted novelty 5.0

Event-VLA integrates event streams into VLA models through action-conditioned gated cross-attention to maintain performance in normal light while improving success rates under low-light and near-dark conditions.