StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

· 2026 · cs.RO · arXiv 2604.11757

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$\alpha$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$\alpha$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $\pi_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$\alpha$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

representative citing papers

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.

DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

cs.RO · 2026-04-27 · unverdicted · novelty 7.0

Discrete diffusion policies act as natural asynchronous executors for robotics by treating action generation as iterative unmasking, yielding higher success rates and lower computation than flow-matching real-time chunking in dynamic tasks.

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.

3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

A 3D-thinking-guided co-training method disentangles geometry perception and spatial reasoning to inject latent 3D priors into VLA models via adapters, achieving SOTA on manipulation benchmarks while running on 2D images only.

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

cs.RO · 2026-06-11 · unverdicted · novelty 5.0

SPARC generates reliable spatial annotations for robot demonstrations by leveraging spatio-temporal task structure, outperforming detection baselines on localization accuracy while retaining more samples and enabling competitive model performance without manual annotations.

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

cs.RO · 2026-06-08 · unverdicted · novelty 5.0

Efficient-WAM delivers 30x lower latency than prior WAMs at 100 ms per chunk while keeping competitive manipulation performance by treating coarse future video as guidance rather than high-fidelity output.

citing papers explorer

Showing 4 of 4 citing papers after filters.

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies cs.RO · 2026-06-16 · unverdicted · none · ref 17 · internal anchor
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors cs.RO · 2026-04-27 · unverdicted · none · ref 38 · internal anchor
Discrete diffusion policies act as natural asynchronous executors for robotics by treating action generation as iterative unmasking, yielding higher success rates and lower computation than flow-matching real-time chunking in dynamic tasks.
SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale cs.RO · 2026-06-11 · unverdicted · none · ref 68 · internal anchor
SPARC generates reliable spatial annotations for robot demonstrations by leveraging spatio-temporal task structure, outperforming detection baselines on localization accuracy while retaining more samples and enabling competitive model performance without manual annotations.
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination cs.RO · 2026-06-08 · unverdicted · none · ref 36 · internal anchor
Efficient-WAM delivers 30x lower latency than prior WAMs at 100 ms per chunk while keeping competitive manipulation performance by treating coarse future video as guidance rather than high-fidelity output.

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

fields

years

verdicts

representative citing papers

citing papers explorer