arxiv: 2604.05014 · v1 · submitted 2026-04-06 · 💻 cs.RO · cs.AI· cs.CV

Recognition: no theorem link

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords vision-language-action modelsmodular codebaseembodied agentsreproducible trainingbenchmark integrationopen-source frameworkcross-embodiment learning

0 comments

The pith

A modular backbone-action-head architecture unifies fragmented VLA research under one shared abstraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an open codebase that tackles the incompatibility of existing vision-language-action methods by defining a single structure where any supported backbone can pair with any supported action head. This structure keeps training strategies such as cross-embodiment learning and multimodal co-training identical across combinations. The same interface also wires together several major benchmarks for both simulation and real-robot use. The authors demonstrate that straightforward, single-benchmark training recipes built on this structure already reach or exceed earlier published results on multiple tasks when using either vision-language-model or world-model backbones.

Core claim

StarVLA establishes a modular backbone-action-head architecture under a shared abstraction that supports independent swapping of components from VLM and world-model backbones to various action paradigms, while providing reusable training strategies and integrated benchmarks that enable simple reproducible recipes to achieve competitive performance.

What carries the argument

The modular backbone-action-head architecture with shared abstraction that lets backbone and action head be swapped independently while preserving training and evaluation interfaces.

If this is right

New VLA variants can be prototyped by exchanging only the backbone or only the action head without rewriting training or evaluation code.
The same training recipes and evaluation protocols apply uniformly to both vision-language-model and world-model approaches.
Reproduction of prior methods and direct comparison across benchmarks become possible from a single codebase.
Unified interfaces for simulation and real-robot deployment reduce the engineering cost of moving between environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could shorten the time from idea to working agent by removing repeated data-pipeline and evaluation work.
Direct head-to-head testing of VLM-based versus world-model-based agents on identical tasks becomes feasible for the first time.
Extensions that add new action paradigms or new benchmarks would automatically inherit the existing training and evaluation machinery.

Load-bearing premise

The shared modular abstraction preserves performance and compatibility when different backbones and action heads are swapped in.

What would settle it

A controlled swap of a new backbone into the framework that produces benchmark scores materially below the scores reported for that same backbone in its original non-modular implementation.

Figures

Figures reproduced from arXiv: 2604.05014 by StarVLA Community.

**Figure 2.** Figure 2: Overview of four representative approaches for adapting Vision-Language Models into Vision [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the StarVLA framework. We present a unified and modular pipeline that connects [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Perception–action co-optimization dynamics under different co-training strategies (reproduced [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Per-step latency and throughput on a single 8-GPU node. Left: step latency as a function of per-GPU batch size for our method on A100 and H200, compared with LingBot-VLA and Dexbotic (both on 8×H200). Right: training throughput and GPU utilization on 8×A100 across batch sizes. 8 16 32 64 128 256 Number of GPUs 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 Seconds / Step 0.735 0.850 0.899 0.925 0.921 0.93… view at source ↗

**Figure 6.** Figure 6: Multi-node scaling efficiency. Left: per-step latency rises noticeably from 8 to 32 GPUs due to inter-node communication overhead, then plateaus between 64 and 256 GPUs. Right: measured sample throughput versus ideal linear scaling; parallel efficiency stabilizes around 79–80% beyond 32 GPUs. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StarVLA is a useful modular codebase for VLA work that unifies backbones, heads, and benchmarks, but its performance-matching claims lack the supporting tables and ablations needed to judge them.

read the letter

The core of this paper is an open codebase that treats VLA models as swappable parts: you can plug in different vision-language or world-model backbones and pair them with various action heads under one shared interface. It also bundles consistent training recipes for cross-embodiment and multimodal co-training, plus a single evaluation layer that covers LIBERO, SimplerEnv, RoboTwin, and a couple others for both simulation and real robots. That combination is new enough to matter in a field where every group ships its own incompatible stack.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces StarVLA, an open-source codebase for Vision-Language-Action (VLA) model development. It proposes a modular backbone–action-head architecture supporting independent swaps of VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) along with representative action-decoding paradigms under a shared abstraction; provides reusable training strategies such as cross-embodiment learning and multimodal co-training; integrates benchmarks including LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, and BEHAVIOR-1K through a unified evaluation interface; and ships simple, fully reproducible single-benchmark training recipes that are claimed to match or surpass prior methods on these benchmarks despite minimal data engineering.

Significance. A well-maintained, modular, and reproducible codebase with integrated benchmarks and explicit training recipes would meaningfully lower barriers to entry and enable principled comparisons in the fragmented VLA field. The explicit release of code, documentation, and single-benchmark recipes is a concrete strength that supports direct reproduction and extension.

major comments (1)

[Abstract] Abstract: the central claim that 'simple, fully reproducible single-benchmark training recipes ... already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones' is load-bearing for the paper's contribution but is presented without any accompanying experimental tables, ablation studies, success-rate comparisons, or training curves. This leaves unverified whether the shared modular abstraction preserves representational capacity and training dynamics relative to native implementations of the cited backbones.

minor comments (1)

The description of the unified evaluation interface would benefit from an explicit API-level example or pseudocode showing how backbone and action-head swaps are performed without hidden compatibility adjustments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of a modular, reproducible VLA codebase. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'simple, fully reproducible single-benchmark training recipes ... already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones' is load-bearing for the paper's contribution but is presented without any accompanying experimental tables, ablation studies, success-rate comparisons, or training curves. This leaves unverified whether the shared modular abstraction preserves representational capacity and training dynamics relative to native implementations of the cited backbones.

Authors: We agree that the performance claim is central and requires explicit empirical support within the manuscript. In the revised version we will add a dedicated experimental section containing success-rate tables on LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1 and BEHAVIOR-1K, direct comparisons against the original native implementations of the cited backbones, and ablations that isolate the effect of the shared modular interface. Training curves will also be included to document convergence behavior. These additions will verify that the abstraction does not degrade representational capacity or training dynamics. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted predictions; codebase and benchmark presentation only

full rationale

The manuscript describes a modular software framework (backbone-action-head abstraction, training strategies, unified benchmark interface) and states that its provided recipes match or surpass prior methods on external benchmarks. No equations, first-principles derivations, parameter fitting, or predictions appear in the text. Performance claims are empirical reports of running the released code against public benchmarks, with no self-referential reduction of outputs to inputs by construction. The work is self-contained as an engineering artifact release; any circularity would require hidden data or hyperparameter advantages, which is a reproducibility concern rather than a definitional or derivation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about software modularity and benchmark validity rather than new fitted parameters or invented physical entities.

axioms (1)

domain assumption A modular backbone-action-head abstraction can support independent component swapping without loss of functionality or performance
Invoked in the description of the shared abstraction for VLM and world-model backbones.

pith-pipeline@v0.9.0 · 5625 in / 1223 out tokens · 56923 ms · 2026-05-10T18:57:44.642245+00:00 · methodology

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
cs.RO 2026-05 unverdicted novelty 6.0

Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
Geometry Guided Self-Consistency for Physical AI
cs.RO 2026-05 unverdicted novelty 6.0

KeyStone improves task success rates in diffusion-based physical AI models by up to 13.3% by sampling K trajectories in parallel, clustering them in action space, and returning the medoid of the largest cluster.
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
cs.RO 2026-04 unverdicted novelty 6.0

LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 14 Pith papers · 2 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

1X World Model Team (2025). 1x world model: Evaluating bits, not atoms. Supplementary technical progress report. Contributed by Daniel Ho, Jack Monas, Juntao Ren, Christina Yu. AgiBot (2025). Agibot official website.https://www.agibot.com/. Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K....

work page internal anchor Pith review arXiv 2025
[2]

Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Guo, Y ., Shi, L. X., Chen, J., and Finn, C. (2025). Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. ...

work page arXiv 2025
[3]

Liu, H., Li, C., Wu, Q., and Lee, Y . J. (2023a). Visual instruction tuning.CoRR, abs/2304.08485. Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. (2024b). Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864. 22 Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., L...

work page internal anchor Pith review arXiv 2022