{"work":{"id":"e9be5436-00f8-4b43-b82a-ff154145e079","openalex_id":null,"doi":null,"arxiv_id":"2604.05014","raw_key":null,"title":"StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing","authors":null,"authors_text":null,"year":2026,"venue":"cs.RO","abstract":"Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.","external_url":"https://arxiv.org/abs/2604.05014","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-22T09:46:22.423054+00:00","pith_arxiv_id":"2604.05014","created_at":"2026-05-09T06:05:35.050928+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing","render_title":"StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing"},"hub":{"state":{"work_id":"e9be5436-00f8-4b43-b82a-ff154145e079","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":15,"external_cited_by_count":null,"distinct_field_count":2,"first_pith_cited_at":"2026-04-21T17:51:51+00:00","last_pith_cited_at":"2026-05-14T18:11:47+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-09T17:55:13.313804+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":6},{"context_role":"baseline","n":2},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":5},{"context_polarity":"baseline","n":2},{"context_polarity":"use_method","n":2},{"context_polarity":"unclear","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}