{"work":{"id":"1f46210b-e0d5-42ba-9d2b-29f3b33f07b9","openalex_id":null,"doi":null,"arxiv_id":"2512.22539","raw_key":null,"title":"VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models","authors":null,"authors_text":"Borong Zhang, Jiahao Li, Jiachen Shen, Yishuai Cai, Yuhao Zhang, Yuanpei Chen, Juntao Dai, Jiaming Ji, and Yaodong Yang","year":2025,"venue":"cs.RO","abstract":"While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.","external_url":"https://arxiv.org/abs/2512.22539","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-07-01T14:25:46.223186+00:00","pith_arxiv_id":"2512.22539","created_at":"2026-05-10T15:05:32.332806+00:00","updated_at":"2026-07-01T14:25:46.223186+00:00","title_quality_ok":true,"display_title":"Vla-arena: An open-source framework for benchmarking vision-language-action models","render_title":"Vla-arena: An open-source framework for benchmarking vision-language-action models"},"hub":{"state":{"work_id":"1f46210b-e0d5-42ba-9d2b-29f3b33f07b9","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":12,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2026-04-13T17:25:41+00:00","last_pith_cited_at":"2026-06-25T14:19:36+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-07-01T21:11:36.406210+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4},{"context_role":"dataset","n":2},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":4},{"context_polarity":"use_dataset","n":2},{"context_polarity":"baseline","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}