{"work":{"id":"e35c8c6d-977d-4af1-963a-766ba98703ce","openalex_id":null,"doi":null,"arxiv_id":"2510.13626","raw_key":null,"title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models","authors":null,"authors_text":"Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian","year":2025,"venue":"cs.RO","abstract":"Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.","external_url":"https://arxiv.org/abs/2510.13626","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-30T05:04:20.388686+00:00","pith_arxiv_id":"2510.13626","created_at":"2026-05-09T06:05:34.834864+00:00","updated_at":"2026-06-30T05:04:20.388686+00:00","title_quality_ok":true,"display_title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models","render_title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models"},"hub":{"state":{"work_id":"e35c8c6d-977d-4af1-963a-766ba98703ce","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":56,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2025-11-24T10:22:28+00:00","last_pith_cited_at":"2026-06-29T17:48:01+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-30T05:39:30.969076+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":15},{"context_role":"dataset","n":7},{"context_role":"baseline","n":3},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":12},{"context_polarity":"use_dataset","n":7},{"context_polarity":"baseline","n":3},{"context_polarity":"unclear","n":3},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:09:30.934963+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":20},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":16},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":15},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":15},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":15},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":14},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":14},{"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","shared_citers":12},{"title":"Internvla-m1: A spatially guided vision- language-action framework for generalist robot policy","work_id":"8a11d29e-4bf8-4a9c-a97e-d87e7350dd9c","shared_citers":9},{"title":"Nora: A small open-sourced generalist vision language action model for embodied tasks","work_id":"1e89a464-b414-4d5c-a974-b2cb8be33053","shared_citers":9},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":9},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":9},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":8},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":8},{"title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","work_id":"e05d654d-db73-48f6-9318-381b6798bac9","shared_citers":8},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":8},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":7},{"title":"Evaluating Real-World Robot Manipulation Policies in Simulation","work_id":"7f4ca6cb-1b94-454c-9623-b52441b74b61","shared_citers":7},{"title":"Gemini Robotics: Bringing AI into the Physical World","work_id":"f7c5ce10-8364-4fbe-964f-2802b81c3a98","shared_citers":7},{"title":"Interactive post-training for vision-language- action models","work_id":"1ad0b2af-71bb-415b-b955-e3350f1a1ae8","shared_citers":7},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":7},{"title":"A Survey on Vision-Language-Action Models for Embodied AI","work_id":"9492fb3d-d667-4892-81bb-b2878f12ff0c","shared_citers":6},{"title":"Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning","work_id":"3d63039f-41b0-4a31-af31-6fc10f5c1b1b","shared_citers":6},{"title":"F1: A vision-language-action model bridg- ing understanding and generation to actions","work_id":"0557ae67-ef52-4cba-a579-208458bc2bbc","shared_citers":6}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:09:30.993756+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:10:07.458463+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models","claims":[{"claim_text":"Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embo","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"}]},"error":null,"updated_at":"2026-05-14T18:10:19.419107+00:00"}},"summary":{"title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models","claims":[{"claim_text":"Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embo","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"}]},"graph":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":20},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":16},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":15},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":15},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":15},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":14},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":14},{"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","shared_citers":12},{"title":"Internvla-m1: A spatially guided vision- language-action framework for generalist robot policy","work_id":"8a11d29e-4bf8-4a9c-a97e-d87e7350dd9c","shared_citers":9},{"title":"Nora: A small open-sourced generalist vision language action model for embodied tasks","work_id":"1e89a464-b414-4d5c-a974-b2cb8be33053","shared_citers":9},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":9},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":9},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":8},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":8},{"title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","work_id":"e05d654d-db73-48f6-9318-381b6798bac9","shared_citers":8},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":8},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":7},{"title":"Evaluating Real-World Robot Manipulation Policies in Simulation","work_id":"7f4ca6cb-1b94-454c-9623-b52441b74b61","shared_citers":7},{"title":"Gemini Robotics: Bringing AI into the Physical World","work_id":"f7c5ce10-8364-4fbe-964f-2802b81c3a98","shared_citers":7},{"title":"Interactive post-training for vision-language- action models","work_id":"1ad0b2af-71bb-415b-b955-e3350f1a1ae8","shared_citers":7},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":7},{"title":"A Survey on Vision-Language-Action Models for Embodied AI","work_id":"9492fb3d-d667-4892-81bb-b2878f12ff0c","shared_citers":6},{"title":"Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning","work_id":"3d63039f-41b0-4a31-af31-6fc10f5c1b1b","shared_citers":6},{"title":"F1: A vision-language-action model bridg- ing understanding and generation to actions","work_id":"0557ae67-ef52-4cba-a579-208458bc2bbc","shared_citers":6}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"authors":[]}}