{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ","short_pith_number":"pith:QJ7SX2AM","schema_version":"1.0","canonical_sha256":"827f2be80cd8d3c8e409f9f6f7a927924061d8248c61b892f6dcb7847bbe717b","source":{"kind":"arxiv","id":"2510.03827","version":1},"attestation_state":"computed","paper":{"title":"LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Vision-Language-Action models achieve over 90 percent on standard benchmarks yet drop to zero percent when objects, instructions or environments are perturbed.","cross_cats":["cs.RO"],"primary_cat":"cs.CV","authors_text":"Duanfeng Chu, Guiyao Tie, Guowen Zhang, Lichao Sun, Pan Zhou, Xueyang Zhou, Yangming Xu, Yongchao Chen","submitted_at":"2025-10-04T14:56:40Z","abstract_excerpt":"LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy u"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2510.03827","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-10-04T14:56:40Z","cross_cats_sorted":["cs.RO"],"title_canon_sha256":"ac47bff84c98b8cbd8254dad859366420c75b1086ba3196334414e34e088aaed","abstract_canon_sha256":"be511cad227cf3b223756032a6a653287c4f6fcade091d65bcff23d40138b13a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.844099Z","signature_b64":"3SldZL2nOwmHhlLKpCsnABBgBoGHcSTk73ARDtYxC2O2ipigx+Wie8cetsNTW9i7gsxboApyAAsWu6In+TviDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"827f2be80cd8d3c8e409f9f6f7a927924061d8248c61b892f6dcb7847bbe717b","last_reissued_at":"2026-05-17T23:38:14.843571Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.843571Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Vision-Language-Action models achieve over 90 percent on standard benchmarks yet drop to zero percent when objects, instructions or environments are perturbed.","cross_cats":["cs.RO"],"primary_cat":"cs.CV","authors_text":"Duanfeng Chu, Guiyao Tie, Guowen Zhang, Lichao Sun, Pan Zhou, Xueyang Zhou, Yangming Xu, Yongchao Chen","submitted_at":"2025-10-04T14:56:40Z","abstract_excerpt":"LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy u"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. This discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The specific perturbations chosen across the four dimensions constitute fair tests of generalization and comprehension rather than introducing unrelated difficulties that no model could reasonably handle.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LIBERO-PRO shows VLA models collapse from over 90% to 0% accuracy under perturbations in objects, states, instructions, and environments, exposing memorization instead of genuine comprehension.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Vision-Language-Action models achieve over 90 percent on standard benchmarks yet drop to zero percent when objects, instructions or environments are perturbed.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f3c9776a32567ba064ad66a58a78a13feeb6a780f0ab2058b5bc0a01b1d9d6dc"},"source":{"id":"2510.03827","kind":"arxiv","version":1},"verdict":{"id":"09f1216c-ab34-4323-9a9c-8647414b8514","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T06:15:03.919808Z","strongest_claim":"although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. This discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception.","one_line_summary":"LIBERO-PRO shows VLA models collapse from over 90% to 0% accuracy under perturbations in objects, states, instructions, and environments, exposing memorization instead of genuine comprehension.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The specific perturbations chosen across the four dimensions constitute fair tests of generalization and comprehension rather than introducing unrelated difficulties that no model could reasonably handle.","pith_extraction_headline":"Vision-Language-Action models achieve over 90 percent on standard benchmarks yet drop to zero percent when objects, instructions or environments are perturbed."},"references":{"count":25,"sample":[{"doi":"","year":null,"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","ref_index":1,"cited_arxiv_id":"2410.24164","is_internal_anchor":true},{"doi":"","year":null,"title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","work_id":"e05d654d-db73-48f6-9318-381b6798bac9","ref_index":2,"cited_arxiv_id":"2505.06111","is_internal_anchor":true},{"doi":"","year":null,"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","ref_index":3,"cited_arxiv_id":"2506.21539","is_internal_anchor":true},{"doi":"","year":null,"title":"arXiv preprint arXiv:2506.08440 , year=","work_id":"4d5bf60f-37d6-49f4-b277-81ca7233cb16","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Irving Fang, Juexiao Zhang, Shengbang Tong, and Chen Feng","work_id":"50ad5e19-0d60-4bb4-b3fc-01ecdee5d7a4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":25,"snapshot_sha256":"0a5de65afed22fbbae28d8c9a5c7d8b357df8b93eee731fb59f2cc0045f1e9b6","internal_anchors":13},"formal_canon":{"evidence_count":3,"snapshot_sha256":"c4f4d9e0b8d2da1151c68f1ff7a6aca42c8a20f9f9c0db27530c79c492ec449c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2510.03827","created_at":"2026-05-17T23:38:14.843657+00:00"},{"alias_kind":"arxiv_version","alias_value":"2510.03827v1","created_at":"2026-05-17T23:38:14.843657+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2510.03827","created_at":"2026-05-17T23:38:14.843657+00:00"},{"alias_kind":"pith_short_12","alias_value":"QJ7SX2AM3DJ4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"QJ7SX2AM3DJ4RZAJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"QJ7SX2AM","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2605.21414","citing_title":"PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13193","citing_title":"Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2603.13966","citing_title":"vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09860","citing_title":"RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22003","citing_title":"VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22126","citing_title":"ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04161","citing_title":"Adaptive Action Chunking at Inference-time for Vision-Language-Action Models","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12090","citing_title":"World Action Models: The Next Frontier in Embodied AI","ref_index":231,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12236","citing_title":"TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11205","citing_title":"The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26689","citing_title":"Atomic-Probe Governance for Skill Updates in Compositional Robot Policies","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26689","citing_title":"Atomic-Probe Governance for Skill Updates in Compositional Robot Policies","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23775","citing_title":"Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23121","citing_title":"Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06481","citing_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","ref_index":99,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18000","citing_title":"Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09860","citing_title":"RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05595","citing_title":"Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11751","citing_title":"Grounded World Model for Semantically Generalizable Planning","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17706","citing_title":"OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL","ref_index":56,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ","json":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ.json","graph_json":"https://pith.science/api/pith-number/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/graph.json","events_json":"https://pith.science/api/pith-number/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/events.json","paper":"https://pith.science/paper/QJ7SX2AM"},"agent_actions":{"view_html":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ","download_json":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ.json","view_paper":"https://pith.science/paper/QJ7SX2AM","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2510.03827&json=true","fetch_graph":"https://pith.science/api/pith-number/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/graph.json","fetch_events":"https://pith.science/api/pith-number/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/action/storage_attestation","attest_author":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/action/author_attestation","sign_citation":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/action/citation_signature","submit_replication":"https://pith.science/pith/QJ7SX2AM3DJ4RZAJ7H3PPKJHSJ/action/replication_record"}},"created_at":"2026-05-17T23:38:14.843657+00:00","updated_at":"2026-05-17T23:38:14.843657+00:00"}