{"work":{"id":"0575c08a-597a-4420-9cbd-e33ef136c900","openalex_id":null,"doi":null,"arxiv_id":"2505.03233","raw_key":null,"title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data","authors":null,"authors_text":"Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen","year":2025,"venue":"cs.RO","abstract":"Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA's advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.","external_url":"https://arxiv.org/abs/2505.03233","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-21T13:24:11.161446+00:00","pith_arxiv_id":"2505.03233","created_at":"2026-05-10T13:25:26.784240+00:00","updated_at":"2026-05-21T13:24:11.161446+00:00","title_quality_ok":true,"display_title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data","render_title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data"},"hub":{"state":{"work_id":"0575c08a-597a-4420-9cbd-e33ef136c900","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":18,"external_cited_by_count":null,"distinct_field_count":2,"first_pith_cited_at":"2025-06-22T16:26:53+00:00","last_pith_cited_at":"2026-05-18T17:50:32+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T21:07:58.335376+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":8},{"context_role":"baseline","n":1},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":9},{"context_polarity":"baseline","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}