{"work":{"id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","openalex_id":null,"doi":null,"arxiv_id":"2502.19645","raw_key":null,"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","authors":null,"authors_text":"Moo Jin Kim, Chelsea Finn, Percy Liang","year":2025,"venue":"cs.RO","abstract":"Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.","external_url":"https://arxiv.org/abs/2502.19645","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-19T16:37:39.942867+00:00","pith_arxiv_id":"2502.19645","created_at":"2026-05-09T06:30:42.687012+00:00","updated_at":"2026-05-19T16:37:39.942867+00:00","title_quality_ok":true,"display_title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","render_title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success"},"hub":{"state":{"work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":112,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2025-02-09T11:25:56+00:00","last_pith_cited_at":"2026-05-14T18:11:47+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-20T02:21:44.743385+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":33},{"context_role":"baseline","n":10},{"context_role":"method","n":6},{"context_role":"other","n":3}],"polarity_counts":[{"context_polarity":"background","n":33},{"context_polarity":"baseline","n":10},{"context_polarity":"use_method","n":6},{"context_polarity":"unclear","n":3}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","claims":[{"claim_text":"Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Foster, G. Lam, P . Sanketiet al., \"Openvla: An open-source vision-language-action model,\"arXiv preprint arXiv:2406.09246, 2024. [7] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., \"π 0: A vision- language-action flow model for general robot control,\"arXiv preprint arXiv:2410.24164, 2024. [8] M. J. Kim, C. Finn, and P . Liang, \"Fine-tuning vision-language- action models: Optimizing speed and success,\"arXiv preprint arXiv:2502.19645, 20","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"CoT-VLA [90] 87.5 91.6 87.6 69.0 81.1 ThinkAct [28] 88.3 91.4 87.1 70.9 84.4 SmolVLA [67] 93.0 94.0 91.0 77.0 88.8 CronusVLA [37] 97.3 99.6 96.9 94.0 97.0 FLOWER [62] 97.1 96.7 95.6 93.5 95.7 GR00T-N1 [6] 94.4 97.6 93.0 90.6 93.9 π0 [7] 96.8 98.8 95.8 85.2 94.1 π0+FAST [57] 96.4 96.8 88.6 60.2 85.5 OpenVLA [34] 84.7 88.4 79.2 53.7 76.5 OpenVLA-OFT [32] 97.6 98.497.994.5 97.1 DD-VLA [44] 97.2 98.6 97.4 92.0 96.3 UniVLA [78] 95.4 98.8 93.6 94.0 95.4 X-VLA [93] 98.2 98.6 97.8 97.6 98.1 LingBot-V A(","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"/ Light / Background / Language / Sensor noise).Geo Avg= mean of Camera/Robot/Layout (the geometric axes per [19]). Baseline rows are from the zero-shot evaluation of [94] or each method's original report.Bestresults are in bold, and second-best are underlined. Method Cam Robot LayoutGeo AvgLight BG Lang Noise Avg Vision-Language-Action Models OpenVLA-OFT [33] 56.4 31.9 74.2 54.2 88.7 93.3 79.5 75.8 69.6 π0 [3] 13.8 6.0 68.9 29.6 85.0 81.4 58.8 79.0 53.6 π0.5 [60] 75.4 77.585.779.5 96.994.6 85.6","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"data and robot trajectories by representing robot actions as text tokens, yielding strong semantic generalization and improved performance on novel objects and instructions. OpenVLA [130] advances this paradigm toward open and scalable robot learning, improving performance while remaining practical for efficient fine-tuning and deployment. OpenVLA-OFT [131] proposes an optimized fine-tuning recipe with parallel decod- ing, action chunking, continuous action representations, and an L1 regression ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"and deep learning [42, 51, 60-63, 73, 84, 91, 93, 95, 96] have advanced robotic manipulation by bridging visual per- ception with language. RT-2 [6] introduced VLA model- ing with pre-trained VLMs, and RT-X [71] scaled it for better generalization. OpenVLA [31], trained on 970K robot demonstrations, released a 7B-parameter model, while OpenVLA-OFT [32] optimized inference with action chunking. These single-system models [50, 66, 80, 85, 88] sparked numerous subsequent studies. In contrast, dual-","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"(VLM) to generate driving actions, improving the adaptability and scalability of end-to-end systems across diverse driving scenarios. However, the existing VLA models still face two challenges:1) High action generation latency with autoregressive decoding.How to bridge the vi- sion,reasoning,andactionspaceisthecorequestionoftheVLAmodel.Although directly generating action tokens within VLM [34,81] simplifies the model struc- ture and unifies the reasoning and planning, it requires autoregressive ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (12 contexts).","role_counts":[{"n":12,"context_role":"background"},{"n":3,"context_role":"baseline"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-05-16T07:08:42.445746+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"243ae73b-9e68-445f-b88e-fa0b28f4c914","orcid":null,"display_name":"Moo Jin Kim"},{"id":"379e406e-0cbc-4ede-b9dd-9a76a16a6da8","orcid":null,"display_name":"Chelsea Finn"},{"id":"1151f7b5-4762-4a80-b79b-147a68616c43","orcid":null,"display_name":"Percy Liang"}]},"error":null,"updated_at":"2026-05-16T07:08:42.441274+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T07:28:13.089323+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":55},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":43},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":34},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":29},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":28},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":28},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":25},{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","shared_citers":20},{"title":"SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model","work_id":"592041b3-3ca2-4836-8dd4-f8095d8a692b","shared_citers":20},{"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","shared_citers":20},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":19},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":19},{"title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","work_id":"e05d654d-db73-48f6-9318-381b6798bac9","shared_citers":18},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":17},{"title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models","work_id":"e35c8c6d-977d-4af1-963a-766ba98703ce","shared_citers":16},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":16},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":15},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":15},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":15},{"title":"Dexvla: Vision- language model with plug-in diffusion expert for gen- eral robot control","work_id":"3564a757-5726-4b2a-a28e-114a4a467dfb","shared_citers":12},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":12},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":12},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":11},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":10}],"time_series":[{"n":2,"year":2025},{"n":76,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T07:38:00.600937+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T07:27:55.978474+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","claims":[{"claim_text":"Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Foster, G. Lam, P . Sanketiet al., \"Openvla: An open-source vision-language-action model,\"arXiv preprint arXiv:2406.09246, 2024. [7] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., \"π 0: A vision- language-action flow model for general robot control,\"arXiv preprint arXiv:2410.24164, 2024. [8] M. J. Kim, C. Finn, and P . Liang, \"Fine-tuning vision-language- action models: Optimizing speed and success,\"arXiv preprint arXiv:2502.19645, 20","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"CoT-VLA [90] 87.5 91.6 87.6 69.0 81.1 ThinkAct [28] 88.3 91.4 87.1 70.9 84.4 SmolVLA [67] 93.0 94.0 91.0 77.0 88.8 CronusVLA [37] 97.3 99.6 96.9 94.0 97.0 FLOWER [62] 97.1 96.7 95.6 93.5 95.7 GR00T-N1 [6] 94.4 97.6 93.0 90.6 93.9 π0 [7] 96.8 98.8 95.8 85.2 94.1 π0+FAST [57] 96.4 96.8 88.6 60.2 85.5 OpenVLA [34] 84.7 88.4 79.2 53.7 76.5 OpenVLA-OFT [32] 97.6 98.497.994.5 97.1 DD-VLA [44] 97.2 98.6 97.4 92.0 96.3 UniVLA [78] 95.4 98.8 93.6 94.0 95.4 X-VLA [93] 98.2 98.6 97.8 97.6 98.1 LingBot-V A(","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"/ Light / Background / Language / Sensor noise).Geo Avg= mean of Camera/Robot/Layout (the geometric axes per [19]). Baseline rows are from the zero-shot evaluation of [94] or each method's original report.Bestresults are in bold, and second-best are underlined. Method Cam Robot LayoutGeo AvgLight BG Lang Noise Avg Vision-Language-Action Models OpenVLA-OFT [33] 56.4 31.9 74.2 54.2 88.7 93.3 79.5 75.8 69.6 π0 [3] 13.8 6.0 68.9 29.6 85.0 81.4 58.8 79.0 53.6 π0.5 [60] 75.4 77.585.779.5 96.994.6 85.6","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"data and robot trajectories by representing robot actions as text tokens, yielding strong semantic generalization and improved performance on novel objects and instructions. OpenVLA [130] advances this paradigm toward open and scalable robot learning, improving performance while remaining practical for efficient fine-tuning and deployment. OpenVLA-OFT [131] proposes an optimized fine-tuning recipe with parallel decod- ing, action chunking, continuous action representations, and an L1 regression ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"and deep learning [42, 51, 60-63, 73, 84, 91, 93, 95, 96] have advanced robotic manipulation by bridging visual per- ception with language. RT-2 [6] introduced VLA model- ing with pre-trained VLMs, and RT-X [71] scaled it for better generalization. OpenVLA [31], trained on 970K robot demonstrations, released a 7B-parameter model, while OpenVLA-OFT [32] optimized inference with action chunking. These single-system models [50, 66, 80, 85, 88] sparked numerous subsequent studies. In contrast, dual-","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"(VLM) to generate driving actions, improving the adaptability and scalability of end-to-end systems across diverse driving scenarios. However, the existing VLA models still face two challenges:1) High action generation latency with autoregressive decoding.How to bridge the vi- sion,reasoning,andactionspaceisthecorequestionoftheVLAmodel.Although directly generating action tokens within VLM [34,81] simplifies the model struc- ture and unifies the reasoning and planning, it requires autoregressive ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (12 contexts).","role_counts":[{"n":12,"context_role":"background"},{"n":3,"context_role":"baseline"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-05-16T07:08:42.448650+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","claims":[{"claim_text":"Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T07:37:56.422965+00:00"}},"summary":{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","claims":[{"claim_text":"Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":55},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":43},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":34},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":29},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":28},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":28},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":25},{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","shared_citers":20},{"title":"SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model","work_id":"592041b3-3ca2-4836-8dd4-f8095d8a692b","shared_citers":20},{"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","shared_citers":20},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":19},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":19},{"title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","work_id":"e05d654d-db73-48f6-9318-381b6798bac9","shared_citers":18},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":17},{"title":"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models","work_id":"e35c8c6d-977d-4af1-963a-766ba98703ce","shared_citers":16},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":16},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":15},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":15},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":15},{"title":"Dexvla: Vision- language model with plug-in diffusion expert for gen- eral robot control","work_id":"3564a757-5726-4b2a-a28e-114a4a467dfb","shared_citers":12},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":12},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":12},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":11},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":10}],"time_series":[{"n":2,"year":2025},{"n":76,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"379e406e-0cbc-4ede-b9dd-9a76a16a6da8","orcid":null,"display_name":"Chelsea Finn","source":"manual","import_confidence":0.72},{"id":"243ae73b-9e68-445f-b88e-fa0b28f4c914","orcid":null,"display_name":"Moo Jin Kim","source":"manual","import_confidence":0.72},{"id":"1151f7b5-4762-4a80-b79b-147a68616c43","orcid":null,"display_name":"Percy Liang","source":"manual","import_confidence":0.72}]}}