{"work":{"id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","openalex_id":null,"doi":null,"arxiv_id":"2410.24164","raw_key":null,"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","authors":null,"authors_text":"Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn","year":2024,"venue":"cs.LG","abstract":"Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.","external_url":"https://arxiv.org/abs/2410.24164","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:20:31.800669+00:00","pith_arxiv_id":"2410.24164","created_at":"2026-05-09T05:50:25.642664+00:00","updated_at":"2026-05-25T08:20:31.800669+00:00","title_quality_ok":true,"display_title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","render_title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control"},"hub":{"state":{"work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":386,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2024-05-23T01:43:54+00:00","last_pith_cited_at":"2026-05-22T17:08:37+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-29T22:30:34.352705+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":119},{"context_role":"baseline","n":25},{"context_role":"method","n":13},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":114},{"context_polarity":"baseline","n":25},{"context_polarity":"use_method","n":12},{"context_polarity":"unclear","n":6},{"context_polarity":"support","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","claims":[{"claim_text":"Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks $\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:03:28.620273+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"06730c35-b7c3-40f7-9cf7-d777f261f66c","orcid":null,"display_name":"Kevin Black"},{"id":"657aa3e6-af8f-48bd-aeb3-c63cc3bedc60","orcid":null,"display_name":"Noah Brown"},{"id":"247490b1-4a54-4544-be15-3af387b7907a","orcid":null,"display_name":"Danny Driess"},{"id":"fc6c4ec4-ac85-473e-b531-b9dd6bdd7863","orcid":null,"display_name":"Adnan Esmail"},{"id":"eb31f572-d6ed-432c-a48f-757c1f70148f","orcid":null,"display_name":"Michael Equi"},{"id":"379e406e-0cbc-4ede-b9dd-9a76a16a6da8","orcid":null,"display_name":"Chelsea Finn"}]},"error":null,"updated_at":"2026-05-13T20:03:28.617554+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T19:53:40.182925+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":95},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":89},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":66},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":61},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":54},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":48},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":45},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":42},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":38},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":36},{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","shared_citers":35},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":35},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":34},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":28},{"title":"SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model","work_id":"592041b3-3ca2-4836-8dd4-f8095d8a692b","shared_citers":26},{"title":"$\\pi^{*}_{0.6}$: a VLA That Learns From Experience","work_id":"7c1b3355-694a-44c6-880f-631e897e1713","shared_citers":24},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":24},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":23},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":23},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":22},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":22},{"title":"LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning","work_id":"662203ad-084f-42c4-8e60-977b3173755b","shared_citers":22},{"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","shared_citers":22},{"title":"Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations","work_id":"62dbe235-8473-4190-8686-17e7437de50f","shared_citers":21}],"time_series":[{"n":11,"year":2025},{"n":189,"year":2026}]},"error":null,"updated_at":"2026-05-13T20:03:28.229729+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T19:53:39.204867+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","claims":[{"claim_text":"Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks $\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T19:53:40.187479+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","claims":[{"claim_text":"Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks $\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T19:53:40.185439+00:00"}},"summary":{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","claims":[{"claim_text":"Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks $\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":95},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":89},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":66},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":61},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":54},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":48},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":45},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":42},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":38},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":36},{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","shared_citers":35},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":35},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":34},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":28},{"title":"SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model","work_id":"592041b3-3ca2-4836-8dd4-f8095d8a692b","shared_citers":26},{"title":"$\\pi^{*}_{0.6}$: a VLA That Learns From Experience","work_id":"7c1b3355-694a-44c6-880f-631e897e1713","shared_citers":24},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":24},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":23},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":23},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":22},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":22},{"title":"LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning","work_id":"662203ad-084f-42c4-8e60-977b3173755b","shared_citers":22},{"title":"WorldVLA: Towards Autoregressive Action World Model","work_id":"d8c0c873-b2fc-44a5-a0c8-0d4a698783fb","shared_citers":22},{"title":"Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations","work_id":"62dbe235-8473-4190-8686-17e7437de50f","shared_citers":21}],"time_series":[{"n":11,"year":2025},{"n":189,"year":2026}]},"authors":[{"id":"fc6c4ec4-ac85-473e-b531-b9dd6bdd7863","orcid":null,"display_name":"Adnan Esmail","source":"manual","import_confidence":0.72},{"id":"379e406e-0cbc-4ede-b9dd-9a76a16a6da8","orcid":null,"display_name":"Chelsea Finn","source":"manual","import_confidence":0.72},{"id":"247490b1-4a54-4544-be15-3af387b7907a","orcid":null,"display_name":"Danny Driess","source":"manual","import_confidence":0.72},{"id":"06730c35-b7c3-40f7-9cf7-d777f261f66c","orcid":null,"display_name":"Kevin Black","source":"manual","import_confidence":0.72},{"id":"eb31f572-d6ed-432c-a48f-757c1f70148f","orcid":null,"display_name":"Michael Equi","source":"manual","import_confidence":0.72},{"id":"657aa3e6-af8f-48bd-aeb3-c63cc3bedc60","orcid":null,"display_name":"Noah Brown","source":"manual","import_confidence":0.72}]}}