{"work":{"id":"ff438a8a-8003-4fae-9131-acd418b3597b","openalex_id":null,"doi":null,"arxiv_id":"2307.15818","raw_key":null,"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","authors":null,"authors_text":"Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski","year":2023,"venue":"cs.RO","abstract":"We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).","external_url":"https://arxiv.org/abs/2307.15818","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:35:27.921753+00:00","pith_arxiv_id":"2307.15818","created_at":"2026-05-09T06:50:41.654340+00:00","updated_at":"2026-05-25T07:35:27.921753+00:00","title_quality_ok":true,"display_title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","render_title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control"},"hub":{"state":{"work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":161,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2023-09-05T17:56:20+00:00","last_pith_cited_at":"2026-05-22T12:35:28+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T15:42:39.262778+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":47},{"context_role":"method","n":6},{"context_role":"baseline","n":4},{"context_role":"other","n":2},{"context_role":"extension","n":1}],"polarity_counts":[{"context_polarity":"background","n":46},{"context_polarity":"use_method","n":5},{"context_polarity":"baseline","n":4},{"context_polarity":"unclear","n":4},{"context_polarity":"extend","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","claims":[{"claim_text":"We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilin- sky.π 0: A vision-language-action flow model for general robot control, 2026. URL https://arxiv.org/abs/ 2410.24164. [8] Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava","claim_type":"other","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Multi-view aggregationStereo fusion is performed independently per camera view. The resulting embeddings are then concatenated (denoted as∥) with low-dimensional proprioceptive featuress t as the input to the policy backbone to predict robot actions: zobs = [z t,1∥zt,2∥ · · · ∥zt,N ∥st ].(3) 3.2 STEREOPOLICY-DP: Diffusion Policy with StereoPolicy For imitation learning, we primarily adopt a diffusion policy [8] as the policy backbone. The policy predicts a sequence of continuous actionsa t:t+H−1","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"efficient robotic manipulation, as illustrated on the right. 1. Introduction Vision-Language-Action (VLA) models [4, 7, 31, 32, 38, 39, 45, 53, 64, 97] have made significant progress in en- abling robots to perform complex tasks by integrating vi- arXiv:2605.05126v1 [cs.RO] 6 May 2026 sual perception and language instructions. Representative works such as RT-2 [6], Octo [68], OpenVLA [31], and π-series [4, 17, 27, 57] highlight the potential of the VLA paradigm in bridging the gap between 2D per","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"sources, we begin by utilizing a pre-trained vision-language model (VLM) to import Internet-scale experience. By basing our model on a VLM, we inherit the general knowledge, semantic reasoning, and problem-solving abilities of language- and vision-language models. We then further train our model to incorporate robot actions, turning it into a vision-language- action (VLA) model [7]. In order to make it feasible to utilize a variety of diverse robot data sources, we employcross- embodiment traini","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"alist robot policy learning. All the details and codes are open- sourced. I. I NTRODUCTION Generalist robot policies that are capable of interacting with the physical environment, adapting to various embodiments, and performing complex tasks have been a long-standing pursuit in robotics [6, 3, 16, 8, 65]. Recent advances in Vision-Language-Action (VLA) models [7, 30, 5, 33] show a promising paradigm in building such generalist policy by fine- tuning the pre-trained Vision-Language Models (VLMs) ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"driven embodied AI research. In recent years, the field has converged on a powerful paradigm: Vision- Language-Action (VLA) models that repurpose pretrained vision-language backbones as generalist robot policies. By formulating action generation as conditional token prediction atop internet-scale visual and lin- guistic representations, VLA models such as RT-2 [ 1], OpenVLA [ 2], and 𝜋0 [3] have demonstrated striking generalizationifollowing novel language instructions, manipulating previously u","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (15 contexts).","role_counts":[{"n":15,"context_role":"background"},{"n":5,"context_role":"method"},{"n":1,"context_role":"baseline"},{"n":1,"context_role":"extension"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-16T06:08:39.501566+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"9e0bdc26-0bc7-4e92-93ac-9b2a7f12c819","orcid":null,"display_name":"Anthony Brohan"},{"id":"657aa3e6-af8f-48bd-aeb3-c63cc3bedc60","orcid":null,"display_name":"Noah Brown"},{"id":"322522a9-d4c8-4ec9-9e57-fea78173ea02","orcid":null,"display_name":"Justice Carbajal"},{"id":"dbfa9c9c-12ca-4878-9d7d-10db6dc427ed","orcid":null,"display_name":"Yevgen Chebotar"},{"id":"3c910835-0332-416e-87ba-3e49dd8c1b91","orcid":null,"display_name":"Xi Chen"},{"id":"324ff8e5-4dc9-4067-b02b-eb392058e7c6","orcid":null,"display_name":"Krzysztof Choromanski"}]},"error":null,"updated_at":"2026-05-16T06:08:40.162228+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T08:07:50.224171+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":43},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":40},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":33},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":23},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":23},{"title":"Open X-Embodiment: Robotic Learning Datasets and RT-X Models","work_id":"62f0fb6c-e6ae-4dc4-95a4-d9dd64b240e8","shared_citers":22},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":21},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":20},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":20},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":19},{"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","shared_citers":16},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":15},{"title":"LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning","work_id":"662203ad-084f-42c4-8e60-977b3173755b","shared_citers":15},{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","shared_citers":14},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":12},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":12},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":12},{"title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","work_id":"aebf924c-e761-437e-9cee-f1ccc2e427bd","shared_citers":11},{"title":"Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets","work_id":"59e728c0-b6ca-4759-a8f4-02b981f2220f","shared_citers":10},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":10},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":10},{"title":"Diffusion Policy: Visuomotor Policy Learning via Action Diffusion","work_id":"2dce18e6-f07a-4f57-8a81-e71c3e6a293c","shared_citers":9},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":9},{"title":"Mastering Diverse Domains through World Models","work_id":"6aeb260f-8c7c-4f9c-b98b-067cd7c59acd","shared_citers":9}],"time_series":[{"n":2,"year":2023},{"n":9,"year":2024},{"n":8,"year":2025},{"n":56,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T08:07:54.636728+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T08:07:45.964987+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","claims":[{"claim_text":"We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilin- sky.π 0: A vision-language-action flow model for general robot control, 2026. URL https://arxiv.org/abs/ 2410.24164. [8] Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava","claim_type":"other","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Multi-view aggregationStereo fusion is performed independently per camera view. The resulting embeddings are then concatenated (denoted as∥) with low-dimensional proprioceptive featuress t as the input to the policy backbone to predict robot actions: zobs = [z t,1∥zt,2∥ · · · ∥zt,N ∥st ].(3) 3.2 STEREOPOLICY-DP: Diffusion Policy with StereoPolicy For imitation learning, we primarily adopt a diffusion policy [8] as the policy backbone. The policy predicts a sequence of continuous actionsa t:t+H−1","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"efficient robotic manipulation, as illustrated on the right. 1. Introduction Vision-Language-Action (VLA) models [4, 7, 31, 32, 38, 39, 45, 53, 64, 97] have made significant progress in en- abling robots to perform complex tasks by integrating vi- arXiv:2605.05126v1 [cs.RO] 6 May 2026 sual perception and language instructions. Representative works such as RT-2 [6], Octo [68], OpenVLA [31], and π-series [4, 17, 27, 57] highlight the potential of the VLA paradigm in bridging the gap between 2D per","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"sources, we begin by utilizing a pre-trained vision-language model (VLM) to import Internet-scale experience. By basing our model on a VLM, we inherit the general knowledge, semantic reasoning, and problem-solving abilities of language- and vision-language models. We then further train our model to incorporate robot actions, turning it into a vision-language- action (VLA) model [7]. In order to make it feasible to utilize a variety of diverse robot data sources, we employcross- embodiment traini","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"alist robot policy learning. All the details and codes are open- sourced. I. I NTRODUCTION Generalist robot policies that are capable of interacting with the physical environment, adapting to various embodiments, and performing complex tasks have been a long-standing pursuit in robotics [6, 3, 16, 8, 65]. Recent advances in Vision-Language-Action (VLA) models [7, 30, 5, 33] show a promising paradigm in building such generalist policy by fine- tuning the pre-trained Vision-Language Models (VLMs) ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"driven embodied AI research. In recent years, the field has converged on a powerful paradigm: Vision- Language-Action (VLA) models that repurpose pretrained vision-language backbones as generalist robot policies. By formulating action generation as conditional token prediction atop internet-scale visual and lin- guistic representations, VLA models such as RT-2 [ 1], OpenVLA [ 2], and 𝜋0 [3] have demonstrated striking generalizationifollowing novel language instructions, manipulating previously u","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (15 contexts).","role_counts":[{"n":15,"context_role":"background"},{"n":5,"context_role":"method"},{"n":1,"context_role":"baseline"},{"n":1,"context_role":"extension"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-16T06:08:39.498849+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","claims":[{"claim_text":"We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T08:07:41.201686+00:00"}},"summary":{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","claims":[{"claim_text":"We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":43},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":40},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":33},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":23},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":23},{"title":"Open X-Embodiment: Robotic Learning Datasets and RT-X Models","work_id":"62f0fb6c-e6ae-4dc4-95a4-d9dd64b240e8","shared_citers":22},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":21},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":20},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":20},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":19},{"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","shared_citers":16},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":15},{"title":"LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning","work_id":"662203ad-084f-42c4-8e60-977b3173755b","shared_citers":15},{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","shared_citers":14},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":12},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":12},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":12},{"title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","work_id":"aebf924c-e761-437e-9cee-f1ccc2e427bd","shared_citers":11},{"title":"Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets","work_id":"59e728c0-b6ca-4759-a8f4-02b981f2220f","shared_citers":10},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":10},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":10},{"title":"Diffusion Policy: Visuomotor Policy Learning via Action Diffusion","work_id":"2dce18e6-f07a-4f57-8a81-e71c3e6a293c","shared_citers":9},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":9},{"title":"Mastering Diverse Domains through World Models","work_id":"6aeb260f-8c7c-4f9c-b98b-067cd7c59acd","shared_citers":9}],"time_series":[{"n":2,"year":2023},{"n":9,"year":2024},{"n":8,"year":2025},{"n":56,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"9e0bdc26-0bc7-4e92-93ac-9b2a7f12c819","orcid":null,"display_name":"Anthony Brohan","source":"manual","import_confidence":0.72},{"id":"322522a9-d4c8-4ec9-9e57-fea78173ea02","orcid":null,"display_name":"Justice Carbajal","source":"manual","import_confidence":0.72},{"id":"324ff8e5-4dc9-4067-b02b-eb392058e7c6","orcid":null,"display_name":"Krzysztof Choromanski","source":"manual","import_confidence":0.72},{"id":"657aa3e6-af8f-48bd-aeb3-c63cc3bedc60","orcid":null,"display_name":"Noah Brown","source":"manual","import_confidence":0.72},{"id":"3c910835-0332-416e-87ba-3e49dd8c1b91","orcid":null,"display_name":"Xi Chen","source":"manual","import_confidence":0.72},{"id":"dbfa9c9c-12ca-4878-9d7d-10db6dc427ed","orcid":null,"display_name":"Yevgen Chebotar","source":"manual","import_confidence":0.72}]}}