{"work":{"id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","openalex_id":null,"doi":null,"arxiv_id":"2303.03378","raw_key":null,"title":"PaLM-E: An Embodied Multimodal Language Model","authors":null,"authors_text":"Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter","year":2023,"venue":"cs.LG","abstract":"Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.","external_url":"https://arxiv.org/abs/2303.03378","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:40:33.076948+00:00","pith_arxiv_id":"2303.03378","created_at":"2026-05-09T06:50:41.589082+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"PaLM-E: An Embodied Multimodal Language Model","render_title":"PaLM-E: An Embodied Multimodal Language Model"},"hub":{"state":{"work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":139,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2023-03-20T18:31:47+00:00","last_pith_cited_at":"2026-05-21T14:48:35+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-24T09:51:41.481987+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":55}],"polarity_counts":[{"context_polarity":"background","n":54},{"context_polarity":"support","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"PaLM-E: An Embodied Multimodal Language Model","claims":[{"claim_text":"Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Building upon this foundation, recent advances in multimodal modeling have augmented LLMs with visual encoders, giving rise to large vision-language models (VLMs) capable of jointly reasoning over text and visual inputs [3]. These models have enabled a new class of deployment scenarios, including au- tonomous driving [4], [5], embodied agents [6], [7], interactive assistants [8], [9], and video question-answering [10], [11]. While early VLM studies primarily centered on static image-text pairs, ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"may choose the image tagging model [7,16,19] and check if the predicted tag list contains \"people\". If \"people\" exists, we may select the celebrity model [20] to further understand whether a celebrity appears and who he/she is. One research direction is to combine the vision and language modules as one end-to-end model, such as Flamingo [2], PaLM-E [10], to provide a dialogue-based experience to the end user. That is, the user can use natural language to interact with the model around the image ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In the field of IR, traditional search engines are challenged by the new information seeking way through AI chatbots (i.e.,ChatGPT), andNew Bing 3 presents an initial attempt that enhances the search results based on LLMs. In the field of CV , the researchers try to develop ChatGPT-like vision-language models that can better serve multimodal dialogues [42-45], and GPT-4 [46] has supported multi- modal input by integrating the visual information. This new wave of technology would potentially lead","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"This interface provides important flexibility: the input sequence can in- clude demonstrations for a new task, enabling few- shot, in-context learning [3] or multi-round multi- modal chatbot interactions. Evaluations suggest that autoregressive vision-language models can be performant foundation models [ 5]: models like Flamingo [ 3], CM3 [ 1], Kosmos-1 [ 12], PALM- E [8], and multimodal GPT-4 [ 28] generalize well across diverse vision-language tasks. arXiv:2308.01390v2 [cs.CV] 7 Aug 2023 Outpu","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"robotics systems leverage large foundation models for high-level planning, affordance grounding, program synthesis, and spatial-constraint reasoning [ 1, 23, 24, 38, 56]. Models such as PaLM-E, VIMA, and Gato further extend this idea by integrating language, vision, and embodied observations into general-purpose architectures for planning and interactive decision making [14, 29, 50]. Robotics- oriented VQA, progress-reasoning, and embodied reasoning datasets further push this direction toward gr","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"ideas with foundation models have been studied in other topics such as image classification [ 14]. Flamingo [2] can be viewed as the GPT-3 moment in the multimodal domain, due to its strong performance on zero-shot task transfer and in-context-learning. Other LMMs trained on image- text pairs include BLIP-2 [28], FROMAGe [24], and KOSMOS-1 [20]. PaLM-E [13] is an LMM for embodied AI. Based on the recent \"best\" open-source LLM LLaMA, OpenFlamingo [ 5] and LLaMA-Adapter [59] are open-source effort","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks PaLM-E: An Embodied Multimodal Language Model because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"}]},"error":null,"updated_at":"2026-05-17T05:39:33.408250+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"247490b1-4a54-4544-be15-3af387b7907a","orcid":null,"display_name":"Danny Driess"},{"id":"bb693ea4-d2ec-4e98-abd9-408775f8657d","orcid":null,"display_name":"Fei Xia"},{"id":"c7a2e88c-b190-424c-8280-00125cb50e94","orcid":null,"display_name":"Mehdi S. M. Sajjadi"},{"id":"ca38b8de-e45b-4789-989c-37db1f8de19f","orcid":null,"display_name":"Corey Lynch"},{"id":"d02e9bea-d45d-4246-8e52-818baefa7a0b","orcid":null,"display_name":"Aakanksha Chowdhery"},{"id":"9ddcb952-44b8-421e-b082-34b6c7ac1d39","orcid":null,"display_name":"Brian Ichter"}]},"error":null,"updated_at":"2026-05-17T05:39:38.076599+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:19:50.118156+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":20},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":19},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":18},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":18},{"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","shared_citers":17},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":14},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"Open X-Embodiment: Robotic Learning Datasets and RT-X Models","work_id":"62f0fb6c-e6ae-4dc4-95a4-d9dd64b240e8","shared_citers":11},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":10},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":10},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":9},{"title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","work_id":"5a5edf95-2538-4e2b-8dfa-da39cec89f22","shared_citers":9},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":8},{"title":"Language is not all you need: Aligning perception with language models","work_id":"2a1e0563-79f5-4521-8293-b8b1aebf7cee","shared_citers":8},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":8},{"title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","work_id":"aebf924c-e761-437e-9cee-f1ccc2e427bd","shared_citers":7},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":7},{"title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","work_id":"f6e5e4a1-e34b-4602-a7ad-df0c6103a4d0","shared_citers":7},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":7},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":7},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":7},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":7},{"title":"Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets","work_id":"59e728c0-b6ca-4759-a8f4-02b981f2220f","shared_citers":6},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":6}],"time_series":[{"n":12,"year":2023},{"n":6,"year":2024},{"n":2,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:19:48.242753+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:19:45.905170+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"PaLM-E: An Embodied Multimodal Language Model","claims":[{"claim_text":"Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Building upon this foundation, recent advances in multimodal modeling have augmented LLMs with visual encoders, giving rise to large vision-language models (VLMs) capable of jointly reasoning over text and visual inputs [3]. These models have enabled a new class of deployment scenarios, including au- tonomous driving [4], [5], embodied agents [6], [7], interactive assistants [8], [9], and video question-answering [10], [11]. While early VLM studies primarily centered on static image-text pairs, ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"may choose the image tagging model [7,16,19] and check if the predicted tag list contains \"people\". If \"people\" exists, we may select the celebrity model [20] to further understand whether a celebrity appears and who he/she is. One research direction is to combine the vision and language modules as one end-to-end model, such as Flamingo [2], PaLM-E [10], to provide a dialogue-based experience to the end user. That is, the user can use natural language to interact with the model around the image ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In the field of IR, traditional search engines are challenged by the new information seeking way through AI chatbots (i.e.,ChatGPT), andNew Bing 3 presents an initial attempt that enhances the search results based on LLMs. In the field of CV , the researchers try to develop ChatGPT-like vision-language models that can better serve multimodal dialogues [42-45], and GPT-4 [46] has supported multi- modal input by integrating the visual information. This new wave of technology would potentially lead","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"This interface provides important flexibility: the input sequence can in- clude demonstrations for a new task, enabling few- shot, in-context learning [3] or multi-round multi- modal chatbot interactions. Evaluations suggest that autoregressive vision-language models can be performant foundation models [ 5]: models like Flamingo [ 3], CM3 [ 1], Kosmos-1 [ 12], PALM- E [8], and multimodal GPT-4 [ 28] generalize well across diverse vision-language tasks. arXiv:2308.01390v2 [cs.CV] 7 Aug 2023 Outpu","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"robotics systems leverage large foundation models for high-level planning, affordance grounding, program synthesis, and spatial-constraint reasoning [ 1, 23, 24, 38, 56]. Models such as PaLM-E, VIMA, and Gato further extend this idea by integrating language, vision, and embodied observations into general-purpose architectures for planning and interactive decision making [14, 29, 50]. Robotics- oriented VQA, progress-reasoning, and embodied reasoning datasets further push this direction toward gr","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"ideas with foundation models have been studied in other topics such as image classification [ 14]. Flamingo [2] can be viewed as the GPT-3 moment in the multimodal domain, due to its strong performance on zero-shot task transfer and in-context-learning. Other LMMs trained on image- text pairs include BLIP-2 [28], FROMAGe [24], and KOSMOS-1 [20]. PaLM-E [13] is an LMM for embodied AI. Based on the recent \"best\" open-source LLM LLaMA, OpenFlamingo [ 5] and LLaMA-Adapter [59] are open-source effort","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks PaLM-E: An Embodied Multimodal Language Model because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"}]},"error":null,"updated_at":"2026-05-17T05:39:38.080784+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"PaLM-E: An Embodied Multimodal Language Model","claims":[{"claim_text":"Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks PaLM-E: An Embodied Multimodal Language Model because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:19:48.245415+00:00"}},"summary":{"title":"PaLM-E: An Embodied Multimodal Language Model","claims":[{"claim_text":"Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks PaLM-E: An Embodied Multimodal Language Model because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":20},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":19},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":18},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":18},{"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","shared_citers":17},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":14},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"Open X-Embodiment: Robotic Learning Datasets and RT-X Models","work_id":"62f0fb6c-e6ae-4dc4-95a4-d9dd64b240e8","shared_citers":11},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":10},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":10},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":9},{"title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","work_id":"5a5edf95-2538-4e2b-8dfa-da39cec89f22","shared_citers":9},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":8},{"title":"Language is not all you need: Aligning perception with language models","work_id":"2a1e0563-79f5-4521-8293-b8b1aebf7cee","shared_citers":8},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":8},{"title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","work_id":"aebf924c-e761-437e-9cee-f1ccc2e427bd","shared_citers":7},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":7},{"title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","work_id":"f6e5e4a1-e34b-4602-a7ad-df0c6103a4d0","shared_citers":7},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":7},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":7},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":7},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":7},{"title":"Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets","work_id":"59e728c0-b6ca-4759-a8f4-02b981f2220f","shared_citers":6},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":6}],"time_series":[{"n":12,"year":2023},{"n":6,"year":2024},{"n":2,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"d02e9bea-d45d-4246-8e52-818baefa7a0b","orcid":null,"display_name":"Aakanksha Chowdhery","source":"manual","import_confidence":0.72},{"id":"9ddcb952-44b8-421e-b082-34b6c7ac1d39","orcid":null,"display_name":"Brian Ichter","source":"manual","import_confidence":0.72},{"id":"ca38b8de-e45b-4789-989c-37db1f8de19f","orcid":null,"display_name":"Corey Lynch","source":"manual","import_confidence":0.72},{"id":"247490b1-4a54-4544-be15-3af387b7907a","orcid":null,"display_name":"Danny Driess","source":"manual","import_confidence":0.72},{"id":"bb693ea4-d2ec-4e98-abd9-408775f8657d","orcid":null,"display_name":"Fei Xia","source":"manual","import_confidence":0.72},{"id":"c7a2e88c-b190-424c-8280-00125cb50e94","orcid":null,"display_name":"Mehdi S. M. Sajjadi","source":"manual","import_confidence":0.72}]}}