super hub Canonical reference

PaLM-E: An Embodied Multimodal Language Model

Aakanksha Chowdhery, Brian Ichter, Corey Lynch, Danny Driess, Fei Xia, Mehdi S. M. Sajjadi · 2023 · cs.LG · arXiv 2303.03378

Canonical reference. 98% of citing Pith papers cite this work as background.

189 Pith papers citing it

Background 98% of classified citations

open full Pith review browse 189 citing papers more from Aakanksha Chowdhery arXiv PDF

abstract

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 55

citation-polarity summary

background 54 support 1

claims ledger

abstract Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f

authors

Aakanksha Chowdhery Brian Ichter Corey Lynch Danny Driess Fei Xia Mehdi S. M. Sajjadi

co-cited works

representative citing papers

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

cs.CR · 2026-04-29 · unverdicted · novelty 8.0

A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

cs.RO · 2026-04-28 · unverdicted · novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.

AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

cs.CV · 2026-04-17 · conditional · novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

cs.RO · 2026-04-08 · unverdicted · novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

cs.CV · 2026-03-24 · unverdicted · novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

cs.RO · 2026-03-10 · unverdicted · novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

cs.CV · 2026-02-28 · unverdicted · novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.

citing papers explorer

Showing 15 of 15 citing papers after filters.

Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 38 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace cs.AI · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 18 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 59 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency cs.AI · 2023-04-22 · accept · none · ref 36 · internal anchor
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Efficient Skill Grounding via Code Refactoring with Small Language Models cs.AI · 2026-06-06 · unverdicted · none · ref 66 · internal anchor
RECENT decouples skill semantics from embodiment-specific bindings via code refactoring to let small language models achieve skill grounding performance matching large language model baselines.
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions cs.AI · 2026-05-25 · unverdicted · none · ref 5 · internal anchor
POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 26 · internal anchor
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 20 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory cs.AI · 2023-05-25 · conditional · none · ref 4 · internal anchor
GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.
FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models cs.AI · 2026-06-28 · unverdicted · none · ref 11 · 2 links · internal anchor
FADE attenuates FFN outputs at critical layers in LVLMs to curb language-prior dominance and cut hallucinations, shown effective on POPE, CHAIR, and MME across three models.
Self-Evolving Cognitive Framework via Causal World Modeling for Embodied Scientific Intelligence cs.AI · 2026-06-21 · unverdicted · none · ref 6 · internal anchor
Proposes a self-evolving cognitive framework integrating causal world modeling, intervention-driven reasoning, and continual refinement for embodied scientific intelligence.
Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots cs.AI · 2025-05-25 · unverdicted · none · ref 19 · internal anchor
Multimodal LLMs in robots develop self-identification and predictive awareness through sensorimotor loops, with structural equation modeling linking sensory integration to dimensions of the minimal self.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 119 · internal anchor
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling cs.AI · 2026-05-01 · unreviewed · ref 5 · 2 links · internal anchor

PaLM-E: An Embodied Multimodal Language Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer