A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
super hub Canonical reference
PaLM-E: An Embodied Multimodal Language Model
Canonical reference. 98% of citing Pith papers cite this work as background.
abstract
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f
authors
co-cited works
roles
background 55representative citing papers
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
citing papers explorer
-
Adapting Generalist Robot Policies with Semantic Reinforcement Learning
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
-
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
-
Trajectory-Level Redirection Attacks on Vision-Language-Action Models
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
-
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
-
Colosseum V2: Benchmarking Generalization for Vision Language Action Models
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.
-
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
-
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
-
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
RT-H: Action Hierarchies Using Language
RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Path Planning in Physically Viable World Models
A physically viable world model augments 3D Gaussian splats with physics simulation to assess robot route feasibility under simulated terrain changes like flooding, revealing failures not visible in static maps.
-
Automating the Design of Embodied AgentArchitectures
Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.
-
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
-
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
-
SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation
SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.
-
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
-
Robot Critics that Sweat the Small Stuff
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
-
BIT-Nav: Brain-Inspired Trajectory Memory for Embodied Navigation
BIT-Nav augments VLMs with a Bi-GRU trajectory embedding projected as one memory token to supply structured motion history at constant token cost.
-
Vesta: A Generalist Embodied Reasoning Model
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
-
Guava: An Effective and Universal Harness for Embodied Manipulation
Guava harness enables 4B open-source models to achieve performance comparable to frontier models on embodied manipulation tasks by distilling capabilities from under 2K simulation trajectories using three identified design principles.
-
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control
CT-VAM is a 68M-parameter cerebello-thalamic-inspired model that achieves competitive LIBERO success rates with lower inference latency than larger VLA models by using a stream-separated attention decoder called TARS.
-
When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA
Closed-Loop Trace Distillation distills one-line natural-language prompts from labeled training traces to improve VLM accuracy on predicting minimal-success action chains in Exploratory Manipulation Trace QA by 0.38-0.47 across simulator and real-robot tasks.
-
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.
-
Auditing LLM-Governed Social Robots with Culture-Specific Moral Gradients
Introduces a gradient-based multilingual audit framework for LLM moral decisions in robot assistance scenarios and reports persistent culturally asymmetric gradient tracking failures not fixed by prompting.
-
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models
RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.
-
Continuous Reasoning for Vision-Language-Action
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
-
Turning Video Models into Generalist Robot Policies
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
-
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
-
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
Presents BioProVLA-Agent, a protocol-driven VLA-enabled multi-agent system for embodied biological manipulation with visual state verification and AugSmolVLA augmentation for robustness in wet-lab conditions.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
-
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.
-
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
-
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
-
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user escalation.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
-
Real2Sim via Active Perception with Behavior Trees Automatically Generated by VLMs
An intent-driven Real2Sim framework uses VLMs for semantic task decomposition to identify missing physical parameters and generates reactive behavior trees to acquire them via contact-rich robotic interactions on a Franka Panda arm.
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
-
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
-
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
-
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.