super hub Canonical reference

PaLM-E: An Embodied Multimodal Language Model

Aakanksha Chowdhery, Brian Ichter, Corey Lynch, Danny Driess, Fei Xia, Mehdi S. M. Sajjadi · 2023 · cs.LG · arXiv 2303.03378

Canonical reference. 98% of citing Pith papers cite this work as background.

205 Pith papers citing it

Background 98% of classified citations

open full Pith review browse 205 citing papers more from Aakanksha Chowdhery arXiv PDF

abstract

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 55

citation-polarity summary

background 54 support 1

claims ledger

abstract Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, f

authors

Aakanksha Chowdhery Brian Ichter Corey Lynch Danny Driess Fei Xia Mehdi S. M. Sajjadi

co-cited works

representative citing papers

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

cs.CR · 2026-04-29 · unverdicted · novelty 8.0

A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

cs.RO · 2026-04-28 · unverdicted · novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.

AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

cs.CV · 2026-04-17 · conditional · novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

cs.RO · 2026-04-08 · unverdicted · novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

cs.CV · 2026-03-24 · unverdicted · novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

cs.RO · 2026-03-10 · unverdicted · novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

cs.CV · 2026-02-28 · unverdicted · novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.

citing papers explorer

Showing 50 of 89 citing papers after filters.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning cs.RO · 2026-06-30 · unverdicted · none · ref 44 · internal anchor
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation cs.RO · 2026-06-30 · unverdicted · none · ref 5 · internal anchor
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models cs.RO · 2026-06-11 · unverdicted · none · ref 11 · internal anchor
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies cs.RO · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
Colosseum V2: Benchmarking Generalization for Vision Language Action Models cs.RO · 2026-05-26 · unverdicted · none · ref 29 · internal anchor
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments cs.RO · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning cs.RO · 2026-04-28 · unverdicted · none · ref 44 · internal anchor
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs cs.RO · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis cs.RO · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models cs.RO · 2026-03-10 · unverdicted · none · ref 12 · internal anchor
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning cs.RO · 2026-02-23 · unverdicted · none · ref 17 · internal anchor
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models cs.RO · 2026-02-23 · unverdicted · none · ref 27 · internal anchor
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 36 · internal anchor
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 24 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
RT-H: Action Hierarchies Using Language cs.RO · 2024-03-04 · conditional · none · ref 17 · internal anchor
RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models cs.RO · 2023-10-16 · conditional · none · ref 15 · internal anchor
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models cs.RO · 2023-07-12 · unverdicted · none · ref 68 · internal anchor
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Path Planning in Physically Viable World Models cs.RO · 2026-07-01 · unverdicted · none · ref 25 · internal anchor
A physically viable world model augments 3D Gaussian splats with physics simulation to assess robot route feasibility under simulated terrain changes like flooding, revealing failures not visible in static maps.
Automating the Design of Embodied AgentArchitectures cs.RO · 2026-06-29 · unverdicted · none · ref 19 · internal anchor
Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models cs.RO · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision cs.RO · 2026-06-25 · unverdicted · none · ref 6 · internal anchor
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation cs.RO · 2026-06-25 · unverdicted · none · ref 40 · 2 links · internal anchor
SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 20 · internal anchor
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
Robot Critics that Sweat the Small Stuff cs.RO · 2026-06-19 · unverdicted · none · ref 3 · internal anchor
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
BIT-Nav: Brain-Inspired Trajectory Memory for Embodied Navigation cs.RO · 2026-06-19 · unverdicted · none · ref 1 · internal anchor
BIT-Nav augments VLMs with a Bi-GRU trajectory embedding projected as one memory token to supply structured motion history at constant token cost.
Vesta: A Generalist Embodied Reasoning Model cs.RO · 2026-06-18 · unverdicted · none · ref 27 · internal anchor
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
Guava: An Effective and Universal Harness for Embodied Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
Guava harness enables 4B open-source models to achieve performance comparable to frontier models on embodied manipulation tasks by distilling capabilities from under 2K simulation trajectories using three identified design principles.
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control cs.RO · 2026-06-08 · unverdicted · none · ref 3 · internal anchor
CT-VAM is a 68M-parameter cerebello-thalamic-inspired model that achieves competitive LIBERO success rates with lower inference latency than larger VLA models by using a stream-separated attention decoder called TARS.
When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA cs.RO · 2026-06-07 · unverdicted · none · ref 14 · internal anchor
Closed-Loop Trace Distillation distills one-line natural-language prompts from labeled training traces to improve VLM accuracy on predicting minimal-success action chains in Exploratory Manipulation Trace QA by 0.38-0.47 across simulator and real-robot tasks.
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies cs.RO · 2026-06-04 · unverdicted · none · ref 21 · internal anchor
TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.
Auditing LLM-Governed Social Robots with Culture-Specific Moral Gradients cs.RO · 2026-06-02 · unverdicted · none · ref 14 · internal anchor
Introduces a gradient-based multilingual audit framework for LLM moral decisions in robot assistance scenarios and reports persistent culturally asymmetric gradient tracking failures not fixed by prompting.
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models cs.RO · 2026-06-01 · unverdicted · none · ref 8 · internal anchor
RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.
Continuous Reasoning for Vision-Language-Action cs.RO · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
Turning Video Models into Generalist Robot Policies cs.RO · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning cs.RO · 2026-05-16 · unverdicted · none · ref 10 · internal anchor
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation cs.RO · 2026-05-08 · unverdicted · none · ref 25 · 2 links · internal anchor
Presents BioProVLA-Agent, a protocol-driven VLA-enabled multi-agent system for embodied biological manipulation with visual state verification and AugSmolVLA augmentation for robustness in wet-lab conditions.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 16 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments cs.RO · 2026-04-24 · unverdicted · none · ref 4 · internal anchor
Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems cs.RO · 2026-04-22 · unverdicted · none · ref 106 · internal anchor
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems cs.RO · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring cs.RO · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user escalation.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 23 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
Real2Sim via Active Perception with Behavior Trees Automatically Generated by VLMs cs.RO · 2026-01-13 · unverdicted · none · ref 7 · internal anchor
An intent-driven Real2Sim framework uses VLMs for semantic task decomposition to identify missing physical parameters and generates reactive behavior trees to acquire them via contact-rich robotic interactions on a Franka Panda arm.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 23 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding cs.RO · 2025-11-21 · unverdicted · none · ref 11 · internal anchor
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models cs.RO · 2025-11-18 · unverdicted · none · ref 15 · internal anchor
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 25 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 26 · internal anchor
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models cs.RO · 2025-02-26 · unverdicted · none · ref 10 · internal anchor
A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.

PaLM-E: An Embodied Multimodal Language Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer