super hub Canonical reference

World Action Models are Zero-shot Policies

George Kurian, Kaiyuan Zheng, Seonghyeon Ye, Shenyuan Gao, Sihyun Yu, Yunhao Ge · 2026 · cs.RO · arXiv 2602.15922

Canonical reference. 90% of citing Pith papers cite this work as background.

138 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 138 citing papers more from George Kurian arXiv PDF

abstract

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 28 baseline 1 dataset 1

citation-polarity summary

background 27 unclear 2 baseline 1

claims ledger

abstract State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x i

authors

George Kurian Kaiyuan Zheng Seonghyeon Ye Shenyuan Gao Sihyun Yu Yunhao Ge

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

Improving Robotic Generalist Policies via Flow Reversal Steering

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

cs.RO · 2026-06-09 · unverdicted · novelty 7.0

UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.

ActionMap: Robot Policy Learning via Voxel Action Heatmap

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

cs.RO · 2026-04-09 · unverdicted · novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

citing papers explorer

Showing 50 of 138 citing papers after filters.

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 22 · internal anchor
HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots cs.RO · 2026-07-02 · unverdicted · none · ref 30 · internal anchor
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 47 · internal anchor
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
Improving Robotic Generalist Policies via Flow Reversal Steering cs.RO · 2026-06-11 · unverdicted · none · ref 4 · internal anchor
Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 55 · internal anchor
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data cs.RO · 2026-06-09 · unverdicted · none · ref 19 · internal anchor
UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies cs.RO · 2026-06-07 · unverdicted · none · ref 8 · internal anchor
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation cs.RO · 2026-06-05 · unverdicted · none · ref 74 · internal anchor
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
ActionMap: Robot Policy Learning via Voxel Action Heatmap cs.RO · 2026-06-05 · unverdicted · none · ref 38 · internal anchor
ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation? cs.CV · 2026-06-03 · unverdicted · none · ref 38 · internal anchor
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
World Models as Group Actions cs.CV · 2026-05-23 · unverdicted · none · ref 55 · internal anchor
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 26 · internal anchor
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation cs.RO · 2026-05-15 · unverdicted · none · ref 47 · internal anchor
WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL cs.CV · 2026-05-14 · conditional · none · ref 40 · internal anchor
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 52 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models cs.CV · 2026-05-09 · unverdicted · none · ref 35 · 2 links · internal anchor
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 89 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields cs.CV · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 26 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 53 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 85 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 65 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation cs.RO · 2026-07-02 · unverdicted · none · ref 27 · internal anchor
VT-WAM jointly predicts visual futures, tactile deformation, and actions via flow matching with Asymmetric MoT attention and contact-gated AVTAG, reporting 71.67% success on six real-world contact-rich tasks.
ACID: Action Consistency via Inverse Dynamics for Planning with World Models cs.RO · 2026-07-02 · unverdicted · none · ref 16 · internal anchor
ACID improves decision-time planning in world models by adding per-step action consistency residuals from an inverse dynamics model to the planning cost via an adaptive weight, yielding better performance with less compute across manipulation and navigation tasks.
VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment cs.CV · 2026-07-02 · unverdicted · none · ref 41 · internal anchor
VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.
ROSA: A Robotics Foundation Model Serving System for Robot Factories cs.RO · 2026-07-01 · unverdicted · none · ref 47 · internal anchor
ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.
RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation cs.RO · 2026-07-01 · unverdicted · none · ref 48 · internal anchor
RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model cs.CV · 2026-07-01 · unverdicted · none · ref 73 · internal anchor
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 11 · 2 links · internal anchor
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation cs.RO · 2026-06-26 · unverdicted · none · ref 101 · internal anchor
SimFoundry automates zero-shot real-to-sim scene generation from video, producing digital twins and cousins that enable policy training with 0.911 mean Pearson correlation to real-world results and 17-40% success gains from variations.
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots cs.RO · 2026-06-26 · unverdicted · none · ref 64 · internal anchor
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation cs.RO · 2026-06-26 · unverdicted · none · ref 16 · internal anchor
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory cs.RO · 2026-06-26 · unverdicted · none · ref 13 · internal anchor
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? cs.CV · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation cs.CV · 2026-06-17 · unverdicted · none · ref 23 · internal anchor
Mem-World augments world models with W-VMem, a wrist-view-centered surfel memory, to generate persistent action-conditioned video rollouts that improve policy evaluation correlation by 14.5% and raise task success from 58% to 72%.
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation cs.RO · 2026-06-17 · unverdicted · none · ref 16 · 2 links · internal anchor
SC3-Eval enforces three consistencies on a video model to produce policy rollouts that correlate 0.929 with real-world performance across seven vision-language-action policies and reproduce observed failure modes.
AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 4 · internal anchor
AnnotateAnything converts passive 3D assets into manipulation-ready assets by combining vision-language reasoning for semantics with parallel physics pipelines for executable action annotations such as grasps and articulations.
T-Rex: Tactile-Reactive Dexterous Manipulation cs.RO · 2026-06-15 · unverdicted · none · ref 53 · internal anchor
T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.
Geometric Action Model for Robot Policy Learning cs.RO · 2026-06-15 · unverdicted · none · ref 23 · internal anchor
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 36 · internal anchor
RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV · 2026-06-11 · unverdicted · none · ref 57 · internal anchor
MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.
Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration cs.RO · 2026-06-10 · conditional · none · ref 13 · internal anchor
VLA models with inference-time steering mitigate action leakage in implicit human-robot collaboration, supporting longer horizons and yielding faster, more reliable assembly than shorter-horizon baselines in a 16-person study.
Next Forcing: Causal World Modeling with Multi-Chunk Prediction cs.CV · 2026-06-09 · unverdicted · none · ref 62 · internal anchor
Next Forcing augments video generation models with auxiliary multi-chunk prediction modules to achieve faster training convergence, higher accuracy at high frame rates, and 2x faster inference on world modeling benchmarks.
Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction cs.RO · 2026-06-09 · unverdicted · none · ref 15 · internal anchor
EDITH combines egocentric vision and gaze from smart glasses with language in a hierarchical policy to let robots interpret brief nonverbal human intent and reduce user effort in interactive tasks.
Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 56 · internal anchor
Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.
SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation cs.RO · 2026-06-06 · unverdicted · none · ref 3 · internal anchor
SIMPLE is a new large-scale simulation benchmark for humanoid loco-manipulation that integrates accurate dynamics and photorealistic rendering and demonstrates policy transfer from simulation to physical robots.
DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 62 · internal anchor
DRIFT adapts pretrained VLMs to continuous decoding via a base predictor plus residual flow matching, outperforming regression and generative baselines on grounding and robotic control tasks.
Flash-WAM: Modality-Aware Distillation for World Action Models cs.LG · 2026-06-03 · unverdicted · none · ref 39 · internal anchor
Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.

World Action Models are Zero-shot Policies

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer