OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

· 2026 · cs.RO · arXiv 2605.06481

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

representative citing papers

OneVLA: A Unified Framework for Embodied Tasks

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

OneVLA is a unified VLA model using a shared action head and multi-stage progressive training with CoT fine-tuning that reports state-of-the-art results on both navigation and manipulation in simulation and real-world settings.

FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 5.0

FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.

SANTS: A State-Adaptive Scheduler for World Action Models

cs.RO · 2026-05-27 · unverdicted · novelty 5.0

SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

Bridge-WA: Predicting Where and How the World Changes for Robotic Action

cs.RO · 2026-07-02 · unverdicted · novelty 4.0

Bridge-WA introduces a lightweight distillation-based world-action model that uses future-change priors to improve robotic task success and robustness without deployment-time dense rollouts.

citing papers explorer

Showing 4 of 4 citing papers after filters.

OneVLA: A Unified Framework for Embodied Tasks cs.RO · 2026-05-31 · unverdicted · none · ref 31 · internal anchor
OneVLA is a unified VLA model using a shared action head and multi-stage progressive training with CoT fine-tuning that reports state-of-the-art results on both navigation and manipulation in simulation and real-world settings.
FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 54 · internal anchor
FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
Bridge-WA: Predicting Where and How the World Changes for Robotic Action cs.RO · 2026-07-02 · unverdicted · none · ref 29 · internal anchor
Bridge-WA introduces a lightweight distillation-based world-action model that uses future-change priors to improve robotic task success and robustness without deployment-time dense rollouts.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

fields

years

verdicts

representative citing papers

citing papers explorer