WALL-WM: Carving World Action Modeling at the Event Joints

Charles Yang; Chris Pan; Colin Ye; Elise Mon; Ellie Ma; Ethan Chen; Gody Li; Hang Su; Hao Wang; Howard Lu

arxiv: 2606.01955 · v1 · pith:COEFE7A3new · submitted 2026-06-01 · 💻 cs.RO · cs.CV

WALL-WM: Carving World Action Modeling at the Event Joints

Shalfun Li , Victor Yao , Charles Yang , Truth Qu , Regis Cheng , Ryan Yu , Howard Lu , Newton Von

show 23 more authors

Vincent Chen Yohann Tang Maeve Zhang Ellie Ma Gody Li Sage Yang Lorien Shu J.W. Gao Ethan Chen Colin Ye Yu Sun Elise Mon PS Zhang Neo Li Lily Li James Wang Ping Yang Chris Pan Lucy Liang Hang Su Roy Gan Hao Wang Qian Wang

This is my paper

classification 💻 cs.RO cs.CV

keywords wall-wmactioneventsfixed-lengthlearningpretrainingchunk-centricchunks

0 comments

read the original abstract

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MemoryWAM: Efficient World Action Modeling with Persistent Memory
cs.RO 2026-06 unverdicted novelty 4.0

MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.
World Action Models: A Survey
cs.RO 2026-06 unverdicted novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.