Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

Farshad Khorrami; Prashanth Krishnamurthy; Raktim Gautam Goswami; Yann LeCun

arxiv: 2606.08775 · v1 · pith:4OQDERXGnew · submitted 2026-06-07 · 💻 cs.RO · cs.AI

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

Raktim Gautam Goswami , Prashanth Krishnamurthy , Yann LeCun , Farshad Khorrami This is my paper

Pith reviewed 2026-06-27 18:06 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords world modelsdiffusion policyrobotic manipulationhierarchical controlobject-centric representationsmulti-stage tasksmodel predictive control

0 comments

The pith

A hierarchical framework pairs an object-centric world model for subgoal planning with a diffusion policy for execution to handle multi-stage robotic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorldDP as a two-level system for robotic manipulation. A high-level world model acts as a transition function inside model predictive control to generate feasible subgoals at runtime. Object-centric representations let the model plan with respect to individual entities rather than the full scene. These subgoals are then passed to a low-level diffusion policy that executes the motions efficiently. Tests across multiple robotics benchmarks show the combined system outperforms prior single-level approaches on sequential tasks.

Core claim

The paper claims that using an object-centric world model to optimize subgoals during runtime and feeding those subgoals to a diffusion policy for low-level control produces better results on multi-stage manipulation than either component alone or existing baselines.

What carries the argument

The hierarchical WorldDP structure, in which the high-level world model optimizes feasible subgoals via object-centric representations and the low-level diffusion policy reaches them.

If this is right

The approach enables sequential planning with respect to each decoupled object rather than the entire environment.
Coupling physically grounded planning from the world model with efficient execution from the diffusion policy improves multi-stage task success.
Object-centric decoupling aids both dynamics learning and runtime subgoal selection in manipulation settings.
The framework extends world-model MPC beyond single-stage reaching or grasping to full sequential tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subgoal-planning pattern could be tested in non-robotic sequential control domains such as game playing or process scheduling.
Replacing the diffusion policy with another low-level controller would isolate whether the performance gain comes mainly from the world-model layer.
Running the same hierarchy on physical hardware rather than simulation would reveal whether the object-centric representations transfer without additional tuning.

Load-bearing premise

The high-level world model can optimize for feasible subgoals at runtime using object-centric representations that decouple environmental entities.

What would settle it

A direct comparison on the same multi-stage benchmarks where WorldDP shows no consistent advantage over the strongest single-stage baselines would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.08775 by Farshad Khorrami, Prashanth Krishnamurthy, Raktim Gautam Goswami, Yann LeCun.

**Figure 1.** Figure 1: Overview of WorldDP, a hierarchical framework for multi-stage robotic manipulation. (a) At test-time, we use our object-centric world model within a particle filter to optimize latent action sequences, employing a structured, object-based loss to find optimal subgoals. (b) A low-level, goal-conditioned diffusion policy (DP) then sequentially tracks and executes these subgoals to solve the task. While exis… view at source ↗

**Figure 2.** Figure 2: Object-Centric Encoder Training with SAM2 Guidance. The DINOv2-encoded image patches zk and initialized slots s¯k are refined by a Slot Corrector into latent slots sk. A Slot Decoder then maps these slots to predicted masks mˆ k and per-slot reconstructions, which are aggregated to reconstruct zˆk. The model is trained using dual objectives: a reconstruction loss (between zk and zˆk) and a mask segmentatio… view at source ↗

**Figure 3.** Figure 3: WorldDP’s dynamics model. Given the object-centric states sk1 and an action sequence at timestep k1, we next define our dynamics model fθ. 1 We adopt the Conditional Diffusion Transformer (CDiT) architecture used in Goswami et al. (2025). However, unlike Goswami et al. (2025), which operates on patch-level states of shape (P, d), our dynamics model takes states sk ∈ R N×d as input to explicitly process th… view at source ↗

**Figure 4.** Figure 4: Example Start and Goal images from each of the robotic tasks are shown. 4 Experiments Our experiments address the following key questions: (a) How does WorldDP perform in multi-stage robotics tasks compared to existing methods? (b) How does unifying the world model’s planning with a diffusion policy in WorldDP impact performance? (c) What is the impact of the object-centric representation? (d) Which framew… view at source ↗

**Figure 5.** Figure 5: Ablation Studies: Removing either the low-level Diffusion Policy (w/o DP) or the ObjectCentric Encoder (w/o OCE) degrades success rates compared to our full WorldDP framework [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 7.** Figure 7: Task Execution Example. Given the initial and target states, our framework decomposes the task into sequential planning phases. Step 1 optimizes subgoals (1A, 1B) for the buttons which are executed by the Diffusion Policy (DP) (represented by orange arrows). Step 2 takes the new obvervation as input and optimizes subgoals (2A, 2B) for the drawer and window, subsequently executed by the DP [PITH_FULL_IMAGE… view at source ↗

**Figure 6.** Figure 6: Ablations on use of Particle Filter (PF) and Contact Predictor (CP). Which components are crucial for WorldDP’s success? While the hierarchical unification of world-model planning with DP’s low-level execution and object-centric representations forms the backbone of WorldDP, we also evaluate the individual contributions of the PF optimizer and the Contact Predictor (CP) by testing two variants: “w/o PF” … view at source ↗

**Figure 8.** Figure 8: Object-Centric Encoding Visualization. Left: image; Right: pred. masks from OCE emb [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Open-Loop Trajectory Rollouts. Given the initial state and an action sequence, we show predicted future states over a 4-second horizon. Latent states are decoded into images for visualization. Predicted frames are subsampled in the figure due to space constraints. OCE predictions. We provide qualitative examples of the predicted slot masks mˆ k in [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Task Execution Examples. Given the initial and target states, our framework decomposes the task into sequential planning phases, with each finding a set of subgoals, which are reached using the diffusion policy. (a) (b) (c) [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Object-Centric Encoding Visualization for Cube-Single. Left: image; Right: predicted masks from OCE embeddings (a) (b) (c) [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Object-Centric Encoding Visualization for Cube-Triple. Left: image; Right: predicted masks from OCE embeddings 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Object-Centric Encoding Visualization for Scene-Single. Left: image; Right: predicted masks from OCE embeddings (a) Cube-Single (b) Cube-Triple (c) Scene-Single [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Open-Loop Trajectory Rollouts. Given the initial state and an action sequence, we show future states over a 4-second horizon. Latent states are decoded into images for visualization. Predicted frames are subsampled in the figure due to space constraints. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

read the original abstract

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldDP layers an object-centric world model for subgoal planning over a diffusion policy for execution, but the abstract gives no numbers or ablations and the decoupling may not handle physical interactions.

read the letter

The paper's main move is to name WorldDP as a hierarchical setup: a high-level object-centric world model acts as the transition function inside MPC-style planning to pick feasible subgoals at runtime, which a low-level diffusion policy then executes. This targets the known limit that pure world-model controllers have mostly handled single-stage tasks like reaching or grasping.

What the work does cleanly is identify that gap and sketch a practical split that reuses two existing pieces—object-centric representations for modularity and diffusion policies for good execution. Planning sequentially per object is a reasonable way to keep the high-level search tractable.

The soft spots are straightforward. The abstract states that WorldDP “consistently outperforms existing baselines” across robotics benchmarks, yet supplies no metrics, error bars, task list, or ablation results. Without those, the central performance claim cannot be checked. The stress-test concern also lands: object-centric decoupling treats entities as independent enough to plan one at a time, but many multi-stage tasks involve tight physical coupling (stacking, insertion, assembly). The description gives no interaction terms, joint state, or constraint mechanism to keep subgoals feasible under those couplings, so the handoff to the diffusion policy could break.

This is for people already working on world models or diffusion policies in manipulation who want a named hierarchical template. A reader looking for a fully worked-out solution with strong evidence will find the current version thin. It is worth sending to peer review so the experiments can be examined and the interaction issue can be tested directly.

Referee Report

2 major / 0 minor

Summary. The paper introduces WorldDP, a hierarchical framework for multi-stage robotic manipulation. A high-level world model with object-centric representations serves as a transition function to optimize feasible subgoals at runtime; these subgoals are then executed by a low-level Diffusion Policy. The central claim is that this coupling of physically grounded planning and efficient execution yields consistent outperformance over baselines on robotics benchmarks.

Significance. If the empirical claims hold, the work could advance hierarchical model-based control in robotics by extending world models beyond single-stage tasks through object-centric sequential planning. The proposed unification addresses a recognized limitation in applying MPC-style world models to complex manipulation.

major comments (2)

[Abstract] Abstract: the claim that WorldDP 'consistently outperforms existing baselines' supplies no metrics, baselines, error bars, dataset details, or ablation results, rendering the central empirical claim impossible to assess.
[Abstract] Abstract: the assertion that object-centric representations 'decouple environmental entities and enable us to plan sequentially with respect to each' provides no mechanism (interaction terms, joint latent state, or constraint layer) to preserve subgoal feasibility under physical couplings between objects, which is load-bearing for multi-stage tasks such as assembly or stacking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will make revisions to strengthen the abstract and clarify key aspects of the framework.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that WorldDP 'consistently outperforms existing baselines' supplies no metrics, baselines, error bars, dataset details, or ablation results, rendering the central empirical claim impossible to assess.

Authors: We agree that the abstract, as a high-level summary, omits specific quantitative details. The full manuscript reports these results in the experiments section, including success rates, baselines (e.g., standard diffusion policies and MPC variants), error bars across multiple seeds, and ablations on the hierarchical components. We will revise the abstract to incorporate concise key metrics and dataset references to make the empirical claims more self-contained. revision: yes
Referee: [Abstract] Abstract: the assertion that object-centric representations 'decouple environmental entities and enable us to plan sequentially with respect to each' provides no mechanism (interaction terms, joint latent state, or constraint layer) to preserve subgoal feasibility under physical couplings between objects, which is load-bearing for multi-stage tasks such as assembly or stacking.

Authors: The abstract statement summarizes the object-centric design, but the full paper details the mechanism in the high-level world model (Section 3), where object-centric latents are processed with explicit interaction terms in the transition function to model physical couplings and ensure feasible subgoals. We will add a short clarifying phrase to the abstract (and expand the introduction) to reference this interaction modeling, addressing the concern about physical feasibility in coupled tasks. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hierarchical framework without derivational reductions

full rationale

The paper describes WorldDP as a hierarchical system using a high-level world model for runtime subgoal optimization via object-centric representations, followed by low-level diffusion policy execution. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All claims rest on empirical benchmark evaluations rather than any self-referential construction, making the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5710 in / 972 out tokens · 19352 ms · 2026-06-27T18:06:16.642205+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 15 linked inside Pith

[1]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

Pith/arXiv arXiv
[2]

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Pith/arXiv arXiv
[3]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Pith/arXiv arXiv
[4]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

Pith/arXiv arXiv
[5]

Simple hierarchical planning with diffusion.arXiv preprint arXiv:2401.02644,

Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hierarchical planning with diffusion.arXiv preprint arXiv:2401.02644,

arXiv
[6]

Chain-of-goals hierarchical policy for long-horizon offline goal-conditioned rl.arXiv preprint arXiv:2602.03389,

Jinwoo Choi, Sang-Hyun Lee, and Seung-Woo Seo. Chain-of-goals hierarchical policy for long-horizon offline goal-conditioned rl.arXiv preprint arXiv:2602.03389,

Pith/arXiv arXiv
[7]

Slot structured world models

Jonathan Collu, Riccardo Majellaro, Aske Plaat, and Thomas M Moerland. Slot structured world models. arXiv preprint arXiv:2402.03326,

arXiv
[8]

Unsupervised image representation learning with deep latent particles.arXiv preprint arXiv:2205.15821,

13 Tal Daniel and Aviv Tamar. Unsupervised image representation learning with deep latent particles.arXiv preprint arXiv:2205.15821,

arXiv
[9]

Latent particle world models: Self-supervised object-centric stochastic dynamics modeling.arXiv preprint arXiv:2603.04553,

Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, and David Held. Latent particle world models: Self-supervised object-centric stochastic dynamics modeling.arXiv preprint arXiv:2603.04553,

arXiv
[10]

Dynamics learning with cascaded variational inference for multi-step manipulation.arXiv preprint arXiv:1910.13395,

Kuan Fang, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Dynamics learning with cascaded variational inference for multi-step manipulation.arXiv preprint arXiv:1910.13395,

arXiv 1910
[11]

World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644,

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644,

arXiv
[12]

World models.arXiv preprint arXiv:1803.10122, 2(3):440,

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440,

Pith/arXiv arXiv
[13]

Td-mpc2: Scalable, robust world models for continuous control

Nick Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, volume 2024, pp. 47376–47405,

2024
[14]

Hierarchical entity- centric reinforcement learning with factored subgoal diffusion.arXiv preprint arXiv:2602.02722,

Dan Haramati, Carl Qi, Tal Daniel, Amy Zhang, Aviv Tamar, and George Konidaris. Hierarchical entity- centric reinforcement learning with factored subgoal diffusion.arXiv preprint arXiv:2602.02722,

arXiv
[15]

World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, et al. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

Pith/arXiv arXiv
[16]

Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

Pith/arXiv arXiv
[17]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62 (1):1–62,

2022
[18]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Pith/arXiv arXiv
[19]

stable-worldmodel-v1: Reproducible world modeling research and evaluation.arXiv preprint arXiv:2602.08968, 2026a

Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, and Randall Balestriero. stable-worldmodel-v1: Reproducible world modeling research and evaluation.arXiv preprint arXiv:2602.08968, 2026a. Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding ...

arXiv
[20]

Sold: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822,

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. Sold: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822,

arXiv
[21]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

Pith/arXiv arXiv
[22]

Ogbench: Benchmarking offline goal- conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal- conditioned rl. InInternational Conference on Learning Representations, volume 2025, pp. 94937–94982,

2025
[23]

pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

Pith/arXiv arXiv
[24]

Parallel stochastic gradient-based planning for world models.arXiv preprint arXiv:2602.00475,

Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar. Parallel stochastic gradient-based planning for world models.arXiv preprint arXiv:2602.00475,

arXiv
[25]

Ec-diffuser: Multi-object manipulation via entity-centric behavior generation

Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, and Amy Zhang. Ec-diffuser: Multi-object manipulation via entity-centric behavior generation. InInternational Conference on Learning Representations, volume 2025, pp. 74835–74858,

2025
[26]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pp. 28085–28128,

2025
[27]

What drives success in physical planning with joint-embedding predictive world models?arXiv preprint arXiv:2512.24497,

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. What drives success in physical planning with joint-embedding predictive world models?arXiv preprint arXiv:2512.24497,

Pith/arXiv arXiv
[28]

Slotformer: Unsupervised visual dynamics simulation with object-centric models.arXiv preprint arXiv:2210.05861,

Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, and Animesh Garg. Slotformer: Unsupervised visual dynamics simulation with object-centric models.arXiv preprint arXiv:2210.05861,

arXiv
[29]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Pith/arXiv arXiv
[30]

Hierarchical planning with latent world models

15 Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, et al. Hierarchical planning with latent world models. arXiv preprint arXiv:2604.03208,

Pith/arXiv arXiv
[31]

Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983,

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983,

Pith/arXiv arXiv
[32]

Table 4: Hyperparameter configurations for the Object-Centric Encoder

with an initial learning rate of10−3, which is decayed to10−6via a one-cycle learning rate scheduler with cosine annealing (Smith & Topin, 2019). Table 4: Hyperparameter configurations for the Object-Centric Encoder. Slot Corrector Slot Decoder Dataset Num Slots Slot Dim. Num. iter. Hidden Dim. Layers Hidden Dim. Cube-Single 3 64 3 128 3 384 Cube-Triple 5...

2019
[33]

A.3 Diffusion Policy

with an initial learning rate of10−4, which is decayed to10−7via a one-cycle learning rate scheduler with cosine annealing (Smith & Topin, 2019). A.3 Diffusion Policy. Apart from the input dimension, which scales with the slot dimension and the number of slots, all other parameters of the diffusion policy transformer remain constant across datasets. Speci...

2019
[34]

A.4 Contact Predictor The contact predictor consists of two linear layers separated by a SiLU activation function (Elfwing et al., 2018)

with an initial learning rate of10−4and a weight decay of10−6. A.4 Contact Predictor The contact predictor consists of two linear layers separated by a SiLU activation function (Elfwing et al., 2018). It takes the mean of the DINOv2 patch-level representations as input and outputs a vector corre- sponding to the number of manipulable objects in the enviro...

2018
[35]

B Tversky Loss As described in Section 3.1, to supervise predicted masks during object-centric encoder training, we employ the Tversky loss function (Salehi et al., 2017)

with an initial learning rate of10−3, which decays to10−6via a one-cycle learning rate scheduler with cosine annealing (Smith & Topin, 2019). B Tversky Loss As described in Section 3.1, to supervise predicted masks during object-centric encoder training, we employ the Tversky loss function (Salehi et al., 2017). Given predicted (ˆmk) and ground-truth (mk)...

2019

[1] [1]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

Pith/arXiv arXiv

[2] [2]

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Pith/arXiv arXiv

[3] [3]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

Pith/arXiv arXiv

[4] [4]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

Pith/arXiv arXiv

[5] [5]

Simple hierarchical planning with diffusion.arXiv preprint arXiv:2401.02644,

Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hierarchical planning with diffusion.arXiv preprint arXiv:2401.02644,

arXiv

[6] [6]

Chain-of-goals hierarchical policy for long-horizon offline goal-conditioned rl.arXiv preprint arXiv:2602.03389,

Jinwoo Choi, Sang-Hyun Lee, and Seung-Woo Seo. Chain-of-goals hierarchical policy for long-horizon offline goal-conditioned rl.arXiv preprint arXiv:2602.03389,

Pith/arXiv arXiv

[7] [7]

Slot structured world models

Jonathan Collu, Riccardo Majellaro, Aske Plaat, and Thomas M Moerland. Slot structured world models. arXiv preprint arXiv:2402.03326,

arXiv

[8] [8]

Unsupervised image representation learning with deep latent particles.arXiv preprint arXiv:2205.15821,

13 Tal Daniel and Aviv Tamar. Unsupervised image representation learning with deep latent particles.arXiv preprint arXiv:2205.15821,

arXiv

[9] [9]

Latent particle world models: Self-supervised object-centric stochastic dynamics modeling.arXiv preprint arXiv:2603.04553,

Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, and David Held. Latent particle world models: Self-supervised object-centric stochastic dynamics modeling.arXiv preprint arXiv:2603.04553,

arXiv

[10] [10]

Dynamics learning with cascaded variational inference for multi-step manipulation.arXiv preprint arXiv:1910.13395,

Kuan Fang, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Dynamics learning with cascaded variational inference for multi-step manipulation.arXiv preprint arXiv:1910.13395,

arXiv 1910

[11] [11]

World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644,

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644,

arXiv

[12] [12]

World models.arXiv preprint arXiv:1803.10122, 2(3):440,

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440,

Pith/arXiv arXiv

[13] [13]

Td-mpc2: Scalable, robust world models for continuous control

Nick Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, volume 2024, pp. 47376–47405,

2024

[14] [14]

Hierarchical entity- centric reinforcement learning with factored subgoal diffusion.arXiv preprint arXiv:2602.02722,

Dan Haramati, Carl Qi, Tal Daniel, Amy Zhang, Aviv Tamar, and George Konidaris. Hierarchical entity- centric reinforcement learning with factored subgoal diffusion.arXiv preprint arXiv:2602.02722,

arXiv

[15] [15]

World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, et al. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

Pith/arXiv arXiv

[16] [16]

Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

Pith/arXiv arXiv

[17] [17]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62 (1):1–62,

2022

[18] [18]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Pith/arXiv arXiv

[19] [19]

stable-worldmodel-v1: Reproducible world modeling research and evaluation.arXiv preprint arXiv:2602.08968, 2026a

Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, and Randall Balestriero. stable-worldmodel-v1: Reproducible world modeling research and evaluation.arXiv preprint arXiv:2602.08968, 2026a. Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding ...

arXiv

[20] [20]

Sold: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822,

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. Sold: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822,

arXiv

[21] [21]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

Pith/arXiv arXiv

[22] [22]

Ogbench: Benchmarking offline goal- conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal- conditioned rl. InInternational Conference on Learning Representations, volume 2025, pp. 94937–94982,

2025

[23] [23]

pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

Pith/arXiv arXiv

[24] [24]

Parallel stochastic gradient-based planning for world models.arXiv preprint arXiv:2602.00475,

Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar. Parallel stochastic gradient-based planning for world models.arXiv preprint arXiv:2602.00475,

arXiv

[25] [25]

Ec-diffuser: Multi-object manipulation via entity-centric behavior generation

Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, and Amy Zhang. Ec-diffuser: Multi-object manipulation via entity-centric behavior generation. InInternational Conference on Learning Representations, volume 2025, pp. 74835–74858,

2025

[26] [26]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pp. 28085–28128,

2025

[27] [27]

What drives success in physical planning with joint-embedding predictive world models?arXiv preprint arXiv:2512.24497,

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. What drives success in physical planning with joint-embedding predictive world models?arXiv preprint arXiv:2512.24497,

Pith/arXiv arXiv

[28] [28]

Slotformer: Unsupervised visual dynamics simulation with object-centric models.arXiv preprint arXiv:2210.05861,

Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, and Animesh Garg. Slotformer: Unsupervised visual dynamics simulation with object-centric models.arXiv preprint arXiv:2210.05861,

arXiv

[29] [29]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

Pith/arXiv arXiv

[30] [30]

Hierarchical planning with latent world models

15 Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, et al. Hierarchical planning with latent world models. arXiv preprint arXiv:2604.03208,

Pith/arXiv arXiv

[31] [31]

Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983,

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983,

Pith/arXiv arXiv

[32] [32]

Table 4: Hyperparameter configurations for the Object-Centric Encoder

with an initial learning rate of10−3, which is decayed to10−6via a one-cycle learning rate scheduler with cosine annealing (Smith & Topin, 2019). Table 4: Hyperparameter configurations for the Object-Centric Encoder. Slot Corrector Slot Decoder Dataset Num Slots Slot Dim. Num. iter. Hidden Dim. Layers Hidden Dim. Cube-Single 3 64 3 128 3 384 Cube-Triple 5...

2019

[33] [33]

A.3 Diffusion Policy

with an initial learning rate of10−4, which is decayed to10−7via a one-cycle learning rate scheduler with cosine annealing (Smith & Topin, 2019). A.3 Diffusion Policy. Apart from the input dimension, which scales with the slot dimension and the number of slots, all other parameters of the diffusion policy transformer remain constant across datasets. Speci...

2019

[34] [34]

A.4 Contact Predictor The contact predictor consists of two linear layers separated by a SiLU activation function (Elfwing et al., 2018)

with an initial learning rate of10−4and a weight decay of10−6. A.4 Contact Predictor The contact predictor consists of two linear layers separated by a SiLU activation function (Elfwing et al., 2018). It takes the mean of the DINOv2 patch-level representations as input and outputs a vector corre- sponding to the number of manipulable objects in the enviro...

2018

[35] [35]

B Tversky Loss As described in Section 3.1, to supervise predicted masks during object-centric encoder training, we employ the Tversky loss function (Salehi et al., 2017)

with an initial learning rate of10−3, which decays to10−6via a one-cycle learning rate scheduler with cosine annealing (Smith & Topin, 2019). B Tversky Loss As described in Section 3.1, to supervise predicted masks during object-centric encoder training, we employ the Tversky loss function (Salehi et al., 2017). Given predicted (ˆmk) and ground-truth (mk)...

2019