arxiv: 2604.03208 · v1 · submitted 2026-04-03 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Hierarchical Planning with Latent World Models

Wancong Zhang , Basile Terver , Artem Zholus , Soham Chitnis , Harsh Sutaria , Mido Assran , Randall Balestriero , Amir Bar

show 3 more authors

Adrien Bardes Yann LeCun Nicolas Ballas

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords hierarchical planninglatent world modelsmodel predictive controlzero-shot controllong-horizon planningrobotic manipulationmulti-scale dynamics

0 comments

The pith

Learning latent world models at multiple temporal scales and planning hierarchically across them enables reliable long-horizon control with far less online computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Model predictive control using a single learned world model accumulates prediction errors over long horizons and faces an exponentially expanding action search space. The paper trains separate latent world models for short, medium, and long time horizons and performs planning by choosing coarse actions at the longest scale before refining them at finer scales. This hierarchical structure lets an agent reach a distant goal from only a final specification. On a real robot the method achieves 70 percent success on non-greedy pick-and-place tasks where a flat world-model planner scores zero percent. The same approach raises success rates and cuts planning-time compute by up to four times in simulated pushing and maze-navigation environments.

Core claim

Training latent world models at multiple temporal scales and executing hierarchical planning across those scales lets agents solve long-horizon embodied control problems more reliably and with substantially lower inference-time cost than flat planning. The hierarchical planner reaches 70 percent success on real-robot pick-and-place using only a final goal image, while a single-level model reaches zero percent. Across physics-based simulations the method improves success on push manipulation and maze navigation while requiring up to four times less planning compute. The abstraction works as a modular layer on top of diverse latent world-model architectures.

What carries the argument

A hierarchy of latent world models, each trained to predict dynamics at a distinct temporal scale, with planning that optimizes coarse actions at long scales before refining them at shorter scales.

If this is right

Zero-shot control on real non-greedy robotic tasks becomes feasible using only a final goal specification.
Planning-time compute drops by a factor of up to four while success rates increase in both real and simulated domains.
The method functions as a modular planning layer compatible with many existing latent world-model architectures.
Long-horizon reasoning is possible without the exponential growth in search space that limits flat model-predictive control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-scale hierarchy could be applied to other sequential decision domains such as video-game planning or long-term scheduling.
If the coarsest-scale model remains accurate, the approach may scale to horizons orders of magnitude longer than those tested.
Lower planning cost could make model-based control practical on embedded hardware with limited onboard compute.

Load-bearing premise

The multi-scale models must predict future states accurately enough that planning across scales reduces rather than compounds long-horizon prediction error.

What would settle it

A controlled long-horizon experiment in which the hierarchical planner produces lower task success or higher planning time than a well-tuned single-scale planner.

read the original abstract

Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a modular planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-&-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 4x less planning-time compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets real-robot gains from hierarchical planning over multi-scale latent models, but lacks the error breakdowns needed to confirm it fixes long-horizon accumulation rather than working around it.

read the letter

The main point is that training latent world models at several temporal scales and planning across them lets MPC handle longer tasks with less compute at inference. They report 70% success on real pick-and-place using only a goal specification, against 0% for the single-scale version, plus higher success and up to 4x lower planning time in simulated push and maze settings. The modular abstraction claim is also practical, since it is meant to sit on top of different world-model architectures without major rewrites. What works here is the direct comparison to flat baselines and the zero-shot real-world transfer on non-greedy behavior, which is still rare enough to be useful for embodied control work. The compute savings are straightforward to appreciate if they hold. The soft spot is the missing analysis of prediction error. The abstract gives no per-scale accuracy numbers, no ablation on joint versus separate training, and no check on whether coarse plans actually improve or degrade fine-scale rollouts over full horizons. If the coarser models carry their own bias, the hierarchy could be masking single-level failures instead of reducing compounding error. The number of scales and their horizons also look like free parameters that may need tuning. This is for researchers working on model-based control or hierarchical RL who already run into the long-horizon wall in MPC. A reader who wants concrete numbers on real hardware and ideas for cutting planning cost would get value from the experiments. I would send it for peer review. The empirical results address a real bottleneck and are worth referee time, even if the methods will need tighter error analysis to strengthen the central claim.

Referee Report

3 major / 1 minor

Summary. The paper proposes learning latent world models at multiple temporal scales and performing hierarchical planning across them to enable long-horizon model predictive control while reducing inference-time compute. It claims this modular approach yields zero-shot real-world success on non-greedy robotic pick-and-place (70% vs 0% for single-level baselines) and higher success rates with up to 4x less planning compute in simulated push-manipulation and maze-navigation tasks.

Significance. If the central empirical claims hold after proper validation, the work would be significant for embodied control and RL. It offers a practical, architecture-agnostic way to scale planning in learned dynamics models without exponential search costs, directly addressing error accumulation in long-horizon MPC. The reported real-robot zero-shot results and compute savings would be impactful if reproducible.

major comments (3)

[Abstract and Section 3] Abstract and Section 3: The headline claim that multi-scale latent models can be composed hierarchically without compounding prediction errors (rather than masking single-level failures) is load-bearing but unsupported by direct evidence; no per-level rollout error metrics, horizon-wise accuracy comparisons, or propagation analysis from coarse to fine scales are reported.
[Section 4 (Experiments)] Section 4 (Experiments): The 70% vs 0% real-robot success rates and simulated gains lack ablations on joint vs separate training of scales, number of trials, variance, or controls isolating hierarchy from other implementation details; without these the improvements cannot be confidently attributed to the proposed mechanism.
[Methods] Methods: The description of how coarse-scale plans constrain or refine fine-scale rollouts does not include any measurement of how approximation errors at higher temporal scales affect long-horizon accuracy at lower scales, leaving the weakest assumption untested.

minor comments (1)

[Notation] Notation throughout: The precise definition of temporal scales, their horizons, and the interface between planning levels would benefit from an explicit equation or pseudocode block for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and Section 3: The headline claim that multi-scale latent models can be composed hierarchically without compounding prediction errors (rather than masking single-level failures) is load-bearing but unsupported by direct evidence; no per-level rollout error metrics, horizon-wise accuracy comparisons, or propagation analysis from coarse to fine scales are reported.

Authors: We agree that direct per-level error metrics and propagation analysis would provide stronger support. In the revised manuscript we will add these measurements in Section 3, including horizon-wise prediction accuracy at each scale and an explicit comparison of error accumulation between hierarchical and flat rollouts. revision: yes
Referee: [Section 4 (Experiments)] Section 4 (Experiments): The 70% vs 0% real-robot success rates and simulated gains lack ablations on joint vs separate training of scales, number of trials, variance, or controls isolating hierarchy from other implementation details; without these the improvements cannot be confidently attributed to the proposed mechanism.

Authors: We will expand Section 4 with the requested ablations: joint versus separate training of the scales, the exact number of trials performed, standard deviations on success rates, and additional controls that isolate the hierarchical planning component from other implementation choices. revision: yes
Referee: [Methods] Methods: The description of how coarse-scale plans constrain or refine fine-scale rollouts does not include any measurement of how approximation errors at higher temporal scales affect long-horizon accuracy at lower scales, leaving the weakest assumption untested.

Authors: We will augment the Methods section with quantitative results that measure the effect of coarse-scale approximation error on fine-scale long-horizon accuracy. This will include controlled experiments that deliberately degrade coarse-scale predictions and report the resulting impact on overall task performance. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical claims rest on experimental comparisons

full rationale

The paper's core contribution is an empirical demonstration of hierarchical planning over multi-scale latent world models, validated through success rates (70% real-robot pick-and-place vs 0% single-level) and compute reductions (up to 4x) in simulation environments. No load-bearing equations, fitted parameters renamed as predictions, or self-citation chains reduce the central result to its inputs by construction. The approach is presented as a modular abstraction applicable across architectures, with performance measured against independent baselines rather than derived tautologically from definitions or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; approach rests on standard domain assumptions of learnable latent dynamics rather than new postulates.

free parameters (1)

number of temporal scales and their horizons
Choice of how many scales and their relative time resolutions is a design choice likely tuned on data.

axioms (1)

domain assumption Latent world models can be trained to predict dynamics reliably at multiple distinct temporal resolutions
Invoked implicitly as the foundation for hierarchical planning to work.

pith-pipeline@v0.9.0 · 5504 in / 1293 out tokens · 32595 ms · 2026-05-13T20:10:10.840309+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learning latent world models at multiple temporal scales and performing hierarchical planning across these scales... high-level planner optimizes macro-actions... low-level planner optimizes primitive actions... E2(ˆl1:H;z1,zg) ≜ ||zg−P(2)(ˆl1:H;z1)||1
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

8-tick period... three spatial dimensions... J(x)=½(x+x⁻¹)−1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Roboarena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

work page 2025
[4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[5]

TD - JEPA : Latent-predictive representations for zero-shot reinforcement learning

Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirinzoni. TD - JEPA : Latent-predictive representations for zero-shot reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=SzXDuBN8M1

work page 2026
[6]

Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction. arXiv preprint arXiv:2506.21552, 2025

work page arXiv 2025
[7]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791--15801, 2025

work page 2025
[8]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021

work page internal anchor Pith review arXiv 2021
[9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[10]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44 0 (10-11): 0 1684--1704, 2025

work page 2025
[11]

Iql-td-mpc: Implicit q-learning for hierarchical model predictive control

Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, and Olivier Delalleau. Iql-td-mpc: Implicit q-learning for hierarchical model predictive control. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9154--9160. IEEE, 2024

work page 2024
[12]

Pilco: A model-based and data-efficient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465--472, 2011

work page 2011
[13]

Self-supervised visual planning with temporal skip connections

Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. CoRL, 12 0 (16): 0 23, 2017

work page 2017
[14]

Dynamics learning with cascaded variational inference for multi-step manipulation

Kuan Fang, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Dynamics learning with cascaded variational inference for multi-step manipulation. arXiv preprint arXiv:1910.13395, 2019

work page arXiv 1910
[15]

Learning hierarchical world models with adaptive temporal abstractions from discrete latent dynamics

Christian Gumbsch, Noor Sajid, Georg Martius, and Martin V Butz. Learning hierarchical world models with adaptive temporal abstractions from discrete latent dynamics. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[16]

World Models

David Ha and J \"u rgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2 0 (3), 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pages 2555--2565. PMLR, 2019

work page 2019
[18]

Deep hierarchical planning from pixels

Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35: 0 26091--26104, 2022

work page 2022
[19]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828, 2023

work page internal anchor Pith review arXiv 2023
[21]

Hierarchical world models as visual whole-body humanoid controllers

Nicklas Hansen, Jyothir SV, Vlad Sobal, Yann LeCun, Xiaolong Wang, and Hao Su. Hierarchical world models as visual whole-body humanoid controllers. arXiv preprint arXiv:2405.18418, 2024

work page arXiv 2024
[22]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Broadly-exploring, local-policy trees for long-horizon task planning

Brian Ichter, Pierre Sermanet, and Corey Lynch. Broadly-exploring, local-policy trees for long-horizon task planning. arXiv preprint arXiv:2010.06491, 2020

work page arXiv 2010
[24]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. _ 0.5 : a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019

work page 2019
[26]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Safe hierarchical model predictive control and planning for autonomous systems

Markus K \"o gel, Mohamed Ibrahim, Christian Kallies, and Rolf Findeisen. Safe hierarchical model predictive control and planning for autonomous systems. International Journal of Robust and Nonlinear Control, 35 0 (7): 0 2658--2676, 2025

work page 2025
[28]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Robohive: A unified framework for robot learning

Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Abhishek Gupta, and Aravind Rajeswaran. Robohive: A unified framework for robot learning. Advances in Neural Information Processing Systems, 36: 0 44323--44340, 2023

work page 2023
[30]

Planning in learned latent action spaces for generalizable legged locomotion

Tianyu Li, Roberto Calandra, Deepak Pathak, Yuandong Tian, Franziska Meier, and Akshara Rai. Planning in learned latent action spaces for generalizable legged locomotion. IEEE Robotics and Automation Letters, 6 0 (2): 0 2682--2689, 2021

work page 2021
[31]

stable-worldmodel-v1: Reproducible world modeling research and evaluation, 2026

Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, and Randall Balestriero. stable-worldmodel-v1: Reproducible world modeling research and evaluation. arXiv preprint arXiv:2602.08968, 2026

work page arXiv 2026
[32]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022

work page internal anchor Pith review arXiv 2022
[33]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Hiql: Offline goal-conditioned rl with latent states as actions

Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Hiql: Offline goal-conditioned rl with latent states as actions. Advances in Neural Information Processing Systems, 36: 0 34866--34891, 2023

work page 2023
[35]

Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092,

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. arXiv preprint arXiv:2410.20092, 2024 a

work page arXiv 2024
[36]

Foundation policies with hilbert representations

Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representations. arXiv preprint arXiv:2402.15567, 2024 b

work page arXiv 2024
[37]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2004

work page 2004
[39]

Exploring the limits of hierarchical world models in reinforcement learning

Robin Schiewer, Anand Subramoney, and Laurenz Wiskott. Exploring the limits of hierarchical world models in reinforcement learning. Scientific Reports, 14 0 (1): 0 26856, 2024

work page 2024
[40]

Data-efficient reinforcement learning with self-predictive representations.arXiv preprint arXiv:2007.05929,

Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020

work page arXiv 2007
[41]

Learning from reward-free offline data: A case for planning with latent dynamics models

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim GJ Rudner, and Yann LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models. arXiv preprint arXiv:2502.14819, 2025

work page arXiv 2025
[42]

An adaptive network that constructs and uses and internal model of its world

Richard S Sutton. An adaptive network that constructs and uses and internal model of its world. Cognition and Brain Theory, 4 0 (3): 0 217--246, 1981

work page 1981
[43]

Dyna, an integrated architecture for learning, planning, and reacting

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991

work page 1991
[44]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112 0 (1-2): 0 181--211, 1999

work page 1999
[45]

Model regularization for stable sample rollouts

Erik Talvitie. Model regularization for stable sample rollouts. In UAI, pages 780--789, 2014

work page 2014
[46]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagarajan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, et al. A lightweight library for energy-based joint-embedding predictive architectures. arXiv preprint arXiv:2602.03604, 2026 a

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

What drives success in physical planning with joint-embedding predictive world models?, 2026 b

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. What drives success in physical planning with joint-embedding predictive world models?, 2026 b . https://arxiv.org/abs/2512.24497

work page arXiv 2026
[49]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, pages 5026--5033. IEEE, 2012. ISBN 978-1-4673-1737-5. http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12

work page 2012
[50]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. Advances in neural information processing systems, 28, 2015

work page 2015
[51]

Information theoretic mpc for model-based reinforcement learning

Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 1714--1721. IEEE, 2017

work page 2017
[52]

Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 1 0 (2): 0 6, 2023

work page arXiv 2023
[53]

Light-weight probing of unsupervised representations for reinforcement learning

Wancong Zhang, Anthony GX-Chen, Vlad Sobal, Yann LeCun, and Nicolas Carion. Light-weight probing of unsupervised representations for reinforcement learning. arXiv preprint arXiv:2208.12345, 2022

work page arXiv 2022
[54]

2411.04983 , archiveprefix =

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024

work page arXiv 2024