Recognition: 2 theorem links
· Lean TheoremJEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
Pith reviewed 2026-05-14 19:32 UTC · model grok-4.3
The pith
JEDI trains an end-to-end latent diffusion world model by learning predictive latents directly from the diffusion denoising loss inside a JEPA framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JEDI is the first online end-to-end latent diffusion world model. It learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. The method supplies a theoretical motivation that conventional JEPA objectives induce a predictive information bottleneck while conditional diffusion denoising admits a closely related predictive-compression decomposition. Empirically it remains competitive on Atari100k, outperforms the separately trained latent baseline where directly comparable, and delivers 43 percent lower VRAM use, over three times faster world-model采样,
What carries the argument
The Joint Embedding Diffusion (JEDI) objective that replaces reconstruction with conditional diffusion denoising inside a joint-embedding predictive architecture to jointly learn and forecast future latents.
If this is right
- World-model sampling runs over three times faster than pixel diffusion while using 43 percent less VRAM.
- End-to-end training removes the need for a separate pretrained encoder, allowing the entire pipeline to optimize for downstream planning.
- The learned latents produce a different profile of task performance than pixel-space diffusion, showing that the representation itself changes behavior.
- The predictive-compression decomposition supplies a route to replace reconstruction objectives in other latent world-model architectures.
Where Pith is reading between the lines
- The same denoising-plus-prediction structure could be tested in continuous-control domains where stochastic dynamics dominate.
- If the information-bottleneck argument holds, replacing the JEPA predictor with a diffusion head might improve sample efficiency in other self-supervised representation learners.
- The observed shift in task-level performance suggests that hybrid latent-pixel models could combine the speed of JEDI with the fidelity of pixel diffusion on the hardest games.
Load-bearing premise
That the diffusion denoising loss, when placed inside the JEPA predictive loop, produces latents that are both sufficiently compressed and free of the predictive information bottleneck created by conventional JEPA training.
What would settle it
A head-to-head comparison on Atari100k in which the separately trained latent baseline is given the same total compute budget as JEDI and still underperforms, or a direct measurement showing that JEDI latents retain less mutual information with future states than a standard JEPA encoder.
Figures
read the original abstract
Diffusion world models have recently become competitive for online model-based reinforcement learning, but current approaches expose a tension: pixel diffusion is effective but computationally expensive while the latest latent diffusion approach improves efficiency yet performs subpar. The latter also relies on separately trained latents rather than the end-to-end world-model objectives that have driven much of modern MBRL progress. In particular, JEPA-style predictive representation learning has emerged as an especially promising direction for world modeling and MBRL. Concurrently, diffusion-style objectives have gained traction across multiple domains, with iterative refinement as a promising approach for multimodal and stochastic targets. Taken together, these trends motivate Joint Embedding DIffusion (JEDI), the first online end-to-end latent diffusion world model. JEDI learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive-compression decomposition. Empirically, JEDI is competitive on Atari100k and outperforms the baseline with seperately trained latents where directly comparable. Relative to the pixel diffusion baseline, JEDI uses 43% less VRAM, over 3$\times$ faster world-model sampling, and 2.5$\times$ faster training. JEDI also exhibits a markedly different task-level performance profile from the pixel baseline, suggesting that end-to-end predictive latents change more than compute alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JEDI as the first online end-to-end latent diffusion world model for model-based reinforcement learning. It learns latent spaces directly from the diffusion denoising loss inside a JEPA framework rather than using reconstruction or pretrained encoders, motivated by a claimed theoretical result that standard JEPA objectives induce a predictive information bottleneck while conditional diffusion denoising yields an analogous predictive-compression decomposition. Empirically, JEDI reports competitive Atari100k scores, outperforms a separately-trained-latent baseline where directly compared, and achieves substantial efficiency gains (43% less VRAM, >3× faster world-model sampling, 2.5× faster training) relative to a pixel-diffusion baseline, along with a distinct task-level performance profile.
Significance. If the unshown decomposition is valid and the Atari100k results prove robust, the work would meaningfully advance efficient online MBRL by demonstrating that diffusion objectives can be used for end-to-end predictive latent learning. The reported efficiency improvements and the observation of a qualitatively different performance profile from pixel baselines would be valuable contributions to the design of scalable world models.
major comments (3)
- [§3] §3 (theoretical motivation): the central claim that conditional diffusion denoising admits a predictive-compression decomposition that avoids the JEPA bottleneck is asserted without derivation steps. The manuscript does not show how the decomposition follows once the stochastic forward process, conditioning on past latents, and finite denoising steps are taken into account; this derivation is load-bearing for the justification of end-to-end latent training.
- [§4] §4 (experiments): Atari100k results are presented without error bars, ablation tables, or a complete experimental protocol (e.g., number of seeds, exact hyper-parameter matching to the separately-trained baseline). The claim that JEDI “outperforms the baseline with separately trained latents” and exhibits a “markedly different task-level performance profile” therefore cannot be assessed for statistical reliability.
- [§4.2] §4.2 (baseline comparisons): efficiency numbers (43% VRAM reduction, 3× sampling speedup) are reported relative to a pixel-diffusion baseline, yet the manuscript does not detail whether the latent baseline used identical architecture depth, optimizer settings, or training horizon; without these controls it is unclear whether observed gains are attributable to the end-to-end diffusion objective or to other implementation differences.
minor comments (2)
- [Abstract] Abstract: “seperately” is a typo and should read “separately.”
- [Notation] Notation: ensure that the symbols for latent variables, diffusion time steps, and conditioning variables are defined once and used consistently across equations and figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to include the full theoretical derivation and complete experimental details.
read point-by-point responses
-
Referee: [§3] §3 (theoretical motivation): the central claim that conditional diffusion denoising admits a predictive-compression decomposition that avoids the JEPA bottleneck is asserted without derivation steps. The manuscript does not show how the decomposition follows once the stochastic forward process, conditioning on past latents, and finite denoising steps are taken into account; this derivation is load-bearing for the justification of end-to-end latent training.
Authors: We agree that the original derivation steps were insufficiently explicit. In the revised manuscript we expand Section 3 with a complete step-by-step derivation: starting from the stochastic forward process q(z_t | z_{t-1}), conditioning the reverse process on past latents, and taking the finite-step denoising objective, we obtain an explicit decomposition into a predictive term (future latent forecasting via the score function) and a compression term (information bottleneck on the latent representation). This decomposition is shown to be analogous to but strictly weaker than the JEPA bottleneck, thereby justifying end-to-end training directly from the diffusion loss. revision: yes
-
Referee: [§4] §4 (experiments): Atari100k results are presented without error bars, ablation tables, or a complete experimental protocol (e.g., number of seeds, exact hyper-parameter matching to the separately-trained baseline). The claim that JEDI “outperforms the baseline with separately trained latents” and exhibits a “markedly different task-level performance profile” therefore cannot be assessed for statistical reliability.
Authors: We have revised Section 4 and the appendix to report all Atari100k scores with error bars computed over 5 independent random seeds. A new ablation table directly compares end-to-end JEDI against the separately-trained latent baseline under identical hyperparameters. The experimental protocol is now fully specified (5 seeds, exact optimizer settings, training horizon, and hyperparameter matching), allowing statistical assessment of the reported outperformance and distinct task-level profile. revision: yes
-
Referee: [§4.2] §4.2 (baseline comparisons): efficiency numbers (43% VRAM reduction, 3× sampling speedup) are reported relative to a pixel-diffusion baseline, yet the manuscript does not detail whether the latent baseline used identical architecture depth, optimizer settings, or training horizon; without these controls it is unclear whether observed gains are attributable to the end-to-end diffusion objective or to other implementation differences.
Authors: We have added an explicit controls table in the revised Section 4.2 and appendix confirming that the latent baseline (and all other comparisons) used identical architecture depth, Adam optimizer settings (learning rate 1e-4), and training horizon as JEDI. The pixel-diffusion baseline differs solely in operating on pixels rather than latents. With these matched controls, the reported efficiency gains (43% VRAM, >3× sampling, 2.5× training) are attributable to the latent diffusion formulation and end-to-end objective. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper asserts a theoretical motivation that conventional JEPA induces a predictive information bottleneck while conditional diffusion denoising admits a predictive-compression decomposition, yet the visible text (abstract and context) contains no equations, self-referential definitions, or reductions that equate outputs to inputs by construction. Empirical claims rest on direct comparisons to baselines with separately trained latents and pixel-diffusion models, which are independent measurements rather than fitted quantities renamed as predictions. No self-citations are used to import uniqueness theorems or smuggle ansatzes; the efficiency and performance results (VRAM, speed, Atari100k scores) are externally falsifiable benchmarks. The derivation chain is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Conventional JEPA objectives induce a predictive information bottleneck
invented entities (1)
-
JEDI latent diffusion world model
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive-compression decomposition.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991
Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991
1991
-
[2]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[4]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Genie 2: A large-scale foundation world model.URL: https://deepmind
J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model.URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model, 2024
2024
-
[6]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
2024
-
[8]
Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, and Shie Mannor. Horizon imagination: Efficient on-policy rollout in diffusion world models.arXiv preprint arXiv:2602.08032, 2026
-
[9]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019
2019
-
[10]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review arXiv 2010
-
[11]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023
Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023
2023
-
[13]
Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023
-
[14]
Discovering predictable classifications.Neural Computation, 5(4):625–635, 1993
Jürgen Schmidhuber and Daniel Prelinger. Discovering predictable classifications.Neural Computation, 5(4):625–635, 1993
1993
-
[15]
Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020
2020
-
[16]
A path towards autonomous machine intelligence version 0.9
Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022
2022
-
[17]
Self-supervised learning from images with a joint- embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023. 10
2023
-
[18]
Revisiting feature prediction for learning visual repre- sentations from video, 2024
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video, 2024
2024
-
[19]
Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022
-
[20]
TD-MPC2: Scalable, Robust World Models for Continuous Control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Temporal straightening for latent planning.arXiv preprint arXiv:2603.12231, 2026
Ying Wang, Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim GJ Rudner, Yann LeCun, and Mengye Ren. Temporal straightening for latent planning.arXiv preprint arXiv:2603.12231, 2026
-
[23]
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026
work page internal anchor Pith review arXiv 2026
-
[24]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[25]
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022
work page internal anchor Pith review arXiv 2022
-
[26]
Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458, 2025
-
[27]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025
-
[29]
Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
2022
-
[30]
Omer Belhasin, Shelly Golan, Ran El-Yaniv, and Michael Elad. Advancing image classification with discrete diffusion classification modeling.arXiv preprint arXiv:2511.20263, 2025
-
[31]
arXiv preprint arXiv:1612.00410 , year=
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck.arXiv preprint arXiv:1612.00410, 2016
-
[33]
Opening the Black Box of Deep Neural Networks via Information
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Model-Based Reinforcement Learning for Atari
Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model- based reinforcement learning for atari.arXiv preprint arXiv:1903.00374, 2019
-
[35]
Mikel Malagón, Josu Ceberio, and Jose A Lozano. Craftium: Bridging flexibility and efficiency for rich 3d single-and multi-agent environments.arXiv preprint arXiv:2407.03969, 2024. 11
-
[36]
Self-supervised information bottleneck for deep multi-view subspace clustering.IEEE Transactions on Image Processing, 32:1555–1567, 2023
Shiye Wang, Changsheng Li, Yanming Li, Ye Yuan, and Guoren Wang. Self-supervised information bottleneck for deep multi-view subspace clustering.IEEE Transactions on Image Processing, 32:1555–1567, 2023
2023
-
[37]
To compress or not to compress—self-supervised learning and information theory: A review.Entropy, 26(3):252, 2024
Ravid Shwartz Ziv and Yann LeCun. To compress or not to compress—self-supervised learning and information theory: A review.Entropy, 26(3):252, 2024
2024
-
[38]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
2021
-
[39]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[40]
An information theory perspective on variance-invariance-covariance regularization.Advances in neural information processing systems, 36:33965–33998, 2023
Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Tim GJ Rudner, and Yann LeCun. An information theory perspective on variance-invariance-covariance regularization.Advances in neural information processing systems, 36:33965–33998, 2023
2023
-
[41]
Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025
Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025
-
[42]
Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022
2022
-
[43]
Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992
1992
-
[44]
Deep reinforcement learning at the edge of the statistical precipice
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep reinforcement learning at the edge of the statistical precipice. InAdvances in Neural Information Processing Systems, volume 34, pages 29304–29320, 2021
2021
-
[45]
Weipu Zhang, Adam Jelley, Trevor McInroe, and Amos Storkey. Objects matter: object-centric world models improve reinforcement learning in visually complex environments.arXiv preprint arXiv:2501.16443, 2025
-
[46]
Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022
Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022
-
[47]
Maxime Burchi and Radu Timofte. Learning transformer-based world models with contrastive predictive coding.arXiv preprint arXiv:2503.04416, 2025
-
[48]
How not to lie with statistics: the correct way to summarize benchmark results.Communications of the ACM, 29(3):218–221, 1986
Philip J Fleming and John J Wallace. How not to lie with statistics: the correct way to summarize benchmark results.Communications of the ACM, 29(3):218–221, 1986
1986
-
[49]
OECD publishing, 2008
Joint Research Centre.Handbook on constructing composite indicators: methodology and user guide. OECD publishing, 2008
2008
-
[50]
Information retrieval: theory and practice
C Van Rijsbergen. Information retrieval: theory and practice. InProceedings of the joint IBM/University of Newcastle upon tyne seminar on data base systems, volume 79, pages 1–14. Butterworth-Heinemann Oxford, UK, 1979
1979
-
[51]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[52]
Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020
2020
-
[53]
Mastering atari games with limited data.Advances in neural information processing systems, 34:25476–25488, 2021
Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data.Advances in neural information processing systems, 34:25476–25488, 2021. 12
2021
-
[54]
Zhao-Han Peng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, and You He. From observations to events: Event-aware world model for reinforcement learning.arXiv preprint arXiv:2601.19336, 2026
-
[55]
Simulus: Combining Improvements in Sample-Efficient World Model Agents
Lior Cohen, Kaixin Wang, Bingyi Kang, Uri Gadot, and Shie Mannor. Uncovering untapped potential in sample-efficient world model agents.arXiv preprint arXiv:2502.11537, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[57]
Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019
2019
-
[58]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[59]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[60]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
2024
-
[61]
Score-based generative modeling in latent space
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in neural information processing systems, 34:11287–11302, 2021
2021
-
[62]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Policy-guided diffusion.arXiv preprint arXiv:2404.06356, 2024
Matthew Thomas Jackson, Michael Tryfan Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob Foerster. Policy-guided diffusion.arXiv preprint arXiv:2404.06356, 2024
-
[64]
Synthetic experience replay
Cong Lu, Philip Ball, Yee Whye Teh, and Jack Parker-Holder. Synthetic experience replay. Advances in Neural Information Processing Systems, 36:46323–46344, 2023
2023
-
[65]
Zihan Ding, Amy Zhang, Yuandong Tian, and Qinqing Zheng. Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024
-
[66]
Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023
Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023
2023
-
[67]
Zhongqi Yue, Jiankun Wang, Qianru Sun, Lei Ji, Eric I Chang, Hanwang Zhang, et al. Exploring diffusion time-steps for unsupervised representation learning.arXiv preprint arXiv:2401.11430, 2024
-
[68]
Diffusion model as representation learner
Xingyi Yang and Xinchao Wang. Diffusion model as representation learner. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18938–18949, 2023
2023
-
[69]
Unsupervised representation learning from pre- trained diffusion probabilistic models.Advances in neural information processing systems, 35: 22117–22130, 2022
Zijian Zhang, Zhou Zhao, and Zhijie Lin. Unsupervised representation learning from pre- trained diffusion probabilistic models.Advances in neural information processing systems, 35: 22117–22130, 2022
2022
-
[70]
Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Kumar. Diffusevae: Effi- cient, controllable and high-fidelity generation from low-dimensional latents.arXiv preprint arXiv:2201.00308, 2022
-
[71]
Diffenc: Variational diffusion with a learned encoder.arXiv preprint arXiv:2310.19789, 2023
Beatrix MG Nielsen, Anders Christensen, Andrea Dittadi, and Ole Winther. Diffenc: Variational diffusion with a learned encoder.arXiv preprint arXiv:2310.19789, 2023. 13
-
[72]
Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021
Dmitry Baranchuk, Ivan Rubachev, Andrey V oynov, Valentin Khrulkov, and Artem Babenko. Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021
-
[73]
Denoising diffusion autoencoders are unified self-supervised learners
Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang. Denoising diffusion autoencoders are unified self-supervised learners. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15802–15812, 2023
2023
-
[74]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. A Technical appendices and supplementary m...
2015
-
[75]
Equivalently, the joint distribution factorizes as pψ(x1, x2, z0 1, z0:T 2 ) =p(z 0 1)p(x 1 |z 0 1)p(z T 2 ) TY t=1 pψ(zt−1 2 |z t 2, z0 1)p(x 2 |z 0 2).(26) Here, pψ(z0:T 2 |z 0
-
[76]
Therefore, the induced stochastic JEPA predictor from z0 1 to z0 2 is pψ(z0 2 |z 0
:=p(z T 2 ) TY t=1 pψ(zt−1 2 |z t 2, z0 1) is the conditional reverse diffusion path. Therefore, the induced stochastic JEPA predictor from z0 1 to z0 2 is pψ(z0 2 |z 0
-
[77]
= Z p(zT 2 ) TY t=1 pψ(zt−1 2 |z t 2, z0
-
[78]
We introduce the amortized variational posterior qφ(z0 1, z0:T 2 |x 1, x2) =q φ(z0 1 |x 1)q φ(z0 2 |x 2)q(z 1:T 2 |z 0 2),(28) 15 where q(z 1:T 2 |z 0
dz1:T 2 .(27) Thus, the conditional diffusion model is a multi-step stochastic JEPA predictor. We introduce the amortized variational posterior qφ(z0 1, z0:T 2 |x 1, x2) =q φ(z0 1 |x 1)q φ(z0 2 |x 2)q(z 1:T 2 |z 0 2),(28) 15 where q(z 1:T 2 |z 0
-
[79]
For notational compactness, for a fixed pair (x1, x2), define q1(z0
is the fixed forward diffusion process. For notational compactness, for a fixed pair (x1, x2), define q1(z0
-
[80]
:=q φ(z0 1 |x 1), q0 2(z0
-
[81]
Variational decomposition of the joint likelihood.We begin with the marginal log-likelihood of the two views
:=q φ(z0 2 |x 2), q0:T 2 (z0:T 2 ) :=q 0 2(z0 2)q(z 1:T 2 |z 0 2), and q12(z0 1, z0:T 2 ) :=q 1(z0 1)q0:T 2 (z0:T 2 ). Variational decomposition of the joint likelihood.We begin with the marginal log-likelihood of the two views. Sinceq 12 is a normalized distribution overz 0 1, z0:T 2 , we may write logp ψ(x1, x2) = Z q12(z0 1, z0:T 2 ) logp ψ(x1, x2) dz0...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.