Recognition: 3 theorem links
· Lean TheoremLeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Pith reviewed 2026-05-15 04:03 UTC · model grok-4.3
The pith
LeWorldModel trains the first stable end-to-end JEPA from raw pixels using only two loss terms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LeWM is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detectsphys
What carries the argument
The Gaussian regularizer on latent embeddings, which keeps representations from collapsing by enforcing a Gaussian distribution during end-to-end training from pixels.
If this is right
- World-model training becomes feasible with only one tunable hyperparameter instead of six.
- Models with 15 million parameters can be trained on a single GPU and still produce competitive policies.
- Planning speed improves by up to 48 times relative to larger foundation-model world models.
- Latent embeddings can be probed to recover physical quantities such as positions and velocities.
- Surprise signals in the latent space reliably flag physically implausible transitions.
Where Pith is reading between the lines
- The same two-term recipe may generalize to video prediction or robotic manipulation domains beyond the current control benchmarks.
- If the Gaussian constraint preserves physical structure, it could serve as a lightweight prior for other latent-space predictive models.
- Removing the need for pre-trained encoders opens the door to fully self-supervised world-model learning on raw sensor streams.
- Faster planning combined with physical interpretability could enable real-time model-based control on embedded hardware.
Load-bearing premise
The Gaussian regularizer alone is sufficient to prevent representation collapse across diverse 2D and 3D control tasks without auxiliary supervision or pre-trained encoders.
What would settle it
Training LeWM on a new suite of control tasks without the Gaussian regularizer and observing immediate representation collapse would falsify the claim that the regularizer alone guarantees stability.
read the original abstract
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LeWorldModel (LeWM), a Joint-Embedding Predictive Architecture (JEPA) that claims to be the first to train stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one. The ~15M-parameter model trains on a single GPU in hours, plans up to 48x faster than foundation-model baselines, performs competitively on diverse 2D and 3D control tasks, encodes physical quantities in its latent space (via probing), and detects implausible events through surprise evaluation.
Significance. If the empirical claims hold, the work would be significant for simplifying JEPA training in self-supervised world-model learning, removing reliance on multi-term losses, EMAs, or pretrained encoders. The single-hyperparameter design and efficiency could broaden accessibility for control applications, while the physical-structure probing offers a concrete advance beyond task performance metrics.
major comments (3)
- Abstract: The claim that the Gaussian regularizer alone suffices to prevent representation collapse (the weakest assumption) is load-bearing for the 'first stable two-loss JEPA' assertion, yet no formulation of the regularizer, its weight schedule, or embedding statistics (variance, mode coverage) across tasks is provided; without this, it is impossible to verify whether it replaces the auxiliary terms used in prior JEPAs.
- Experiments section: No ablation table or figure isolates the effect of the Gaussian regularizer versus the prediction loss alone, nor reports the single tunable hyperparameter value per task; this undermines the reduction-from-six-to-one claim, especially given that prior work required additional terms precisely because simpler regularizers often led to collapse on similar 2D/3D benchmarks.
- Results: The competitive performance and 48x planning speedup are stated without reference to specific baseline tables, error bars, or statistical tests; the abstract-only presentation leaves the soundness of these quantitative claims unverifiable.
minor comments (1)
- Abstract: The ~15M parameter count and single-GPU training time should be tied to a specific model diagram or experimental-setup paragraph for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of clarity and verifiability. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim that the Gaussian regularizer alone suffices to prevent representation collapse (the weakest assumption) is load-bearing for the 'first stable two-loss JEPA' assertion, yet no formulation of the regularizer, its weight schedule, or embedding statistics (variance, mode coverage) across tasks is provided; without this, it is impossible to verify whether it replaces the auxiliary terms used in prior JEPAs.
Authors: We agree that explicit details on the regularizer are necessary to support the stability claim. In the revised manuscript, we will add the precise mathematical formulation of the Gaussian regularizer (including its implementation as a KL-divergence term to a standard normal), the weighting schedule used during training, and quantitative embedding statistics (mean variance, effective mode coverage, and collapse metrics) across all 2D and 3D tasks. These additions will allow direct verification that the two-loss formulation suffices without auxiliary terms. revision: yes
-
Referee: Experiments section: No ablation table or figure isolates the effect of the Gaussian regularizer versus the prediction loss alone, nor reports the single tunable hyperparameter value per task; this undermines the reduction-from-six-to-one claim, especially given that prior work required additional terms precisely because simpler regularizers often led to collapse on similar 2D/3D benchmarks.
Authors: We acknowledge this gap in the experimental presentation. The revised version will include a new ablation table and accompanying figure that directly compares training with only the prediction loss against the full two-loss objective (prediction + Gaussian regularizer). We will also tabulate the single tunable hyperparameter value used for each task and environment, along with sensitivity analysis showing stability across a narrow range around the reported value. This will substantiate the hyperparameter reduction claim. revision: yes
-
Referee: Results: The competitive performance and 48x planning speedup are stated without reference to specific baseline tables, error bars, or statistical tests; the abstract-only presentation leaves the soundness of these quantitative claims unverifiable.
Authors: We will update the results section to explicitly reference the relevant baseline comparison tables (currently in the supplementary material but now moved to the main text), include error bars computed over multiple random seeds, and add statistical significance tests (e.g., paired t-tests) for the reported performance metrics and planning speedups. The abstract will be revised to point to these tables, ensuring all quantitative claims are directly verifiable from the main paper. revision: yes
Circularity Check
No circularity: empirical claims without derivation chain
full rationale
The manuscript introduces LeWorldModel as an empirical architecture that trains end-to-end from pixels using a next-embedding prediction loss plus a Gaussian regularizer on latent embeddings. No equations, formal derivations, or proof steps are presented that would allow any claimed prediction or result to reduce by construction to fitted inputs, self-citations, or ansatzes. The central assertions (stable training, reduced hyperparameters, competitive performance on 2D/3D tasks, and physical structure in latents) are supported solely by experimental outcomes rather than any self-referential mathematical structure. This leaves the work self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- single tunable loss hyperparameter
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
a regularizer enforcing Gaussian-distributed latent embeddings, promoting feature diversity... to prevent trivial collapse
-
CostJcost_nonneg echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
SIGReg regularization term enforces Gaussian-distributed latent embeddings
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
-
ProteinJEPA: Latent prediction complements protein language models
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
-
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
-
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
Predictive but Not Plannable: RC-aux for Latent World Models
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
-
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
AeroJEPA applies joint-embedding predictive learning to produce scalable, semantically organized latent representations for 3D aerodynamic fields that support both field reconstruction and downstream design tasks.
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
DySIB recovers a two-dimensional representation matching the phase space of a physical pendulum from high-dimensional video data by maximizing predictive mutual information in latent space.
-
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
-
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
-
Metriplector: From Field Theory to Neural Architecture
Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Reference graph
Works this paper leans on
-
[1]
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016
work page 2016
-
[2]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Transformers are sample-efficient world models
Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vhFu1Acb0xb
work page 2023
-
[4]
Training Agents Inside of Scalable World Models
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022
work page 2022
-
[6]
Diffusion for world modeling: Visual details matter in atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=NadTwTODgC
work page 2024
-
[7]
Efficient world models with context- aware tokenization
Vincent Micheli, Eloi Alonso, and François Fleuret. Efficient world models with context- aware tokenization. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=BiWIERWBFX
work page 2024
-
[8]
Oasis: A universe in a transformer
Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. 2024. URLhttps://oasis-model.github.io/
work page 2024
-
[9]
Genie: Generative interactive environments, 2024
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...
-
[10]
Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025
work page 2025
-
[11]
Worldgym: World model as an environment for policy evaluation, 2025
Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation, 2025. URL https://arxiv.org/abs/2506.00613. 12
-
[12]
Self-supervised learning from images with a joint- embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023
work page 2023
-
[13]
V-jepa: Latent video prediction for visual representation learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023
work page 2023
-
[14]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Brain-JEPA: Brain dynamics foundation model with gradient positioning and spatiotemporal masking
Zijian Dong, Li Ruilin, Yilei Wu, Thuan Tinh Nguyen, Joanna Su Xian Chong, Fang Ji, Nathanael Ren Jie Tong, Christopher Li Hsian Chen, and Juan Helen Zhou. Brain-JEPA: Brain dynamics foundation model with gradient positioning and spatiotemporal masking. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://open...
work page 2024
-
[16]
Echojepa: A latent predictive foundation model for echocardiography,
Alif Munim, Adibvafa Fallahpour, Teodora Szasz, Ahmadreza Attarpour, River Jiang, Brana Sooriyakanthan, Maala Sooriyakanthan, Heather Whitney, Jeremy Slivnick, Barry Rubin, Wendy Tsang, and Bo Wang. Echojepa: A latent predictive foundation model for echocardiography,
- [17]
-
[18]
Dual perspectives on non- contrastive self-supervised learning
Jean Ponce, Basile Terver, Martial Hebert, and Michael Arbel. Dual perspectives on non- contrastive self-supervised learning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=f5MC1G6XhB
work page 2026
-
[19]
Dino-wm: World models on pre-trained visual features enable zero-shot planning
Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025
work page 2025
-
[20]
Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, and Farshad Khorrami. Osvi-wm: One-shot visual imitation for unseen tasks using world-model-guided trajectory generation, 2025. URLhttps://arxiv.org/abs/2505.20425
-
[21]
Causal- jepa: Learning world models through object-level latent interventions, 2026
Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, and Randall Balestriero. Causal- jepa: Learning world models through object-level latent interventions, 2026. URL https: //arxiv.org/abs/2602.11389
-
[22]
Joint embedding predictive architectures focus on slow features, 2022
Vlad Sobal, Jyothir S V , Siddhartha Jalagam, Nicolas Carion, Kyunghyun Cho, and Yann LeCun. Joint embedding predictive architectures focus on slow features, 2022. URL https: //arxiv.org/abs/2211.10831
-
[23]
Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Stress-testing offline reward-free reinforcement learning: A case for planning with latent dynamics models. In7th Robot Learning Workshop: Towards Robots with Human-Level Abilities, 2025. URLhttps://openreview.net/forum?id=jON7H6A9UU
work page 2025
-
[24]
VICReg: Variance-invariance-covariance regular- ization for self-supervised learning
Adrien Bardes, Jean Ponce, and Yann LeCun. VICReg: Variance-invariance-covariance regular- ization for self-supervised learning. InInternational Conference on Learning Representations,
-
[25]
URLhttps://openreview.net/forum?id=xm6YD62D1Ub
-
[26]
Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learn- ing recover global and local spectral embedding methods.Advances in Neural Information Processing Systems, 35:26671–26685, 2022
work page 2022
-
[27]
Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025
Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025. URLhttps://arxiv.org/abs/2511.08544
-
[28]
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018. 13
work page 2018
-
[29]
Dream to control: Learning behaviors by latent imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020. URLhttps://openreview.net/forum?id=S1lOTC4tDS
work page 2020
-
[30]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[31]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
J Testud, J Richalet, A Rault, and J Papon. Model predictive heuristic control: Applications to industial processes.Automatica, 14(5):413–428, 1978
work page 1978
-
[33]
Temporal difference learning for model predictive control
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. InInternational Conference on Machine Learning (ICML), 2022
work page 2022
-
[34]
TD-MPC2: Scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Oxh5CstDJU
work page 2024
-
[35]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models, 2025. URLhttps://arxiv.org/abs/2412.03572
-
[36]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[37]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. URLhttps://arxiv.org/abs/1502.03167
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[38]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[39]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[40]
Thomas W Epps and Lawrence B Pulley. A test for normality based on the empirical character- istic function.Biometrika, 70(3):723–726, 1983
work page 1983
-
[41]
Harald Cramér and Herman Wold. Some theorems on distribution functions.Journal of the London Mathematical Society, 1(4):290–294, 1936
work page 1936
-
[42]
Springer Science & Business Media, 2004
Reuven Y Rubinstein and Dirk P Kroese.The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2004
work page 2004
-
[43]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...
work page 2024
-
[44]
Perceptual straightening of natural videos.Nature neuroscience, 22(6):984–991, 2019
Olivier J Hénaff, Robbe LT Goris, and Eero P Simoncelli. Perceptual straightening of natural videos.Nature neuroscience, 22(6):984–991, 2019
work page 2019
-
[45]
The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024
Francesco Margoni, Luca Surian, and Renée Baillargeon. The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024
work page 2024
-
[46]
Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025. 14
-
[47]
Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025
Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025. URLhttps://arxiv.org/abs/2506.09849
-
[48]
Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
OGBench: Bench- marking offline goal-conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Bench- marking offline goal-conditioned RL. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=M992mjgKzI
work page 2025
-
[50]
Learning to reach goals via iterated supervised learning.arXiv preprint arXiv:1912.06088, 2019
Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals via iterated supervised learning.arXiv preprint arXiv:1912.06088, 2019
-
[51]
stable-pretraining- v1: Foundation model research made simple, 2025
Randall Balestriero, Hugues Van Assel, Sami BuGhanem, and Lucas Maes. stable-pretraining- v1: Foundation model research made simple, 2025. URL https://arxiv.org/abs/2511. 19484
work page 2025
-
[52]
stable-worldmodel-v1: Reproducible world modeling research and evaluation, 2026
Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, and Randall Balestriero. stable-worldmodel-v1: Reproducible world modeling research and evaluation, 2026. URLhttps://arxiv.org/abs/2602.08968
-
[53]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[54]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[56]
AI-generated video detection via perceptual straightening
Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. AI-generated video detection via perceptual straightening. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=LsmUgStXby
work page 2025
-
[57]
Temporal straightening for latent planning.arXiv preprint arXiv:2603.12231, 2026
Ying Wang, Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim GJ Rudner, Yann LeCun, and Mengye Ren. Temporal straightening for latent planning.arXiv preprint arXiv:2603.12231, 2026. A SIGReg SIGReg proposes to match the distribution of embeddings towards the isotropic Gaussian target distribution. Achieving that match in high-dimension is gracefully d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.