pith. sign in

arxiv: 2605.26379 · v1 · pith:2KHSEK2Znew · submitted 2026-05-25 · 📊 stat.ML · cs.LG

When Does LeJEPA Learn a World Model?

Pith reviewed 2026-06-29 20:06 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords LeJEPAlinear identifiabilityworld modelsGaussian regularizationlatent variablesrepresentation learningself-supervised learningadditive noise transitions
0
0 comments X

The pith

LeJEPA linearly recovers the world's latent variables from nonlinear observations precisely when the latents are Gaussian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that LeJEPA, using alignment plus Gaussian regularization, achieves linear identifiability of latent variables under stationary additive-noise dynamics. It shows the Gaussian is the only distribution in this class that permits the guarantee, because alignment penalizes every nonlinear degree of freedom via spectral decomposition while the converse excludes all alternatives. A reader would care because such identifiability is required for reliable planning and compositional generalization in learned world models. The work also establishes an approximate version that degrades gracefully and links orthogonal identifiability to optimal latent-space planning. Experiments span low-dimensional cases to 1024-dimensional robotic control tasks.

Core claim

LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent variables from nonlinear observations in worlds where latents evolve under stationary additive-noise transitions. The central result is that the Gaussian is the unique latent distribution for which this linear identifiability holds. The forward direction follows from a spectral decomposition in which alignment strictly penalizes nonlinearity, rendering the linear map optimal; the converse rules out every non-Gaussian alternative. An approximate identifiability result is also proved, and linear orthogonal identifiability is shown to enable optimal latent-space planning.

What carries the argument

LeJEPA's alignment objective with Gaussian regularization, which enforces linear identifiability through spectral decomposition that penalizes nonlinearity.

If this is right

  • Linear identifiability supports reliable planning directly in the recovered latent space.
  • The guarantee applies across a broad class of worlds with stationary additive-noise transitions.
  • An approximate version of the result allows the guarantee to degrade gracefully with distribution mismatch.
  • Orthogonal linear identifiability enables optimal latent-space planning.
  • The theory converts an empirical recipe into a mathematical guarantee for world-model structure recovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of other self-supervised objectives for world models could incorporate similar Gaussian regularization to target identifiability.
  • Testing whether learned latents in deployed models are approximately Gaussian could serve as a practical diagnostic for planning reliability.
  • Extensions to non-stationary or multiplicative noise transitions would require new proof techniques beyond the current spectral argument.
  • The result connects to broader questions in causal representation learning about when nonlinear observations can be inverted to linear latent structure.

Load-bearing premise

The latent variables evolve under stationary additive-noise transitions.

What would settle it

A non-Gaussian latent distribution under stationary additive-noise transitions where LeJEPA nevertheless achieves exact linear identifiability would falsify the uniqueness claim.

Figures

Figures reproduced from arXiv: 2605.26379 by David Klindt, Randall Balestriero, Yann LeCun.

Figure 1
Figure 1. Figure 1: LeJEPA learns the World Model. (left) The world has independent Gaussian latent variables. (center) An unknown nonlinear process scrambles them into the data we observe. (right) LeJEPA [2] recovers the latent variables up to rotation. We prove this is the unique optimum. Code, Lean proofs, and demo: https://github.com/klindtlab/lejepa-identifiability. arXiv:2605.26379v1 [stat.ML] 25 May 2026 [PITH_FULL_IM… view at source ↗
Figure 2
Figure 2. Figure 2: LeJEPA Theory Illustration. (left) The world has clean latent structure (Gaussian, disentangled) with correlated positive pairs. (center) An unknown nonlinear process produces the observations we actually see, scrambling the latent structure. (right) LeJEPA trains a representa￾tion with two objectives: pull positive pairs together (attract) and keep the embedding distribution Gaussian (SIGReg) to prevent c… view at source ↗
Figure 3
Figure 3. Figure 3: 2D Simulations. Points colored by the polar angle and radius of the ground-truth latent variables z ∼ N (0, I2) (like [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experimental Results. a) Bound Verification. SIGReg runs across grid, 2D, scaling, and gennorm α=2 lie below the diagonal, confirming Thm. 3. Two near-zero outliers reflect finite￾sample noise. b) Gaussian Optimality. Linear recovery, R2 (h → z), peaks at Gaussian, illustrating Thm. 2. SIGReg’s Gaussianization of h is more robust to non-Gaussian latent variable distributions than whitening. (c) Control cos… view at source ↗
Figure 5
Figure 5. Figure 5: Linear Identifiability Enables Latent-Space Planning. Interpolation in each encoder’s latent space between fixed start and goal frames, decoded by nearest-neighbor retrieval ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Grid search over the regularization weight [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Linear identifiability across the gennorm family for all four 2D mixings. R2 (h → z) on a fixed evaluation grid as a function of latent shape α (α=1 Laplace, α=2 Gaussian). All three methods peak at α = 2 as predicted by Thm. 2; SIGReg and InfoNCE, which constrain h beyond second moments, retain a wider plateau than VICReg for heavy-tailed latents. Mean ± std over 3 seeds. 2 2 2 0 2 2 2 4 Source shape 10 3… view at source ↗
Figure 8
Figure 8. Figure 8: Orthogonality error across the gennorm family for all four 2D mixings. ∥Qˆ⊤Qˆ − I∥F / √ n on the same fixed grid (log scale). VICReg and SIGReg dip sharply near α = 2 where their constraints align with the latent distribution; InfoNCE remains roughly flat, reflecting weaker control over the linear map under fixed kernel width. Mean ± std over 3 seeds [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Decomposition of the recovery error into its two sources. [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Losses predict linear identifiability. Each point is one trained encoder from one of our three experiments (2D illustrations, grid search, scaling), zoomed to the converged regime (R2 > 0.9). Top left: Total loss (alignment + λ SIGReg) correlates with linear R2 (h → z). Top right: Alignment loss alone is predictive of identifiability quality. Bottom left: SIGReg loss vs. R2 . Bottom right: SIGReg and whit… view at source ↗
Figure 11
Figure 11. Figure 11: DMC Reacher. The latent state z = (z0, z1) consists of two joint angles (shoulder and wrist) that fully determine the arm configuration. The nonlinear mixing g is the MuJoCo rendering pipeline producing 64 × 64 pixel observations [PITH_FULL_IMAGE:figures/full_fig_p043_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reacher trajectory latent distributions across temporal strides δ. Left: Stationary marginal of (z0, z1); shoulder is broad, wrist is nearly bimodal. Top row: 2D transition differences zt+δ − zt, with the per-dimension R2 of the best-tuned encoder. Bottom row: Autocorrelation scatter zt vs. zt+δ, per dimension, with Pearson ρ annotated. Small δ: ρ ≈ 1, transition is trivial, alignment carries no signal. L… view at source ↗
Figure 13
Figure 13. Figure 13: Identifiability requires both approximate Gaussianity and non-trivial autocorre￾lation. For each temporal stride δ, we plot SIGReg (zscored, averaged over random projections and subsamples, error bars show one standard deviation) against the Pearson autocorrelation ρ of the transition-difference distribution, colored by the corresponding identifiability R2 (see [PITH_FULL_IMAGE:figures/full_fig_p044_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Left: OU identifiability vs. ρ for different λ values. R2 increases monotonically, with all λ values converging at high ρ. Higher λ (stronger SIGReg) helps at low ρ where the alignment signal is weak. Right: Gaussian (OU) vs. trajectory data at matched autocorrelation ρ. At the same ρ, Gaussian latents achieve substantially higher R2 , directly validating the converse theorem: non￾Gaussian marginals reduc… view at source ↗
Figure 15
Figure 15. Figure 15: Per-dimension identifiability. Left: OU condition; shoulder and wrist are recovered symmetrically, consistent with the isotropic transition. Right: Trajectory condition; massive asym￾metry: the wrist (R2 ≈ 0 at δ = 1) recovers only at larger δ where temporal variation provides learning signal; the shoulder is consistently easier but degrades at large δ due to wrapping beyond ±π. Per-dimension ρ values are… view at source ↗
Figure 16
Figure 16. Figure 16: Left: λ robustness in the OU condition. For high ρ, identifiability is stable across λ; for low ρ, stronger regularization (λ = 5 × 10−2 ) compensates for the weak alignment signal. At ρ = 0.99, the highest λ degrades slightly as SIGReg begins to dominate alignment. Right: Orthogonality error decreases monotonically with ρ, consistent with the approximate bound (Thm. 3). 47 [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 17
Figure 17. Figure 17: Planning: embedding and straight-line paths. Columns: true θ-space (left), Gaus￾sian/OU encoder (center), Trajectory encoder (right). Top row: scatter of eval-set embeddings, colored by true θ-space polar angle. The Gaussian encoder is an approximate rotation of the true latent; the Trajectory encoder is visibly warped. Middle row: three example trajectories that are straight in the true joint space, rend… view at source ↗
read the original abstract

A representation that scrambles the true degrees of freedom of the world cannot support reliable planning or compositional generalization. We prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent variables from nonlinear observations, a property known as linear identifiability, in a broad class of worlds where latents evolve under stationary, additive-noise transitions. Our main result is that among all such worlds, the Gaussian is the unique latent distribution for which this guarantee holds. The forward direction rests on a spectral decomposition in which each degree of nonlinearity is strictly penalized by alignment, making the linear map the optimum; the converse rules out every non-Gaussian alternative. We further prove an approximate identifiability result where the guarantee degrades gracefully, and show that linear, orthogonal identifiability enables optimal latent-space planning. We validate the theory with experiments ranging from 2D examples to 1024-dimensional latents, including distributional ablations and pixel-based robotic control. Our theory turns an empirically successful recipe into a mathematical guarantee, providing the foundation for building World Models that provably recover the structure of the world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims to prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers latent variables from nonlinear observations under stationary additive-noise transitions, with Gaussian as the unique latent distribution enabling this linear identifiability. The forward direction uses a spectral decomposition penalizing nonlinearity, the converse rules out non-Gaussians, an approximate version is shown, and linear identifiability is linked to optimal planning; experiments from 2D to 1024D latents plus robotic control support the claims.

Significance. If the central proof holds, the result supplies a mathematical guarantee that converts an empirical recipe into a foundation for world models with provable structure recovery, which would be a notable advance in representation learning for planning and generalization. The combination of spectral argument, uniqueness converse, approximate extension, and scaling experiments to high dimensions is a strength.

minor comments (3)
  1. [Abstract] Abstract: the phrasing 'among all such worlds, the Gaussian is the unique latent distribution' would benefit from an explicit qualifier that uniqueness holds within the stationary additive-noise class stated in the setup.
  2. The experimental section should include a table or appendix listing exact hyperparameters, random seeds, and precise metrics (e.g., recovery error norms) for the 1024-dimensional and robotic-control runs to support reproducibility claims.
  3. Notation: ensure the definition of the alignment loss and the spectral penalty term are introduced with consistent symbols before their use in the main theorem statement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. We appreciate the recognition that the combination of the spectral argument, uniqueness result, approximate extension, and scaling experiments constitutes a strength, and that the result could provide a foundation for world models with provable structure recovery.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a mathematical proof of linear identifiability for LeJEPA under stationary additive-noise latent transitions, relying on a forward spectral decomposition that penalizes nonlinearity and a converse establishing Gaussian uniqueness. No steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the derivation is self-contained within the stated class of worlds and does not rename known results or smuggle ansatzes via prior work. This matches the default expectation for a proof-based paper with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of stationary additive-noise transitions for the class of worlds considered; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Latents evolve under stationary, additive-noise transitions
    This defines the broad class of worlds for which the linear identifiability guarantee is proven, as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5721 in / 1328 out tokens · 37554 ms · 2026-06-29T20:06:06.346100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Generalization Theory for JEPA-Based World Models

    cs.LG 2026-06 unverdicted novelty 8.0

    The paper formulates JEPA pretraining as conditional spectral graph learning equivalent to low-rank factorization of an action-conditioned co-occurrence matrix and derives a finite-sample generalization bound connecti...

  2. Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

    stat.ML 2026-06 unverdicted novelty 7.0 partial

    PGSA achieves exact linear identifiability and near-infinite temporal consistency for non-Gaussian regimes via symbolic causal grounding, with four theorems formalized in Lean 4.

  3. Information Lattice Learning as Probabilistic Graphical Model Structure Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    ILL rules on PMFs are marginal laws on deterministic quotient variables; the resulting constraint sets define log-linear factor graphs whose factors are indexed by learned abstractions, positioning ILL as interpretabl...

  4. Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group

    cs.LG 2026-06 unverdicted novelty 6.0

    Exact equivariance preserved through training makes prediction and closed-loop errors invariant across the symmetry group, enabling zero-shot generalization from a data slice to the full orbit.

  5. Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

    cs.LG 2026-06 unverdicted novelty 4.0

    Proposes DCGWM architecture that partitions latent space into physical and behavioral subspaces with isolated gradient flows to structurally prevent objective interference collapse in grounded JEPA world models.

Reference graph

Works this paper leans on

105 extracted references · 4 canonical work pages · cited by 5 Pith papers

  1. [1]

    A path towards autonomous machine intelligence, 2022

    Yann LeCun. A path towards autonomous machine intelligence, 2022. URLhttps: //openreview.net/forum?id=BZ5a1r-kVsf. (Cited on pages 1, 2, 2, and 2.)

  2. [2]

    Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

    Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025. (Cited on pages 1, 1, 2, 2, 4, 7, and 33.)

  3. [3]

    Vicreg: Variance-invariance-covariance regu- larization for self-supervised learning.CoRR, abs/2105.04906, 2021

    Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regu- larization for self-supervised learning.CoRR, abs/2105.04906, 2021. URLhttps://arxiv. org/abs/2105.04906. (Cited on pages 1, 2, and 7.)

  4. [4]

    Self-supervised learning from images with a joint- embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. (Cited on pages 1 and 2.)

  5. [5]

    Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024. (Cited on pages 1 and 2.)

  6. [6]

    V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. (Cited on pages 1 and 2.)

  7. [7]

    Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Stress-testing offline reward-free reinforcement learning: A case for planning with latent dynamics models. In7th Robot Learning Workshop: Towards Robots with Human- Level Abilities, 2025. URLhttps://openreview.net/forum?id=jON7H6A9UU. (Cited on page 1.)

  8. [8]

    DINO-WM: World models on pre-trained visual features enable zero-shot planning

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025. (Cited on page 1.)

  9. [9]

    LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. (Cited on pages 1, 2, 2, 2, 6, 7, 8, 9, 26, and 35.)

  10. [10]

    Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

    Bernhard Sch ¨olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. (Cited on page 2.)

  11. [11]

    Identifiability of latent-variable and structural-equation models: from linear to nonlinear.Annals of the Institute of Statistical Mathematics, 76(1):1–33, 2024

    Aapo Hyv ¨arinen, Ilyes Khemakhem, and Ricardo Monti. Identifiability of latent-variable and structural-equation models: from linear to nonlinear.Annals of the Institute of Statistical Mathematics, 76(1):1–33, 2024. (Cited on pages 2 and 3.)

  12. [12]

    Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. (Cited on page 2.)

  13. [13]

    A model for analogical reasoning.Cognitive Psychology, 5(1):1–28, July 1973

    David E Rumelhart and Adele A Abrahamson. A model for analogical reasoning.Cognitive Psychology, 5(1):1–28, July 1973. ISSN 0010-0285. doi: 10.1016/0010-0285(73)90023-6. URLhttps://www.sciencedirect.com/science/article/pii/0010028573900236. (Cited on page 2.) 10

  14. [14]

    Learning distributed representations of concepts

    Geoffrey E Hinton. Learning distributed representations of concepts. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 8, 1986. (Cited on page 2.)

  15. [15]

    Tensor product variable binding and the representation of symbolic struc- tures in connectionist systems.Artificial intelligence, 46(1-2):159–216, 1990

    Paul Smolensky. Tensor product variable binding and the representation of symbolic struc- tures in connectionist systems.Artificial intelligence, 46(1-2):159–216, 1990. URLhttps: //www.sciencedirect.com/science/article/pii/000437029090007M. Publisher: El- sevier. (Cited on pages 2 and 9.)

  16. [16]

    Linear Algebraic Structure of Word Senses, with Applications to Polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, December 2018

    Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear Algebraic Structure of Word Senses, with Applications to Polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, December 2018. ISSN 2307-387X. doi: 10.1162/ tacl a 00034. URLhttps://direct.mit.edu/tacl/article/43451. (Cited on page 2.)

  17. [17]

    The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024

    Kiho Park, Yo Joong Choe, and Victor Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024. URLhttp://arxiv.org/abs/2311. 03658. arXiv:2311.03658 [cs, stat]. (Cited on page 2.)

  18. [18]

    From superposition to sparse codes: interpretable representations in neural networks.arXiv preprint arXiv:2503.01824, 2025

    David Klindt, Charles O’Neill, Patrik Reizinger, Harald Maurer, and Nina Miolane. From superposition to sparse codes: interpretable representations in neural networks.arXiv preprint arXiv:2503.01824, 2025. (Cited on pages 2, 9, and 9.)

  19. [19]

    Stop probing, start coding: Why linear probes and sparse autoencoders fail at compositional generalisation, 2026

    Vit ´oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, and David Klindt. Stop probing, start coding: Why linear probes and sparse autoencoders fail at compositional generalisation, 2026. URLhttps://arxiv.org/abs/2603.28744. (Cited on page 2.)

  20. [20]

    Exploring simple siamese representation learning

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750– 15758, 2021. (Cited on page 2 and 2.)

  21. [21]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, 2020. (Cited on page 2 and 2.)

  22. [22]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021. (Cited on page 2 and 2.)

  23. [23]

    Information theory and statistical mechanics.Physical review, 106(4):620,

    Edwin T Jaynes. Information theory and statistical mechanics.Physical review, 106(4):620,

  24. [24]

    (Cited on pages 2, 4, and 8.)

  25. [25]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. (Cited on page 2.)

  26. [26]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. (Cited on page 2.)

  27. [27]

    Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. (Cited on pages 2 and 7.)

  28. [28]

    Barlow twins: Self- supervised learning via redundancy reduction.International Conference on Machine Learning,

    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St ´ephane Deny. Barlow twins: Self- supervised learning via redundancy reduction.International Conference on Machine Learning,

  29. [29]

    DINOv3.arXiv preprint arXiv:2508.10104, 2025

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th ´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, ...

  30. [30]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learn- ing, pages 9929–9939. PMLR, 2020. (Cited on page 2.)

  31. [31]

    Rethinking negative pairs in code search

    Haochen Li, Xin Zhou, Luu Anh Tuan, and Chunyan Miao. Rethinking negative pairs in code search. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12760–12774, 2023. (Cited on page 2.)

  32. [32]

    On the importance of gaussianizing representations

    Daniel Eftekhari and Vardan Papyan. On the importance of gaussianizing representations. arXiv preprint arXiv:2505.00685, 2025. (Cited on page 2.)

  33. [33]

    InfoNCE induces Gaussian dis- tribution

    Roy Betser, Eyal Gofer, Meir Yossef Levi, and Guy Gilboa. InfoNCE induces Gaussian dis- tribution. InInternational Conference on Learning Representations, 2026. (Cited on page 2.)

  34. [34]

    Cambridge University Press, 1943

    Kenneth J W Craik.The Nature of Explanation. Cambridge University Press, 1943. (Cited on page 2.)

  35. [35]

    Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

    Edward C Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

  36. [36]

    Harvard University Press, 1983

    Philip N Johnson-Laird.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, 1983. (Cited on page 2.)

  37. [37]

    An internal model for sensori- motor integration.Science, 269(5232):1880–1882, 1995

    Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensori- motor integration.Science, 269(5232):1880–1882, 1995. (Cited on page 2.)

  38. [38]

    Perceptions as hypotheses.Philosophical Transactions of the Royal Society of London

    Richard L Gregory. Perceptions as hypotheses.Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 290(1038):181–197, 1980. (Cited on page 2.)

  39. [39]

    The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

    Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010. (Cited on page 2.)

  40. [40]

    Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

    Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015. (Cited on page 2.)

  41. [41]

    Every good regulator of a system must be a model of that system.International Journal of Systems Science, 1(2):89–97, 1970

    Roger C Conant and W Ross Ashby. Every good regulator of a system must be a model of that system.International Journal of Systems Science, 1(2):89–97, 1970. (Cited on page 2.)

  42. [42]

    The internal model principle of control theory.Automatica, 12(5):457–465, 1976

    B A Francis and W M Wonham. The internal model principle of control theory.Automatica, 12(5):457–465, 1976. (Cited on page 2.)

  43. [43]

    Princeton University Press, 1957

    Richard Bellman.Dynamic Programming. Princeton University Press, 1957. (Cited on page 2.)

  44. [44]

    A new approach to linear filtering and prediction problems.Transactions of the ASME – Journal of Basic Engineering, 82(Series D):35–45, 1960

    Rudolph E Kalman. A new approach to linear filtering and prediction problems.Transactions of the ASME – Journal of Basic Engineering, 82(Series D):35–45, 1960. (Cited on page 2.)

  45. [45]

    Neural networks for self-learning control systems

    Derrick H Nguyen and Bernard Widrow. Neural networks for self-learning control systems. IEEE Control systems magazine, 10(3):18–23, 1990. (Cited on pages 2 and 26.)

  46. [46]

    Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environ- ments

    J ¨urgen Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environ- ments. Technical Report FKI-126-90, Institut f¨ur Informatik, Technische Universit¨at M¨unchen,

  47. [47]

    Integrated architectures for learning, planning, and reacting based on ap- proximating dynamic programming

    Richard S Sutton. Integrated architectures for learning, planning, and reacting based on ap- proximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224, 1990. (Cited on page 2.)

  48. [48]

    Embed to control: A locally linear latent dynamics model for control from raw images

    Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAdvances in Neural Information Processing Systems, 2015. (Cited on page 2.) 12

  49. [49]

    Action- conditional video prediction using deep networks in Atari games

    Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in Atari games. InAdvances in Neural In- formation Processing Systems, 2015. (Cited on page 2.)

  50. [50]

    Unsupervised learning for physical interac- tion through video prediction

    Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interac- tion through video prediction. InAdvances in Neural Information Processing Systems, 2016. (Cited on page 2.)

  51. [51]

    World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

    David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018. (Cited on pages 2 and 26.)

  52. [52]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Confer- ence on Machine Learning, pages 2555–2565. PMLR, 2019. (Cited on page 2.)

  53. [53]

    Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. (Cited on page 2.)

  54. [54]

    Mastering diverse control tasks through world models.Nature, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 2025. (Cited on page 2.)

  55. [55]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URLhttps: //openai.com/index/video-generation-models-as-world-simulators/. (Cited on page 2.)

  56. [56]

    Genie: Generative interactive envi- ronments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C Y Chan, Nicolas Heess, Lucy Gon- zalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

  57. [57]

    Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999

    Aapo Hyv ¨arinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999. (Cited on page 2.)

  58. [58]

    Challenging common assumptions in the unsupervised learn- ing of disentangled representations

    Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R ¨atsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learn- ing of disentangled representations. InInternational Conference on Machine Learning, 2019. (Cited on page 2.)

  59. [59]

    Unsupervised feature extraction by time-contrastive learning and nonlinear ICA

    Aapo Hyv ¨arinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. InAdvances in Neural Information Processing Systems, 2016. (Cited on page 2.)

  60. [60]

    Nonlinear ICA of temporally dependent stationary sources

    Aapo Hyv ¨arinen and Hiroshi Morioka. Nonlinear ICA of temporally dependent stationary sources. InInternational Conference on Artificial Intelligence and Statistics, 2017. (Cited on page 2.)

  61. [61]

    Towards nonlinear disentanglement in natural data with temporal sparse coding

    David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. InInternational Conference on Learning Representations, 2021. (Cited on page 2.)

  62. [62]

    Variational autoen- coders and nonlinear ICA: A unifying framework

    Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyv¨arinen. Variational autoen- coders and nonlinear ICA: A unifying framework. InInternational Conference on Artificial Intelligence and Statistics, 2020. (Cited on page 2.) 13

  63. [63]

    Nonlinear ICA Using Auxiliary Vari- ables and Generalized Contrastive Learning

    Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA Using Auxiliary Vari- ables and Generalized Contrastive Learning. InProceedings of the Twenty-Second Interna- tional Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, April 2019. URLhttps://proceedings.mlr.press/v89/hyvarinen19a.html. (Cited on page 2.)

  64. [64]

    Contrastive learning inverts the data generating process

    Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Bren- del. Contrastive learning inverts the data generating process. InInternational Conference on Machine Learning, 2021. (Cited on page 2.)

  65. [65]

    Self-supervised learning with data augmentations provably isolates content from style

    Julius von K ¨ugelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Sch ¨olkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. InAdvances in Neural Information Processing Systems,

  66. [66]

    Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA

    S ´ebastien Lachapelle, Pau Rodriguez, Yash Sharma, Katie E Everett, R´emi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. InConference on Causal Learning and Reasoning, 2022. (Cited on page 2.)

  67. [67]

    Interventional causal repre- sentation learning

    Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal repre- sentation learning. InInternational Conference on Machine Learning, 2023. (Cited on pages 2, 9, and 26.)

  68. [68]

    Learning linear causal representations from interventions under gen- eral nonlinear mixing.Advances in Neural Information Processing Systems, 36:45419–45462,

    Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Sch ¨olkopf, and Pradeep Ravikumar. Learning linear causal representations from interventions under gen- eral nonlinear mixing.Advances in Neural Information Processing Systems, 36:45419–45462,

  69. [69]

    (Cited on pages 2, 9, and 26.)

  70. [70]

    Cross-entropy is all you need to invert the data generating process

    Patrik Reizinger, Alice Bizeul, Attila Juhos, Julia E V ogt, Randall Balestriero, Wieland Bren- del, and David Klindt. Cross-entropy is all you need to invert the data generating process. arXiv preprint arXiv:2410.21869, 2024. (Cited on page 2.)

  71. [71]

    On linear identifiability of learned represen- tations

    Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned represen- tations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021. (Cited on page 2.)

  72. [72]

    Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 2002

    Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 2002. (Cited on pages 2, 28, and 29.)

  73. [73]

    Joint embedding predictive architectures focus on slow features.arXiv preprint arXiv:2211.10831, 2022

    Vlad Sobal, S V Jyothir, Siddhartha Jalagam, Nicolas Carion, Kyunghyun Cho, and Yann LeCun. Joint embedding predictive architectures focus on slow features.arXiv preprint arXiv:2211.10831, 2022. (Cited on pages 2, 5, and 28.)

  74. [74]

    An extension of slow feature analysis for nonlinear blind source separation.Journal of Machine Learning Research, 15:921–947,

    Henning Sprekeler, Tiziano Zito, and Laurenz Wiskott. An extension of slow feature analysis for nonlinear blind source separation.Journal of Machine Learning Research, 15:921–947,

  75. [75]

    (Cited on pages 2, 5, 5, 22, 28, 29, 29, 29, 29, 30, 30, 30, 30, 30, and 31.)

  76. [76]

    Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022

    Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022. URLhttps://arxiv.org/ abs/2205.11508. (Cited on pages 2 and 30.)

  77. [77]

    Robustness of nonlinear representation learning

    Simon Buchholz and Bernhard Sch ¨olkopf. Robustness of nonlinear representation learning. arXiv preprint arXiv:2503.15355, 2025. (Cited on pages 2, 6, and 26.)

  78. [78]

    When does closeness in distribution imply representational similarity? an identifiability perspective.arXiv preprint arXiv:2506.03784, 2025

    Beatrix MG Nielsen, Emanuele Marconato, Andrea Dittadi, and Luigi Gresele. When does closeness in distribution imply representational similarity? an identifiability perspective.arXiv preprint arXiv:2506.03784, 2025. (Cited on page 2.)

  79. [79]

    Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

    Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013. (Cited on page 3 and 3.)

  80. [80]

    On the theory of the Brownian motion.Physical Review, 36(5):823–841, 1930

    George E Uhlenbeck and Leonard S Ornstein. On the theory of the Brownian motion.Physical Review, 36(5):823–841, 1930. (Cited on page 4.) 14

Showing first 80 references.