When Does LeJEPA Learn a World Model?

David Klindt; Randall Balestriero; Yann LeCun

arxiv: 2605.26379 · v1 · pith:2KHSEK2Znew · submitted 2026-05-25 · 📊 stat.ML · cs.LG

When Does LeJEPA Learn a World Model?

David Klindt , Yann LeCun , Randall Balestriero This is my paper

Pith reviewed 2026-06-29 20:06 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords LeJEPAlinear identifiabilityworld modelsGaussian regularizationlatent variablesrepresentation learningself-supervised learningadditive noise transitions

0 comments

The pith

LeJEPA linearly recovers the world's latent variables from nonlinear observations precisely when the latents are Gaussian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that LeJEPA, using alignment plus Gaussian regularization, achieves linear identifiability of latent variables under stationary additive-noise dynamics. It shows the Gaussian is the only distribution in this class that permits the guarantee, because alignment penalizes every nonlinear degree of freedom via spectral decomposition while the converse excludes all alternatives. A reader would care because such identifiability is required for reliable planning and compositional generalization in learned world models. The work also establishes an approximate version that degrades gracefully and links orthogonal identifiability to optimal latent-space planning. Experiments span low-dimensional cases to 1024-dimensional robotic control tasks.

Core claim

LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent variables from nonlinear observations in worlds where latents evolve under stationary additive-noise transitions. The central result is that the Gaussian is the unique latent distribution for which this linear identifiability holds. The forward direction follows from a spectral decomposition in which alignment strictly penalizes nonlinearity, rendering the linear map optimal; the converse rules out every non-Gaussian alternative. An approximate identifiability result is also proved, and linear orthogonal identifiability is shown to enable optimal latent-space planning.

What carries the argument

LeJEPA's alignment objective with Gaussian regularization, which enforces linear identifiability through spectral decomposition that penalizes nonlinearity.

If this is right

Linear identifiability supports reliable planning directly in the recovered latent space.
The guarantee applies across a broad class of worlds with stationary additive-noise transitions.
An approximate version of the result allows the guarantee to degrade gracefully with distribution mismatch.
Orthogonal linear identifiability enables optimal latent-space planning.
The theory converts an empirical recipe into a mathematical guarantee for world-model structure recovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of other self-supervised objectives for world models could incorporate similar Gaussian regularization to target identifiability.
Testing whether learned latents in deployed models are approximately Gaussian could serve as a practical diagnostic for planning reliability.
Extensions to non-stationary or multiplicative noise transitions would require new proof techniques beyond the current spectral argument.
The result connects to broader questions in causal representation learning about when nonlinear observations can be inverted to linear latent structure.

Load-bearing premise

The latent variables evolve under stationary additive-noise transitions.

What would settle it

A non-Gaussian latent distribution under stationary additive-noise transitions where LeJEPA nevertheless achieves exact linear identifiability would falsify the uniqueness claim.

Figures

Figures reproduced from arXiv: 2605.26379 by David Klindt, Randall Balestriero, Yann LeCun.

**Figure 1.** Figure 1: LeJEPA learns the World Model. (left) The world has independent Gaussian latent variables. (center) An unknown nonlinear process scrambles them into the data we observe. (right) LeJEPA [2] recovers the latent variables up to rotation. We prove this is the unique optimum. Code, Lean proofs, and demo: https://github.com/klindtlab/lejepa-identifiability. arXiv:2605.26379v1 [stat.ML] 25 May 2026 [PITH_FULL_IM… view at source ↗

**Figure 2.** Figure 2: LeJEPA Theory Illustration. (left) The world has clean latent structure (Gaussian, disentangled) with correlated positive pairs. (center) An unknown nonlinear process produces the observations we actually see, scrambling the latent structure. (right) LeJEPA trains a representation with two objectives: pull positive pairs together (attract) and keep the embedding distribution Gaussian (SIGReg) to prevent c… view at source ↗

**Figure 3.** Figure 3: 2D Simulations. Points colored by the polar angle and radius of the ground-truth latent variables z ∼ N (0, I2) (like [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental Results. a) Bound Verification. SIGReg runs across grid, 2D, scaling, and gennorm α=2 lie below the diagonal, confirming Thm. 3. Two near-zero outliers reflect finitesample noise. b) Gaussian Optimality. Linear recovery, R2 (h → z), peaks at Gaussian, illustrating Thm. 2. SIGReg’s Gaussianization of h is more robust to non-Gaussian latent variable distributions than whitening. (c) Control cos… view at source ↗

**Figure 5.** Figure 5: Linear Identifiability Enables Latent-Space Planning. Interpolation in each encoder’s latent space between fixed start and goal frames, decoded by nearest-neighbor retrieval ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Grid search over the regularization weight [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: Linear identifiability across the gennorm family for all four 2D mixings. R2 (h → z) on a fixed evaluation grid as a function of latent shape α (α=1 Laplace, α=2 Gaussian). All three methods peak at α = 2 as predicted by Thm. 2; SIGReg and InfoNCE, which constrain h beyond second moments, retain a wider plateau than VICReg for heavy-tailed latents. Mean ± std over 3 seeds. 2 2 2 0 2 2 2 4 Source shape 10 3… view at source ↗

**Figure 8.** Figure 8: Orthogonality error across the gennorm family for all four 2D mixings. ∥Qˆ⊤Qˆ − I∥F / √ n on the same fixed grid (log scale). VICReg and SIGReg dip sharply near α = 2 where their constraints align with the latent distribution; InfoNCE remains roughly flat, reflecting weaker control over the linear map under fixed kernel width. Mean ± std over 3 seeds [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: Decomposition of the recovery error into its two sources. [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗

**Figure 10.** Figure 10: Losses predict linear identifiability. Each point is one trained encoder from one of our three experiments (2D illustrations, grid search, scaling), zoomed to the converged regime (R2 > 0.9). Top left: Total loss (alignment + λ SIGReg) correlates with linear R2 (h → z). Top right: Alignment loss alone is predictive of identifiability quality. Bottom left: SIGReg loss vs. R2 . Bottom right: SIGReg and whit… view at source ↗

**Figure 11.** Figure 11: DMC Reacher. The latent state z = (z0, z1) consists of two joint angles (shoulder and wrist) that fully determine the arm configuration. The nonlinear mixing g is the MuJoCo rendering pipeline producing 64 × 64 pixel observations [PITH_FULL_IMAGE:figures/full_fig_p043_11.png] view at source ↗

**Figure 12.** Figure 12: Reacher trajectory latent distributions across temporal strides δ. Left: Stationary marginal of (z0, z1); shoulder is broad, wrist is nearly bimodal. Top row: 2D transition differences zt+δ − zt, with the per-dimension R2 of the best-tuned encoder. Bottom row: Autocorrelation scatter zt vs. zt+δ, per dimension, with Pearson ρ annotated. Small δ: ρ ≈ 1, transition is trivial, alignment carries no signal. L… view at source ↗

**Figure 13.** Figure 13: Identifiability requires both approximate Gaussianity and non-trivial autocorrelation. For each temporal stride δ, we plot SIGReg (zscored, averaged over random projections and subsamples, error bars show one standard deviation) against the Pearson autocorrelation ρ of the transition-difference distribution, colored by the corresponding identifiability R2 (see [PITH_FULL_IMAGE:figures/full_fig_p044_13.png] view at source ↗

**Figure 14.** Figure 14: Left: OU identifiability vs. ρ for different λ values. R2 increases monotonically, with all λ values converging at high ρ. Higher λ (stronger SIGReg) helps at low ρ where the alignment signal is weak. Right: Gaussian (OU) vs. trajectory data at matched autocorrelation ρ. At the same ρ, Gaussian latents achieve substantially higher R2 , directly validating the converse theorem: nonGaussian marginals reduc… view at source ↗

**Figure 15.** Figure 15: Per-dimension identifiability. Left: OU condition; shoulder and wrist are recovered symmetrically, consistent with the isotropic transition. Right: Trajectory condition; massive asymmetry: the wrist (R2 ≈ 0 at δ = 1) recovers only at larger δ where temporal variation provides learning signal; the shoulder is consistently easier but degrades at large δ due to wrapping beyond ±π. Per-dimension ρ values are… view at source ↗

**Figure 16.** Figure 16: Left: λ robustness in the OU condition. For high ρ, identifiability is stable across λ; for low ρ, stronger regularization (λ = 5 × 10−2 ) compensates for the weak alignment signal. At ρ = 0.99, the highest λ degrades slightly as SIGReg begins to dominate alignment. Right: Orthogonality error decreases monotonically with ρ, consistent with the approximate bound (Thm. 3). 47 [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 17.** Figure 17: Planning: embedding and straight-line paths. Columns: true θ-space (left), Gaussian/OU encoder (center), Trajectory encoder (right). Top row: scatter of eval-set embeddings, colored by true θ-space polar angle. The Gaussian encoder is an approximate rotation of the true latent; the Trajectory encoder is visibly warped. Middle row: three example trajectories that are straight in the true joint space, rend… view at source ↗

read the original abstract

A representation that scrambles the true degrees of freedom of the world cannot support reliable planning or compositional generalization. We prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent variables from nonlinear observations, a property known as linear identifiability, in a broad class of worlds where latents evolve under stationary, additive-noise transitions. Our main result is that among all such worlds, the Gaussian is the unique latent distribution for which this guarantee holds. The forward direction rests on a spectral decomposition in which each degree of nonlinearity is strictly penalized by alignment, making the linear map the optimum; the converse rules out every non-Gaussian alternative. We further prove an approximate identifiability result where the guarantee degrades gracefully, and show that linear, orthogonal identifiability enables optimal latent-space planning. We validate the theory with experiments ranging from 2D examples to 1024-dimensional latents, including distributional ablations and pixel-based robotic control. Our theory turns an empirically successful recipe into a mathematical guarantee, providing the foundation for building World Models that provably recover the structure of the world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LeJEPA gets linear identifiability of latents only under Gaussian distributions, via a spectral argument that looks new.

read the letter

The main thing to know is that this paper claims to prove linear identifiability for LeJEPA (alignment plus Gaussian regularization) and shows that the Gaussian is the unique latent distribution that delivers the guarantee, for worlds whose latents follow stationary additive-noise transitions.

The forward direction uses a spectral decomposition where alignment strictly penalizes each degree of nonlinearity, so the linear map becomes the optimum. The converse rules out non-Gaussian cases. They also give an approximate version that degrades gracefully and tie the result to optimal latent-space planning. Those pieces are new relative to earlier empirical work on JEPA-style methods. The experiments, running from 2D toy cases up to 1024-dimensional latents with pixel-based robotic control and distributional ablations, supply concrete checks on the theory.

The central assumption—stationary additive-noise transitions—is stated clearly, but it does restrict the class of worlds covered; real dynamics often have multiplicative noise or non-stationarity, so the result applies only inside that slice. The approximate identifiability claim would benefit from explicit error bounds in the full text. No obvious circularity in the proof sketch, and the citations stay on relevant identifiability literature rather than self-referential loops.

This paper is aimed at researchers who want theoretical grounding for world models that support planning and generalization. Readers working on representation learning or self-supervised methods will get the most out of the uniqueness result and the planning corollary. It is worth sending to peer review because the claim is specific, the argument outline is coherent, and the empirical support is present; referees can check the derivations and scope without the paper being obviously flawed on its own terms.

Referee Report

0 major / 3 minor

Summary. The paper claims to prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers latent variables from nonlinear observations under stationary additive-noise transitions, with Gaussian as the unique latent distribution enabling this linear identifiability. The forward direction uses a spectral decomposition penalizing nonlinearity, the converse rules out non-Gaussians, an approximate version is shown, and linear identifiability is linked to optimal planning; experiments from 2D to 1024D latents plus robotic control support the claims.

Significance. If the central proof holds, the result supplies a mathematical guarantee that converts an empirical recipe into a foundation for world models with provable structure recovery, which would be a notable advance in representation learning for planning and generalization. The combination of spectral argument, uniqueness converse, approximate extension, and scaling experiments to high dimensions is a strength.

minor comments (3)

[Abstract] Abstract: the phrasing 'among all such worlds, the Gaussian is the unique latent distribution' would benefit from an explicit qualifier that uniqueness holds within the stationary additive-noise class stated in the setup.
The experimental section should include a table or appendix listing exact hyperparameters, random seeds, and precise metrics (e.g., recovery error norms) for the 1024-dimensional and robotic-control runs to support reproducibility claims.
Notation: ensure the definition of the alignment loss and the spectral penalty term are introduced with consistent symbols before their use in the main theorem statement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. We appreciate the recognition that the combination of the spectral argument, uniqueness result, approximate extension, and scaling experiments constitutes a strength, and that the result could provide a foundation for world models with provable structure recovery.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a mathematical proof of linear identifiability for LeJEPA under stationary additive-noise latent transitions, relying on a forward spectral decomposition that penalizes nonlinearity and a converse establishing Gaussian uniqueness. No steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the derivation is self-contained within the stated class of worlds and does not rename known results or smuggle ansatzes via prior work. This matches the default expectation for a proof-based paper with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of stationary additive-noise transitions for the class of worlds considered; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Latents evolve under stationary, additive-noise transitions
This defines the broad class of worlds for which the linear identifiability guarantee is proven, as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5721 in / 1328 out tokens · 37554 ms · 2026-06-29T20:06:06.346100+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Generalization Theory for JEPA-Based World Models
cs.LG 2026-06 unverdicted novelty 8.0

The paper formulates JEPA pretraining as conditional spectral graph learning equivalent to low-rank factorization of an action-conditioned co-occurrence matrix and derives a finite-sample generalization bound connecti...
Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency
stat.ML 2026-06 unverdicted novelty 7.0 partial

PGSA achieves exact linear identifiability and near-infinite temporal consistency for non-Gaussian regimes via symbolic causal grounding, with four theorems formalized in Lean 4.
Information Lattice Learning as Probabilistic Graphical Model Structure Learning
cs.LG 2026-06 unverdicted novelty 6.0

ILL rules on PMFs are marginal laws on deterministic quotient variables; the resulting constraint sets define log-linear factor graphs whose factors are indexed by learned abstractions, positioning ILL as interpretabl...
Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group
cs.LG 2026-06 unverdicted novelty 6.0

Exact equivariance preserved through training makes prediction and closed-loop errors invariant across the symmetry group, enabling zero-shot generalization from a data slice to the full orbit.
Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow
cs.LG 2026-06 unverdicted novelty 4.0

Proposes DCGWM architecture that partitions latent space into physical and behavioral subspaces with isolated gradient flows to structurally prevent objective interference collapse in grounded JEPA world models.

Reference graph

Works this paper leans on

105 extracted references · 4 canonical work pages · cited by 5 Pith papers

[1]

A path towards autonomous machine intelligence, 2022

Yann LeCun. A path towards autonomous machine intelligence, 2022. URLhttps: //openreview.net/forum?id=BZ5a1r-kVsf. (Cited on pages 1, 2, 2, and 2.)

2022
[2]

Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025. (Cited on pages 1, 1, 2, 2, 4, 7, and 33.)

Pith/arXiv arXiv 2025
[3]

Vicreg: Variance-invariance-covariance regu- larization for self-supervised learning.CoRR, abs/2105.04906, 2021

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regu- larization for self-supervised learning.CoRR, abs/2105.04906, 2021. URLhttps://arxiv. org/abs/2105.04906. (Cited on pages 1, 2, and 7.)

Pith/arXiv arXiv 2021
[4]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. (Cited on pages 1 and 2.)

2023
[5]

Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024. (Cited on pages 1 and 2.)

Pith/arXiv arXiv 2024
[6]

V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. (Cited on pages 1 and 2.)

Pith/arXiv arXiv 2025
[7]

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Stress-testing offline reward-free reinforcement learning: A case for planning with latent dynamics models. In7th Robot Learning Workshop: Towards Robots with Human- Level Abilities, 2025. URLhttps://openreview.net/forum?id=jON7H6A9UU. (Cited on page 1.)

2025
[8]

DINO-WM: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025. (Cited on page 1.)

2025
[9]

LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. (Cited on pages 1, 2, 2, 2, 6, 7, 8, 9, 26, and 35.)

Pith/arXiv arXiv 2026
[10]

Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

Bernhard Sch ¨olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. (Cited on page 2.)

2021
[11]

Identifiability of latent-variable and structural-equation models: from linear to nonlinear.Annals of the Institute of Statistical Mathematics, 76(1):1–33, 2024

Aapo Hyv ¨arinen, Ilyes Khemakhem, and Ricardo Monti. Identifiability of latent-variable and structural-equation models: from linear to nonlinear.Annals of the Institute of Statistical Mathematics, 76(1):1–33, 2024. (Cited on pages 2 and 3.)

2024
[12]

Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. (Cited on page 2.)

Pith/arXiv arXiv 2016
[13]

A model for analogical reasoning.Cognitive Psychology, 5(1):1–28, July 1973

David E Rumelhart and Adele A Abrahamson. A model for analogical reasoning.Cognitive Psychology, 5(1):1–28, July 1973. ISSN 0010-0285. doi: 10.1016/0010-0285(73)90023-6. URLhttps://www.sciencedirect.com/science/article/pii/0010028573900236. (Cited on page 2.) 10

work page doi:10.1016/0010-0285(73)90023-6 1973
[14]

Learning distributed representations of concepts

Geoffrey E Hinton. Learning distributed representations of concepts. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 8, 1986. (Cited on page 2.)

1986
[15]

Tensor product variable binding and the representation of symbolic struc- tures in connectionist systems.Artificial intelligence, 46(1-2):159–216, 1990

Paul Smolensky. Tensor product variable binding and the representation of symbolic struc- tures in connectionist systems.Artificial intelligence, 46(1-2):159–216, 1990. URLhttps: //www.sciencedirect.com/science/article/pii/000437029090007M. Publisher: El- sevier. (Cited on pages 2 and 9.)

arXiv 1990
[16]

Linear Algebraic Structure of Word Senses, with Applications to Polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, December 2018

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear Algebraic Structure of Word Senses, with Applications to Polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, December 2018. ISSN 2307-387X. doi: 10.1162/ tacl a 00034. URLhttps://direct.mit.edu/tacl/article/43451. (Cited on page 2.)

2018
[17]

The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024

Kiho Park, Yo Joong Choe, and Victor Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024. URLhttp://arxiv.org/abs/2311. 03658. arXiv:2311.03658 [cs, stat]. (Cited on page 2.)

Pith/arXiv arXiv 2024
[18]

From superposition to sparse codes: interpretable representations in neural networks.arXiv preprint arXiv:2503.01824, 2025

David Klindt, Charles O’Neill, Patrik Reizinger, Harald Maurer, and Nina Miolane. From superposition to sparse codes: interpretable representations in neural networks.arXiv preprint arXiv:2503.01824, 2025. (Cited on pages 2, 9, and 9.)

arXiv 2025
[19]

Stop probing, start coding: Why linear probes and sparse autoencoders fail at compositional generalisation, 2026

Vit ´oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, and David Klindt. Stop probing, start coding: Why linear probes and sparse autoencoders fail at compositional generalisation, 2026. URLhttps://arxiv.org/abs/2603.28744. (Cited on page 2.)

arXiv 2026
[20]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750– 15758, 2021. (Cited on page 2 and 2.)

2021
[21]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, 2020. (Cited on page 2 and 2.)

2020
[22]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021. (Cited on page 2 and 2.)

2021
[23]

Information theory and statistical mechanics.Physical review, 106(4):620,

Edwin T Jaynes. Information theory and statistical mechanics.Physical review, 106(4):620,
[24]

(Cited on pages 2, 4, and 8.)
[25]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. (Cited on page 2.)

2020
[26]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. (Cited on page 2.)

2020
[27]

Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. (Cited on pages 2 and 7.)

Pith/arXiv arXiv 2018
[28]

Barlow twins: Self- supervised learning via redundancy reduction.International Conference on Machine Learning,

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St ´ephane Deny. Barlow twins: Self- supervised learning via redundancy reduction.International Conference on Machine Learning,
[29]

DINOv3.arXiv preprint arXiv:2508.10104, 2025

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th ´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, ...

Pith/arXiv arXiv 2025
[30]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learn- ing, pages 9929–9939. PMLR, 2020. (Cited on page 2.)

2020
[31]

Rethinking negative pairs in code search

Haochen Li, Xin Zhou, Luu Anh Tuan, and Chunyan Miao. Rethinking negative pairs in code search. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12760–12774, 2023. (Cited on page 2.)

2023
[32]

On the importance of gaussianizing representations

Daniel Eftekhari and Vardan Papyan. On the importance of gaussianizing representations. arXiv preprint arXiv:2505.00685, 2025. (Cited on page 2.)

arXiv 2025
[33]

InfoNCE induces Gaussian dis- tribution

Roy Betser, Eyal Gofer, Meir Yossef Levi, and Guy Gilboa. InfoNCE induces Gaussian dis- tribution. InInternational Conference on Learning Representations, 2026. (Cited on page 2.)

2026
[34]

Cambridge University Press, 1943

Kenneth J W Craik.The Nature of Explanation. Cambridge University Press, 1943. (Cited on page 2.)

1943
[35]

Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

Edward C Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208,
[36]

Harvard University Press, 1983

Philip N Johnson-Laird.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, 1983. (Cited on page 2.)

1983
[37]

An internal model for sensori- motor integration.Science, 269(5232):1880–1882, 1995

Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensori- motor integration.Science, 269(5232):1880–1882, 1995. (Cited on page 2.)

1995
[38]

Perceptions as hypotheses.Philosophical Transactions of the Royal Society of London

Richard L Gregory. Perceptions as hypotheses.Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 290(1038):181–197, 1980. (Cited on page 2.)

1980
[39]

The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010. (Cited on page 2.)

2010
[40]

Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015. (Cited on page 2.)

2015
[41]

Every good regulator of a system must be a model of that system.International Journal of Systems Science, 1(2):89–97, 1970

Roger C Conant and W Ross Ashby. Every good regulator of a system must be a model of that system.International Journal of Systems Science, 1(2):89–97, 1970. (Cited on page 2.)

1970
[42]

The internal model principle of control theory.Automatica, 12(5):457–465, 1976

B A Francis and W M Wonham. The internal model principle of control theory.Automatica, 12(5):457–465, 1976. (Cited on page 2.)

1976
[43]

Princeton University Press, 1957

Richard Bellman.Dynamic Programming. Princeton University Press, 1957. (Cited on page 2.)

1957
[44]

A new approach to linear filtering and prediction problems.Transactions of the ASME – Journal of Basic Engineering, 82(Series D):35–45, 1960

Rudolph E Kalman. A new approach to linear filtering and prediction problems.Transactions of the ASME – Journal of Basic Engineering, 82(Series D):35–45, 1960. (Cited on page 2.)

1960
[45]

Neural networks for self-learning control systems

Derrick H Nguyen and Bernard Widrow. Neural networks for self-learning control systems. IEEE Control systems magazine, 10(3):18–23, 1990. (Cited on pages 2 and 26.)

1990
[46]

Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environ- ments

J ¨urgen Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environ- ments. Technical Report FKI-126-90, Institut f¨ur Informatik, Technische Universit¨at M¨unchen,
[47]

Integrated architectures for learning, planning, and reacting based on ap- proximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on ap- proximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224, 1990. (Cited on page 2.)

1990
[48]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAdvances in Neural Information Processing Systems, 2015. (Cited on page 2.) 12

2015
[49]

Action- conditional video prediction using deep networks in Atari games

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in Atari games. InAdvances in Neural In- formation Processing Systems, 2015. (Cited on page 2.)

2015
[50]

Unsupervised learning for physical interac- tion through video prediction

Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interac- tion through video prediction. InAdvances in Neural Information Processing Systems, 2016. (Cited on page 2.)

2016
[51]

World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018. (Cited on pages 2 and 26.)

Pith/arXiv arXiv 2018
[52]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Confer- ence on Machine Learning, pages 2555–2565. PMLR, 2019. (Cited on page 2.)

2019
[53]

Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. (Cited on page 2.)

2020
[54]

Mastering diverse control tasks through world models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 2025. (Cited on page 2.)

2025
[55]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URLhttps: //openai.com/index/video-generation-models-as-world-simulators/. (Cited on page 2.)

2024
[56]

Genie: Generative interactive envi- ronments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C Y Chan, Nicolas Heess, Lucy Gon- zalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

2024
[57]

Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999

Aapo Hyv ¨arinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999. (Cited on page 2.)

1999
[58]

Challenging common assumptions in the unsupervised learn- ing of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R ¨atsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learn- ing of disentangled representations. InInternational Conference on Machine Learning, 2019. (Cited on page 2.)

2019
[59]

Unsupervised feature extraction by time-contrastive learning and nonlinear ICA

Aapo Hyv ¨arinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. InAdvances in Neural Information Processing Systems, 2016. (Cited on page 2.)

2016
[60]

Nonlinear ICA of temporally dependent stationary sources

Aapo Hyv ¨arinen and Hiroshi Morioka. Nonlinear ICA of temporally dependent stationary sources. InInternational Conference on Artificial Intelligence and Statistics, 2017. (Cited on page 2.)

2017
[61]

Towards nonlinear disentanglement in natural data with temporal sparse coding

David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. InInternational Conference on Learning Representations, 2021. (Cited on page 2.)

2021
[62]

Variational autoen- coders and nonlinear ICA: A unifying framework

Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyv¨arinen. Variational autoen- coders and nonlinear ICA: A unifying framework. InInternational Conference on Artificial Intelligence and Statistics, 2020. (Cited on page 2.) 13

2020
[63]

Nonlinear ICA Using Auxiliary Vari- ables and Generalized Contrastive Learning

Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA Using Auxiliary Vari- ables and Generalized Contrastive Learning. InProceedings of the Twenty-Second Interna- tional Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, April 2019. URLhttps://proceedings.mlr.press/v89/hyvarinen19a.html. (Cited on page 2.)

2019
[64]

Contrastive learning inverts the data generating process

Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Bren- del. Contrastive learning inverts the data generating process. InInternational Conference on Machine Learning, 2021. (Cited on page 2.)

2021
[65]

Self-supervised learning with data augmentations provably isolates content from style

Julius von K ¨ugelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Sch ¨olkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. InAdvances in Neural Information Processing Systems,
[66]

Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA

S ´ebastien Lachapelle, Pau Rodriguez, Yash Sharma, Katie E Everett, R´emi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. InConference on Causal Learning and Reasoning, 2022. (Cited on page 2.)

2022
[67]

Interventional causal repre- sentation learning

Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal repre- sentation learning. InInternational Conference on Machine Learning, 2023. (Cited on pages 2, 9, and 26.)

2023
[68]

Learning linear causal representations from interventions under gen- eral nonlinear mixing.Advances in Neural Information Processing Systems, 36:45419–45462,

Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Sch ¨olkopf, and Pradeep Ravikumar. Learning linear causal representations from interventions under gen- eral nonlinear mixing.Advances in Neural Information Processing Systems, 36:45419–45462,
[69]

(Cited on pages 2, 9, and 26.)
[70]

Cross-entropy is all you need to invert the data generating process

Patrik Reizinger, Alice Bizeul, Attila Juhos, Julia E V ogt, Randall Balestriero, Wieland Bren- del, and David Klindt. Cross-entropy is all you need to invert the data generating process. arXiv preprint arXiv:2410.21869, 2024. (Cited on page 2.)

arXiv 2024
[71]

On linear identifiability of learned represen- tations

Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned represen- tations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021. (Cited on page 2.)

2021
[72]

Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 2002

Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 2002. (Cited on pages 2, 28, and 29.)

2002
[73]

Joint embedding predictive architectures focus on slow features.arXiv preprint arXiv:2211.10831, 2022

Vlad Sobal, S V Jyothir, Siddhartha Jalagam, Nicolas Carion, Kyunghyun Cho, and Yann LeCun. Joint embedding predictive architectures focus on slow features.arXiv preprint arXiv:2211.10831, 2022. (Cited on pages 2, 5, and 28.)

arXiv 2022
[74]

An extension of slow feature analysis for nonlinear blind source separation.Journal of Machine Learning Research, 15:921–947,

Henning Sprekeler, Tiziano Zito, and Laurenz Wiskott. An extension of slow feature analysis for nonlinear blind source separation.Journal of Machine Learning Research, 15:921–947,
[75]

(Cited on pages 2, 5, 5, 22, 28, 29, 29, 29, 29, 30, 30, 30, 30, 30, and 31.)
[76]

Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022

Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022. URLhttps://arxiv.org/ abs/2205.11508. (Cited on pages 2 and 30.)

arXiv 2022
[77]

Robustness of nonlinear representation learning

Simon Buchholz and Bernhard Sch ¨olkopf. Robustness of nonlinear representation learning. arXiv preprint arXiv:2503.15355, 2025. (Cited on pages 2, 6, and 26.)

arXiv 2025
[78]

When does closeness in distribution imply representational similarity? an identifiability perspective.arXiv preprint arXiv:2506.03784, 2025

Beatrix MG Nielsen, Emanuele Marconato, Andrea Dittadi, and Luigi Gresele. When does closeness in distribution imply representational similarity? an identifiability perspective.arXiv preprint arXiv:2506.03784, 2025. (Cited on page 2.)

arXiv 2025
[79]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013. (Cited on page 3 and 3.)

2013
[80]

On the theory of the Brownian motion.Physical Review, 36(5):823–841, 1930

George E Uhlenbeck and Leonard S Ornstein. On the theory of the Brownian motion.Physical Review, 36(5):823–841, 1930. (Cited on page 4.) 14

1930

Showing first 80 references.

[1] [1]

A path towards autonomous machine intelligence, 2022

Yann LeCun. A path towards autonomous machine intelligence, 2022. URLhttps: //openreview.net/forum?id=BZ5a1r-kVsf. (Cited on pages 1, 2, 2, and 2.)

2022

[2] [2]

Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025. (Cited on pages 1, 1, 2, 2, 4, 7, and 33.)

Pith/arXiv arXiv 2025

[3] [3]

Vicreg: Variance-invariance-covariance regu- larization for self-supervised learning.CoRR, abs/2105.04906, 2021

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regu- larization for self-supervised learning.CoRR, abs/2105.04906, 2021. URLhttps://arxiv. org/abs/2105.04906. (Cited on pages 1, 2, and 7.)

Pith/arXiv arXiv 2021

[4] [4]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. (Cited on pages 1 and 2.)

2023

[5] [5]

Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024. (Cited on pages 1 and 2.)

Pith/arXiv arXiv 2024

[6] [6]

V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. (Cited on pages 1 and 2.)

Pith/arXiv arXiv 2025

[7] [7]

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Stress-testing offline reward-free reinforcement learning: A case for planning with latent dynamics models. In7th Robot Learning Workshop: Towards Robots with Human- Level Abilities, 2025. URLhttps://openreview.net/forum?id=jON7H6A9UU. (Cited on page 1.)

2025

[8] [8]

DINO-WM: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025. (Cited on page 1.)

2025

[9] [9]

LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. (Cited on pages 1, 2, 2, 2, 6, 7, 8, 9, 26, and 35.)

Pith/arXiv arXiv 2026

[10] [10]

Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

Bernhard Sch ¨olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. (Cited on page 2.)

2021

[11] [11]

Identifiability of latent-variable and structural-equation models: from linear to nonlinear.Annals of the Institute of Statistical Mathematics, 76(1):1–33, 2024

Aapo Hyv ¨arinen, Ilyes Khemakhem, and Ricardo Monti. Identifiability of latent-variable and structural-equation models: from linear to nonlinear.Annals of the Institute of Statistical Mathematics, 76(1):1–33, 2024. (Cited on pages 2 and 3.)

2024

[12] [12]

Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. (Cited on page 2.)

Pith/arXiv arXiv 2016

[13] [13]

A model for analogical reasoning.Cognitive Psychology, 5(1):1–28, July 1973

David E Rumelhart and Adele A Abrahamson. A model for analogical reasoning.Cognitive Psychology, 5(1):1–28, July 1973. ISSN 0010-0285. doi: 10.1016/0010-0285(73)90023-6. URLhttps://www.sciencedirect.com/science/article/pii/0010028573900236. (Cited on page 2.) 10

work page doi:10.1016/0010-0285(73)90023-6 1973

[14] [14]

Learning distributed representations of concepts

Geoffrey E Hinton. Learning distributed representations of concepts. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 8, 1986. (Cited on page 2.)

1986

[15] [15]

Tensor product variable binding and the representation of symbolic struc- tures in connectionist systems.Artificial intelligence, 46(1-2):159–216, 1990

Paul Smolensky. Tensor product variable binding and the representation of symbolic struc- tures in connectionist systems.Artificial intelligence, 46(1-2):159–216, 1990. URLhttps: //www.sciencedirect.com/science/article/pii/000437029090007M. Publisher: El- sevier. (Cited on pages 2 and 9.)

arXiv 1990

[16] [16]

Linear Algebraic Structure of Word Senses, with Applications to Polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, December 2018

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear Algebraic Structure of Word Senses, with Applications to Polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, December 2018. ISSN 2307-387X. doi: 10.1162/ tacl a 00034. URLhttps://direct.mit.edu/tacl/article/43451. (Cited on page 2.)

2018

[17] [17]

The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024

Kiho Park, Yo Joong Choe, and Victor Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024. URLhttp://arxiv.org/abs/2311. 03658. arXiv:2311.03658 [cs, stat]. (Cited on page 2.)

Pith/arXiv arXiv 2024

[18] [18]

From superposition to sparse codes: interpretable representations in neural networks.arXiv preprint arXiv:2503.01824, 2025

David Klindt, Charles O’Neill, Patrik Reizinger, Harald Maurer, and Nina Miolane. From superposition to sparse codes: interpretable representations in neural networks.arXiv preprint arXiv:2503.01824, 2025. (Cited on pages 2, 9, and 9.)

arXiv 2025

[19] [19]

Stop probing, start coding: Why linear probes and sparse autoencoders fail at compositional generalisation, 2026

Vit ´oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, and David Klindt. Stop probing, start coding: Why linear probes and sparse autoencoders fail at compositional generalisation, 2026. URLhttps://arxiv.org/abs/2603.28744. (Cited on page 2.)

arXiv 2026

[20] [20]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750– 15758, 2021. (Cited on page 2 and 2.)

2021

[21] [21]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, 2020. (Cited on page 2 and 2.)

2020

[22] [22]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021. (Cited on page 2 and 2.)

2021

[23] [23]

Information theory and statistical mechanics.Physical review, 106(4):620,

Edwin T Jaynes. Information theory and statistical mechanics.Physical review, 106(4):620,

[24] [24]

(Cited on pages 2, 4, and 8.)

[25] [25]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. (Cited on page 2.)

2020

[26] [26]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. (Cited on page 2.)

2020

[27] [27]

Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. (Cited on pages 2 and 7.)

Pith/arXiv arXiv 2018

[28] [28]

Barlow twins: Self- supervised learning via redundancy reduction.International Conference on Machine Learning,

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St ´ephane Deny. Barlow twins: Self- supervised learning via redundancy reduction.International Conference on Machine Learning,

[29] [29]

DINOv3.arXiv preprint arXiv:2508.10104, 2025

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th ´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, ...

Pith/arXiv arXiv 2025

[30] [30]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learn- ing, pages 9929–9939. PMLR, 2020. (Cited on page 2.)

2020

[31] [31]

Rethinking negative pairs in code search

Haochen Li, Xin Zhou, Luu Anh Tuan, and Chunyan Miao. Rethinking negative pairs in code search. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12760–12774, 2023. (Cited on page 2.)

2023

[32] [32]

On the importance of gaussianizing representations

Daniel Eftekhari and Vardan Papyan. On the importance of gaussianizing representations. arXiv preprint arXiv:2505.00685, 2025. (Cited on page 2.)

arXiv 2025

[33] [33]

InfoNCE induces Gaussian dis- tribution

Roy Betser, Eyal Gofer, Meir Yossef Levi, and Guy Gilboa. InfoNCE induces Gaussian dis- tribution. InInternational Conference on Learning Representations, 2026. (Cited on page 2.)

2026

[34] [34]

Cambridge University Press, 1943

Kenneth J W Craik.The Nature of Explanation. Cambridge University Press, 1943. (Cited on page 2.)

1943

[35] [35]

Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

Edward C Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

[36] [36]

Harvard University Press, 1983

Philip N Johnson-Laird.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, 1983. (Cited on page 2.)

1983

[37] [37]

An internal model for sensori- motor integration.Science, 269(5232):1880–1882, 1995

Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensori- motor integration.Science, 269(5232):1880–1882, 1995. (Cited on page 2.)

1995

[38] [38]

Perceptions as hypotheses.Philosophical Transactions of the Royal Society of London

Richard L Gregory. Perceptions as hypotheses.Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 290(1038):181–197, 1980. (Cited on page 2.)

1980

[39] [39]

The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010. (Cited on page 2.)

2010

[40] [40]

Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015. (Cited on page 2.)

2015

[41] [41]

Every good regulator of a system must be a model of that system.International Journal of Systems Science, 1(2):89–97, 1970

Roger C Conant and W Ross Ashby. Every good regulator of a system must be a model of that system.International Journal of Systems Science, 1(2):89–97, 1970. (Cited on page 2.)

1970

[42] [42]

The internal model principle of control theory.Automatica, 12(5):457–465, 1976

B A Francis and W M Wonham. The internal model principle of control theory.Automatica, 12(5):457–465, 1976. (Cited on page 2.)

1976

[43] [43]

Princeton University Press, 1957

Richard Bellman.Dynamic Programming. Princeton University Press, 1957. (Cited on page 2.)

1957

[44] [44]

A new approach to linear filtering and prediction problems.Transactions of the ASME – Journal of Basic Engineering, 82(Series D):35–45, 1960

Rudolph E Kalman. A new approach to linear filtering and prediction problems.Transactions of the ASME – Journal of Basic Engineering, 82(Series D):35–45, 1960. (Cited on page 2.)

1960

[45] [45]

Neural networks for self-learning control systems

Derrick H Nguyen and Bernard Widrow. Neural networks for self-learning control systems. IEEE Control systems magazine, 10(3):18–23, 1990. (Cited on pages 2 and 26.)

1990

[46] [46]

Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environ- ments

J ¨urgen Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environ- ments. Technical Report FKI-126-90, Institut f¨ur Informatik, Technische Universit¨at M¨unchen,

[47] [47]

Integrated architectures for learning, planning, and reacting based on ap- proximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on ap- proximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224, 1990. (Cited on page 2.)

1990

[48] [48]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAdvances in Neural Information Processing Systems, 2015. (Cited on page 2.) 12

2015

[49] [49]

Action- conditional video prediction using deep networks in Atari games

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in Atari games. InAdvances in Neural In- formation Processing Systems, 2015. (Cited on page 2.)

2015

[50] [50]

Unsupervised learning for physical interac- tion through video prediction

Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interac- tion through video prediction. InAdvances in Neural Information Processing Systems, 2016. (Cited on page 2.)

2016

[51] [51]

World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018. (Cited on pages 2 and 26.)

Pith/arXiv arXiv 2018

[52] [52]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Confer- ence on Machine Learning, pages 2555–2565. PMLR, 2019. (Cited on page 2.)

2019

[53] [53]

Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. (Cited on page 2.)

2020

[54] [54]

Mastering diverse control tasks through world models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 2025. (Cited on page 2.)

2025

[55] [55]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URLhttps: //openai.com/index/video-generation-models-as-world-simulators/. (Cited on page 2.)

2024

[56] [56]

Genie: Generative interactive envi- ronments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C Y Chan, Nicolas Heess, Lucy Gon- zalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

2024

[57] [57]

Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999

Aapo Hyv ¨arinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999. (Cited on page 2.)

1999

[58] [58]

Challenging common assumptions in the unsupervised learn- ing of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R ¨atsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learn- ing of disentangled representations. InInternational Conference on Machine Learning, 2019. (Cited on page 2.)

2019

[59] [59]

Unsupervised feature extraction by time-contrastive learning and nonlinear ICA

Aapo Hyv ¨arinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. InAdvances in Neural Information Processing Systems, 2016. (Cited on page 2.)

2016

[60] [60]

Nonlinear ICA of temporally dependent stationary sources

Aapo Hyv ¨arinen and Hiroshi Morioka. Nonlinear ICA of temporally dependent stationary sources. InInternational Conference on Artificial Intelligence and Statistics, 2017. (Cited on page 2.)

2017

[61] [61]

Towards nonlinear disentanglement in natural data with temporal sparse coding

David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. InInternational Conference on Learning Representations, 2021. (Cited on page 2.)

2021

[62] [62]

Variational autoen- coders and nonlinear ICA: A unifying framework

Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyv¨arinen. Variational autoen- coders and nonlinear ICA: A unifying framework. InInternational Conference on Artificial Intelligence and Statistics, 2020. (Cited on page 2.) 13

2020

[63] [63]

Nonlinear ICA Using Auxiliary Vari- ables and Generalized Contrastive Learning

Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA Using Auxiliary Vari- ables and Generalized Contrastive Learning. InProceedings of the Twenty-Second Interna- tional Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, April 2019. URLhttps://proceedings.mlr.press/v89/hyvarinen19a.html. (Cited on page 2.)

2019

[64] [64]

Contrastive learning inverts the data generating process

Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Bren- del. Contrastive learning inverts the data generating process. InInternational Conference on Machine Learning, 2021. (Cited on page 2.)

2021

[65] [65]

Self-supervised learning with data augmentations provably isolates content from style

Julius von K ¨ugelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Sch ¨olkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. InAdvances in Neural Information Processing Systems,

[66] [66]

Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA

S ´ebastien Lachapelle, Pau Rodriguez, Yash Sharma, Katie E Everett, R´emi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. InConference on Causal Learning and Reasoning, 2022. (Cited on page 2.)

2022

[67] [67]

Interventional causal repre- sentation learning

Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal repre- sentation learning. InInternational Conference on Machine Learning, 2023. (Cited on pages 2, 9, and 26.)

2023

[68] [68]

Learning linear causal representations from interventions under gen- eral nonlinear mixing.Advances in Neural Information Processing Systems, 36:45419–45462,

Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Sch ¨olkopf, and Pradeep Ravikumar. Learning linear causal representations from interventions under gen- eral nonlinear mixing.Advances in Neural Information Processing Systems, 36:45419–45462,

[69] [69]

(Cited on pages 2, 9, and 26.)

[70] [70]

Cross-entropy is all you need to invert the data generating process

Patrik Reizinger, Alice Bizeul, Attila Juhos, Julia E V ogt, Randall Balestriero, Wieland Bren- del, and David Klindt. Cross-entropy is all you need to invert the data generating process. arXiv preprint arXiv:2410.21869, 2024. (Cited on page 2.)

arXiv 2024

[71] [71]

On linear identifiability of learned represen- tations

Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned represen- tations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021. (Cited on page 2.)

2021

[72] [72]

Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 2002

Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 2002. (Cited on pages 2, 28, and 29.)

2002

[73] [73]

Joint embedding predictive architectures focus on slow features.arXiv preprint arXiv:2211.10831, 2022

Vlad Sobal, S V Jyothir, Siddhartha Jalagam, Nicolas Carion, Kyunghyun Cho, and Yann LeCun. Joint embedding predictive architectures focus on slow features.arXiv preprint arXiv:2211.10831, 2022. (Cited on pages 2, 5, and 28.)

arXiv 2022

[74] [74]

An extension of slow feature analysis for nonlinear blind source separation.Journal of Machine Learning Research, 15:921–947,

Henning Sprekeler, Tiziano Zito, and Laurenz Wiskott. An extension of slow feature analysis for nonlinear blind source separation.Journal of Machine Learning Research, 15:921–947,

[75] [75]

(Cited on pages 2, 5, 5, 22, 28, 29, 29, 29, 29, 30, 30, 30, 30, 30, and 31.)

[76] [76]

Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022

Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022. URLhttps://arxiv.org/ abs/2205.11508. (Cited on pages 2 and 30.)

arXiv 2022

[77] [77]

Robustness of nonlinear representation learning

Simon Buchholz and Bernhard Sch ¨olkopf. Robustness of nonlinear representation learning. arXiv preprint arXiv:2503.15355, 2025. (Cited on pages 2, 6, and 26.)

arXiv 2025

[78] [78]

When does closeness in distribution imply representational similarity? an identifiability perspective.arXiv preprint arXiv:2506.03784, 2025

Beatrix MG Nielsen, Emanuele Marconato, Andrea Dittadi, and Luigi Gresele. When does closeness in distribution imply representational similarity? an identifiability perspective.arXiv preprint arXiv:2506.03784, 2025. (Cited on page 2.)

arXiv 2025

[79] [79]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013. (Cited on page 3 and 3.)

2013

[80] [80]

On the theory of the Brownian motion.Physical Review, 36(5):823–841, 1930

George E Uhlenbeck and Leonard S Ornstein. On the theory of the Brownian motion.Physical Review, 36(5):823–841, 1930. (Cited on page 4.) 14

1930