PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

Basile Terver; Jean Ponce; Michael Arbel

arxiv: 2605.17671 · v1 · pith:UWKJ7EHVnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

Michael Arbel , Basile Terver , Jean Ponce This is my paper

Pith reviewed 2026-05-20 13:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-supervised learningnon-contrastive SSLcanonical correlation analysisrepresentation learningstability analysispredictive encodersJEPA

0 comments

The pith

PEIRA defines a non-contrastive SSL objective as the trace of the optimal linear regressor between views to guarantee stable nontrivial equilibria aligned with canonical correlation subspaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Non-contrastive self-supervised learning methods such as SimSiam, BYOL, I-JEPA and DINO perform well yet lack a clear objective and are therefore hard to analyze. The paper first studies the dynamics of a regularized linear regressor variant of the Joint Embedding Predictive Architecture that predicts one view's representation from the other. Stable non-collapsed equilibria turn out to coincide with leading nonlinear canonical correlation subspaces, although collapsed solutions can also be attractors. PEIRA is then introduced by taking the trace of this optimal regressor as an explicit loss; the resulting dynamics admit only nontrivial global minimizers as stable points, recovering the same subspaces while regularization selects the effective dimension. ImageNet-1K and CIFAR-10 experiments show performance on par with VICReg and LeJEPA.

Core claim

PEIRA defines a non-contrastive SSL objective through the trace of the optimal linear regressor that predicts representations of one view from the other. Its only stable equilibria are the nontrivial global minimizers, which recover the same leading nonlinear canonical correlation subspaces recovered by the earlier stability analysis, with regularization selecting the effective dimension of the learned representations.

What carries the argument

The trace of the optimal linear regressor between inter-view representations, used as an explicit objective that forces alignment with canonical correlation subspaces and prevents collapse.

If this is right

Stability analysis of the regressor-based predictor explains why self-distillation avoids collapse without explicit negative samples.
Regularization alone can control the effective dimensionality of the representation without extra hyperparameters.
PEIRA supplies a theoretically grounded loss that replaces heuristic self-distillation in non-contrastive pipelines.
The same regressor-trace objective can be applied to other predictive architectures that map multiple views.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the linear-regressor analysis to kernel or deep versions of canonical correlation analysis could produce nonlinear variants of PEIRA.
The stability guarantees may help diagnose collapse modes observed in other non-contrastive methods on different data modalities.
Replacing the linear regressor with a small neural network inside the objective might preserve the theoretical properties while increasing expressivity.

Load-bearing premise

The analyzed variant of the Joint Embedding Predictive Architecture that uses a regularized linear regressor to predict representations across views accurately captures the essential dynamics of practical non-contrastive SSL methods such as SimSiam, BYOL, I-JEPA or DINO.

What would settle it

Compute the leading nonlinear canonical correlation subspaces on a dataset independently, train PEIRA, and check whether the learned representations align with those subspaces or collapse when regularization strength is varied.

Figures

Figures reproduced from arXiv: 2605.17671 by Basile Terver, Jean Ponce, Michael Arbel.

**Figure 1.** Figure 1: Empirical dynamics of PEIRA on CIFAR-10. PEIRA on a single training run. Left: alignment αi = e ⊤ i NU,V ei/(∥ei∥ ∥NU,V ei∥) ∈ [0, 1] of the top eigenvectors ei of the signal ΣU,V with respect to the noise NU,V during training. αi = 1 implies ei is also an eigenvector of NU,V . Both matrices gradually align their eigenvectors during training. Middle: spectrum of ΣU,V during training develops additional sig… view at source ↗

**Figure 2.** Figure 2: Study of PEIRA regularization on ImageNet-1K. ResNet-50, 100 epochs, frozen PEIRA configuration of Table G.1; 1–3 random seeds per cell. Left: sensitivity of the final online linear-probe top-1 accuracy to the regularizer λ; each dot is one sweep cell, pooling over ηmin, optimizer weight decay, and gradient clipping. Middle: backbone entropy-based effective rank [Garrido et al., 2023a] versus epoch, color… view at source ↗

read the original abstract

Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEIRA gives a stability analysis for a linear-regressor JEPA variant and defines an explicit objective from the trace of the optimal predictor, but the empirical link to CCA subspaces stays qualitative.

read the letter

The main thing to know is that the paper analyzes the dynamics of a JEPA-style setup that uses a regularized linear regressor to predict representations across views, then introduces PEIRA as an objective built directly from the trace of that optimal regressor. It claims the only stable equilibria are the nontrivial ones that recover nonlinear canonical correlation subspaces, with regularization controlling the dimension.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the dynamics of a regularized linear regressor variant of the Joint Embedding Predictive Architecture (JEPA) for non-contrastive SSL, fully characterizing its stability to show that non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces. It then introduces PEIRA, which explicitly defines an objective as the trace of the optimal linear regressor between two views, proving that its only stable equilibria are nontrivial global minimizers recovering the same CCA subspaces (with regularization selecting effective dimension). Experiments report competitive downstream accuracy on ImageNet-1K and CIFAR-10 versus VICReg and LeJEPA, with qualitative results supporting the theory.

Significance. If the stability analysis and equilibrium characterization hold, the work offers a concrete theoretical bridge between popular non-contrastive methods (SimSiam, BYOL, I-JEPA, DINO) and CCA, explaining collapse avoidance and providing an explicit, optimizable objective. The explicit regressor-trace formulation and the claim that regularization controls dimension are notable strengths that could inform future SSL design. Competitive empirical results indicate practical relevance, though stronger direct validation of the CCA-subspace recovery would increase impact.

major comments (2)

[§4] §4 (stability analysis of the JEPA variant): the proof that only nontrivial global minimizers are stable assumes the linear regressor is optimal at each step, but the derivation does not explicitly bound the deviation when the encoder is updated simultaneously; this leaves open whether the claimed stability carries over to the joint optimization used in PEIRA.
[§5] §5 (experiments): downstream accuracy on ImageNet-1K and CIFAR-10 is reported, but no quantitative metric (e.g., principal-angle distance or average canonical correlation with the top-k nonlinear CCA directions computed on a controlled synthetic dataset) is given to test whether SGD trajectories actually reach the analyzed CCA equilibria rather than succeeding via unrelated mechanisms.

minor comments (2)

[Abstract] The abstract states that 'qualitative empirical results support the theory' without naming the visualizations or datasets used; a one-sentence clarification would improve readability.
[§2] Notation for the two views (x, x') and the encoder outputs (z, z') is introduced gradually; consolidating the definitions in §2 would reduce forward references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the recognition of the theoretical contributions linking non-contrastive SSL to CCA and the identification of areas where additional clarification and validation would strengthen the manuscript. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§4] §4 (stability analysis of the JEPA variant): the proof that only nontrivial global minimizers are stable assumes the linear regressor is optimal at each step, but the derivation does not explicitly bound the deviation when the encoder is updated simultaneously; this leaves open whether the claimed stability carries over to the joint optimization used in PEIRA.

Authors: We thank the referee for this precise observation. The analysis in Section 4 characterizes the equilibria and stability under the assumption that the linear regressor is at its optimum for fixed encoders, corresponding to the exact objective whose critical points we analyze. In PEIRA the loss is defined directly as the trace of this optimal regressor (computable in closed form for the linear case), so the gradient with respect to the encoders is well-defined. In the practical joint optimization we approximate the optimum via a few inner steps or small learning-rate regimes. We will revise Section 4 to include a brief discussion of the approximation error under standard small-step-size conditions for the encoders, showing that the stability conclusions carry over in the continuous-time limit. This addresses the concern without altering the core claims. revision: yes
Referee: [§5] §5 (experiments): downstream accuracy on ImageNet-1K and CIFAR-10 is reported, but no quantitative metric (e.g., principal-angle distance or average canonical correlation with the top-k nonlinear CCA directions computed on a controlled synthetic dataset) is given to test whether SGD trajectories actually reach the analyzed CCA equilibria rather than succeeding via unrelated mechanisms.

Authors: We agree that a direct quantitative test of CCA-subspace recovery would provide stronger empirical corroboration of the theory. The reported results emphasize practical downstream performance together with qualitative visualizations that are consistent with the predicted alignment. In the revised manuscript we will add controlled synthetic experiments (data generated from known nonlinear CCA structures) and report quantitative metrics including principal angles and average canonical correlations between the learned subspaces and the top-k ground-truth nonlinear CCA directions. These additions will appear in a new subsection of the experiments and directly address whether optimization reaches the analyzed equilibria. revision: yes

Circularity Check

1 steps flagged

Objective defined via trace of optimal regressor between representations carries moderate definitional dependence but central CCA equivalence holds independently

specific steps

self definitional [Section 3 (PEIRA objective definition)]
"we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor"

The loss is the trace of the regressor that is optimal for the very representations being optimized, so any equilibrium analysis begins from a quantity that is already a direct function of the current encoder outputs rather than an external target.

full rationale

The paper defines PEIRA's objective explicitly as the trace of the optimal linear regressor between the two view representations. This makes the loss value depend directly on the current features by construction, which is the source of the noted moderate burden. However, the stability analysis then derives that equilibria correspond to nonlinear CCA subspaces without reducing the CCA claim itself to a tautology or self-citation; the derivation uses standard linear algebra on the regressor coefficients. No load-bearing self-citation or imported uniqueness theorem is required for the core result. The assumption that the linear-regressor JEPA variant captures practical SSL dynamics is stated as an approximation rather than proven by construction. Overall the derivation chain remains self-contained against external CCA benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling non-contrastive SSL dynamics via a regularized linear regressor and on the definition of the objective as the trace of that regressor; no new physical entities are introduced.

free parameters (1)

regularization parameter
Controls selection of effective representation dimension in the stability analysis and PEIRA objective.

axioms (1)

domain assumption The variant of JEPA with a regularized linear regressor models the essential dynamics of self-distillation based non-contrastive SSL methods.
Invoked to motivate the stability analysis and the design of PEIRA.

pith-pipeline@v0.9.0 · 5739 in / 1403 out tokens · 52234 ms · 2026-05-20T13:37:03.435554+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 14 internal anchors

[1]

Andrew, R

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247--1255. PMLR, 2013

work page 2013
[2]

Arbel and J

M. Arbel and J. Mairal. Amortized implicit differentiation for stochastic bilevel optimization . In International Conference on Learning Representations (ICLR), Nov. 2022. URL https://hal.archives-ouvertes.fr/hal-03455458. working paper or preprint

work page 2022
[3]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619--15629, 2023

work page 2023
[4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Balestriero and Y

R. Balestriero and Y. LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. Advances in Neural Information Processing Systems, 35: 0 26671--26685, 2022

work page 2022
[6]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

R. Balestriero and Y. LeCun. LeJEPA : Provable and scalable self-supervised learning without the heuristics, 2025. URL https://arxiv.org/abs/2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Barb a lat

I. Barb a lat. Syst\`emes d'\'equations diff\'erentielles d'oscillations non lin\'eaires. Revue de Math \'e matiques Pures et Appliqu \'e es , 4: 0 267--270, 1959

work page 1959
[8]

Bardes, J

A. Bardes, J. Ponce, and Y. LeCun. VICR eg: Variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xm6YD62D1Ub

work page 2022
[9]

Bardes, Q

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. Transactions on Machine Learning Research, 2024

work page 2024
[10]

R. Bhatia. Matrix analysis. Springer Science & Business Media, 2013

work page 2013
[11]

Bolte, S

J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146 0 (1--2): 0 459--494, 2014

work page 2014
[12]

Emerging Properties in Self-Supervised Vision Transformers

M. Caron, H. Touvron, I. Misra, H. J \'e gou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Chang, T

X. Chang, T. Xiang, and T. M. Hospedales. Scalable and effective deep cca via soft decorrelation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1488--1497, 2018

work page 2018
[14]

Chapman, L

J. Chapman, L. Wells, and A. L. Aguila. Unconstrained stochastic CCA : Unifying multiview and self-supervised learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PHLVmV88Zy

work page 2024
[15]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PmLR, 2020

work page 2020
[16]

Chen and K

X. Chen and K. He. Exploring simple siamese representation learning. In Proc. CVPR, 2021

work page 2021
[17]

R. Chill. On the ojasiewicz--simon gradient inequality. Journal of Functional Analysis, 201 0 (2): 0 572--601, 2003

work page 2003
[18]

Chill, A

R. Chill, A. Haraux, and M. A. Jendoubi. Applications of the lojasiewicz-simon gradient inequality to gradient-like evolution equations. Anal. Appl.(Singap.), 7 0 (4): 0 351--372, 2009

work page 2009
[19]

J. B. Conway. A course in functional analysis. Springer, 2019

work page 2019
[20]

Dagr \'e ou, P

M. Dagr \'e ou, P. Ablin, S. Vaiter, and T. Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in Neural Information Processing Systems, 35: 0 26698--26710, 2022

work page 2022
[21]

De Sa, C

C. De Sa, C. Re, and K. Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. In International conference on machine learning, pages 2332--2341. PMLR, 2015

work page 2015
[22]

Ermolov, A

A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe. Whitening for self-supervised representation learning. In International conference on machine learning, pages 3015--3024. PMLR, 2021

work page 2021
[23]

P. M. Feehan and M. Maridakis. ojasiewicz--simon gradient inequalities for analytic and morse--bott functions on banach spaces. Journal f \"u r die reine und angewandte Mathematik (Crelles Journal) , 2020 0 (765): 0 35--67, 2020

work page 2020
[24]

Fukumizu, F

K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. J. Mach. Learn. Res., 8: 0 361--383, May 2007. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1248659.1248673

work page arXiv 2007
[25]

Closed-Form Last Layer Optimization

A. Galashov, N. Da Costa, L. Xu, P. Hennig, and A. Gretton. Closed-form last layer optimization. arXiv preprint arXiv:2510.04606, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Garrido, R

Q. Garrido, R. Balestriero, L. Najman, and Y. LeCun. Rankme: assessing the downstream performance of pretrained self-supervised representations by their rank. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023 a

work page 2023
[27]

Garrido, Y

Q. Garrido, Y. Chen, A. Bardes, L. Najman, and Y. LeCun. On the duality between contrastive and non-contrastive self-supervised learning. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=kDEL91Dufpa

work page 2023
[28]

Gohberg, S

I. Gohberg, S. Goldberg, and M. A. Kaashoek. Spectral Theory of Compact Self Adjoint Operators, pages 171--191. Birkh \"a user Basel, Basel, 2003. ISBN 978-3-0348-7980-4. doi:10.1007/978-3-0348-7980-4_4. URL https://doi.org/10.1007/978-3-0348-7980-4_4

work page doi:10.1007/978-3-0348-7980-4_4 2003
[29]

The "something something" video database for learning and evaluating visual common sense

R. Goyal, S. E. Kahou, V. Michalski, J. Materzyńska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The "something something" video database for learning and evaluating visual common sense, 2017. URL https://arxiv.org/abs/1706.04261

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

R. D. Grigorieff. A note on von neumann's trace inequalitv. Mathematische Nachrichten, 151 0 (1): 0 327--328, 1991

work page 1991
[31]

Grill, F

J.-B. Grill, F. Strub, F. Altch \'e , C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 0 21271--21284, 2020

work page 2020
[32]

P. Habets. Stabilit \'e asymptotique pour des probl \`e mes de perturbations singuli \`e res. In Stability Problems, pages 2--18. Springer, 1974

work page 1974
[33]

D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16 0 (12): 0 2639--2664, 2004. doi:10.1162/0899766042321814

work page doi:10.1162/0899766042321814 2004
[34]

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proc. CVPR, 2020

work page 2020
[35]

Deep Learning Scaling is Predictable, Empirically

J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Huang, Y

L. Huang, Y. Ni, X. Weng, R. M. Anwer, S. Khan, M.-H. Yang, and F. S. Khan. Understanding whitening loss in self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 46 0 (12): 0 9479--9492, 2024

work page 2024
[37]

M. Huh, B. Cheung, T. Wang, and P. Isola. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 20 0 (4): 0 729--735, 2009

work page 2009
[39]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[40]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset, 2017. URL https://arxiv.org/abs/1705.06950

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009

work page 2009
[42]

H. O. Lancaster. The structure of bivariate distributions. The Annals of Mathematical Statistics, 29 0 (3): 0 719--736, 1958

work page 1958
[43]

Y. LeCun. A path towards autonomous machine intelligence. OpenReview, Jun 2022

work page 2022
[44]

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246--1257. PMLR, 2016

work page 2016
[45]

Littwin, O

E. Littwin, O. Saremi, M. Advani, V. Thilak, P. Nakkiran, C. Huang, and J. Susskind. How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems, 37: 0 91300--91336, 2024

work page 2024
[46]

Lojasiewicz

S. Lojasiewicz. Sur les trajectoires du gradient d'une fonction analytique. Seminari di geometria, 1983: 0 115--117, 1982

work page 1983
[47]

Q. Lyu, X. Fu, W. Wang, and S. Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. arXiv preprint arXiv:2106.07115, 2021

work page arXiv 2021
[48]

Nonparametric Canonical Correlation Analysis

T. Michaeli, W. Wang, and K. Livescu. Nonparametric canonical correlation analysis. arXiv preprint arXiv:1511.04839, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[49]

Morales-Brotons, T

D. Morales-Brotons, T. Vogels, and H. Hendrikx. Exponential moving average of weights in deep learning: Dynamics and benefits. Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=2M9CUnYnBA

work page 2024
[50]

Nesterov

Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003

work page 2003
[51]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Ponce, B

J. Ponce, B. Terver, M. Hebert, and M. Arbel. Dual perspectives on non-contrastive self-supervised learning. In International Conference on Learning Representations (ICLR), 2026

work page 2026
[53]

B. Shi, W. Su, and M. I. Jordan. On learning rates and schr \"o dinger operators. Journal of Machine Learning Research, 24 0 (379): 0 1--53, 2023

work page 2023
[54]

DINOv3

O. Sim \'e oni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. DINOv3 . arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Learning from reward-free offline data: A case for planning with latent dynamics models.arXiv preprint arXiv:2502.14819, 2025

V. Sobal, W. Zhang, K. Cho, R. Balestriero, T. G. Rudner, and Y. LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models. arXiv preprint arXiv:2502.14819, 2025

work page arXiv 2025
[56]

L. Sun, S. Ji, and J. Ye. A least squares formulation for a class of generalized eigenvalue problems in machine learning. In Proceedings of the 26th annual international conference on machine learning, pages 977--984, 2009

work page 2009
[57]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2 edition, 2018. ISBN 9780262039246

work page 2018
[58]

Y. Tang, Z. D. Guo, P. H. Richemond, B. A. Pires, Y. Chandak, R. Munos, M. Rowland, M. G. Azar, C. Le Lan, C. Lyle, et al. Understanding self-predictive learning for reinforcement learning. In International Conference on Machine Learning, pages 33632--33656. PMLR, 2023

work page 2023
[59]

Terver, R

B. Terver, R. Balestriero, M. Dervishi, D. Fan, Q. Garrido, T. Nagarajan, K. Sinha, W. Zhang, M. Rabbat, Y. LeCun, and A. Bar. A lightweight library for energy-based joint-embedding predictive architectures. In ICLR 2026 Workshop on World Models, 2026 a

work page 2026
[60]

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

B. Terver, T.-Y. Yang, J. Ponce, A. Bardes, and Y. LeCun. What drives success in physical planning with joint-embedding predictive world models?, 2026 b . URL https://arxiv.org/abs/2512.24497

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

G. Teschl. Ordinary differential equations and dynamical systems. Graduate Studies in Mathematics, 140: 0 08854--8019, 2000

work page 2000
[62]

Y. Tian, X. Chen, and S. Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268--10278. PMLR, 2021

work page 2021
[63]

T. T. Truong. Some convergent results for backtracking gradient descent method on banach spaces, 2020

work page 2020
[64]

H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang. Trace ratio vs. ratio trace for dimensionality reduction. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1--8. IEEE, 2007

work page 2007
[65]

M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161 0 (1--2): 0 419--449, 2017

work page 2017
[66]

Wen and Y

Z. Wen and Y. Li. The mechanism of prediction head in non-contrastive self-supervised learning. In NeurIPS, 2022

work page 2022
[67]

L. Xu, Y. Chen, S. Srinivasan, N. de Freitas, A. Doucet, and A. Gretton. Learning deep features in instrumental variable regression. arXiv preprint arXiv:2010.07154, 2020

work page arXiv 2010
[68]

Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[69]

Zbontar, L

J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310--12320. PMLR, 2021

work page 2021
[70]

Zhang, Q

H. Zhang, Q. Wu, J. Yan, D. Wipf, and P. S. Yu. From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems, 34: 0 76--89, 2021

work page 2021

[1] [1]

Andrew, R

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247--1255. PMLR, 2013

work page 2013

[2] [2]

Arbel and J

M. Arbel and J. Mairal. Amortized implicit differentiation for stochastic bilevel optimization . In International Conference on Learning Representations (ICLR), Nov. 2022. URL https://hal.archives-ouvertes.fr/hal-03455458. working paper or preprint

work page 2022

[3] [3]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619--15629, 2023

work page 2023

[4] [4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Balestriero and Y

R. Balestriero and Y. LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. Advances in Neural Information Processing Systems, 35: 0 26671--26685, 2022

work page 2022

[6] [6]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

R. Balestriero and Y. LeCun. LeJEPA : Provable and scalable self-supervised learning without the heuristics, 2025. URL https://arxiv.org/abs/2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Barb a lat

I. Barb a lat. Syst\`emes d'\'equations diff\'erentielles d'oscillations non lin\'eaires. Revue de Math \'e matiques Pures et Appliqu \'e es , 4: 0 267--270, 1959

work page 1959

[8] [8]

Bardes, J

A. Bardes, J. Ponce, and Y. LeCun. VICR eg: Variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xm6YD62D1Ub

work page 2022

[9] [9]

Bardes, Q

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. Transactions on Machine Learning Research, 2024

work page 2024

[10] [10]

R. Bhatia. Matrix analysis. Springer Science & Business Media, 2013

work page 2013

[11] [11]

Bolte, S

J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146 0 (1--2): 0 459--494, 2014

work page 2014

[12] [12]

Emerging Properties in Self-Supervised Vision Transformers

M. Caron, H. Touvron, I. Misra, H. J \'e gou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Chang, T

X. Chang, T. Xiang, and T. M. Hospedales. Scalable and effective deep cca via soft decorrelation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1488--1497, 2018

work page 2018

[14] [14]

Chapman, L

J. Chapman, L. Wells, and A. L. Aguila. Unconstrained stochastic CCA : Unifying multiview and self-supervised learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PHLVmV88Zy

work page 2024

[15] [15]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PmLR, 2020

work page 2020

[16] [16]

Chen and K

X. Chen and K. He. Exploring simple siamese representation learning. In Proc. CVPR, 2021

work page 2021

[17] [17]

R. Chill. On the ojasiewicz--simon gradient inequality. Journal of Functional Analysis, 201 0 (2): 0 572--601, 2003

work page 2003

[18] [18]

Chill, A

R. Chill, A. Haraux, and M. A. Jendoubi. Applications of the lojasiewicz-simon gradient inequality to gradient-like evolution equations. Anal. Appl.(Singap.), 7 0 (4): 0 351--372, 2009

work page 2009

[19] [19]

J. B. Conway. A course in functional analysis. Springer, 2019

work page 2019

[20] [20]

Dagr \'e ou, P

M. Dagr \'e ou, P. Ablin, S. Vaiter, and T. Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in Neural Information Processing Systems, 35: 0 26698--26710, 2022

work page 2022

[21] [21]

De Sa, C

C. De Sa, C. Re, and K. Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. In International conference on machine learning, pages 2332--2341. PMLR, 2015

work page 2015

[22] [22]

Ermolov, A

A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe. Whitening for self-supervised representation learning. In International conference on machine learning, pages 3015--3024. PMLR, 2021

work page 2021

[23] [23]

P. M. Feehan and M. Maridakis. ojasiewicz--simon gradient inequalities for analytic and morse--bott functions on banach spaces. Journal f \"u r die reine und angewandte Mathematik (Crelles Journal) , 2020 0 (765): 0 35--67, 2020

work page 2020

[24] [24]

Fukumizu, F

K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. J. Mach. Learn. Res., 8: 0 361--383, May 2007. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1248659.1248673

work page arXiv 2007

[25] [25]

Closed-Form Last Layer Optimization

A. Galashov, N. Da Costa, L. Xu, P. Hennig, and A. Gretton. Closed-form last layer optimization. arXiv preprint arXiv:2510.04606, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Garrido, R

Q. Garrido, R. Balestriero, L. Najman, and Y. LeCun. Rankme: assessing the downstream performance of pretrained self-supervised representations by their rank. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023 a

work page 2023

[27] [27]

Garrido, Y

Q. Garrido, Y. Chen, A. Bardes, L. Najman, and Y. LeCun. On the duality between contrastive and non-contrastive self-supervised learning. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=kDEL91Dufpa

work page 2023

[28] [28]

Gohberg, S

I. Gohberg, S. Goldberg, and M. A. Kaashoek. Spectral Theory of Compact Self Adjoint Operators, pages 171--191. Birkh \"a user Basel, Basel, 2003. ISBN 978-3-0348-7980-4. doi:10.1007/978-3-0348-7980-4_4. URL https://doi.org/10.1007/978-3-0348-7980-4_4

work page doi:10.1007/978-3-0348-7980-4_4 2003

[29] [29]

The "something something" video database for learning and evaluating visual common sense

R. Goyal, S. E. Kahou, V. Michalski, J. Materzyńska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The "something something" video database for learning and evaluating visual common sense, 2017. URL https://arxiv.org/abs/1706.04261

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

R. D. Grigorieff. A note on von neumann's trace inequalitv. Mathematische Nachrichten, 151 0 (1): 0 327--328, 1991

work page 1991

[31] [31]

Grill, F

J.-B. Grill, F. Strub, F. Altch \'e , C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 0 21271--21284, 2020

work page 2020

[32] [32]

P. Habets. Stabilit \'e asymptotique pour des probl \`e mes de perturbations singuli \`e res. In Stability Problems, pages 2--18. Springer, 1974

work page 1974

[33] [33]

D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16 0 (12): 0 2639--2664, 2004. doi:10.1162/0899766042321814

work page doi:10.1162/0899766042321814 2004

[34] [34]

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proc. CVPR, 2020

work page 2020

[35] [35]

Deep Learning Scaling is Predictable, Empirically

J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Huang, Y

L. Huang, Y. Ni, X. Weng, R. M. Anwer, S. Khan, M.-H. Yang, and F. S. Khan. Understanding whitening loss in self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 46 0 (12): 0 9479--9492, 2024

work page 2024

[37] [37]

M. Huh, B. Cheung, T. Wang, and P. Isola. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 20 0 (4): 0 729--735, 2009

work page 2009

[39] [39]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[40] [40]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset, 2017. URL https://arxiv.org/abs/1705.06950

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009

work page 2009

[42] [42]

H. O. Lancaster. The structure of bivariate distributions. The Annals of Mathematical Statistics, 29 0 (3): 0 719--736, 1958

work page 1958

[43] [43]

Y. LeCun. A path towards autonomous machine intelligence. OpenReview, Jun 2022

work page 2022

[44] [44]

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246--1257. PMLR, 2016

work page 2016

[45] [45]

Littwin, O

E. Littwin, O. Saremi, M. Advani, V. Thilak, P. Nakkiran, C. Huang, and J. Susskind. How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems, 37: 0 91300--91336, 2024

work page 2024

[46] [46]

Lojasiewicz

S. Lojasiewicz. Sur les trajectoires du gradient d'une fonction analytique. Seminari di geometria, 1983: 0 115--117, 1982

work page 1983

[47] [47]

Q. Lyu, X. Fu, W. Wang, and S. Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. arXiv preprint arXiv:2106.07115, 2021

work page arXiv 2021

[48] [48]

Nonparametric Canonical Correlation Analysis

T. Michaeli, W. Wang, and K. Livescu. Nonparametric canonical correlation analysis. arXiv preprint arXiv:1511.04839, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[49] [49]

Morales-Brotons, T

D. Morales-Brotons, T. Vogels, and H. Hendrikx. Exponential moving average of weights in deep learning: Dynamics and benefits. Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=2M9CUnYnBA

work page 2024

[50] [50]

Nesterov

Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003

work page 2003

[51] [51]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Ponce, B

J. Ponce, B. Terver, M. Hebert, and M. Arbel. Dual perspectives on non-contrastive self-supervised learning. In International Conference on Learning Representations (ICLR), 2026

work page 2026

[53] [53]

B. Shi, W. Su, and M. I. Jordan. On learning rates and schr \"o dinger operators. Journal of Machine Learning Research, 24 0 (379): 0 1--53, 2023

work page 2023

[54] [54]

DINOv3

O. Sim \'e oni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. DINOv3 . arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Learning from reward-free offline data: A case for planning with latent dynamics models.arXiv preprint arXiv:2502.14819, 2025

V. Sobal, W. Zhang, K. Cho, R. Balestriero, T. G. Rudner, and Y. LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models. arXiv preprint arXiv:2502.14819, 2025

work page arXiv 2025

[56] [56]

L. Sun, S. Ji, and J. Ye. A least squares formulation for a class of generalized eigenvalue problems in machine learning. In Proceedings of the 26th annual international conference on machine learning, pages 977--984, 2009

work page 2009

[57] [57]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2 edition, 2018. ISBN 9780262039246

work page 2018

[58] [58]

Y. Tang, Z. D. Guo, P. H. Richemond, B. A. Pires, Y. Chandak, R. Munos, M. Rowland, M. G. Azar, C. Le Lan, C. Lyle, et al. Understanding self-predictive learning for reinforcement learning. In International Conference on Machine Learning, pages 33632--33656. PMLR, 2023

work page 2023

[59] [59]

Terver, R

B. Terver, R. Balestriero, M. Dervishi, D. Fan, Q. Garrido, T. Nagarajan, K. Sinha, W. Zhang, M. Rabbat, Y. LeCun, and A. Bar. A lightweight library for energy-based joint-embedding predictive architectures. In ICLR 2026 Workshop on World Models, 2026 a

work page 2026

[60] [60]

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

B. Terver, T.-Y. Yang, J. Ponce, A. Bardes, and Y. LeCun. What drives success in physical planning with joint-embedding predictive world models?, 2026 b . URL https://arxiv.org/abs/2512.24497

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

G. Teschl. Ordinary differential equations and dynamical systems. Graduate Studies in Mathematics, 140: 0 08854--8019, 2000

work page 2000

[62] [62]

Y. Tian, X. Chen, and S. Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268--10278. PMLR, 2021

work page 2021

[63] [63]

T. T. Truong. Some convergent results for backtracking gradient descent method on banach spaces, 2020

work page 2020

[64] [64]

H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang. Trace ratio vs. ratio trace for dimensionality reduction. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1--8. IEEE, 2007

work page 2007

[65] [65]

M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161 0 (1--2): 0 419--449, 2017

work page 2017

[66] [66]

Wen and Y

Z. Wen and Y. Li. The mechanism of prediction head in non-contrastive self-supervised learning. In NeurIPS, 2022

work page 2022

[67] [67]

L. Xu, Y. Chen, S. Srinivasan, N. de Freitas, A. Doucet, and A. Gretton. Learning deep features in instrumental variable regression. arXiv preprint arXiv:2010.07154, 2020

work page arXiv 2010

[68] [68]

Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[69] [69]

Zbontar, L

J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310--12320. PMLR, 2021

work page 2021

[70] [70]

Zhang, Q

H. Zhang, Q. Wu, J. Yan, D. Wipf, and P. S. Yu. From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems, 34: 0 76--89, 2021

work page 2021