arxiv: 2511.08544 · v3 · submitted 2025-11-11 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

Recognition: 3 theorem links

· Lean Theorem

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero , Yann LeCun

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVstat.ML

keywords self-supervised learningJEPAisotropic Gaussianregularizationrepresentation learningViTImageNetpredictive architectures

0 comments

The pith

LeJEPA shows that self-supervised embeddings reach minimal downstream prediction risk when constrained to an isotropic Gaussian distribution via sketched regularization added to the JEPA loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first derives that an isotropic Gaussian is the optimal distribution for embeddings in Joint-Embedding Predictive Architectures to reduce prediction risk on downstream tasks. It then proposes Sketched Isotropic Gaussian Regularization (SIGReg) as a practical way to enforce this distribution. When SIGReg is combined with the standard JEPA predictive loss, the resulting LeJEPA objective requires only one trade-off hyperparameter, runs in linear time and memory, and eliminates common training heuristics such as stop-gradients and teacher-student setups. Experiments across more than ten datasets and sixty architectures confirm stable performance, including 79 percent top-1 accuracy on ImageNet-1k pretraining followed by linear evaluation of a frozen ViT-H/14 backbone.

Core claim

Identifying the isotropic Gaussian as the distribution that minimizes downstream prediction risk and introducing SIGReg to enforce it allows the JEPA predictive loss to be augmented into LeJEPA, a single-objective training method that is theoretically grounded, linearly scalable, and free of ad-hoc heuristics while remaining stable across architectures and data domains.

What carries the argument

Sketched Isotropic Gaussian Regularization (SIGReg) approximates the constraint that embeddings follow an isotropic Gaussian distribution when added to the JEPA predictive loss.

If this is right

Only one trade-off hyperparameter controls the entire training process
Time and memory scale linearly with dataset size
Training remains stable without stop-gradients, teacher-student networks, or learning-rate schedulers
The method works across ResNets, ViTs, and ConvNets on multiple domains
ImageNet-1k pretraining followed by linear evaluation of a frozen ViT-H/14 reaches 79 percent top-1 accuracy

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-objective form could simplify large-scale distributed pretraining pipelines by removing many implementation choices
Similar sketched regularization might extend to other embedding-based objectives beyond JEPA
The Gaussian optimality result may link to broader information-theoretic views of representation learning
Implementation requiring roughly fifty lines of code suggests the method is immediately usable in standard frameworks

Load-bearing premise

The derivation that the isotropic Gaussian distribution minimizes downstream prediction risk must hold under the stated conditions on the embedding space and loss.

What would settle it

If LeJEPA requires extra heuristics or underperforms competitive baselines on a large new dataset where the optimality conditions are met, the claimed benefits of the Gaussian target and SIGReg would not generalize.

read the original abstract

Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective--{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)--to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79\% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LeJEPA combines JEPA loss with SIGReg to target isotropic Gaussian embeddings, delivering a single hyperparameter and broad empirical coverage, but the optimality derivation needs verification on its assumptions.

read the letter

The paper's main move is to identify the isotropic Gaussian as the embedding distribution that minimizes downstream prediction risk under the JEPA setup, then introduce SIGReg as a sketched regularizer to enforce it. This produces LeJEPA, which runs with one trade-off parameter, linear complexity, and no stop-gradients or teacher-student machinery. The authors report that it stays stable across ResNets, ViTs, and ConvNets on more than ten datasets. The 79% linear-evaluation number on ImageNet-1k with a ViT-H/14 is a clear data point, and the 50-line implementation claim is practical for anyone who wants to test it quickly. The code release helps here too. The soft spot sits in the theory. The optimality argument for the Gaussian target appears to rest on conditions about the downstream predictor and embedding independence that may not hold for nonlinear heads or real data covariances. Without the full derivation steps, it is hard to confirm that SIGReg truly delivers the claimed risk reduction rather than an implicit fit. The stress-test note on unstated assumptions lands because the abstract promises a comprehensive theory but the visible evidence is limited to the high-level claim. This work is aimed at researchers building large-scale self-supervised pipelines who want fewer tuning knobs. A reader working on representation learning would get value from the empirical breadth and the simplicity if the math checks out. It deserves a serious referee because the framing is coherent, the experiments are wide, and the practical simplifications are worth checking even if the proof requires tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims to deliver a comprehensive theory of Joint-Embedding Predictive Architectures (JEPAs) by identifying the isotropic Gaussian as the unique distribution that minimizes downstream prediction risk, then instantiating this via Sketched Isotropic Gaussian Regularization (SIGReg) to produce LeJEPA: a single-hyperparameter objective with linear time/memory, no stop-gradients or teacher-student mechanisms, and stable performance across 60+ architectures and 10+ datasets. Empirical highlight is 79% linear-evaluation accuracy on ImageNet-1k using a ViT-H/14 backbone.

Significance. If the optimality derivation is rigorous and the empirical claims hold under the stated conditions, the work would be significant for replacing heuristic-heavy SSL pipelines with a provably motivated, implementation-light alternative. The single trade-off parameter, distributed-training compatibility, and broad architecture coverage are attractive; the open GitHub repo further strengthens reproducibility.

major comments (2)

[Theory section] Theory section (optimality derivation): the claim that the isotropic Gaussian uniquely minimizes downstream risk under the JEPA loss is load-bearing for SIGReg and the 'provable' and 'heuristics-free' assertions, yet the provided sketch appears to assume a linear downstream head and independence between embedding covariances and targets; these assumptions must be stated explicitly with the precise conditions on the data distribution and predictor class, or the risk-reduction guarantee does not follow for general nonlinear heads.
[§4] §4 (empirical validation): the 79% ImageNet-1k result with ViT-H/14 is concrete, but the stability claim across 60+ architectures requires an ablation table showing performance variance when the single trade-off hyperparameter is varied by ±50% and when the sketching dimension in SIGReg is reduced; without these controls the 'linear complexity' and 'no scheduler' benefits cannot be isolated from implementation details.

minor comments (2)

[Abstract] Abstract and §3: the phrase 'parameter-free' for the Gaussian target is imprecise once the trade-off hyperparameter and sketching dimension are introduced; clarify the exact count of free parameters.
[Tables/Figures] Figure captions and Table 1: axis labels and column headers should explicitly state whether accuracies are top-1 or top-5 and whether the backbone is frozen.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications on the theoretical assumptions and committing to additional empirical controls in the revision.

read point-by-point responses

Referee: [Theory section] Theory section (optimality derivation): the claim that the isotropic Gaussian uniquely minimizes downstream risk under the JEPA loss is load-bearing for SIGReg and the 'provable' and 'heuristics-free' assertions, yet the provided sketch appears to assume a linear downstream head and independence between embedding covariances and targets; these assumptions must be stated explicitly with the precise conditions on the data distribution and predictor class, or the risk-reduction guarantee does not follow for general nonlinear heads.

Authors: We agree that the optimality derivation relies on a linear downstream head and assumes independence between embedding covariances and targets. These conditions will be stated explicitly in the revised theory section, together with the precise requirements on the data distribution (bounded second moments) and the predictor class (linear functions). Under these assumptions the isotropic Gaussian is the unique minimizer of downstream risk. For general nonlinear heads the strict guarantee does not follow from the current analysis; we will add a remark acknowledging this limitation while noting that the empirical results across diverse architectures remain consistent with the proposed objective. revision: partial
Referee: [§4] §4 (empirical validation): the 79% ImageNet-1k result with ViT-H/14 is concrete, but the stability claim across 60+ architectures requires an ablation table showing performance variance when the single trade-off hyperparameter is varied by ±50% and when the sketching dimension in SIGReg is reduced; without these controls the 'linear complexity' and 'no scheduler' benefits cannot be isolated from implementation details.

Authors: We accept that stronger controls are needed to isolate the claimed benefits. In the revised manuscript we will add an ablation table in §4 that reports linear-evaluation accuracy for the trade-off hyperparameter varied by ±50% and for reduced sketching dimensions, using a representative subset of the 60+ architectures (including ResNets and ViTs). The full set of results will be summarized with references to the new table, thereby supporting the stability, linear-complexity, and scheduler-free claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a first-principles derivation identifying the isotropic Gaussian as the distribution minimizing downstream prediction risk under its stated conditions on embeddings and loss, then introduces SIGReg as a new constraint to enforce it. No quoted steps reduce by the paper's own equations to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain. The central claim (LeJEPA objective) retains independent content from the new regularization rather than collapsing to its inputs by construction. This is the expected non-finding for a theory-driven contribution with external empirical validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on one explicit trade-off hyperparameter and the domain assumption that isotropic Gaussian embeddings minimize downstream risk; no new particles or dimensions are postulated.

free parameters (1)

trade-off hyperparameter
Single scalar balancing the predictive JEPA loss against SIGReg; its value is chosen once per run rather than fitted per dataset.

axioms (1)

domain assumption Isotropic Gaussian is the optimal distribution for JEPAs embeddings to minimize downstream prediction risk.
Identified as the first theoretical step before introducing SIGReg.

pith-pipeline@v0.9.0 · 5622 in / 1326 out tokens · 24985 ms · 2026-05-16T07:18:53.098609+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk... SIGReg... to constrain embeddings to reach that ideal distribution
Foundation.LogicAsFunctionalEquation rcl_polynomial_closure_theorem echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Combining the JEPA predictive loss with SIGReg yields LeJEPA with... heuristics-free... single trade-off hyperparameter... linear time and memory complexity
Foundation.LogicAsFunctionalEquation operative_to_laws_of_logic refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

finite pairwise polynomial closure... RCL family

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
ProteinJEPA: Latent prediction complements protein language models
cs.LG 2026-05 unverdicted novelty 7.0

Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
Joint Embedding Variational Bayes
cs.LG 2026-02 unverdicted novelty 7.0

VJE is a new variational non-contrastive SSL method that models target embeddings with a directional-radial Student-t distribution to enable structured uncertainty estimation directly in the learned representation space.
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
cs.LG 2026-05 unverdicted novelty 6.0

HEPA combines self-supervised JEPA pretraining on time series representations with horizon-conditioned finetuning to predict rare events via survival CDFs, outperforming PatchTST, iTransformer, MAE, and Chronos-2 on a...
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
cs.LG 2026-05 unverdicted novelty 6.0

HEPA combines JEPA self-supervised pretraining with horizon-conditioned fine-tuning to predict rare events in multivariate time series as a monotonic survival distribution, outperforming PatchTST, iTransformer, MAE, a...
Latent Geometry Beyond Search: Amortizing Planning in World Models
cs.RO 2026-05 unverdicted novelty 6.0

In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
cs.LG 2026-05 unverdicted novelty 6.0

AeroJEPA applies joint-embedding predictive learning to produce scalable, semantically organized latent representations for 3D aerodynamic fields that support both field reconstruction and downstream design tasks.
Understanding Self-Supervised Learning via Latent Distribution Matching
cs.LG 2026-05 unverdicted novelty 6.0

Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants ev...
Why Self-Supervised Encoders Want to Be Normal
cs.IT 2026-04 unverdicted novelty 6.0

Self-supervised encoders prefer isotropic Gaussian latent states because the Information Bottleneck, recast as rate-distortion over the predictive manifold, makes these states optimal for target-neutral representations.
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
physics.data-an 2026-04 unverdicted novelty 6.0

DySIB recovers a two-dimensional representation matching the phase space of a physical pendulum from high-dimensional video data by maximizing predictive mutual information in latent space.
Self-supervised pretraining for an iterative image size agnostic vision transformer
cs.CV 2026-04 unverdicted novelty 6.0

A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
cs.LG 2026-04 unverdicted novelty 6.0

Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
cs.CV 2026-04 unverdicted novelty 6.0

Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
cs.CL 2026-04 unverdicted novelty 6.0

REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
cs.CV 2026-02 unverdicted novelty 6.0

PEPR reframes learning with privileged event data as predicting latent event features from RGB to improve domain generalization in object detection and segmentation without direct cross-modal alignment.
MultiMedVision: Multi-Modal Medical Vision Framework
cs.CV 2026-05 unverdicted novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
Position: agentic AI orchestration should be Bayes-consistent
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
cs.LG 2026-04 unverdicted novelty 4.0

JEPAMatch augments FlexMatch with LeJEPA-derived latent regularization to produce better-structured representations, yielding higher accuracy and faster convergence on CIFAR-100, STL-10, and Tiny-ImageNet.

Reference graph

Works this paper leans on

144 extracted references · 144 canonical work pages · cited by 20 Pith papers · 11 internal anchors

[1]

Biometrika , volume=

An analysis of variance test for normality (complete samples) , author=. Biometrika , volume=. 1965 , publisher=

work page 1965
[2]

Biometrika , volume=

Goodness-of-fit tests on a circle , author=. Biometrika , volume=. 1961 , publisher=

work page 1961
[3]

2010 , publisher=

Digital nets and sequences: discrepancy theory and quasi--Monte Carlo integration , author=. 2010 , publisher=

work page 2010
[4]

NA , year=

Characteristic functions , author=. NA , year=

work page
[5]

Econometric reviews , volume=

Empirical characteristic function estimation and its applications , author=. Econometric reviews , volume=. 2004 , publisher=

work page 2004
[6]

Les Fonctions quasi analytiques: le

Carleman, Torsten , year=. Les Fonctions quasi analytiques: le

work page
[7]

The annals of Statistics , pages=

The empirical characteristic function and its applications , author=. The annals of Statistics , pages=. 1977 , publisher=

work page 1977
[8]

1987 , publisher=

Real and complex analysis , author=. 1987 , publisher=

work page 1987
[9]

2, 2022-06-27 , author=

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

work page 2022
[10]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[11]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[12]

International conference on machine learning , pages=

Whitening for self-supervised representation learning , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[13]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Vicreg: Variance-invariance-covariance regularization for self-supervised learning , author=. arXiv preprint arXiv:2105.04906 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

1981 , publisher=

Probability, statistics, and truth , author=. 1981 , publisher=

work page 1981
[15]

IEEE transactions on neural networks and learning systems , volume=

Efficient kNN classification with different numbers of nearest neighbors , author=. IEEE transactions on neural networks and learning systems , volume=. 2017 , publisher=

work page 2017
[16]

Theory of Probability & Its Applications , volume=

On estimating regression , author=. Theory of Probability & Its Applications , volume=. 1964 , publisher=

work page 1964
[17]

Journal of Mathematical Imaging and Vision , volume=

Sliced and radon wasserstein barycenters of measures , author=. Journal of Mathematical Imaging and Vision , volume=. 2015 , publisher=

work page 2015
[18]

2019 , eprint=

Averaging Weights Leads to Wider Optima and Better Generalization , author=. 2019 , eprint=

work page 2019
[19]

Uncertainty in artificial intelligence , pages=

Sliced score matching: A scalable approach to density and score estimation , author=. Uncertainty in artificial intelligence , pages=. 2020 , organization=

work page 2020
[20]

Software Impacts , volume=

The t-digest: Efficient estimates of distributions , author=. Software Impacts , volume=. 2021 , publisher=

work page 2021
[21]

Computing Extremely Accurate Quantiles Using t-Digests

Computing extremely accurate quantiles using t-digests , author=. arXiv preprint arXiv:1902.04023 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1902
[22]

arXiv preprint arXiv:1908.10693 , year=

Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees , author=. arXiv preprint arXiv:1908.10693 , year=

work page arXiv 1908
[23]

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units , pages=

Comparison based sorting for systems with multiple GPUs , author=. Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units , pages=

work page
[24]

Proceedings of the 2022 International Conference on Management of Data , pages=

Evaluating multi-GPU sorting with modern interconnects , author=. Proceedings of the 2022 International Conference on Management of Data , pages=

work page 2022
[25]

Advances in Neural Information Processing Systems , volume=

Energy-based sliced wasserstein distance , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Smooth regression analysis , author=. Sankhy. 1964 , publisher=

work page 1964
[27]

2019 international conference on intelligent computing and control systems (ICCS) , pages=

A brief review of nearest neighbor algorithm for learning and classification , author=. 2019 international conference on intelligent computing and control systems (ICCS) , pages=. 2019 , organization=

work page 2019
[28]

2010 seventh international conference on fuzzy systems and knowledge discovery , volume=

An adaptive k-nearest neighbor algorithm , author=. 2010 seventh international conference on fuzzy systems and knowledge discovery , volume=. 2010 , organization=

work page 2010
[29]

2006 , publisher=

Pattern recognition and machine learning , author=. 2006 , publisher=

work page 2006
[30]

SIAM journal on matrix analysis and applications , volume=

Tikhonov regularization and total least squares , author=. SIAM journal on matrix analysis and applications , volume=. 1999 , publisher=

work page 1999
[31]

Neural computation , volume=

Training with noise is equivalent to Tikhonov regularization , author=. Neural computation , volume=. 1995 , publisher=

work page 1995
[32]

Big data , volume=

Effects of distance measure choice on k-nearest neighbor classifier performance: a review , author=. Big data , volume=. 2019 , publisher=

work page 2019
[33]

International Conference on Machine Learning , pages=

An alternative softmax operator for reinforcement learning , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[34]

International Journal of Mathematical and Statistical Sciences , volume=

Nonparametric entropy estimation: An overview , author=. International Journal of Mathematical and Statistical Sciences , volume=. 1997 , publisher=

work page 1997
[35]

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003

A new class of entropy estimators for multi-dimensional densities , author=. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). , volume=. 2003 , organization=

work page 2003
[36]

Annals of the Institute of Statistical Mathematics , volume=

Estimation of entropy and other functionals of a multivariate density , author=. Annals of the Institute of Statistical Mathematics , volume=. 1989 , publisher=

work page 1989
[37]

2018 , publisher=

Density estimation for statistics and data analysis , author=. 2018 , publisher=

work page 2018
[38]

Vision Transformers Need Registers

Vision transformers need registers , author=. arXiv preprint arXiv:2309.16588 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

DINOv3

Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Advances in neural information processing systems , volume=

Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=

work page
[41]

arXiv preprint arXiv:2308.00566 , year=

Stochastic positional embeddings improve masked image modeling , author=. arXiv preprint arXiv:2308.00566 , year=

work page arXiv
[42]

The Journal of Machine Learning Research , volume=

Hilbert space embeddings and metrics on probability measures , author=. The Journal of Machine Learning Research , volume=. 2010 , publisher=

work page 2010
[43]

International conference on machine learning , pages=

A kernel test of goodness of fit , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[44]

The journal of machine learning research , volume=

A kernel two-sample test , author=. The journal of machine learning research , volume=. 2012 , publisher=

work page 2012
[45]

A. N. Kolmogorov , title =. Giornale dell'Istituto Italiano degli Attuari , volume =

work page
[46]

1990 , publisher=

How to test normality and other distributional assumptions , author=. 1990 , publisher=

work page 1990
[47]

goodness of fit

Asymptotic theory of certain" goodness of fit" criteria based on stochastic processes , author=. The annals of mathematical statistics , pages=. 1952 , publisher=

work page 1952
[48]

Scandinavian Actuarial Journal , volume=

On the composition of elementary errors: First paper: Mathematical deductions , author=. Scandinavian Actuarial Journal , volume=. 1928 , publisher=

work page 1928
[49]

Journal of the London Mathematical Society , volume=

Some theorems on distribution functions , author=. Journal of the London Mathematical Society , volume=. 1936 , publisher=

work page 1936
[50]

Advances in Neural Information Processing Systems , volume=

Implicit variance regularization in non-contrastive SSL , author=. Advances in Neural Information Processing Systems , volume=

work page
[51]

DINOv2: Learning Robust Visual Features without Supervision

Dinov2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2504.01017 , year=

Scaling language-free visual representation learning , author=. arXiv preprint arXiv:2504.01017 , year=

work page arXiv
[53]

Advances in neural information processing systems , volume=

Big self-supervised models are strong semi-supervised learners , author=. Advances in neural information processing systems , volume=

work page
[54]

Proceedings of the ieee/cvf International Conference on computer vision , pages=

Scaling and benchmarking self-supervised visual representation learning , author=. Proceedings of the ieee/cvf International Conference on computer vision , pages=

work page
[55]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[56]

arXiv preprint arXiv:2405.15613 , year=

Automatic data curation for self-supervised learning: A clustering-based approach , author=. arXiv preprint arXiv:2405.15613 , year=

work page arXiv
[57]

arXiv preprint arXiv:2305.17326 , year=

Matrix information theory for self-supervised learning , author=. arXiv preprint arXiv:2305.17326 , year=

work page arXiv
[58]

IEEE transactions on knowledge and data engineering , volume=

Self-supervised learning: Generative or contrastive , author=. IEEE transactions on knowledge and data engineering , volume=. 2021 , publisher=

work page 2021
[59]

arXiv preprint arXiv:2207.10081 , year=

What do we maximize in self-supervised learning? , author=. arXiv preprint arXiv:2207.10081 , year=

work page arXiv
[60]

Entropy , volume=

To compress or not to compress—self-supervised learning and information theory: A review , author=. Entropy , volume=. 2024 , publisher=

work page 2024
[61]

, author=

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. , author=. Journal of machine learning research , volume=

work page
[62]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

The inaturalist species classification and detection dataset , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[63]

Proceedings of the National Academy of Sciences , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[64]

Advances in neural information processing systems , volume=

Semi-supervised learning with deep generative models , author=. Advances in neural information processing systems , volume=

work page
[65]

nature , volume=

Learning representations by back-propagating errors , author=. nature , volume=. 1986 , publisher=

work page 1986
[66]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[67]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Droid: A large-scale in-the-wild robot manipulation dataset , author=. arXiv preprint arXiv:2403.12945 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised learning from images with a joint-embedding predictive architecture , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[70]

Advances in Neural Information Processing Systems , volume=

How jepa avoids noisy features: The implicit bias of deep linear self distillation networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[71]

Journal of personality , volume=

On the perception of incongruity: A paradigm , author=. Journal of personality , volume=. 1949 , publisher=

work page 1949
[72]

Voss, Leipzig , year=

Handbook of physiological optics , author=. Voss, Leipzig , year=

work page
[73]

siamese

Signature verification using a" siamese" time delay neural network , author=. Advances in neural information processing systems , volume=

work page
[74]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

On the importance of asymmetry for siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[75]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[76]

International Conference on Machine Learning , pages=

Understanding self-supervised learning dynamics without contrastive pairs , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[77]

IEEE transactions on electronic computers , number=

Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition , author=. IEEE transactions on electronic computers , number=. 2006 , publisher=

work page 2006
[78]

arXiv preprint arXiv:2110.09348 , year=

Understanding dimensional collapse in contrastive self-supervised learning , author=. arXiv preprint arXiv:2110.09348 , year=

work page arXiv
[79]

1867 , publisher=

Handbuch der physiologischen Optik , author=. 1867 , publisher=

work page
[80]

Philosophical Transactions of the Royal Society of London

Perceptions as hypotheses , author=. Philosophical Transactions of the Royal Society of London. B, Biological Sciences , volume=. 1980 , publisher=

work page 1980

Showing first 80 references.