Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

Robert Jenkinson Alvarez

arxiv: 2605.20107 · v1 · pith:2WK2UOQXnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

Robert Jenkinson Alvarez This is my paper

Pith reviewed 2026-05-20 06:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords JEPAisotropyHamiltonian geometrysymplectic predictionself-supervised learningrepresentation geometrycovariance structurephase space

0 comments

The pith

For unknown downstream geometries no fixed marginal covariance is canonical in JEPAs, so structural bias must enter the cross-view predictive coupling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

JEPAs commonly regularize one-view embeddings toward isotropic Gaussians, which implicitly assumes Euclidean symmetry. The paper shows that when a structured downstream geometry H is known, both the minimax and maximum-entropy covariance under a Hamiltonian energy budget equal a scaled inverse of H, so isotropy carries a closed-form cost. When H is unknown, every possible fixed covariance shape can be shown to be maximally misaligned with some geometry. Even oracle knowledge of the true one-view marginals leaves the required view-to-view predictive coupling unidentified. These observations lead the authors to move the structural prior into a learned symplectic map between views rather than into the marginals.

Core claim

The central claim is that Euclidean isotropy is not a neutral default for JEPA representations. For known positive-definite downstream geometry H the minimax and maximum-entropy covariances under Hamiltonian energy are both (c/d) H inverse. When H is unknown no geometry-independent fixed marginal is canonical, because any chosen covariance can be the worst possible for some H. Oracle one-view marginals do not determine the inter-view coupling. Therefore the structural bias should be placed in the cross-view predictor, which the authors implement by encoding views as phase-space states and predicting transitions with a learned Hamiltonian leapfrog map.

What carries the argument

Learned Hamiltonian leapfrog map that predicts transitions between phase-space states (q, p) of the two views, augmented by non-isotropic scale and spectral-floor regularizers to avoid collapse.

If this is right

For any known downstream geometry H the optimal covariance is exactly (c/d) H inverse under the Hamiltonian budget.
Isotropy carries an explicit closed-form performance penalty relative to the geometry-matched covariance.
No single fixed marginal distribution can be optimal or even safe across the space of all possible downstream geometries.
Oracle access to one-view marginals still leaves the predictive coupling between views underdetermined.
Replacing the isotropic marginal with symplectic coupling improves kNN and linear-probe accuracy on CIFAR-100 and ImageNet-100 relative to isotropic baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same misalignment argument may apply to other self-supervised methods that enforce isotropic marginals.
Direct tests on tasks whose downstream geometry is known and anisotropic would provide a clean check of the inverse-covariance claim.
The phase-space formulation could be extended to temporal or sequential prediction settings where dynamics are naturally Hamiltonian.
Other regularizers in representation learning may embed hidden Euclidean assumptions that conflict with structured downstream geometries.

Load-bearing premise

A learned Hamiltonian leapfrog map can reliably capture the view-to-view predictive coupling in practice when non-isotropic scale and spectral floors are added to prevent collapse.

What would settle it

A concrete counterexample geometry H for which some fixed covariance other than proportional to H inverse achieves strictly lower Hamiltonian energy, or an experiment in which oracle one-view marginals suffice to recover the correct cross-view coupling.

Figures

Figures reproduced from arXiv: 2605.20107 by Robert Jenkinson Alvarez.

**Figure 1.** Figure 1: Frozen-feature diagnostic summaries (raw) 30 epochs. Each panel contains: (top-left) kNN sweep; (top-right) covariance eigenspectrum (top 256, log scale); (bottom-left) cosine similarity histogram (random pairs); (bottom-right) feature norm histogram. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_1.png] view at source ↗

**Figure 2.** Figure 2: CIFAR-100 frozen-feature diagnostic summaries (raw) 80 epochs. Each panel contains: (top-left) kNN sweep; (top-right) mean-centered covariance eigenspectrum (top 256, log scale); (bottom-left) random-pair cosine similarity histogram; (bottom-right) feature norm histogram. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗

**Figure 3.** Figure 3: ImageNet-100 frozen-feature diagnostic summaries (raw). Each panel contains: (topleft) kNN sweep; (top-right) covariance eigenspectrum (top 256, log scale); (bottom-left) cosine similarity histogram (random pairs); (bottom-right) feature norm histogram. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗

**Figure 4.** Figure 4: q/p decomposition across pretraining (ImageNet-100). We evaluate frozen features extracted at sparse pretraining checkpoints using (left) a linear probe top-1 and (right) the best kNN top-1 across the evaluated k values. For SIGReg+tokens, concatenating (q, p) is consistently better than either half alone, indicating that the split halves are complementary (and/or that doubling feature dimension helps). Fo… view at source ↗

**Figure 5.** Figure 5: ImageNet-100 training dynamics from sparse checkpoints (geometry vs. accuracy). Using the q representation at a set of pretraining epochs, we plot: (a) linear probe top-1; (b) best kNN top-1 over k (shaded region indicates the spread over the k sweep); (c) effective-rank proxy computed from the top-256 eigenvalues of the mean-centered q covariance; (d) probe accuracy versus effective rank (points annotated… view at source ↗

**Figure 6.** Figure 6: LeJEPA-inspired sliced-projection diagnostic for HamJEPA rollouts (cartoon). Left: conceptual support of the input distribution (manifold-structured data). Middle: an example embedding distribution in a 2D slice with several projection directions u(θ). Right: estimated 1D projected densities along representative directions; black denotes the reference/target, green denotes HamJEPA rollouts, orange denotes… view at source ↗

**Figure 7.** Figure 7: LeJEPA-style projection sketch on a toy Hamiltonian transport problem. Left: initial phase-space distribution px over (q, p). Middle: final distributions in whitened coordinates after rolling out to a fixed horizon T; the reference target is generated by a high-accuracy leapfrog integrator, while the two model curves compare a coarse Euler rollout (SIGReg proxy) against a coarse leapfrog rollout (HamJEPA).… view at source ↗

**Figure 8.** Figure 8: Directional Cramér–Wold discrepancy (sliced mismatch) versus projection angle. For each direction angle θ (period π), we compute a 1D mismatch g(θ) = W1(⟨u(θ), Zmodel⟩,⟨u(θ), Zref⟩) between the projected model and reference distributions. HamJEPA (leapfrog) stays uniformly close across directions, while the Euler/SIGReg proxy exhibits large, anisotropic deviation; the legend reports summary statistics over… view at source ↗

**Figure 9.** Figure 9: HamJEPA at the highest level. Two augmented views of the same image are encoded by a shared encoder into phase-space states sa = [qa; pa] and sb = [qb; pb]. The predictor is not an unconstrained regression head: it is a structured Hamiltonian rollout Φ K ϕ that evolves sa into a predicted target state sˆb. Training combines a prediction loss between sˆb and sb with separate encoder-side anti-collapse regul… view at source ↗

**Figure 10.** Figure 10: How the encoder constructs the phase-space state. HamJEPA is not just taking an arbitrary vector and renaming its halves as q and p. In the implementation, the encoder runs in token mode: an intermediate ResNet feature map is converted into a token grid, the channel stack of each token is split into a q half and a p half, and those halves are flattened across the spatial grid to form q ∈ R d and p ∈ R d .… view at source ↗

**Figure 11.** Figure 11: What the predictor is doing. HamJEPA does not use an arbitrary predictor za 7→ zˆb. Instead, it rolls the state forward with a learned separable Hamiltonian Hϕ(q, p) = T(p) + Vϕ(q) and a leapfrog integrator. The potential Vϕ(q) drives the kick updates through ∇qVϕ(q), while the kinetic term T(p) = 1 2 ∥p∥ 2 yields the drift update q ← q + ∆t p. Repeating this structured step K times produces the predicted… view at source ↗

**Figure 12.** Figure 12: What “symplectic” means geometrically. A generic learned map can bend, stretch, and arbitrarily collapse local geometry. A symplectic map can still deform the state space, but it preserves total phase-space volume locally, i.e. | det DΦ| = 1. This gives the predictor a structured notion of reversibility and ensures that the predictor itself is not the source of volume collapse. Symplecticity constrains th… view at source ↗

**Figure 13.** Figure 13: Why the predictor alone is not enough. Even if the predictor is perfectly symplectic, the encoder could still map many different views to nearly the same latent state. In that situation the rollout acts on an already-collapsed representation. This is why HamJEPA needs a second component beyond the predictor: encoder-side anti-collapse regularization. Why the predictor alone is not enough. The symplectic c… view at source ↗

**Figure 14.** Figure 14: Energy budget. The energy budget keeps the average per-dimension scale of q and p near a target range. It penalizes both under-scaled states (which risk collapse toward zero) and over-scaled states (which can destabilize dynamics), so the latent cloud stays at a controlled overall radius. x1 x2 dead coordinate x1 x2 all coordinates active [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗

**Figure 15.** Figure 15: Variance floor. The variance floor prevents individual coordinates from becoming nearly constant across the batch. Visually, it revives dead axes: instead of allowing the latent cloud to collapse onto a lower-dimensional direction, it forces each coordinate to retain a minimum amount of variation. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗

**Figure 16.** Figure 16: Projected log-det floor. The projected log-det floor prevents the batch representation from flattening into a very low-volume set. It does not require isotropy, but it does force the projected covariance to retain nontrivial volume, ruling out extreme degeneracies such as line-like or sheet-like collapse. one dominant spike variance spread across modes [PITH_FULL_IMAGE:figures/full_fig_p039_16.png] view at source ↗

**Figure 17.** Figure 17: Participation-ratio floor and top-eigenvalue control. These penalties prevent the representation from surviving only through one huge dominant direction. The participation-ratio floor encourages variance to remain spread across multiple modes, while the top-eigenvalue ceiling prevents a single spike from taking over the spectrum. mean near zero µ large offset direction [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 18.** Figure 18: Mean penalty. A large global mean can create a dominant offset direction that makes random-pair cosine similarities look artificially cone-like. The mean penalty discourages this by keeping the batch mean closer to zero, so the geometry is not dominated by a single global shift. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗

**Figure 19.** Figure 19: Why CIFAR-100 and ImageNet-100 use different matching choices. On CIFAR-100 the training objective emphasizes the content coordinate q, so q is the cleanest downstream readout and p mainly supports the dynamics. On ImageNet-100 the stable regime was to match the full state [q; p], which directly constrains both coordinates and makes p much more informative. How to read the sequence. Taken together, Figs. … view at source ↗

read the original abstract

JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbf{HamJEPA}, which encodes each view as a phase-space state $(q,p)$ and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by $+4.89$ kNN@20 and $+3.52$ linear-probe points at 30 epochs, and by $+6.45$ kNN@20 and $+10.64$ linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA-$q$ improves by $+4.82$ kNN@20 and $+7.52$ linear-probe points at 45 epochs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues no fixed marginal target is canonical in JEPAs and proposes a symplectic leapfrog predictor instead, with some gains on small-scale image tasks but thin support for the derivations and stability.

read the letter

The main point worth knowing is that this work pushes back on the default of isotropic targets in JEPA-style self-supervised learning. It claims that for any known positive-definite downstream geometry H, the optimal covariance under a Hamiltonian energy budget is proportional to H inverse, and that when the geometry is unknown, every fixed covariance shape can be the worst possible choice for some structure. It also argues that even oracle one-view marginals fail to determine the cross-view predictive coupling. From there it builds HamJEPA, which encodes views as phase-space pairs and predicts with a learned Hamiltonian leapfrog map plus non-isotropic scale and spectral floors to avoid collapse.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that isotropic Gaussian regularization in Joint-Embedding Predictive Architectures (JEPAs) implicitly imposes Euclidean symmetry that is suboptimal for structured downstream geometries. For known H ≻ 0 it derives that the minimax and maximum-entropy covariance under a Hamiltonian energy budget equals (c/d) H^{-1} and that Euclidean isotropy carries a closed-form price; when the geometry is unknown, no fixed marginal covariance is canonical because every fixed shape can be maximally misaligned for some H. Oracle one-view marginals are shown not to identify the view-to-view predictive coupling. The authors instantiate the principle with HamJEPA, which encodes views as phase-space states (q, p) and predicts transitions via a learned Hamiltonian leapfrog map augmented by non-isotropic scale and spectral floors; on CIFAR-100 and ImageNet-100 the method reports gains over SIGReg, with an MLP ablation attributing the neighborhood-geometry improvement to the symplectic coupling.

Significance. If the derivations are made fully explicit and the learned leapfrog integrator is shown to remain symplectic and stable, the work would meaningfully challenge the default isotropy assumption in JEPA-style self-supervised learning and motivate geometry-aware cross-view predictors. The empirical gains and the predictor-type ablation constitute concrete, falsifiable support for the central recommendation. The presence of free parameters (energy budget c/d, spectral floors) and the absence of volume-preservation diagnostics, however, temper the strength of the parameter-free and symplectic claims.

major comments (3)

[Abstract / Theory derivation] Abstract and the section deriving the optimal covariance: the minimax/max-ent covariance is stated as (c/d) H^{-1} under a Hamiltonian energy budget, yet the intermediate steps that produce this closed form from the energy constraint are not shown. Because c/d is listed among the free parameters, it is unclear whether the result is independent of tuning or whether the budget is chosen post hoc to align with the target geometry.
[Experiments / Tables 1-2] Experiments section reporting CIFAR-100 and ImageNet-100 results: gains such as +4.89 kNN@20 and +3.52 linear-probe points at 30 epochs are presented without error bars, number of random seeds, or statistical significance tests. The SIGReg baseline is not described with sufficient hyper-parameter matching detail to exclude confounding implementation differences.
[Method / §4.3 leapfrog integrator] Method section describing the learned Hamiltonian leapfrog map: the central claim that the symplectic predictor captures the view-to-view coupling presupposes that the trained map remains volume-preserving and stable. No diagnostic (e.g., Jacobian determinant check, sensitivity to step size or number of leapfrog steps) is reported, and the non-isotropic scale and spectral floors lack a derivation showing they commute with the leapfrog update without introducing new instabilities.

minor comments (3)

[Notation / §3] The precise mapping from view embeddings to the phase-space coordinates (q, p) is introduced but not specified in sufficient detail for exact reproduction.
[Figures] Figure captions and axis labels in the geometry visualizations would benefit from explicit indication of the isotropic baseline for direct visual comparison.
[References] A small number of recent references on symplectic neural networks and structure-preserving integrators are missing from the related-work discussion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped clarify several aspects of the manuscript. We address each major comment below and have incorporated revisions to strengthen the theoretical exposition, experimental reporting, and methodological validation.

read point-by-point responses

Referee: [Abstract / Theory derivation] Abstract and the section deriving the optimal covariance: the minimax/max-ent covariance is stated as (c/d) H^{-1} under a Hamiltonian energy budget, yet the intermediate steps that produce this closed form from the energy constraint are not shown. Because c/d is listed among the free parameters, it is unclear whether the result is independent of tuning or whether the budget is chosen post hoc to align with the target geometry.

Authors: We agree that the intermediate derivation steps were insufficiently detailed. In the revised manuscript we have expanded Section 3 with an explicit derivation: starting from the Hamiltonian energy constraint Tr(H Σ) ≤ c, we apply the method of Lagrange multipliers to the minimax objective (or equivalently the maximum-entropy problem under the same linear constraint) and obtain the closed-form solution Σ* = (c/d) H^{-1}. The scalar c/d simply sets the overall scale of the covariance and does not alter its eigenstructure; the price of Euclidean isotropy is then expressed relative to this optimal shape for any fixed positive c. We have clarified in the text that c is treated as a fixed hyper-parameter chosen from the typical magnitude of the learned embeddings rather than tuned post hoc to any particular downstream H. revision: yes
Referee: [Experiments / Tables 1-2] Experiments section reporting CIFAR-100 and ImageNet-100 results: gains such as +4.89 kNN@20 and +3.52 linear-probe points at 30 epochs are presented without error bars, number of random seeds, or statistical significance tests. The SIGReg baseline is not described with sufficient hyper-parameter matching detail to exclude confounding implementation differences.

Authors: We acknowledge the need for more rigorous statistical reporting. The revised version now includes means and standard deviations computed over five independent random seeds for every reported metric on both CIFAR-100 and ImageNet-100. We have added paired t-tests confirming that the observed gains remain statistically significant (p < 0.05). In addition, we have expanded the experimental-setup paragraph to list the precise hyper-parameter values used for the SIGReg baseline (optimizer, learning-rate schedule, augmentation strengths, and regularization coefficient), confirming that they were matched to those employed in our HamJEPA runs. revision: yes
Referee: [Method / §4.3 leapfrog integrator] Method section describing the learned Hamiltonian leapfrog map: the central claim that the symplectic predictor captures the view-to-view coupling presupposes that the trained map remains volume-preserving and stable. No diagnostic (e.g., Jacobian determinant check, sensitivity to step size or number of leapfrog steps) is reported, and the non-isotropic scale and spectral floors lack a derivation showing they commute with the leapfrog update without introducing new instabilities.

Authors: We appreciate the referee’s emphasis on empirical verification of the symplectic property. In the revision we have added a diagnostics subsection (4.3.1) that reports the average absolute log-determinant of the Jacobian of the learned leapfrog map evaluated on held-out batches; the quantity remains near zero throughout training, consistent with volume preservation. We also include sensitivity curves for step size and number of leapfrog steps, demonstrating stable downstream performance within the operating range used in the main experiments. For the non-isotropic scale and spectral floors we supply a short derivation showing that a diagonal scaling matrix applied in the canonical coordinates commutes with the symplectic leapfrog update and that the spectral floor can be realized as a soft projection that preserves the Hamiltonian structure without introducing additional instabilities. revision: yes

Circularity Check

0 steps flagged

Theoretical claims derive from optimization under stated constraints without reducing to self-definition or fitted inputs

full rationale

The derivation of the optimal covariance as (c/d)H^{-1} follows from maximizing entropy subject to a fixed Hamiltonian energy budget, which is a standard variational result and independent of the target conclusion. The argument that no fixed marginal is canonical when H is unknown is a logical consequence of the dependence on H, not a tautology. The demonstration that oracle marginals do not identify the predictive coupling is based on the general non-uniqueness of joints given marginals in the context of view-to-view prediction, which holds outside the specific HamJEPA model. The practical method introduces the leapfrog integrator as an implementation choice justified by symplectic preservation, with ablations providing empirical support rather than circular justification. No self-citations or fitted parameters are load-bearing in the central claims.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claims rest on the existence of a structured positive-definite downstream geometry H and on the modeling choice that Hamiltonian dynamics provide an appropriate inductive bias for view-to-view coupling; several new components are introduced without external falsifiable evidence.

free parameters (2)

Hamiltonian energy budget constant c/d
Appears in the derived optimal covariance (c/d)H^{-1} and is required for the minimax and maximum-entropy statements.
non-isotropic scale and spectral floors
Introduced to prevent collapse in HamJEPA; their concrete values are implementation choices.

axioms (2)

domain assumption Downstream task geometry can be represented by a positive definite matrix H
Invoked to derive the optimal covariance shape under the Hamiltonian budget.
domain assumption Hamiltonian energy budget constrains the allowable covariance
Used as the constraint for the minimax and maximum-entropy derivation.

invented entities (2)

phase-space state (q, p) encoding for each view no independent evidence
purpose: To represent data views in a symplectic geometry suitable for Hamiltonian dynamics
New representational choice introduced for HamJEPA.
learned Hamiltonian leapfrog map no independent evidence
purpose: To perform view-to-view predictive coupling while preserving symplectic structure
Core predictive mechanism of the proposed model.

pith-pipeline@v0.9.0 · 5835 in / 1751 out tokens · 71960 ms · 2026-05-20T06:43:31.117640+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost Jcost convexity and functional-equation uniqueness echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 3.3 (Maximum entropy under a Hamiltonian energy constraint): among distributions with E[Z⊤HZ]=c the unique maximizer is N(0,(c/d)H^{-1})
IndisputableMonolith/Foundation/ArrowOfTime energy-based densities and phase-space constructions echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Proposition 3.6 (Quadratic Hamiltonian lift): HH(q,p)=½q⊤Hq+½‖p‖² yields independent Gaussians q∼N(0,(c/d)H^{-1}), p∼N(0,(c/d)I) with Gibbs density exp(−d/c HH)
IndisputableMonolith/Foundation symplectic flow and volume-preservation lemmas echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem A.4 (Leapfrog is symplectic and volume-preserving for separable Hamiltonians); det(DΦ(K)ϕ(s))=1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Differential-geometrical methods in statistics, 1985

Shun’ichi Amari. Differential-geometrical methods in statistics, 1985

work page 1985
[2]

Self-supervised learning from images with a joint- embedding predictive architecture, 2023

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture, 2023

work page 2023
[3]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025. URLhttps://arxiv.org/abs/2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning, 2021

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning, 2021

work page 2021
[5]

Weitere studien über das wärmegleichgewicht unter gasmolekülen, 1872

Ludwig Boltzmann. Weitere studien über das wärmegleichgewicht unter gasmolekülen, 1872

work page
[6]

A simple framework for contrastive learning of visual representations, 2020

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020

work page 2020
[7]

Willard Gibbs

J. Willard Gibbs. Elementary principles in statistical mechanics: Developed with especial reference to the rational foundation of thermodynamics, 1902

work page 1902
[8]

Hamiltonian neural networks, 2019

Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks, 2019. URL https://arxiv.org/abs/1906.01563

work page arXiv 2019
[9]

Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020

work page 2020
[10]

On a general method in dynamics, 1834

William Rowan Hamilton. On a general method in dynamics, 1834

work page
[11]

E. T. Jaynes. Information theory and statistical mechanics, 1957

work page 1957
[12]

Traité de mécanique céleste, 1799

Pierre-Simon Laplace. Traité de mécanique céleste, 1799

work page
[13]

Sur la théorie de la variation des constantes arbitraires, 1838

Joseph Liouville. Sur la théorie de la variation des constantes arbitraires, 1838

work page
[14]

Sur la variation des constantes arbitraires dans les questions de mé- canique, 1809

Siméon-Denis Poisson. Sur la variation des constantes arbitraires dans les questions de mé- canique, 1809

work page
[15]

Hamiltonian generative networks, 2019

Peter Toth, Danilo Jimenez Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, and Irina Higgins. Hamiltonian generative networks, 2019

work page 2019
[16]

Representation learning with contrastive predictive coding, 2018

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018

work page 2018
[17]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020

work page 2020
[18]

Simplifying dino via coding rate regularization, 2025

Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, and Yi Ma. Simplifying dino via coding rate regularization, 2025. URL https://arxiv. org/abs/2502.10385

work page arXiv 2025
[19]

Learning diverse and discriminative representations via the principle of maximal coding rate reduction

Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, and Yi Ma. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. In Advances in Neural Information Processing Systems, volume 33, 2020

work page 2020
[20]

Barlow twins: Self- supervised learning via redundancy reduction, 2021

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction, 2021

work page 2021
[21]

Symplectic ode-net: Learning hamiltonian dynamics with control, 2019

Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control, 2019

work page 2019
[22]

Kerjepa: Kernel discrepancies for euclidean self-supervised learning, 2025

Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, and Lester Mackey. Kerjepa: Kernel discrepancies for euclidean self-supervised learning, 2025. URL https: //arxiv.org/abs/2512.19605. 10 A Theory: symplecticity, volume preservation, and anti-collapse This appendix records the structural guarantees enjoyed by HamJEPA when using the separa...

work page arXiv 2025
[23]

(Closeness) eH[N] ϕ,∆t(s) =H ϕ(s) +O(∆t 2)uniformly on the region of interest

work page
[24]

(Modified-flow representation) One leapfrog step equals the time-∆t flow of eH[N] ϕ,∆t up to local errorO(∆t N+1)

work page
[25]

one-spike

(Near-conservation) Along the leapfrog iterates (sk), the modified energy eH[N] ϕ,∆t(sk) is nearly conserved, and the original energy error Hϕ(sk)− H ϕ(s0) remains bounded over long horizons when∆tis small. Proof (sketch via BCH/backward error analysis).See Appendix B.2.8.□ Anti-collapse: what each ingredient does (and does not do).HamJEPA uses anencoder-...

work page
[26]

Covariance eigenspectrum:top-256 eigenvalues (log-scale) of themean-centeredempirical covariance

work page
[27]

cone-like

Random-pair cosine similarities:histogram of cosine similarities after per-vector ℓ2 normalizationwithoutmean subtraction. 4.Feature norms:histogram ofℓ 2 feature norms. A key subtlety is that the covariance spectrum is computed aftermean-centering, whereas the cosine histogram is computed on normalized featureswithoutcentering. Consequently, strongly pos...

work page 2048
[28]

BuildVviews per image, concatenate them, and computez views ∈R V×B×2d

work page
[29]

Select the two global views(z 0, z1)and computeL pred via Hamiltonian flow prediction

work page
[30]

ComputeL budget,L var,L log det, andL mean on all views

work page
[31]

Baseline (LeJEPA + SIGReg/HamSIGReg).Ifhjepais absent:

Optimize encoder + (identity) projector + predictor parameters with AdamW, optional gradient clipping, and cosine LR schedule with warmup. Baseline (LeJEPA + SIGReg/HamSIGReg).Ifhjepais absent:

work page
[32]

Computez views as above

work page
[33]

Apply LeJEPA prediction loss: all views match the per-sample mean of global views

work page
[34]

SIGReg) with weightlambda_reg

Add a distributional regularizer (e.g. SIGReg) with weightlambda_reg

work page
[35]

dt-collapse

If the regularizer includes a learnable metric H, optionally update H on a slower/periodic schedule. F.10 Algorithms Notation.Let V be the number of augmented views per image, B the batch size, and K the representation dimension. In MV-HJEPA we enforce K= 2d and interpret each representation as a phase-space state z= [q;p]∈R 2d with q, p∈R d (channel-wise...

work page 2048

[1] [1]

Differential-geometrical methods in statistics, 1985

Shun’ichi Amari. Differential-geometrical methods in statistics, 1985

work page 1985

[2] [2]

Self-supervised learning from images with a joint- embedding predictive architecture, 2023

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture, 2023

work page 2023

[3] [3]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025. URLhttps://arxiv.org/abs/2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning, 2021

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning, 2021

work page 2021

[5] [5]

Weitere studien über das wärmegleichgewicht unter gasmolekülen, 1872

Ludwig Boltzmann. Weitere studien über das wärmegleichgewicht unter gasmolekülen, 1872

work page

[6] [6]

A simple framework for contrastive learning of visual representations, 2020

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020

work page 2020

[7] [7]

Willard Gibbs

J. Willard Gibbs. Elementary principles in statistical mechanics: Developed with especial reference to the rational foundation of thermodynamics, 1902

work page 1902

[8] [8]

Hamiltonian neural networks, 2019

Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks, 2019. URL https://arxiv.org/abs/1906.01563

work page arXiv 2019

[9] [9]

Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020

work page 2020

[10] [10]

On a general method in dynamics, 1834

William Rowan Hamilton. On a general method in dynamics, 1834

work page

[11] [11]

E. T. Jaynes. Information theory and statistical mechanics, 1957

work page 1957

[12] [12]

Traité de mécanique céleste, 1799

Pierre-Simon Laplace. Traité de mécanique céleste, 1799

work page

[13] [13]

Sur la théorie de la variation des constantes arbitraires, 1838

Joseph Liouville. Sur la théorie de la variation des constantes arbitraires, 1838

work page

[14] [14]

Sur la variation des constantes arbitraires dans les questions de mé- canique, 1809

Siméon-Denis Poisson. Sur la variation des constantes arbitraires dans les questions de mé- canique, 1809

work page

[15] [15]

Hamiltonian generative networks, 2019

Peter Toth, Danilo Jimenez Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, and Irina Higgins. Hamiltonian generative networks, 2019

work page 2019

[16] [16]

Representation learning with contrastive predictive coding, 2018

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018

work page 2018

[17] [17]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020

work page 2020

[18] [18]

Simplifying dino via coding rate regularization, 2025

Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, and Yi Ma. Simplifying dino via coding rate regularization, 2025. URL https://arxiv. org/abs/2502.10385

work page arXiv 2025

[19] [19]

Learning diverse and discriminative representations via the principle of maximal coding rate reduction

Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, and Yi Ma. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. In Advances in Neural Information Processing Systems, volume 33, 2020

work page 2020

[20] [20]

Barlow twins: Self- supervised learning via redundancy reduction, 2021

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction, 2021

work page 2021

[21] [21]

Symplectic ode-net: Learning hamiltonian dynamics with control, 2019

Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control, 2019

work page 2019

[22] [22]

Kerjepa: Kernel discrepancies for euclidean self-supervised learning, 2025

Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, and Lester Mackey. Kerjepa: Kernel discrepancies for euclidean self-supervised learning, 2025. URL https: //arxiv.org/abs/2512.19605. 10 A Theory: symplecticity, volume preservation, and anti-collapse This appendix records the structural guarantees enjoyed by HamJEPA when using the separa...

work page arXiv 2025

[23] [23]

(Closeness) eH[N] ϕ,∆t(s) =H ϕ(s) +O(∆t 2)uniformly on the region of interest

work page

[24] [24]

(Modified-flow representation) One leapfrog step equals the time-∆t flow of eH[N] ϕ,∆t up to local errorO(∆t N+1)

work page

[25] [25]

one-spike

(Near-conservation) Along the leapfrog iterates (sk), the modified energy eH[N] ϕ,∆t(sk) is nearly conserved, and the original energy error Hϕ(sk)− H ϕ(s0) remains bounded over long horizons when∆tis small. Proof (sketch via BCH/backward error analysis).See Appendix B.2.8.□ Anti-collapse: what each ingredient does (and does not do).HamJEPA uses anencoder-...

work page

[26] [26]

Covariance eigenspectrum:top-256 eigenvalues (log-scale) of themean-centeredempirical covariance

work page

[27] [27]

cone-like

Random-pair cosine similarities:histogram of cosine similarities after per-vector ℓ2 normalizationwithoutmean subtraction. 4.Feature norms:histogram ofℓ 2 feature norms. A key subtlety is that the covariance spectrum is computed aftermean-centering, whereas the cosine histogram is computed on normalized featureswithoutcentering. Consequently, strongly pos...

work page 2048

[28] [28]

BuildVviews per image, concatenate them, and computez views ∈R V×B×2d

work page

[29] [29]

Select the two global views(z 0, z1)and computeL pred via Hamiltonian flow prediction

work page

[30] [30]

ComputeL budget,L var,L log det, andL mean on all views

work page

[31] [31]

Baseline (LeJEPA + SIGReg/HamSIGReg).Ifhjepais absent:

Optimize encoder + (identity) projector + predictor parameters with AdamW, optional gradient clipping, and cosine LR schedule with warmup. Baseline (LeJEPA + SIGReg/HamSIGReg).Ifhjepais absent:

work page

[32] [32]

Computez views as above

work page

[33] [33]

Apply LeJEPA prediction loss: all views match the per-sample mean of global views

work page

[34] [34]

SIGReg) with weightlambda_reg

Add a distributional regularizer (e.g. SIGReg) with weightlambda_reg

work page

[35] [35]

dt-collapse

If the regularizer includes a learnable metric H, optionally update H on a slower/periodic schedule. F.10 Algorithms Notation.Let V be the number of augmented views per image, B the batch size, and K the representation dimension. In MV-HJEPA we enforce K= 2d and interpret each representation as a phase-space state z= [q;p]∈R 2d with q, p∈R d (channel-wise...

work page 2048