Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
Pith reviewed 2026-05-20 06:43 UTC · model grok-4.3
The pith
For unknown downstream geometries no fixed marginal covariance is canonical in JEPAs, so structural bias must enter the cross-view predictive coupling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Euclidean isotropy is not a neutral default for JEPA representations. For known positive-definite downstream geometry H the minimax and maximum-entropy covariances under Hamiltonian energy are both (c/d) H inverse. When H is unknown no geometry-independent fixed marginal is canonical, because any chosen covariance can be the worst possible for some H. Oracle one-view marginals do not determine the inter-view coupling. Therefore the structural bias should be placed in the cross-view predictor, which the authors implement by encoding views as phase-space states and predicting transitions with a learned Hamiltonian leapfrog map.
What carries the argument
Learned Hamiltonian leapfrog map that predicts transitions between phase-space states (q, p) of the two views, augmented by non-isotropic scale and spectral-floor regularizers to avoid collapse.
If this is right
- For any known downstream geometry H the optimal covariance is exactly (c/d) H inverse under the Hamiltonian budget.
- Isotropy carries an explicit closed-form performance penalty relative to the geometry-matched covariance.
- No single fixed marginal distribution can be optimal or even safe across the space of all possible downstream geometries.
- Oracle access to one-view marginals still leaves the predictive coupling between views underdetermined.
- Replacing the isotropic marginal with symplectic coupling improves kNN and linear-probe accuracy on CIFAR-100 and ImageNet-100 relative to isotropic baselines.
Where Pith is reading between the lines
- The same misalignment argument may apply to other self-supervised methods that enforce isotropic marginals.
- Direct tests on tasks whose downstream geometry is known and anisotropic would provide a clean check of the inverse-covariance claim.
- The phase-space formulation could be extended to temporal or sequential prediction settings where dynamics are naturally Hamiltonian.
- Other regularizers in representation learning may embed hidden Euclidean assumptions that conflict with structured downstream geometries.
Load-bearing premise
A learned Hamiltonian leapfrog map can reliably capture the view-to-view predictive coupling in practice when non-isotropic scale and spectral floors are added to prevent collapse.
What would settle it
A concrete counterexample geometry H for which some fixed covariance other than proportional to H inverse achieves strictly lower Hamiltonian energy, or an experiment in which oracle one-view marginals suffice to recover the correct cross-view coupling.
Figures
read the original abstract
JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbf{HamJEPA}, which encodes each view as a phase-space state $(q,p)$ and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by $+4.89$ kNN@20 and $+3.52$ linear-probe points at 30 epochs, and by $+6.45$ kNN@20 and $+10.64$ linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA-$q$ improves by $+4.82$ kNN@20 and $+7.52$ linear-probe points at 45 epochs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that isotropic Gaussian regularization in Joint-Embedding Predictive Architectures (JEPAs) implicitly imposes Euclidean symmetry that is suboptimal for structured downstream geometries. For known H ≻ 0 it derives that the minimax and maximum-entropy covariance under a Hamiltonian energy budget equals (c/d) H^{-1} and that Euclidean isotropy carries a closed-form price; when the geometry is unknown, no fixed marginal covariance is canonical because every fixed shape can be maximally misaligned for some H. Oracle one-view marginals are shown not to identify the view-to-view predictive coupling. The authors instantiate the principle with HamJEPA, which encodes views as phase-space states (q, p) and predicts transitions via a learned Hamiltonian leapfrog map augmented by non-isotropic scale and spectral floors; on CIFAR-100 and ImageNet-100 the method reports gains over SIGReg, with an MLP ablation attributing the neighborhood-geometry improvement to the symplectic coupling.
Significance. If the derivations are made fully explicit and the learned leapfrog integrator is shown to remain symplectic and stable, the work would meaningfully challenge the default isotropy assumption in JEPA-style self-supervised learning and motivate geometry-aware cross-view predictors. The empirical gains and the predictor-type ablation constitute concrete, falsifiable support for the central recommendation. The presence of free parameters (energy budget c/d, spectral floors) and the absence of volume-preservation diagnostics, however, temper the strength of the parameter-free and symplectic claims.
major comments (3)
- [Abstract / Theory derivation] Abstract and the section deriving the optimal covariance: the minimax/max-ent covariance is stated as (c/d) H^{-1} under a Hamiltonian energy budget, yet the intermediate steps that produce this closed form from the energy constraint are not shown. Because c/d is listed among the free parameters, it is unclear whether the result is independent of tuning or whether the budget is chosen post hoc to align with the target geometry.
- [Experiments / Tables 1-2] Experiments section reporting CIFAR-100 and ImageNet-100 results: gains such as +4.89 kNN@20 and +3.52 linear-probe points at 30 epochs are presented without error bars, number of random seeds, or statistical significance tests. The SIGReg baseline is not described with sufficient hyper-parameter matching detail to exclude confounding implementation differences.
- [Method / §4.3 leapfrog integrator] Method section describing the learned Hamiltonian leapfrog map: the central claim that the symplectic predictor captures the view-to-view coupling presupposes that the trained map remains volume-preserving and stable. No diagnostic (e.g., Jacobian determinant check, sensitivity to step size or number of leapfrog steps) is reported, and the non-isotropic scale and spectral floors lack a derivation showing they commute with the leapfrog update without introducing new instabilities.
minor comments (3)
- [Notation / §3] The precise mapping from view embeddings to the phase-space coordinates (q, p) is introduced but not specified in sufficient detail for exact reproduction.
- [Figures] Figure captions and axis labels in the geometry visualizations would benefit from explicit indication of the isotropic baseline for direct visual comparison.
- [References] A small number of recent references on symplectic neural networks and structure-preserving integrators are missing from the related-work discussion.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which has helped clarify several aspects of the manuscript. We address each major comment below and have incorporated revisions to strengthen the theoretical exposition, experimental reporting, and methodological validation.
read point-by-point responses
-
Referee: [Abstract / Theory derivation] Abstract and the section deriving the optimal covariance: the minimax/max-ent covariance is stated as (c/d) H^{-1} under a Hamiltonian energy budget, yet the intermediate steps that produce this closed form from the energy constraint are not shown. Because c/d is listed among the free parameters, it is unclear whether the result is independent of tuning or whether the budget is chosen post hoc to align with the target geometry.
Authors: We agree that the intermediate derivation steps were insufficiently detailed. In the revised manuscript we have expanded Section 3 with an explicit derivation: starting from the Hamiltonian energy constraint Tr(H Σ) ≤ c, we apply the method of Lagrange multipliers to the minimax objective (or equivalently the maximum-entropy problem under the same linear constraint) and obtain the closed-form solution Σ* = (c/d) H^{-1}. The scalar c/d simply sets the overall scale of the covariance and does not alter its eigenstructure; the price of Euclidean isotropy is then expressed relative to this optimal shape for any fixed positive c. We have clarified in the text that c is treated as a fixed hyper-parameter chosen from the typical magnitude of the learned embeddings rather than tuned post hoc to any particular downstream H. revision: yes
-
Referee: [Experiments / Tables 1-2] Experiments section reporting CIFAR-100 and ImageNet-100 results: gains such as +4.89 kNN@20 and +3.52 linear-probe points at 30 epochs are presented without error bars, number of random seeds, or statistical significance tests. The SIGReg baseline is not described with sufficient hyper-parameter matching detail to exclude confounding implementation differences.
Authors: We acknowledge the need for more rigorous statistical reporting. The revised version now includes means and standard deviations computed over five independent random seeds for every reported metric on both CIFAR-100 and ImageNet-100. We have added paired t-tests confirming that the observed gains remain statistically significant (p < 0.05). In addition, we have expanded the experimental-setup paragraph to list the precise hyper-parameter values used for the SIGReg baseline (optimizer, learning-rate schedule, augmentation strengths, and regularization coefficient), confirming that they were matched to those employed in our HamJEPA runs. revision: yes
-
Referee: [Method / §4.3 leapfrog integrator] Method section describing the learned Hamiltonian leapfrog map: the central claim that the symplectic predictor captures the view-to-view coupling presupposes that the trained map remains volume-preserving and stable. No diagnostic (e.g., Jacobian determinant check, sensitivity to step size or number of leapfrog steps) is reported, and the non-isotropic scale and spectral floors lack a derivation showing they commute with the leapfrog update without introducing new instabilities.
Authors: We appreciate the referee’s emphasis on empirical verification of the symplectic property. In the revision we have added a diagnostics subsection (4.3.1) that reports the average absolute log-determinant of the Jacobian of the learned leapfrog map evaluated on held-out batches; the quantity remains near zero throughout training, consistent with volume preservation. We also include sensitivity curves for step size and number of leapfrog steps, demonstrating stable downstream performance within the operating range used in the main experiments. For the non-isotropic scale and spectral floors we supply a short derivation showing that a diagonal scaling matrix applied in the canonical coordinates commutes with the symplectic leapfrog update and that the spectral floor can be realized as a soft projection that preserves the Hamiltonian structure without introducing additional instabilities. revision: yes
Circularity Check
Theoretical claims derive from optimization under stated constraints without reducing to self-definition or fitted inputs
full rationale
The derivation of the optimal covariance as (c/d)H^{-1} follows from maximizing entropy subject to a fixed Hamiltonian energy budget, which is a standard variational result and independent of the target conclusion. The argument that no fixed marginal is canonical when H is unknown is a logical consequence of the dependence on H, not a tautology. The demonstration that oracle marginals do not identify the predictive coupling is based on the general non-uniqueness of joints given marginals in the context of view-to-view prediction, which holds outside the specific HamJEPA model. The practical method introduces the leapfrog integrator as an implementation choice justified by symplectic preservation, with ablations providing empirical support rather than circular justification. No self-citations or fitted parameters are load-bearing in the central claims.
Axiom & Free-Parameter Ledger
free parameters (2)
- Hamiltonian energy budget constant c/d
- non-isotropic scale and spectral floors
axioms (2)
- domain assumption Downstream task geometry can be represented by a positive definite matrix H
- domain assumption Hamiltonian energy budget constrains the allowable covariance
invented entities (2)
-
phase-space state (q, p) encoding for each view
no independent evidence
-
learned Hamiltonian leapfrog map
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/CostJcost convexity and functional-equation uniqueness echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Theorem 3.3 (Maximum entropy under a Hamiltonian energy constraint): among distributions with E[Z⊤HZ]=c the unique maximizer is N(0,(c/d)H^{-1})
-
IndisputableMonolith/Foundation/ArrowOfTimeenergy-based densities and phase-space constructions echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Proposition 3.6 (Quadratic Hamiltonian lift): HH(q,p)=½q⊤Hq+½‖p‖² yields independent Gaussians q∼N(0,(c/d)H^{-1}), p∼N(0,(c/d)I) with Gibbs density exp(−d/c HH)
-
IndisputableMonolith/Foundationsymplectic flow and volume-preservation lemmas echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Theorem A.4 (Leapfrog is symplectic and volume-preserving for separable Hamiltonians); det(DΦ(K)ϕ(s))=1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Differential-geometrical methods in statistics, 1985
Shun’ichi Amari. Differential-geometrical methods in statistics, 1985
work page 1985
-
[2]
Self-supervised learning from images with a joint- embedding predictive architecture, 2023
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture, 2023
work page 2023
-
[3]
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025. URLhttps://arxiv.org/abs/2511.08544
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning, 2021
Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning, 2021
work page 2021
-
[5]
Weitere studien über das wärmegleichgewicht unter gasmolekülen, 1872
Ludwig Boltzmann. Weitere studien über das wärmegleichgewicht unter gasmolekülen, 1872
-
[6]
A simple framework for contrastive learning of visual representations, 2020
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020
work page 2020
-
[7]
J. Willard Gibbs. Elementary principles in statistical mechanics: Developed with especial reference to the rational foundation of thermodynamics, 1902
work page 1902
-
[8]
Hamiltonian neural networks, 2019
Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks, 2019. URL https://arxiv.org/abs/1906.01563
-
[9]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020
work page 2020
-
[10]
On a general method in dynamics, 1834
William Rowan Hamilton. On a general method in dynamics, 1834
-
[11]
E. T. Jaynes. Information theory and statistical mechanics, 1957
work page 1957
- [12]
-
[13]
Sur la théorie de la variation des constantes arbitraires, 1838
Joseph Liouville. Sur la théorie de la variation des constantes arbitraires, 1838
-
[14]
Sur la variation des constantes arbitraires dans les questions de mé- canique, 1809
Siméon-Denis Poisson. Sur la variation des constantes arbitraires dans les questions de mé- canique, 1809
-
[15]
Hamiltonian generative networks, 2019
Peter Toth, Danilo Jimenez Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, and Irina Higgins. Hamiltonian generative networks, 2019
work page 2019
-
[16]
Representation learning with contrastive predictive coding, 2018
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018
work page 2018
-
[17]
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020
work page 2020
-
[18]
Simplifying dino via coding rate regularization, 2025
Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, and Yi Ma. Simplifying dino via coding rate regularization, 2025. URL https://arxiv. org/abs/2502.10385
-
[19]
Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, and Yi Ma. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. In Advances in Neural Information Processing Systems, volume 33, 2020
work page 2020
-
[20]
Barlow twins: Self- supervised learning via redundancy reduction, 2021
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction, 2021
work page 2021
-
[21]
Symplectic ode-net: Learning hamiltonian dynamics with control, 2019
Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control, 2019
work page 2019
-
[22]
Kerjepa: Kernel discrepancies for euclidean self-supervised learning, 2025
Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, and Lester Mackey. Kerjepa: Kernel discrepancies for euclidean self-supervised learning, 2025. URL https: //arxiv.org/abs/2512.19605. 10 A Theory: symplecticity, volume preservation, and anti-collapse This appendix records the structural guarantees enjoyed by HamJEPA when using the separa...
-
[23]
(Closeness) eH[N] ϕ,∆t(s) =H ϕ(s) +O(∆t 2)uniformly on the region of interest
-
[24]
(Modified-flow representation) One leapfrog step equals the time-∆t flow of eH[N] ϕ,∆t up to local errorO(∆t N+1)
-
[25]
(Near-conservation) Along the leapfrog iterates (sk), the modified energy eH[N] ϕ,∆t(sk) is nearly conserved, and the original energy error Hϕ(sk)− H ϕ(s0) remains bounded over long horizons when∆tis small. Proof (sketch via BCH/backward error analysis).See Appendix B.2.8.□ Anti-collapse: what each ingredient does (and does not do).HamJEPA uses anencoder-...
-
[26]
Covariance eigenspectrum:top-256 eigenvalues (log-scale) of themean-centeredempirical covariance
-
[27]
Random-pair cosine similarities:histogram of cosine similarities after per-vector ℓ2 normalizationwithoutmean subtraction. 4.Feature norms:histogram ofℓ 2 feature norms. A key subtlety is that the covariance spectrum is computed aftermean-centering, whereas the cosine histogram is computed on normalized featureswithoutcentering. Consequently, strongly pos...
work page 2048
-
[28]
BuildVviews per image, concatenate them, and computez views ∈R V×B×2d
-
[29]
Select the two global views(z 0, z1)and computeL pred via Hamiltonian flow prediction
-
[30]
ComputeL budget,L var,L log det, andL mean on all views
-
[31]
Baseline (LeJEPA + SIGReg/HamSIGReg).Ifhjepais absent:
Optimize encoder + (identity) projector + predictor parameters with AdamW, optional gradient clipping, and cosine LR schedule with warmup. Baseline (LeJEPA + SIGReg/HamSIGReg).Ifhjepais absent:
-
[32]
Computez views as above
-
[33]
Apply LeJEPA prediction loss: all views match the per-sample mean of global views
-
[34]
Add a distributional regularizer (e.g. SIGReg) with weightlambda_reg
-
[35]
If the regularizer includes a learnable metric H, optionally update H on a slower/periodic schedule. F.10 Algorithms Notation.Let V be the number of augmented views per image, B the batch size, and K the representation dimension. In MV-HJEPA we enforce K= 2d and interpret each representation as a phase-space state z= [q;p]∈R 2d with q, p∈R d (channel-wise...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.