The paper formulates JEPA pretraining as conditional spectral graph learning equivalent to low-rank factorization of an action-conditioned co-occurrence matrix and derives a finite-sample generalization bound connecting pretraining error to downstream planning regret.
hub
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
74 Pith papers cite this work. Polarity classification is still indexing.
abstract
Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective--{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)--to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79\% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).
hub tools
citation-role summary
citation-polarity summary
years
2026 74representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
LeVLJEPA is the first non-contrastive vision-language pretraining method that learns via cross-modal prediction without negatives, producing stronger dense features than contrastive baselines on VQA and segmentation tasks.
FlexTab shows a shared encoder with task-specific decoders trained on unlabeled tables can achieve SOTA on classification, regression, anomaly detection and entity matching while staying competitive on relational entity classification.
Equilibrium World Models are a deep-learning solver that enforces exact equilibrium conditions on broad model-generated state distributions to globally solve dynamic stochastic models featuring rare disasters, binding constraints, and counterfactual states.
SkyJEPA learns long-horizon latent dynamics for quadrotors via JEPA plus a physics prober, enabling zero-shot sim-to-real control with sampling-based MPC and automated sim data generation.
S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.
PGSA achieves exact linear identifiability and near-infinite temporal consistency for non-Gaussian regimes via symbolic causal grounding, with four theorems formalized in Lean 4.
A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.
A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.
Attention sinks reflect either adaptive nop or broadcast mechanisms, with distinct traces, synthetic diagnostics, and complementary interventions via gating plus registers.
Cross-trajectory negative sampling in contrastive predictive objectives causes encoding of slow noise over dynamics; intra-trajectory sampling eliminates the shortcut and recovers dynamical variables even under strong noise.
Exact equivariance preserved through training renders one-step relMSE invariant across the symmetry group, enabling zero-shot generalization from a restricted training slice.
UR-JEPA applies uniform rectifiability regularization via a smoothed Carleson square function to JEPA training, producing embeddings with 4-5 order PCA spectral drop at dimension 20-25 and lower seed variance than Gaussian regularization on Inet10, Galaxy10, and EuroSAT.
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
HEPA pretrains via horizon-conditioned JEPA on unlabeled data then fine-tunes only the predictor for event survival CDFs, outperforming PatchTST, iTransformer, MAE and Chronos-2 on at least 10 of 14 benchmarks with fixed hyperparameters, an order of magnitude fewer tuned parameters and less labeled
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
The paper proves statistical consistency of contrastive loss to optimal ranking via an AUC criterion and derives generalization bounds O(1/m + 1/sqrt(n)) for supervised and O(1/sqrt(m) + 1/sqrt(n)) for self-supervised CRL that explain benefits of large negative sets.
VJE is a new variational non-contrastive SSL method that models target embeddings with a directional-radial Student-t distribution to enable structured uncertainty estimation directly in the learned representation space.
ACID improves decision-time planning in world models by adding per-step action consistency residuals from an inverse dynamics model to the planning cost via an adaptive weight, yielding better performance with less compute across manipulation and navigation tasks.
Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder that reconstructs actions from embedding displacements, yielding action-sensitive world models that improve planning on four visual continuous-control tasks over JEPA baselines.
ScaleAware-JEPA combines Constrained Diffusion Decomposition with a scale-tied JEPA objective to learn label-free latent coordinates that recover coherent morphology in multiscale fields such as MHD turbulence and interstellar gas.
A JEPA-based model with domain-informed multi-view self-distillation learns light-curve representations that outperform hand-crafted features on 15 of 16 StarEmbed metrics and adapts competitively to other irregular time-series datasets.
citing papers explorer
-
A Generalization Theory for JEPA-Based World Models
The paper formulates JEPA pretraining as conditional spectral graph learning equivalent to low-rank factorization of an action-conditioned co-occurrence matrix and derives a finite-sample generalization bound connecting pretraining error to downstream planning regret.
-
When Does LeJEPA Learn a World Model?
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
-
LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives
LeVLJEPA is the first non-contrastive vision-language pretraining method that learns via cross-modal prediction without negatives, producing stronger dense features than contrastive baselines on VQA and segmentation tasks.
-
FlexTab: A Flexible Encoder-Decoder Architecture for In-Context Learning Across Diverse Tabular Tasks
FlexTab shows a shared encoder with task-specific decoders trained on unlabeled tables can achieve SOTA on classification, regression, anomaly detection and entity matching while staying competitive on relational entity classification.
-
Equilibrium World Models
Equilibrium World Models are a deep-learning solver that enforces exact equilibrium conditions on broad model-generated state distributions to globally solve dynamic stochastic models featuring rare disasters, binding constraints, and counterfactual states.
-
SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors
SkyJEPA learns long-horizon latent dynamics for quadrotors via JEPA plus a physics prober, enabling zero-shot sim-to-real control with sampling-based MPC and automated sim data generation.
-
S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning
S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.
-
Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency
PGSA achieves exact linear identifiability and near-infinite temporal consistency for non-Gaussian regimes via symbolic causal grounding, with four theorems formalized in Lean 4.
-
A Unifying Framework for Concept-Based Representational Similarity
A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.
-
A Unifying View of Attention Sinks: Two Algorithms, Two Solutions
Attention sinks reflect either adaptive nop or broadcast mechanisms, with distinct traces, synthetic diagnostics, and complementary interventions via gating plus registers.
-
Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group
Exact equivariance preserved through training renders one-step relMSE invariant across the symmetry group, enabling zero-shot generalization from a restricted training slice.
-
UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures
UR-JEPA applies uniform rectifiability regularization via a smoothed Carleson square function to JEPA training, producing embeddings with 4-5 order PCA spectral drop at dimension 20-25 and lower seed variance than Gaussian regularization on Inet10, Galaxy10, and EuroSAT.
-
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
-
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
HEPA pretrains via horizon-conditioned JEPA on unlabeled data then fine-tunes only the predictor for event survival CDFs, outperforming PatchTST, iTransformer, MAE and Chronos-2 on at least 10 of 14 benchmarks with fixed hyperparameters, an order of magnitude fewer tuned parameters and less labeled
-
ProteinJEPA: Latent prediction complements protein language models
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
-
Statistical Consistency and Generalization of Contrastive Representation Learning
The paper proves statistical consistency of contrastive loss to optimal ranking via an AUC criterion and derives generalization bounds O(1/m + 1/sqrt(n)) for supervised and O(1/sqrt(m) + 1/sqrt(n)) for self-supervised CRL that explain benefits of large negative sets.
-
Joint Embedding Variational Bayes
VJE is a new variational non-contrastive SSL method that models target embeddings with a directional-radial Student-t distribution to enable structured uncertainty estimation directly in the learned representation space.
-
ACID: Action Consistency via Inverse Dynamics for Planning with World Models
ACID improves decision-time planning in world models by adding per-step action consistency residuals from an inverse dynamics model to the planning cost via an adaptive weight, yielding better performance with less compute across manipulation and navigation tasks.
-
Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding
Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder that reconstructs actions from embedding displacements, yielding action-sensitive world models that improve planning on four visual continuous-control tasks over JEPA baselines.
-
ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields
ScaleAware-JEPA combines Constrained Diffusion Decomposition with a scale-tied JEPA objective to learn label-free latent coordinates that recover coherent morphology in multiscale fields such as MHD turbulence and interstellar gas.
-
Domain-Informed Multi-View Self-Distillation for Astronomical Light-Curve Representation Learning with JEPA
A JEPA-based model with domain-informed multi-view self-distillation learns light-curve representations that outperform hand-crafted features on 15 of 16 StarEmbed metrics and adapts competitively to other irregular time-series datasets.
-
Fast LeWorldModel
Fast-LeWM uses action-prefix encoding and parallel latent prediction to replace sequential rollout, improving success rates and cutting planning time in LeWorldModel tasks.
-
Black-Box Continual Learning for Vision-Language Models
Introduces Black-CL black-box benchmark and BETA textual-prototype method that matches or exceeds white-box continual learning performance on ten datasets using 0.05M parameters.
-
Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying
MELT and SALT match the two-modality SATCLIP baseline on four downstream tasks but show no consistent gains from extra modalities, indicating the location encoder itself is the bottleneck.
-
Sensorimotor World Models: Perception for Action via Inverse Dynamics
SMWM trains end-to-end latent world models from offline reward-free data using inverse dynamics regularization to prevent collapse and align states with controllable actions for planning.
-
Expanding SPHERE-JEPA: A Family of Statistical Regularizers for the Hypersphere
Derives deterministic MMD, KSD, and KL objectives with rotationally invariant kernels on the hypersphere, yielding more stable SSL training and dataset-dependent geometry in learned representations.
-
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation
TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.
-
$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models
ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
-
DALE-CT: Depth-Aware Foundation Models for Computed Tomography
DALE-CT, a 2D LeJEPA model with depth-aware dual supervision, reaches 0.833 Macro AUROC on multi-abnormality detection in CT and approaches 3D SOTA performance using less data and no textual supervision.
-
Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have
FINO adapts vision foundation models to scientific domains via metadata-guided self-supervised learning and outperforms both unsupervised domain adaptation and fully supervised methods without using task labels for the backbone.
-
Video-Mirai: Autoregressive Video Diffusion Models Need Foresight
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
-
VISReg: Variance-Invariance-Sketching Regularization for JEPA training
VISReg replaces covariance in VICReg-style objectives with sliced-Wasserstein sketching for JEPA training, claiming better OOD performance and resilience to collapse.
-
STEP: Learning STructured Embeddings for Progressive Time Series
STEP embeds progressive time series into a manifold between orthogonal prototypes so that polar angle tracks irreversible state progression and radius tracks mode via self-supervised contrastive learning.
-
Uncovering the Latent Potential of Deep Intermediate Representations
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
-
SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining
SpectralEarth-FM is a multisensor hierarchical transformer pretrained on a 40TB co-located HSI-MSI-SAR dataset using a JEPA-style objective and reports state-of-the-art results on hyperspectral and standard EO benchmarks.
-
Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
Fixed isotropic marginals in JEPAs can be maximally misaligned with unknown structured geometries, and HamJEPA using symplectic Hamiltonian leapfrog maps improves kNN and linear-probe performance on CIFAR-100 and ImageNet-100.
-
LACE: Latent Visual Representation for Cross-Embodiment Learning
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
-
Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement
Crys-JEPA introduces a joint embedding predictive architecture that creates an energy-aware latent space, enabling embedding-based stability screening and a refinement pipeline that yields up to 72.7% gains on the V.S.U.N. metric for crystal generation.
-
Latent Geometry Beyond Search: Amortizing Planning in World Models
A Goal-Conditioned Inverse Dynamics Model amortizes planning in pretrained world model latents, matching or exceeding CEM in seven of eight settings at 100-130x lower per-decision cost.
-
Predictive but Not Plannable: RC-aux for Latent World Models
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
-
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
AeroJEPA applies joint-embedding predictive learning to produce scalable, semantically organized latent representations for 3D aerodynamic fields that support both field reconstruction and downstream design tasks.
-
Why Self-Supervised Encoders Want to Be Normal
Self-supervised encoders prefer isotropic Gaussian latent states because the Information Bottleneck, recast as rate-distortion over the predictive manifold, makes these states optimal for target-neutral representations.
-
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
DySIB recovers the two-dimensional phase space of a physical pendulum from experimental video by optimizing a symmetric information bottleneck objective entirely in latent space.
-
Self-supervised pretraining for an iterative image size agnostic vision transformer
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
-
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-device wearables.
-
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
-
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
-
PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
PEPR reframes learning with privileged event data as predicting latent event features from RGB to improve domain generalization in object detection and segmentation without direct cross-modal alignment.
-
LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning
LeNEPA proposes a no-augmentation next-latent prediction recipe that maintains frozen-probe performance across ECG and synthetic diagnostic time-series datasets under fixed-recipe conditions where a tuned JEPA baseline degrades.