Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

Biswa Sengupta

arxiv: 2605.15394 · v1 · pith:PYE5AVJQnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI· stat.ML

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

Biswa Sengupta This is my paper

Pith reviewed 2026-05-19 16:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords JEPALLM fine-tuninghidden-state geometrytask couplingauxiliary objectivesLoRArepresentation learningregex generation

0 comments

The pith

JEPA-style auxiliaries change LLM hidden-state geometry but leave task accuracy unchanged on language-to-regex generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether joint-embedding predictive architectures improve large language models by training them to predict latent representations rather than direct outputs. It applies twenty-two auxiliary objectives, including trajectory regularisers, distributional constraints, and a decoder-visible JEPA loss, to a Llama-3.2-1B model fine-tuned on natural language to regex conversion. Several auxiliaries alter hidden-state curvature, anisotropy, variance, and gradient directions, yet none produce exact-match accuracy gains that survive multiple-comparison correction. The null result persists when the decoder-visible construction is replicated with full fine-tuning at five seeds. The authors conclude that hidden-state representation improvements and decoded task performance remain weakly coupled in this regime.

Core claim

In LLM fine-tuning for natural-language-to-regex generation, auxiliary objectives intended to shape hidden-state geometry produce measurable shifts in representation statistics and gradient alignment, including the first positive cosine with cross-entropy observed for a decoder-visible JEPA construction, yet none deliver task accuracy improvements that survive Bonferroni or Holm-Bonferroni correction. Exact-match scores stay inside seed noise for both LoRA and full-parameter regimes. The findings therefore establish a weak coupling between hidden-state representation work and decoded-task accuracy, reframing JEPA evaluation around the question of when useful geometry becomes visible as task-

What carries the argument

The decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone, tested as one of twenty-two auxiliaries for whether induced hidden-state changes reach the language-model head and improve exact-match accuracy.

Load-bearing premise

The natural-language-to-regex generation task with exact-match metric is representative enough that a null result generalizes to weak coupling between hidden geometry and task signal across LLM fine-tuning.

What would settle it

A statistically significant exact-match gain from the decoder-visible JEPA auxiliary on a second generation task such as text-to-SQL, after the same multiple-testing corrections, would falsify the weak-coupling claim.

Figures

Figures reproduced from arXiv: 2605.15394 by Biswa Sengupta.

**Figure 1.** Figure 1: The auxiliary loss splices into the standard LoRA fine-tuning pipeline at the final-layer hidden states, in parallel with the languagemodel head. TABLE II: The hypothesis map: eighteen training-time auxiliaries plus one inference-time intervention. # Name Class Eq. STP Semantic Tube Prediction 1st-order attractor (7) T1 Curvature-Aware Tube 2nd-order attractor (9) T2 Riemannian-Metric Tube metric-cosine a… view at source ↗

**Figure 2.** Figure 2: Visibility of each loss inside the assistant span. hL−1 predicts EOS directly; hL−2 feeds it via self-attention. Geometric auxiliaries see only the EOS-clipped span {h0, . . . , hL−3}; cross-entropy and the decoder-visible margin hinge see the full span and supervise EOS. baseline produced 4.2–35.8%. We therefore clip the right end of the assistant span by margin = 2 tokens before passing it to any geometr… view at source ↗

**Figure 3.** Figure 3: Data-efficiency curve on NL-RX-TURK. Mean ± one standard deviation across three seeds at each training-data fraction; significance markers are paired Welch’s t-tests of each auxiliary against the matched baseline cell (∗: ppaired < 0.10, ∗∗: ppaired < 0.05). structured-null reading: the original STP claim is framed as a sample-efficiency gain rather than an asymptotic-exactmatch gain, and a small-data l… view at source ↗

**Figure 4.** Figure 4: Decoder-visible JEPA architecture. Interior (t, t+k) pairs for k ∈ {2, . . . , K} on the EOS-clipped span (k = 1 omitted because it duplicates CE) feed a residual MLP predictor qϕ(ht, k), with stopgradient on the target ht+k; both are projected through the shared frozen LM head W and the KL is computed in distribution space. The margin hinge consumes W ht directly at supervised positions. No gradient flow… view at source ↗

read the original abstract

Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emph{and} improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired $\alpha = 0.10$ without correction (T3-Local at $\Delta = +2.53$~pp, $p = 0.003$ being the strongest), but none survives Bonferroni or Holm--Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary--cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at $n = 5$ seeds reproduces the null on both benchmarks (TURK: $\Delta = +0.04$~pp, $p_{\text{paired}} = 0.96$; SYNTH: $\Delta = +0.52$~pp, $p_{\text{paired}} = 0.28$), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a controlled null on 22 auxiliaries for LLM fine-tuning on one narrow task, with the decoder-visible JEPA showing gradient alignment but no accuracy lift, yet the jump to a general weak-coupling reframing rests on limited evidence.

read the letter

The core finding is straightforward: under a fixed LoRA harness on Llama-3.2-1B for natural-language-to-regex generation, none of the twenty-two auxiliaries—including trajectory regularizers, distributional constraints, and a decoder-visible JEPA—produce a reliable gain in exact-match accuracy once multiple-testing corrections are applied. Several move hidden-state properties like curvature or gradient direction, and the JEPA variant even yields the first positive auxiliary-to-cross-entropy cosine in the study, but exact match stays inside seed noise. A full fine-tuning replication at five seeds confirms the same null on both TURK and SYNTH splits. That is a clean empirical result with transparent paired tests, seed reporting, and Bonferroni/Holm controls. The work is new in its scale and in constructing a decoder-visible JEPA that sits in the positive cone of cross-entropy. The statistical handling and replication are the parts that hold up best. The soft spot is the scope. The task outputs are short, highly constrained formal strings scored by exact match, which discards partial or semantic credit. If the coupling between hidden geometry and decoder-visible signal changes under execution accuracy, open-vocabulary generation, or reasoning tasks where representation quality affects outputs more continuously, the claim that hidden-state work and task accuracy are weakly coupled in this regime does not automatically extend. The paper frames the result as a coupling problem for future JEPA evaluation, but that reframing would be stronger with at least one additional task or metric. This is the kind of audit that people working on auxiliary objectives or representation learning for LLMs will want to see. Readers who value controlled experiments with explicit statistical safeguards will find it useful even if they disagree with how far the null travels. It deserves a serious referee because the design is falsifiable, the controls are present, and the null itself is worth checking against broader setups.

Referee Report

1 major / 0 minor

Summary. This paper audits Joint-Embedding Predictive Architectures (JEPAs) for autoregressive LLM fine-tuning by testing whether 22 training auxiliaries (trajectory regularizers, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective) improve hidden-state geometry in a way that reaches the language-model head and raises decoded-task accuracy. Experiments use a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation with exact-match metric. Results show a structured null: nominal gains (e.g., T3-Local at +2.53 pp, p=0.003) fail Bonferroni/Holm correction; decoder-visible JEPA produces the first positive auxiliary–cross-entropy gradient cosine yet yields no task improvement. The null replicates under full fine-tuning (TURK and SYNTH benchmarks). The authors conclude that hidden-state representation work and decoded accuracy are weakly coupled in this regime and reframe LLM-domain JEPA evaluation as a coupling problem.

Significance. If the null holds, the work supplies controlled evidence that JEPA-style auxiliaries can alter hidden-state curvature, anisotropy, and gradient direction without producing decoder-visible task gains on this benchmark. The statistical design (paired tests, seed variation, Bonferroni/Holm correction) and the full-fine-tuning replication are clear strengths that make the reported null reliable within the studied setup. The result usefully separates representation learning from task-signal transmission in autoregressive fine-tuning.

major comments (1)

[Abstract] Abstract: the reframing of LLM-domain JEPA evaluation as a 'coupling problem' is presented as following from the observed null. The null is robust for the reported NL-to-regex task and exact-match metric (including the full-fine-tuning replication), but the extension to a general reframing assumes this narrow, constrained-output, 0/1-metric setup is representative of regimes in which representation geometry more continuously affects generation quality. A brief discussion of scope or a second task would strengthen the claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on scope. We address the concern below and will revise the manuscript to clarify the intended domain of the reframing.

read point-by-point responses

Referee: [Abstract] Abstract: the reframing of LLM-domain JEPA evaluation as a 'coupling problem' is presented as following from the observed null. The null is robust for the reported NL-to-regex task and exact-match metric (including the full-fine-tuning replication), but the extension to a general reframing assumes this narrow, constrained-output, 0/1-metric setup is representative of regimes in which representation geometry more continuously affects generation quality. A brief discussion of scope or a second task would strengthen the claim.

Authors: We agree that the reframing should be explicitly scoped to the studied regime rather than presented as fully general. The manuscript already qualifies the setting as natural-language-to-regex generation under exact-match evaluation and demonstrates robustness via the full-fine-tuning replication on both TURK and SYNTH. To address the referee's point directly, we will revise the abstract to state that the coupling problem is identified 'in this regime' and add a short paragraph in the discussion section noting that the weak coupling between hidden-state geometry and decoded accuracy may not hold under open-ended generation or continuous quality metrics. We do not add a second task at this stage because the current experimental harness (fixed Llama-3.2-1B LoRA, 22 auxiliaries, paired seed design, multiple-testing correction) is already resource-intensive; the added scope discussion will make the boundary conditions of the claim transparent without overclaiming generality. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical null-result study with interpretive reframing

full rationale

The paper reports results from controlled fine-tuning experiments comparing 22 training auxiliaries on a natural-language-to-regex task under fixed LoRA and full fine-tuning regimes. The central claim of weak coupling between hidden-state geometry and decoded-task accuracy follows from the observed structured null on exact-match metrics (none surviving multiple-testing correction, with decoder-visible JEPA also null). This is an empirical conclusion from external benchmarks rather than any self-referential equation, fitted parameter renamed as prediction, or self-citation chain. The reframing as a coupling problem is an interpretive step based on the null findings and does not reduce to quantities defined inside the study by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are present in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen regex task and exact-match metric are adequate proxies for decoder-visible utility, plus standard assumptions about LoRA fine-tuning dynamics and the validity of paired t-tests under seed variation. No new entities are postulated.

axioms (1)

domain assumption The natural-language-to-regex task with exact-match accuracy is representative of broader LLM fine-tuning regimes for testing representation-task coupling.
The paper extrapolates from this single task to the general claim of weak coupling.

pith-pipeline@v0.9.0 · 5892 in / 1403 out tokens · 32205 ms · 2026-05-19T16:23:48.105202+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 6 internal anchors

[1]

Making the world differentiable: On using self- supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments,

J. Schmidhuber, “Making the world differentiable: On using self- supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments,” Institut f ¨ur Informatik, Technische Universit ¨at M ¨unchen, Tech. Rep. FKI-126-90, 1990

work page 1990
[2]

A path towards autonomous machine intelligence, ver- sion 0.9.2,

Y . LeCun, “A path towards autonomous machine intelligence, ver- sion 0.9.2,” OpenReview, 2022, position paper introducing the joint- embedding predictive architecture (JEPA)

work page 2022
[3]

Curious model-building control systems,

J. Schmidhuber, “Curious model-building control systems,”Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1458–1463, 1991

work page 1991
[4]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, arXiv:2301.08243v3

work page arXiv 2023
[5]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

L. Mur-Labadia, M. Muckley, A. Bar, M. Assran, K. Sinha, M. Rabbat, Y . LeCun, N. Ballas, and A. Bardes, “V-jepa 2.1: Unlocking dense fea- tures in video self-supervised learning,” Mar. 2026, arXiv:2603.14482v2, preprint, March 2026

work page arXiv 2026
[6]

Semantic tube prediction: Beating llm data efficiency with jepa, 2026

H. Huang, Y . LeCun, and R. Balestriero, “Semantic tube prediction: Beating llm data efficiency with jepa,” Feb. 2026, arXiv:2602.22617v1, preprint, February 2026

work page arXiv 2026
[7]

A Simple Framework for Contrastive Learning of Visual Representations

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProceedings of the International Conference on Machine Learning (ICML), 2020, arXiv:2002.05709

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

Bootstrap your own latent: A new approach to self-supervised learn- ing

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, R. Mvtchell, A. Ahuja, E. Agapow, and C. Beurie, “Bootstrap your own latent: A new approach to self-supervised learning,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020, arXiv:2006.07733

work page arXiv 2020
[9]

Exploring simple siamese representation learning, 2020

X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, arXiv:2011.10566

work page arXiv 2021
[10]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

A. Bardes, J. Ponce, and Y . LeCun, “Vicreg: Variance-invariance- covariance regularization for self-supervised learning,” inProceedings of the International Conference on Learning Representations (ICLR), 2022, arXiv:2105.04906

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

& Deny, S

J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inProceedings of the International Conference on Machine Learning (ICML), 2021, arXiv:2103.03230

work page arXiv 2021
[12]

Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc H¨ oftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, and Stefan Harmeling

Y . Tian, X. Chen, and S. Ganguli, “Understanding self-supervised learning dynamics without contrastive pairs,” inProceedings of the International Conference on Machine Learning (ICML), 2021, arXiv:2102.06810

work page arXiv 2021
[13]

Elenvth Intern

Q. Garrido, Y . Chen, A. Bardes, L. Najman, and Y . LeCun, “On the duality between contrastive and non-contrastive self-supervised learning,”Proceedings of the International Conference on Learning Representations (ICLR), 2024, arXiv:2206.02574, oral presentation

work page arXiv 2024
[14]

Implicit variance regular- ization in non-contrastive ssl,

M. S. Halvagal, A. Laborieux, and F. Zenke, “Implicit variance regular- ization in non-contrastive ssl,”Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023, arXiv:2212.04858

work page arXiv 2023
[15]

How jepa avoids noisy features: The implicit bias of deep linear self distillation networks,

E. Littwin, O. Saremi, M. Advani, V . Thilak, P. Nakkiran, C. Huang, and J. Susskind, “How jepa avoids noisy features: The implicit bias of deep linear self distillation networks,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2024, arXiv:2407.03475

work page arXiv 2024
[16]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

R. Balestriero and Y . LeCun, “Lejepa: Provable and scalable self- supervised learning without the heuristics,”arXiv preprint, 2025, arXiv:2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

data2vec: A general framework for self-supervised learning in speech, vision and language

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,”Proceedings of the International Conference on Machine Learning (ICML), 2022, arXiv:2202.03555

work page arXiv 2022
[18]

LLM-JEPA: Large language models meet joint embedding predictive architectures,

H. Huang, Y . LeCun, and R. Balestriero, “LLM-JEPA: Large language models meet joint embedding predictive architectures,” Oct. 2025, arXiv:2509.14252v2, preprint, October 2025

work page arXiv 2025
[19]

Temporal straightening for latent planning,

Y . Wang, O. Bounou, G. Zhou, R. Balestriero, T. G. J. Rudner, Y . LeCun, and M. Ren, “Temporal straightening for latent planning,” Mar. 2026, arXiv:2603.12231v1, preprint, March 2026

work page arXiv 2026
[20]

The implicit bias of gradient descent on separable data,

D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, pp. 1–57, 2018, arXiv:1710.10345

work page arXiv 2018
[21]

Sliced and radon wasserstein barycenters of measures,

N. Bonneel, J. Rabin, G. Peyr ´e, and H. Pfister, “Sliced and radon wasserstein barycenters of measures,”Journal of Mathematical Imaging and Vision, vol. 51, no. 1, pp. 22–45, 2015

work page 2015
[22]

Generalized Sliced Wasserstein Distances

S. Kolouri, K. Nadjahi, U. S ¸ims ¸ekli, R. Badeau, and G. Rohde, “Gener- alized sliced Wasserstein distances,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019, arXiv:1902.00434

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Estimation of non-normalized statistical models by score matching,

A. Hyv ¨arinen, “Estimation of non-normalized statistical models by score matching,”Journal of Machine Learning Research, vol. 6, pp. 695–709, 2005

work page 2005
[24]

Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Y . Song, S. Garg, J. Shi, and S. Ermon, “Sliced score matching: A scalable approach to density and score estimation,” inProceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2020, arXiv:1905.07088

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inPro- ceedings of the International Conference on Learning Representations (ICLR), 2022, arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

K. Ethayarajh, “How contextual are contextualized word representa- tions? comparing the geometry of bert, elmo, and gpt-2 embeddings,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, arXiv:1909.00512

work page arXiv 2019
[27]

Large language models implicitly learn to straighten neural sentence trajectories to construct a predic- tive representation of natural language,

E. A. Hosseini and E. Fedorenko, “Large language models implicitly learn to straighten neural sentence trajectories to construct a predic- tive representation of natural language,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2023, arXiv:2311.04930

work page arXiv 2023
[28]

The pitfalls of next-token prediction,

G. Bachmann and V . Nagarajan, “The pitfalls of next-token prediction,” inProceedings of the International Conference on Machine Learning (ICML), 2024, arXiv:2403.06963

work page arXiv 2024
[29]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, arXiv:2001.06782. © 2026 JP Morgan Chase & Co. All rights reserved 22

work page arXiv 2020

[1] [1]

Making the world differentiable: On using self- supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments,

J. Schmidhuber, “Making the world differentiable: On using self- supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments,” Institut f ¨ur Informatik, Technische Universit ¨at M ¨unchen, Tech. Rep. FKI-126-90, 1990

work page 1990

[2] [2]

A path towards autonomous machine intelligence, ver- sion 0.9.2,

Y . LeCun, “A path towards autonomous machine intelligence, ver- sion 0.9.2,” OpenReview, 2022, position paper introducing the joint- embedding predictive architecture (JEPA)

work page 2022

[3] [3]

Curious model-building control systems,

J. Schmidhuber, “Curious model-building control systems,”Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1458–1463, 1991

work page 1991

[4] [4]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, arXiv:2301.08243v3

work page arXiv 2023

[5] [5]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

L. Mur-Labadia, M. Muckley, A. Bar, M. Assran, K. Sinha, M. Rabbat, Y . LeCun, N. Ballas, and A. Bardes, “V-jepa 2.1: Unlocking dense fea- tures in video self-supervised learning,” Mar. 2026, arXiv:2603.14482v2, preprint, March 2026

work page arXiv 2026

[6] [6]

Semantic tube prediction: Beating llm data efficiency with jepa, 2026

H. Huang, Y . LeCun, and R. Balestriero, “Semantic tube prediction: Beating llm data efficiency with jepa,” Feb. 2026, arXiv:2602.22617v1, preprint, February 2026

work page arXiv 2026

[7] [7]

A Simple Framework for Contrastive Learning of Visual Representations

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProceedings of the International Conference on Machine Learning (ICML), 2020, arXiv:2002.05709

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

Bootstrap your own latent: A new approach to self-supervised learn- ing

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, R. Mvtchell, A. Ahuja, E. Agapow, and C. Beurie, “Bootstrap your own latent: A new approach to self-supervised learning,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020, arXiv:2006.07733

work page arXiv 2020

[9] [9]

Exploring simple siamese representation learning, 2020

X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, arXiv:2011.10566

work page arXiv 2021

[10] [10]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

A. Bardes, J. Ponce, and Y . LeCun, “Vicreg: Variance-invariance- covariance regularization for self-supervised learning,” inProceedings of the International Conference on Learning Representations (ICLR), 2022, arXiv:2105.04906

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

& Deny, S

J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inProceedings of the International Conference on Machine Learning (ICML), 2021, arXiv:2103.03230

work page arXiv 2021

[12] [12]

Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc H¨ oftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, and Stefan Harmeling

Y . Tian, X. Chen, and S. Ganguli, “Understanding self-supervised learning dynamics without contrastive pairs,” inProceedings of the International Conference on Machine Learning (ICML), 2021, arXiv:2102.06810

work page arXiv 2021

[13] [13]

Elenvth Intern

Q. Garrido, Y . Chen, A. Bardes, L. Najman, and Y . LeCun, “On the duality between contrastive and non-contrastive self-supervised learning,”Proceedings of the International Conference on Learning Representations (ICLR), 2024, arXiv:2206.02574, oral presentation

work page arXiv 2024

[14] [14]

Implicit variance regular- ization in non-contrastive ssl,

M. S. Halvagal, A. Laborieux, and F. Zenke, “Implicit variance regular- ization in non-contrastive ssl,”Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023, arXiv:2212.04858

work page arXiv 2023

[15] [15]

How jepa avoids noisy features: The implicit bias of deep linear self distillation networks,

E. Littwin, O. Saremi, M. Advani, V . Thilak, P. Nakkiran, C. Huang, and J. Susskind, “How jepa avoids noisy features: The implicit bias of deep linear self distillation networks,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2024, arXiv:2407.03475

work page arXiv 2024

[16] [16]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

R. Balestriero and Y . LeCun, “Lejepa: Provable and scalable self- supervised learning without the heuristics,”arXiv preprint, 2025, arXiv:2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

data2vec: A general framework for self-supervised learning in speech, vision and language

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,”Proceedings of the International Conference on Machine Learning (ICML), 2022, arXiv:2202.03555

work page arXiv 2022

[18] [18]

LLM-JEPA: Large language models meet joint embedding predictive architectures,

H. Huang, Y . LeCun, and R. Balestriero, “LLM-JEPA: Large language models meet joint embedding predictive architectures,” Oct. 2025, arXiv:2509.14252v2, preprint, October 2025

work page arXiv 2025

[19] [19]

Temporal straightening for latent planning,

Y . Wang, O. Bounou, G. Zhou, R. Balestriero, T. G. J. Rudner, Y . LeCun, and M. Ren, “Temporal straightening for latent planning,” Mar. 2026, arXiv:2603.12231v1, preprint, March 2026

work page arXiv 2026

[20] [20]

The implicit bias of gradient descent on separable data,

D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, pp. 1–57, 2018, arXiv:1710.10345

work page arXiv 2018

[21] [21]

Sliced and radon wasserstein barycenters of measures,

N. Bonneel, J. Rabin, G. Peyr ´e, and H. Pfister, “Sliced and radon wasserstein barycenters of measures,”Journal of Mathematical Imaging and Vision, vol. 51, no. 1, pp. 22–45, 2015

work page 2015

[22] [22]

Generalized Sliced Wasserstein Distances

S. Kolouri, K. Nadjahi, U. S ¸ims ¸ekli, R. Badeau, and G. Rohde, “Gener- alized sliced Wasserstein distances,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019, arXiv:1902.00434

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Estimation of non-normalized statistical models by score matching,

A. Hyv ¨arinen, “Estimation of non-normalized statistical models by score matching,”Journal of Machine Learning Research, vol. 6, pp. 695–709, 2005

work page 2005

[24] [24]

Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Y . Song, S. Garg, J. Shi, and S. Ermon, “Sliced score matching: A scalable approach to density and score estimation,” inProceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2020, arXiv:1905.07088

work page internal anchor Pith review Pith/arXiv arXiv 2020

[25] [25]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inPro- ceedings of the International Conference on Learning Representations (ICLR), 2022, arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

K. Ethayarajh, “How contextual are contextualized word representa- tions? comparing the geometry of bert, elmo, and gpt-2 embeddings,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, arXiv:1909.00512

work page arXiv 2019

[27] [27]

Large language models implicitly learn to straighten neural sentence trajectories to construct a predic- tive representation of natural language,

E. A. Hosseini and E. Fedorenko, “Large language models implicitly learn to straighten neural sentence trajectories to construct a predic- tive representation of natural language,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2023, arXiv:2311.04930

work page arXiv 2023

[28] [28]

The pitfalls of next-token prediction,

G. Bachmann and V . Nagarajan, “The pitfalls of next-token prediction,” inProceedings of the International Conference on Machine Learning (ICML), 2024, arXiv:2403.06963

work page arXiv 2024

[29] [29]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, arXiv:2001.06782. © 2026 JP Morgan Chase & Co. All rights reserved 22

work page arXiv 2020