pith. sign in

arxiv: 2605.15394 · v1 · pith:PYE5AVJQnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI· stat.ML

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

Pith reviewed 2026-05-19 16:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords JEPALLM fine-tuninghidden-state geometrytask couplingauxiliary objectivesLoRArepresentation learningregex generation
0
0 comments X

The pith

JEPA-style auxiliaries change LLM hidden-state geometry but leave task accuracy unchanged on language-to-regex generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether joint-embedding predictive architectures improve large language models by training them to predict latent representations rather than direct outputs. It applies twenty-two auxiliary objectives, including trajectory regularisers, distributional constraints, and a decoder-visible JEPA loss, to a Llama-3.2-1B model fine-tuned on natural language to regex conversion. Several auxiliaries alter hidden-state curvature, anisotropy, variance, and gradient directions, yet none produce exact-match accuracy gains that survive multiple-comparison correction. The null result persists when the decoder-visible construction is replicated with full fine-tuning at five seeds. The authors conclude that hidden-state representation improvements and decoded task performance remain weakly coupled in this regime.

Core claim

In LLM fine-tuning for natural-language-to-regex generation, auxiliary objectives intended to shape hidden-state geometry produce measurable shifts in representation statistics and gradient alignment, including the first positive cosine with cross-entropy observed for a decoder-visible JEPA construction, yet none deliver task accuracy improvements that survive Bonferroni or Holm-Bonferroni correction. Exact-match scores stay inside seed noise for both LoRA and full-parameter regimes. The findings therefore establish a weak coupling between hidden-state representation work and decoded-task accuracy, reframing JEPA evaluation around the question of when useful geometry becomes visible as task-

What carries the argument

The decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone, tested as one of twenty-two auxiliaries for whether induced hidden-state changes reach the language-model head and improve exact-match accuracy.

Load-bearing premise

The natural-language-to-regex generation task with exact-match metric is representative enough that a null result generalizes to weak coupling between hidden geometry and task signal across LLM fine-tuning.

What would settle it

A statistically significant exact-match gain from the decoder-visible JEPA auxiliary on a second generation task such as text-to-SQL, after the same multiple-testing corrections, would falsify the weak-coupling claim.

Figures

Figures reproduced from arXiv: 2605.15394 by Biswa Sengupta.

Figure 1
Figure 1. Figure 1: The auxiliary loss splices into the standard LoRA fine-tuning pipeline at the final-layer hidden states, in parallel with the language￾model head. TABLE II: The hypothesis map: eighteen training-time auxiliaries plus one inference-time intervention. # Name Class Eq. STP Semantic Tube Prediction 1st-order attractor (7) T1 Curvature-Aware Tube 2nd-order attractor (9) T2 Riemannian-Metric Tube metric-cosine a… view at source ↗
Figure 2
Figure 2. Figure 2: Visibility of each loss inside the assistant span. hL−1 predicts EOS directly; hL−2 feeds it via self-attention. Geometric auxiliaries see only the EOS-clipped span {h0, . . . , hL−3}; cross-entropy and the decoder-visible margin hinge see the full span and supervise EOS. baseline produced 4.2–35.8%. We therefore clip the right end of the assistant span by margin = 2 tokens before passing it to any geometr… view at source ↗
Figure 3
Figure 3. Figure 3: Data-efficiency curve on NL-RX-TURK. Mean ± one stan￾dard deviation across three seeds at each training-data fraction; sig￾nificance markers are paired Welch’s t-tests of each auxiliary against the matched baseline cell (∗: ppaired < 0.10, ∗∗: ppaired < 0.05). structured-null reading: the original STP claim is framed as a sample-efficiency gain rather than an asymptotic-exact￾match gain, and a small-data l… view at source ↗
Figure 4
Figure 4. Figure 4: Decoder-visible JEPA architecture. Interior (t, t+k) pairs for k ∈ {2, . . . , K} on the EOS-clipped span (k = 1 omitted because it duplicates CE) feed a residual MLP predictor qϕ(ht, k), with stop￾gradient on the target ht+k; both are projected through the shared frozen LM head W and the KL is computed in distribution space. The margin hinge consumes W ht directly at supervised positions. No gradient flow… view at source ↗
read the original abstract

Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emph{and} improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired $\alpha = 0.10$ without correction (T3-Local at $\Delta = +2.53$~pp, $p = 0.003$ being the strongest), but none survives Bonferroni or Holm--Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary--cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at $n = 5$ seeds reproduces the null on both benchmarks (TURK: $\Delta = +0.04$~pp, $p_{\text{paired}} = 0.96$; SYNTH: $\Delta = +0.52$~pp, $p_{\text{paired}} = 0.28$), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. This paper audits Joint-Embedding Predictive Architectures (JEPAs) for autoregressive LLM fine-tuning by testing whether 22 training auxiliaries (trajectory regularizers, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective) improve hidden-state geometry in a way that reaches the language-model head and raises decoded-task accuracy. Experiments use a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation with exact-match metric. Results show a structured null: nominal gains (e.g., T3-Local at +2.53 pp, p=0.003) fail Bonferroni/Holm correction; decoder-visible JEPA produces the first positive auxiliary–cross-entropy gradient cosine yet yields no task improvement. The null replicates under full fine-tuning (TURK and SYNTH benchmarks). The authors conclude that hidden-state representation work and decoded accuracy are weakly coupled in this regime and reframe LLM-domain JEPA evaluation as a coupling problem.

Significance. If the null holds, the work supplies controlled evidence that JEPA-style auxiliaries can alter hidden-state curvature, anisotropy, and gradient direction without producing decoder-visible task gains on this benchmark. The statistical design (paired tests, seed variation, Bonferroni/Holm correction) and the full-fine-tuning replication are clear strengths that make the reported null reliable within the studied setup. The result usefully separates representation learning from task-signal transmission in autoregressive fine-tuning.

major comments (1)
  1. [Abstract] Abstract: the reframing of LLM-domain JEPA evaluation as a 'coupling problem' is presented as following from the observed null. The null is robust for the reported NL-to-regex task and exact-match metric (including the full-fine-tuning replication), but the extension to a general reframing assumes this narrow, constrained-output, 0/1-metric setup is representative of regimes in which representation geometry more continuously affects generation quality. A brief discussion of scope or a second task would strengthen the claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on scope. We address the concern below and will revise the manuscript to clarify the intended domain of the reframing.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reframing of LLM-domain JEPA evaluation as a 'coupling problem' is presented as following from the observed null. The null is robust for the reported NL-to-regex task and exact-match metric (including the full-fine-tuning replication), but the extension to a general reframing assumes this narrow, constrained-output, 0/1-metric setup is representative of regimes in which representation geometry more continuously affects generation quality. A brief discussion of scope or a second task would strengthen the claim.

    Authors: We agree that the reframing should be explicitly scoped to the studied regime rather than presented as fully general. The manuscript already qualifies the setting as natural-language-to-regex generation under exact-match evaluation and demonstrates robustness via the full-fine-tuning replication on both TURK and SYNTH. To address the referee's point directly, we will revise the abstract to state that the coupling problem is identified 'in this regime' and add a short paragraph in the discussion section noting that the weak coupling between hidden-state geometry and decoded accuracy may not hold under open-ended generation or continuous quality metrics. We do not add a second task at this stage because the current experimental harness (fixed Llama-3.2-1B LoRA, 22 auxiliaries, paired seed design, multiple-testing correction) is already resource-intensive; the added scope discussion will make the boundary conditions of the claim transparent without overclaiming generality. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical null-result study with interpretive reframing

full rationale

The paper reports results from controlled fine-tuning experiments comparing 22 training auxiliaries on a natural-language-to-regex task under fixed LoRA and full fine-tuning regimes. The central claim of weak coupling between hidden-state geometry and decoded-task accuracy follows from the observed structured null on exact-match metrics (none surviving multiple-testing correction, with decoder-visible JEPA also null). This is an empirical conclusion from external benchmarks rather than any self-referential equation, fitted parameter renamed as prediction, or self-citation chain. The reframing as a coupling problem is an interpretive step based on the null findings and does not reduce to quantities defined inside the study by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are present in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen regex task and exact-match metric are adequate proxies for decoder-visible utility, plus standard assumptions about LoRA fine-tuning dynamics and the validity of paired t-tests under seed variation. No new entities are postulated.

axioms (1)
  • domain assumption The natural-language-to-regex task with exact-match accuracy is representative of broader LLM fine-tuning regimes for testing representation-task coupling.
    The paper extrapolates from this single task to the general claim of weak coupling.

pith-pipeline@v0.9.0 · 5892 in / 1403 out tokens · 32205 ms · 2026-05-19T16:23:48.105202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 6 internal anchors

  1. [1]

    Making the world differentiable: On using self- supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments,

    J. Schmidhuber, “Making the world differentiable: On using self- supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments,” Institut f ¨ur Informatik, Technische Universit ¨at M ¨unchen, Tech. Rep. FKI-126-90, 1990

  2. [2]

    A path towards autonomous machine intelligence, ver- sion 0.9.2,

    Y . LeCun, “A path towards autonomous machine intelligence, ver- sion 0.9.2,” OpenReview, 2022, position paper introducing the joint- embedding predictive architecture (JEPA)

  3. [3]

    Curious model-building control systems,

    J. Schmidhuber, “Curious model-building control systems,”Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1458–1463, 1991

  4. [4]

    Assran, Q

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, arXiv:2301.08243v3

  5. [5]

    V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

    L. Mur-Labadia, M. Muckley, A. Bar, M. Assran, K. Sinha, M. Rabbat, Y . LeCun, N. Ballas, and A. Bardes, “V-jepa 2.1: Unlocking dense fea- tures in video self-supervised learning,” Mar. 2026, arXiv:2603.14482v2, preprint, March 2026

  6. [6]

    Semantic tube prediction: Beating llm data efficiency with jepa, 2026

    H. Huang, Y . LeCun, and R. Balestriero, “Semantic tube prediction: Beating llm data efficiency with jepa,” Feb. 2026, arXiv:2602.22617v1, preprint, February 2026

  7. [7]

    A Simple Framework for Contrastive Learning of Visual Representations

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProceedings of the International Conference on Machine Learning (ICML), 2020, arXiv:2002.05709

  8. [8]

    Bootstrap your own latent: A new approach to self-supervised learn- ing

    J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, R. Mvtchell, A. Ahuja, E. Agapow, and C. Beurie, “Bootstrap your own latent: A new approach to self-supervised learning,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020, arXiv:2006.07733

  9. [9]

    Exploring simple siamese representation learning, 2020

    X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, arXiv:2011.10566

  10. [10]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    A. Bardes, J. Ponce, and Y . LeCun, “Vicreg: Variance-invariance- covariance regularization for self-supervised learning,” inProceedings of the International Conference on Learning Representations (ICLR), 2022, arXiv:2105.04906

  11. [11]

    & Deny, S

    J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inProceedings of the International Conference on Machine Learning (ICML), 2021, arXiv:2103.03230

  12. [12]

    Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc H¨ oftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, and Stefan Harmeling

    Y . Tian, X. Chen, and S. Ganguli, “Understanding self-supervised learning dynamics without contrastive pairs,” inProceedings of the International Conference on Machine Learning (ICML), 2021, arXiv:2102.06810

  13. [13]

    Elenvth Intern

    Q. Garrido, Y . Chen, A. Bardes, L. Najman, and Y . LeCun, “On the duality between contrastive and non-contrastive self-supervised learning,”Proceedings of the International Conference on Learning Representations (ICLR), 2024, arXiv:2206.02574, oral presentation

  14. [14]

    Implicit variance regular- ization in non-contrastive ssl,

    M. S. Halvagal, A. Laborieux, and F. Zenke, “Implicit variance regular- ization in non-contrastive ssl,”Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023, arXiv:2212.04858

  15. [15]

    How jepa avoids noisy features: The implicit bias of deep linear self distillation networks,

    E. Littwin, O. Saremi, M. Advani, V . Thilak, P. Nakkiran, C. Huang, and J. Susskind, “How jepa avoids noisy features: The implicit bias of deep linear self distillation networks,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2024, arXiv:2407.03475

  16. [16]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    R. Balestriero and Y . LeCun, “Lejepa: Provable and scalable self- supervised learning without the heuristics,”arXiv preprint, 2025, arXiv:2511.08544

  17. [17]

    data2vec: A general framework for self-supervised learning in speech, vision and language

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,”Proceedings of the International Conference on Machine Learning (ICML), 2022, arXiv:2202.03555

  18. [18]

    LLM-JEPA: Large language models meet joint embedding predictive architectures,

    H. Huang, Y . LeCun, and R. Balestriero, “LLM-JEPA: Large language models meet joint embedding predictive architectures,” Oct. 2025, arXiv:2509.14252v2, preprint, October 2025

  19. [19]

    Temporal straightening for latent planning,

    Y . Wang, O. Bounou, G. Zhou, R. Balestriero, T. G. J. Rudner, Y . LeCun, and M. Ren, “Temporal straightening for latent planning,” Mar. 2026, arXiv:2603.12231v1, preprint, March 2026

  20. [20]

    The implicit bias of gradient descent on separable data,

    D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, pp. 1–57, 2018, arXiv:1710.10345

  21. [21]

    Sliced and radon wasserstein barycenters of measures,

    N. Bonneel, J. Rabin, G. Peyr ´e, and H. Pfister, “Sliced and radon wasserstein barycenters of measures,”Journal of Mathematical Imaging and Vision, vol. 51, no. 1, pp. 22–45, 2015

  22. [22]

    Generalized Sliced Wasserstein Distances

    S. Kolouri, K. Nadjahi, U. S ¸ims ¸ekli, R. Badeau, and G. Rohde, “Gener- alized sliced Wasserstein distances,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019, arXiv:1902.00434

  23. [23]

    Estimation of non-normalized statistical models by score matching,

    A. Hyv ¨arinen, “Estimation of non-normalized statistical models by score matching,”Journal of Machine Learning Research, vol. 6, pp. 695–709, 2005

  24. [24]

    Sliced Score Matching: A Scalable Approach to Density and Score Estimation

    Y . Song, S. Garg, J. Shi, and S. Ermon, “Sliced score matching: A scalable approach to density and score estimation,” inProceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2020, arXiv:1905.07088

  25. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inPro- ceedings of the International Conference on Learning Representations (ICLR), 2022, arXiv:2106.09685

  26. [26]

    How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

    K. Ethayarajh, “How contextual are contextualized word representa- tions? comparing the geometry of bert, elmo, and gpt-2 embeddings,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, arXiv:1909.00512

  27. [27]

    Large language models implicitly learn to straighten neural sentence trajectories to construct a predic- tive representation of natural language,

    E. A. Hosseini and E. Fedorenko, “Large language models implicitly learn to straighten neural sentence trajectories to construct a predic- tive representation of natural language,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2023, arXiv:2311.04930

  28. [28]

    The pitfalls of next-token prediction,

    G. Bachmann and V . Nagarajan, “The pitfalls of next-token prediction,” inProceedings of the International Conference on Machine Learning (ICML), 2024, arXiv:2403.06963

  29. [29]

    Gradient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, arXiv:2001.06782. © 2026 JP Morgan Chase & Co. All rights reserved 22