pith. sign in

arxiv: 2605.03724 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI

Rethinking the Rank Threshold for LoRA Fine-Tuning

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LoRAfine-tuningrank thresholdneural tangent kernelbinary classificationlocal minimacross-entropy loss
0
0 comments X

The pith

For binary classification, LoRA fine-tuning requires only rank one to avoid spurious local minima under squared-error and cross-entropy losses in the neural tangent kernel regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reexamines a prior condition requiring LoRA rank r to satisfy r(r+1)/2 > K N to prevent bad minima during fine-tuning of models like RoBERTa. It replaces this with a looser bound based on the actual dimension of the LoRA update space and shows that cross-entropy loss removes any rank restriction via the Polyak-Łojasiewicz inequality. A complexity bound further indicates that rank one is optimal for binary problems where the bias term saturates. This matters for efficient adaptation of large language models, as it suggests using far fewer trainable parameters than previously thought necessary for binary tasks. Experiments on GLUE binary datasets confirm rank one performs comparably to rank 12.

Core claim

Three results together establish that the prescribed LoRA rank reduces to r = 1 for binary classification. A non-symmetric count of the LoRA manifold dimension gives the weaker condition r(m + n) - r² > C* · K N with C* ≈ 1.35, which holds at r = 1. The Polyak-Łojasiewicz inequality eliminates the rank threshold for cross-entropy loss. A Rademacher complexity bound predicts rank-one variance optimality when the bias is saturated, as occurs for K = 2 but not higher.

What carries the argument

The dimension of the non-symmetric LoRA manifold and the Polyak-Łojasiewicz inequality applied to cross-entropy loss in the NTK regime.

If this is right

  • The sufficient rank for avoiding spurious minima drops to one for binary classification.
  • Cross-entropy loss removes any rank threshold requirement entirely.
  • Rank one achieves variance optimality in binary cases due to bias saturation.
  • Empirical performance of rank one matches higher ranks on binary GLUE tasks but shifts higher for multi-class MNLI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar rank reductions might apply to other parameter-efficient methods if their manifold dimensions are calculated precisely.
  • Testing the approach on regression tasks or with different losses could reveal when rank one remains sufficient.
  • The shift in optimal rank for multi-class tasks suggests investigating task-specific rank selection rules.

Load-bearing premise

The binary classification guarantees depend on the neural tangent kernel regime holding during fine-tuning with standard assumptions on the model and data.

What would settle it

Finding a binary classification task and encoder where rank-one LoRA fine-tuning converges to a solution with significantly higher loss than rank 12, indicating a spurious minimum despite the dimension condition being met.

Figures

Figures reproduced from arXiv: 2605.03724 by Juneyoung Park.

Figure 1
Figure 1. Figure 1: Test accuracy as a function of LoRA rank, on RoBERTa-base with view at source ↗
Figure 2
Figure 2. Figure 2: Robustness across train size N and LoRA layer index on RoBERTa-base SST-2. (a, b) Train size sweep at layer 11: surface and line slices. (c, d) LoRA layer sweep at N = 32: surface and line slices. The dashed vertical line in (b, d) marks the existing rank threshold r = 12. The rank pattern is preserved across both axes. N moderate (64 per class, N = 192) the gap is similar. The K > 2 regime thus exhibits a… view at source ↗
Figure 3
Figure 3. Figure 3: Non-trivial training-fit regime on RoBERTa-base SST-2 with NTK Jacobians projected to view at source ↗
read the original abstract

A recent landscape analysis of LoRA fine-tuning in the neural tangent kernel regime establishes a sufficient condition $r(r+1)/2 > KN$ on the LoRA rank $r$ for the absence of spurious local minima under squared-error loss, prescribing $r \geq 12$ on canonical few-shot RoBERTa setups. The condition is stated for general output dimension $K$, so its sharpness in any particular regime, and its practical implication for the cross-entropy loss actually used in fine-tuning, are open. We give three results that together reduce the prescribed rank to $r = 1$ for binary classification in this regime. First, replacing the symmetric Sard-form count with the non-symmetric LoRA manifold dimension yields a strictly weaker capacity requirement, $r(m+n) - r^2 > C^* \cdot KN$ with $C^* \approx 1.35$ under Gaussian-iid features, satisfied at $r = 1$ on canonical setups. Second, in the cross-entropy setting the Polyak--\L{}ojasiewicz inequality removes the rank threshold entirely. Third, a Rademacher-complexity bound predicts rank-one variance optimality precisely when the bias term is saturated, which is the case for binary classification but not for $K > 2$. Empirically, across four GLUE-style binary tasks, three encoder architectures, and at scale on RoBERTa-large, rank one is competitive with the existing prescription $r = 12$; on multi-class MNLI the optimal rank shifts above one, also as predicted. The binary-regime guarantees are conditional on standard NTK assumptions; the multi-class extension is left to future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that previous landscape analysis prescribing LoRA rank r ≥ 12 for absence of spurious minima under squared-error loss in the NTK regime can be relaxed to r = 1 for binary classification. It derives three results: (1) a weaker capacity bound r(m + n) - r² > C* · KN with C* ≈ 1.35 under Gaussian-iid features using the non-symmetric LoRA manifold dimension instead of the symmetric Sard count; (2) the Polyak-Łojasiewicz inequality eliminates any rank threshold under cross-entropy loss; (3) a Rademacher complexity bound shows rank-one variance optimality precisely when the bias term is saturated, which holds for binary but not multi-class settings. Experiments on four GLUE-style binary tasks across three encoders (including RoBERTa-large) show r = 1 competitive with r = 12, while multi-class MNLI requires higher rank as predicted. All guarantees are conditional on NTK assumptions.

Significance. If the theoretical reductions and empirical competitiveness hold, the result would have substantial practical impact by enabling more parameter-efficient LoRA fine-tuning for binary tasks without sacrificing performance or landscape guarantees. The work strengthens understanding of LoRA geometry and loss-specific behavior, and the explicit Gaussian-iid derivation plus falsifiable multi-class prediction are strengths. However, the reliance on NTK regime and feature assumptions limits immediate broad applicability.

major comments (2)
  1. [Section 3 (capacity bound derivation)] The first result (weaker capacity bound r(m + n) - r² > C* · KN): the claim that r = 1 suffices on canonical m, n, K, N setups depends on C* ≈ 1.35 derived under the Gaussian-iid feature model. Real pre-trained encoder features (e.g., RoBERTa on GLUE) exhibit correlations and non-Gaussian structure; the paper must show either that the effective C* remains below the threshold or provide a robustness argument, as this is load-bearing for transferring the absence-of-spurious-minima guarantee to r = 1.
  2. [Section 4 (cross-entropy analysis)] The second result (PL inequality for cross-entropy): while the Polyak-Łojasiewicz inequality is invoked to remove the rank threshold, the manuscript must explicitly state the conditions under which PL holds for the LoRA-parameterized model in the NTK regime and confirm it applies to the actual fine-tuning loss and architecture, as this step is central to the binary-regime claim.
minor comments (3)
  1. [Experiments] The empirical section summarizes results without error bars, standard deviations, or full experimental protocol (e.g., learning rates, number of runs, exact task splits), which weakens assessment of whether r = 1 is statistically competitive with r = 12.
  2. [Introduction / Section 2] Notation for manifold dimension and the constant C* should be introduced with a brief reminder of the prior symmetric count r(r + 1)/2 > KN to improve readability for readers familiar with the referenced landscape analysis.
  3. [Abstract and Experiments] The abstract states results 'at scale on RoBERTa-large' but the main text should include a dedicated table or figure with exact metrics for that scale to support the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and recommendation. We respond point-by-point to the major comments below, with revisions proposed where they strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Section 3 (capacity bound derivation)] The first result (weaker capacity bound r(m + n) - r² > C* · KN): the claim that r = 1 suffices on canonical m, n, K, N setups depends on C* ≈ 1.35 derived under the Gaussian-iid feature model. Real pre-trained encoder features (e.g., RoBERTa on GLUE) exhibit correlations and non-Gaussian structure; the paper must show either that the effective C* remains below the threshold or provide a robustness argument, as this is load-bearing for transferring the absence-of-spurious-minima guarantee to r = 1.

    Authors: We agree that the capacity bound relies on the Gaussian-iid feature model, a standard assumption in NTK analyses but not identical to real encoder features. We will revise Section 3 to add an explicit robustness discussion: the derived C* ≈ 1.35 is conservative, the bound is sufficient for r=1 on canonical setups, and our experiments across real encoders (including RoBERTa-large on GLUE binary tasks) empirically confirm competitive performance at r=1. A full non-Gaussian analysis remains future work, but the combination of theory under standard assumptions and empirical validation supports transferability of the r=1 claim. revision: partial

  2. Referee: [Section 4 (cross-entropy analysis)] The second result (PL inequality for cross-entropy): while the Polyak-Łojasiewicz inequality is invoked to remove the rank threshold, the manuscript must explicitly state the conditions under which PL holds for the LoRA-parameterized model in the NTK regime and confirm it applies to the actual fine-tuning loss and architecture, as this step is central to the binary-regime claim.

    Authors: We thank the referee for this clarification request. We will revise Section 4 to explicitly state the conditions: in the NTK regime the LoRA model is effectively linear with positive-definite Hessian away from minima, and for binary cross-entropy with saturated bias (per our third result) the PL inequality holds with mu > 0 independent of rank. We will add a paragraph confirming these conditions apply directly to the LoRA parameterization, cross-entropy loss, and fine-tuning architectures used in the experiments. revision: yes

Circularity Check

0 steps flagged

Derivations rely on first-principles manifold dimension and standard inequalities

full rationale

The paper's three results are derived from explicit replacements (symmetric Sard count to non-symmetric LoRA manifold dimension), the Polyak-Łojasiewicz inequality for cross-entropy, and a Rademacher complexity bound under the stated bias-saturation condition. The constant C* is presented as an approximation computed under an explicit Gaussian-iid feature model rather than fitted to the target GLUE data or the final claims. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or ansatz imported from the authors' prior work; the referenced landscape analysis is treated as external prior art. Empirical results are reported separately as validation and do not serve as inputs to the theoretical bounds.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the neural tangent kernel regime, Gaussian-iid feature assumptions for the capacity calculation, and the binary output setting that saturates the bias term.

free parameters (1)
  • C^* = 1.35
    Approximate multiplier in the capacity inequality r(m+n)-r^2 > C^* KN derived under Gaussian-iid features.
axioms (2)
  • domain assumption Standard NTK assumptions hold for the fine-tuning dynamics
    Explicitly stated as the condition under which the binary-regime guarantees apply.
  • domain assumption Input features are Gaussian and iid
    Used to obtain the numerical value of C^* in the manifold-dimension bound.

pith-pipeline@v0.9.0 · 5599 in / 1457 out tokens · 49947 ms · 2026-05-07T16:45:25.643085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

  2. [2]

    and Ryu, Ernest K

    Jang, Uijeong and Lee, Jason D. and Ryu, Ernest K. , booktitle=

  3. [3]

    Advances in Neural Information Processing Systems , year=

    Global optimality of local search for low rank matrix recovery , author=. Advances in Neural Information Processing Systems , year=

  4. [4]

    SIAM Journal on Imaging Sciences , volume=

    Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably , author=. SIAM Journal on Imaging Sciences , volume=

  5. [5]

    Advances in Neural Information Processing Systems , year=

    Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in Neural Information Processing Systems , year=

  6. [6]

    International Conference on Machine Learning , year=

    A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , year=

  7. [7]

    Applied and Computational Harmonic Analysis , volume=

    Loss landscapes and optimization in over-parameterized non-linear systems and neural networks , author=. Applied and Computational Harmonic Analysis , volume=

  8. [8]

    Matematicheskii Sbornik , volume=

    Distribution of eigenvalues for some sets of random matrices , author=. Matematicheskii Sbornik , volume=

  9. [9]

    Bai, Z. D. and Yin, Yong-Quan , journal=. Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a

  10. [10]

    High-Dimensional Probability: An Introduction with Applications in Data Science , author=

  11. [11]

    Linear Algebra and its Applications , volume=

    Critical points of matrix least squares distance functions , author=. Linear Algebra and its Applications , volume=

  12. [12]

    , booktitle=

    Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=

  13. [13]

    Low-rank matrix completion by

    Vandereycken, Bart , journal=. Low-rank matrix completion by

  14. [14]

    and Widom, Harold , journal=

    Tracy, Craig A. and Widom, Harold , journal=. Level-spacing distributions and the

  15. [15]

    Mathematical Programming Studies , volume=

    Second-order sensitivity analysis and asymptotic theory of parametrized nonlinear programs , author=. Mathematical Programming Studies , volume=

  16. [16]

    and Mendelson, Shahar , journal=

    Bartlett, Peter L. and Mendelson, Shahar , journal=. Rademacher and

  17. [17]

    and Bousquet, Olivier and Mendelson, Shahar , journal=

    Bartlett, Peter L. and Bousquet, Olivier and Mendelson, Shahar , journal=. Local

  18. [18]

    Pattern Recognition and Machine Learning , author=

  19. [19]

    Advances in Neural Information Processing Systems , year=

    Wide neural networks of any depth evolve as linear models under gradient descent , author=. Advances in Neural Information Processing Systems , year=

  20. [20]

    SIAM Review , volume=

    Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization , author=. SIAM Review , volume=

  21. [21]

    International Conference on Machine Learning , year=

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks , author=. International Conference on Machine Learning , year=

  22. [22]

    International Conference on Machine Learning , year=

    A kernel-based view of language model fine-tuning , author=. International Conference on Machine Learning , year=

  23. [23]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

  24. [24]

    Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=

  25. [25]

    Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

  26. [26]

    2014 , publisher=

    Understanding Machine Learning: From Theory to Algorithms , author=. 2014 , publisher=