pith. sign in

arxiv: 2602.04998 · v2 · pith:MD5N3MWVnew · submitted 2026-02-04 · 💻 cs.LG · cs.AI· cs.CL

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

Pith reviewed 2026-05-21 13:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LoRALLM fine-tuninglearning ratehyperparameter tuningparameter-efficient methodsHessian eigenvaluelarge language models
0
0 comments X

The pith

When learning rates are properly tuned, vanilla LoRA achieves performance comparable to more elaborate variants across diverse LLM fine-tuning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper re-examines nine LoRA variants against the basic version on tasks such as mathematical reasoning, commonsense reasoning, code generation, and instruction following. It runs wide searches over learning rate, batch size, rank, and training length at several model sizes. Each variant turns out to prefer its own learning rate range, yet once those rates are chosen carefully the final accuracies sit within 1-2 percent of one another. The work therefore treats vanilla LoRA as a competitive baseline and argues that earlier reported gains may have come from mismatched training settings rather than from the modifications themselves. A second-order check ties the differing optimal rates to changes in the largest Hessian eigenvalue.

Core claim

The paper establishes that different LoRA methods exhibit distinct optimal learning rate ranges, but after proper tuning across multiple tasks including mathematical reasoning, commonsense reasoning, code generation, and instruction following, their peak performances converge to within 1-2% of each other. This holds at various model scales, with only minor differences tied to the adaptation rank. A second-order analysis connects these optimal ranges to variations in the largest Hessian eigenvalue.

What carries the argument

The systematic hyperparameter search over learning rate, batch size, rank, and training duration, together with the attribution of learning-rate preferences to differences in the largest Hessian eigenvalue.

Load-bearing premise

That the conducted searches over learning rate, batch size, rank, and training duration are broad enough to reveal each method's true best performance on the chosen tasks and models.

What would settle it

Finding that even after wider or differently structured hyperparameter searches one or more variants still exceed vanilla LoRA by more than 2 percent on the same tasks would undermine the central result.

read the original abstract

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript re-evaluates nine LoRA variants against vanilla LoRA for LLM fine-tuning by performing extensive hyperparameter searches over learning rate, batch size, rank, and training duration across tasks in mathematical reasoning, commonsense reasoning, code generation, and instruction following at various model scales. The key finding is that different methods prefer distinct learning rate ranges, but with proper tuning, all achieve similar peak performance within 1-2%, implying that vanilla LoRA may suffice and that prior reported gains could be due to suboptimal configurations. A Hessian eigenvalue analysis is provided to explain the LR preferences.

Significance. Should the empirical equivalence hold after thorough verification, this would be a notable contribution to the LLM fine-tuning literature by cautioning against overclaiming methodological improvements without comprehensive hyperparameter optimization. It promotes vanilla LoRA as a strong baseline and aligns empirical observations with classical optimization theory through second-order analysis. The breadth of tasks and scales adds robustness to the conclusions.

major comments (2)
  1. [Hyperparameter Search Methodology] The description of the hyperparameter searches lacks quantitative details on the ranges explored for learning rate per method, grid resolution, batch sizes tested, and the number of trials conducted. Since the central claim of 1-2% performance parity relies on these searches having identified the true optima for each variant, insufficient coverage could lead to underestimation of some methods' peaks and thus an artifactual equivalence.
  2. [Results and Analysis] The reported performance similarities (within 1-2%) should be accompanied by statistical measures such as standard deviations across multiple seeds or p-values to confirm that the differences are not within experimental noise, particularly given the sensitivity of fine-tuning to random initialization.
minor comments (1)
  1. [Abstract] Clarify whether the 1-2% similarity refers to relative improvement or absolute accuracy/F1 scores, and specify the exact metrics used for each task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Hyperparameter Search Methodology] The description of the hyperparameter searches lacks quantitative details on the ranges explored for learning rate per method, grid resolution, batch sizes tested, and the number of trials conducted. Since the central claim of 1-2% performance parity relies on these searches having identified the true optima for each variant, insufficient coverage could lead to underestimation of some methods' peaks and thus an artifactual equivalence.

    Authors: We agree that additional quantitative details on the hyperparameter search procedure are necessary to support the central claims. In the revised manuscript, we will expand the methods section and add an appendix table that specifies the learning rate ranges tested for each LoRA variant, the grid resolutions employed, the batch sizes evaluated, and the total number of trials per method. This will allow readers to evaluate the coverage of our searches. revision: yes

  2. Referee: [Results and Analysis] The reported performance similarities (within 1-2%) should be accompanied by statistical measures such as standard deviations across multiple seeds or p-values to confirm that the differences are not within experimental noise, particularly given the sensitivity of fine-tuning to random initialization.

    Authors: We acknowledge the value of statistical measures for confirming that observed differences fall within experimental variability. Due to the high computational cost of LLM fine-tuning, our main results used single seeds. In the revision we will add standard deviations computed over multiple seeds for a representative subset of configurations across tasks and scales, along with a discussion of variability. We will also note the resource constraints that limited full multi-seed evaluation for every trial. revision: partial

Circularity Check

0 steps flagged

Empirical hyperparameter study shows no circularity

full rationale

The paper performs direct experimental comparisons of LoRA variants under extensive hyperparameter searches over learning rate, batch size, rank, and duration across multiple tasks and model scales. The central claim of similar peak performance (within 1-2%) after tuning is supported by these empirical results rather than any derivation. The second-order Hessian-eigenvalue analysis is described as post-hoc alignment with classical learning theories, not a self-contained derivation or fitted input renamed as prediction. No self-definitional steps, load-bearing self-citations, or ansatz smuggling are present in the abstract or described methodology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical; it invokes the known sensitivity of neural nets to hyperparameters as background and treats the Hessian analysis as alignment with existing theory rather than introducing new free parameters, axioms, or entities.

free parameters (1)
  • optimal learning rate per method
    Each LoRA variant receives its own tuned learning rate discovered via search; these values are not derived from first principles but selected to maximize observed performance.
axioms (1)
  • domain assumption Neural networks are sensitive to training configurations such as learning rate
    Invoked in the opening paragraph to motivate the re-evaluation.

pith-pipeline@v0.9.0 · 5749 in / 1244 out tokens · 68484 ms · 2026-05-21T13:11:25.925121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion

    cs.LG 2026-02 unverdicted novelty 6.0

    CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.