Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

Ching-Yun Ko; Mi-Yen Yeh; Pin-Yu Chen; Yu-Ang Lee

arxiv: 2602.04998 · v2 · pith:MD5N3MWVnew · submitted 2026-02-04 · 💻 cs.LG · cs.AI· cs.CL

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

Yu-Ang Lee , Ching-Yun Ko , Pin-Yu Chen , Mi-Yen Yeh This is my paper

Pith reviewed 2026-05-21 13:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LoRALLM fine-tuninglearning ratehyperparameter tuningparameter-efficient methodsHessian eigenvaluelarge language models

0 comments

The pith

When learning rates are properly tuned, vanilla LoRA achieves performance comparable to more elaborate variants across diverse LLM fine-tuning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper re-examines nine LoRA variants against the basic version on tasks such as mathematical reasoning, commonsense reasoning, code generation, and instruction following. It runs wide searches over learning rate, batch size, rank, and training length at several model sizes. Each variant turns out to prefer its own learning rate range, yet once those rates are chosen carefully the final accuracies sit within 1-2 percent of one another. The work therefore treats vanilla LoRA as a competitive baseline and argues that earlier reported gains may have come from mismatched training settings rather than from the modifications themselves. A second-order check ties the differing optimal rates to changes in the largest Hessian eigenvalue.

Core claim

The paper establishes that different LoRA methods exhibit distinct optimal learning rate ranges, but after proper tuning across multiple tasks including mathematical reasoning, commonsense reasoning, code generation, and instruction following, their peak performances converge to within 1-2% of each other. This holds at various model scales, with only minor differences tied to the adaptation rank. A second-order analysis connects these optimal ranges to variations in the largest Hessian eigenvalue.

What carries the argument

The systematic hyperparameter search over learning rate, batch size, rank, and training duration, together with the attribution of learning-rate preferences to differences in the largest Hessian eigenvalue.

Load-bearing premise

That the conducted searches over learning rate, batch size, rank, and training duration are broad enough to reveal each method's true best performance on the chosen tasks and models.

What would settle it

Finding that even after wider or differently structured hyperparameter searches one or more variants still exceed vanilla LoRA by more than 2 percent on the same tasks would undermine the central result.

read the original abstract

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tuning learning rates per method makes most LoRA variants reach similar peaks, so many claimed gains look like tuning artifacts rather than real advances.

read the letter

The key point here is that vanilla LoRA matches the performance of nine other variants once each gets its own learning-rate search. The paper ran those searches across math, commonsense, code, and instruction tasks at multiple scales, and the peaks land within 1-2% of each other. That is the main empirical result worth noting. They also show that the variants prefer different LR ranges and tie the difference to the largest Hessian eigenvalue, which lines up with older theory on step-size limits. The work is useful because it directly tests the common practice of reporting improvements under a single fixed schedule. The multi-task, multi-scale setup gives the comparison some breadth. The Hessian analysis is post-hoc but at least connects the observation to existing learning-rate theory instead of leaving it as a pure black-box finding. The soft spot is the lack of detail on how fine-grained the LR grids actually were and whether the tested ranges were wide enough for every variant. If an optimum sat outside the searched interval or between coarse points, the reported peak would be an underestimate and the apparent equivalence could shrink or disappear. The abstract does not give grid resolution or per-method boundaries, so that part stays only partially verifiable. This paper is aimed at practitioners who fine-tune LLMs and at researchers who propose new adaptation methods. Anyone who has to pick a baseline or evaluate a new variant will find the reminder about hyperparameter fairness directly useful. It is worth sending to peer review; the question is practical and the experimental scope is reasonable, even if the search details need tightening in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript re-evaluates nine LoRA variants against vanilla LoRA for LLM fine-tuning by performing extensive hyperparameter searches over learning rate, batch size, rank, and training duration across tasks in mathematical reasoning, commonsense reasoning, code generation, and instruction following at various model scales. The key finding is that different methods prefer distinct learning rate ranges, but with proper tuning, all achieve similar peak performance within 1-2%, implying that vanilla LoRA may suffice and that prior reported gains could be due to suboptimal configurations. A Hessian eigenvalue analysis is provided to explain the LR preferences.

Significance. Should the empirical equivalence hold after thorough verification, this would be a notable contribution to the LLM fine-tuning literature by cautioning against overclaiming methodological improvements without comprehensive hyperparameter optimization. It promotes vanilla LoRA as a strong baseline and aligns empirical observations with classical optimization theory through second-order analysis. The breadth of tasks and scales adds robustness to the conclusions.

major comments (2)

[Hyperparameter Search Methodology] The description of the hyperparameter searches lacks quantitative details on the ranges explored for learning rate per method, grid resolution, batch sizes tested, and the number of trials conducted. Since the central claim of 1-2% performance parity relies on these searches having identified the true optima for each variant, insufficient coverage could lead to underestimation of some methods' peaks and thus an artifactual equivalence.
[Results and Analysis] The reported performance similarities (within 1-2%) should be accompanied by statistical measures such as standard deviations across multiple seeds or p-values to confirm that the differences are not within experimental noise, particularly given the sensitivity of fine-tuning to random initialization.

minor comments (1)

[Abstract] Clarify whether the 1-2% similarity refers to relative improvement or absolute accuracy/F1 scores, and specify the exact metrics used for each task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Hyperparameter Search Methodology] The description of the hyperparameter searches lacks quantitative details on the ranges explored for learning rate per method, grid resolution, batch sizes tested, and the number of trials conducted. Since the central claim of 1-2% performance parity relies on these searches having identified the true optima for each variant, insufficient coverage could lead to underestimation of some methods' peaks and thus an artifactual equivalence.

Authors: We agree that additional quantitative details on the hyperparameter search procedure are necessary to support the central claims. In the revised manuscript, we will expand the methods section and add an appendix table that specifies the learning rate ranges tested for each LoRA variant, the grid resolutions employed, the batch sizes evaluated, and the total number of trials per method. This will allow readers to evaluate the coverage of our searches. revision: yes
Referee: [Results and Analysis] The reported performance similarities (within 1-2%) should be accompanied by statistical measures such as standard deviations across multiple seeds or p-values to confirm that the differences are not within experimental noise, particularly given the sensitivity of fine-tuning to random initialization.

Authors: We acknowledge the value of statistical measures for confirming that observed differences fall within experimental variability. Due to the high computational cost of LLM fine-tuning, our main results used single seeds. In the revision we will add standard deviations computed over multiple seeds for a representative subset of configurations across tasks and scales, along with a discussion of variability. We will also note the resource constraints that limited full multi-seed evaluation for every trial. revision: partial

Circularity Check

0 steps flagged

Empirical hyperparameter study shows no circularity

full rationale

The paper performs direct experimental comparisons of LoRA variants under extensive hyperparameter searches over learning rate, batch size, rank, and duration across multiple tasks and model scales. The central claim of similar peak performance (within 1-2%) after tuning is supported by these empirical results rather than any derivation. The second-order Hessian-eigenvalue analysis is described as post-hoc alignment with classical learning theories, not a self-contained derivation or fitted input renamed as prediction. No self-definitional steps, load-bearing self-citations, or ansatz smuggling are present in the abstract or described methodology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical; it invokes the known sensitivity of neural nets to hyperparameters as background and treats the Hessian analysis as alignment with existing theory rather than introducing new free parameters, axioms, or entities.

free parameters (1)

optimal learning rate per method
Each LoRA variant receives its own tuned learning rate discovered via search; these values are not derived from first principles but selected to maximize observed performance.

axioms (1)

domain assumption Neural networks are sensitive to training configurations such as learning rate
Invoked in the opening paragraph to motivate the re-evaluation.

pith-pipeline@v0.9.0 · 5749 in / 1244 out tokens · 68484 ms · 2026-05-21T13:11:25.925121+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimal learning rate η∗ ∝ 1/λmax(H(θ)) … aligning with classical learning theories
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

different LoRA methods favor distinct learning rate ranges … all methods achieve similar peak performance (within 1-2%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion
cs.LG 2026-02 unverdicted novelty 6.0

CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.