Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
Pith reviewed 2026-05-21 13:11 UTC · model grok-4.3
The pith
When learning rates are properly tuned, vanilla LoRA achieves performance comparable to more elaborate variants across diverse LLM fine-tuning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that different LoRA methods exhibit distinct optimal learning rate ranges, but after proper tuning across multiple tasks including mathematical reasoning, commonsense reasoning, code generation, and instruction following, their peak performances converge to within 1-2% of each other. This holds at various model scales, with only minor differences tied to the adaptation rank. A second-order analysis connects these optimal ranges to variations in the largest Hessian eigenvalue.
What carries the argument
The systematic hyperparameter search over learning rate, batch size, rank, and training duration, together with the attribution of learning-rate preferences to differences in the largest Hessian eigenvalue.
Load-bearing premise
That the conducted searches over learning rate, batch size, rank, and training duration are broad enough to reveal each method's true best performance on the chosen tasks and models.
What would settle it
Finding that even after wider or differently structured hyperparameter searches one or more variants still exceed vanilla LoRA by more than 2 percent on the same tasks would undermine the central result.
read the original abstract
Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript re-evaluates nine LoRA variants against vanilla LoRA for LLM fine-tuning by performing extensive hyperparameter searches over learning rate, batch size, rank, and training duration across tasks in mathematical reasoning, commonsense reasoning, code generation, and instruction following at various model scales. The key finding is that different methods prefer distinct learning rate ranges, but with proper tuning, all achieve similar peak performance within 1-2%, implying that vanilla LoRA may suffice and that prior reported gains could be due to suboptimal configurations. A Hessian eigenvalue analysis is provided to explain the LR preferences.
Significance. Should the empirical equivalence hold after thorough verification, this would be a notable contribution to the LLM fine-tuning literature by cautioning against overclaiming methodological improvements without comprehensive hyperparameter optimization. It promotes vanilla LoRA as a strong baseline and aligns empirical observations with classical optimization theory through second-order analysis. The breadth of tasks and scales adds robustness to the conclusions.
major comments (2)
- [Hyperparameter Search Methodology] The description of the hyperparameter searches lacks quantitative details on the ranges explored for learning rate per method, grid resolution, batch sizes tested, and the number of trials conducted. Since the central claim of 1-2% performance parity relies on these searches having identified the true optima for each variant, insufficient coverage could lead to underestimation of some methods' peaks and thus an artifactual equivalence.
- [Results and Analysis] The reported performance similarities (within 1-2%) should be accompanied by statistical measures such as standard deviations across multiple seeds or p-values to confirm that the differences are not within experimental noise, particularly given the sensitivity of fine-tuning to random initialization.
minor comments (1)
- [Abstract] Clarify whether the 1-2% similarity refers to relative improvement or absolute accuracy/F1 scores, and specify the exact metrics used for each task.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Hyperparameter Search Methodology] The description of the hyperparameter searches lacks quantitative details on the ranges explored for learning rate per method, grid resolution, batch sizes tested, and the number of trials conducted. Since the central claim of 1-2% performance parity relies on these searches having identified the true optima for each variant, insufficient coverage could lead to underestimation of some methods' peaks and thus an artifactual equivalence.
Authors: We agree that additional quantitative details on the hyperparameter search procedure are necessary to support the central claims. In the revised manuscript, we will expand the methods section and add an appendix table that specifies the learning rate ranges tested for each LoRA variant, the grid resolutions employed, the batch sizes evaluated, and the total number of trials per method. This will allow readers to evaluate the coverage of our searches. revision: yes
-
Referee: [Results and Analysis] The reported performance similarities (within 1-2%) should be accompanied by statistical measures such as standard deviations across multiple seeds or p-values to confirm that the differences are not within experimental noise, particularly given the sensitivity of fine-tuning to random initialization.
Authors: We acknowledge the value of statistical measures for confirming that observed differences fall within experimental variability. Due to the high computational cost of LLM fine-tuning, our main results used single seeds. In the revision we will add standard deviations computed over multiple seeds for a representative subset of configurations across tasks and scales, along with a discussion of variability. We will also note the resource constraints that limited full multi-seed evaluation for every trial. revision: partial
Circularity Check
Empirical hyperparameter study shows no circularity
full rationale
The paper performs direct experimental comparisons of LoRA variants under extensive hyperparameter searches over learning rate, batch size, rank, and duration across multiple tasks and model scales. The central claim of similar peak performance (within 1-2%) after tuning is supported by these empirical results rather than any derivation. The second-order Hessian-eigenvalue analysis is described as post-hoc alignment with classical learning theories, not a self-contained derivation or fitted input renamed as prediction. No self-definitional steps, load-bearing self-citations, or ansatz smuggling are present in the abstract or described methodology.
Axiom & Free-Parameter Ledger
free parameters (1)
- optimal learning rate per method
axioms (1)
- domain assumption Neural networks are sensitive to training configurations such as learning rate
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimal learning rate η∗ ∝ 1/λmax(H(θ)) … aligning with classical learning theories
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
different LoRA methods favor distinct learning rate ranges … all methods achieve similar peak performance (within 1-2%)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion
CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.