Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models

Tao Li; Xiaolin Huang; Yuhang Liu; Zhehao Huang; Zuopeng Yang

arxiv: 2508.19564 · v2 · submitted 2025-08-27 · 💻 cs.LG · cs.AI

Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models

Yuhang Liu , Tao Li , Zhehao Huang , Zuopeng Yang , Xiaolin Huang This is my paper

Pith reviewed 2026-05-18 21:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Bi-LoRASharpness-Aware MinimizationLow-Rank Adaptationparameter-efficient fine-tuningflat minimageneralizationlarge-scale modelsadversarial perturbation

0 comments

The pith

Bi-LoRA adds an auxiliary LoRA module to model SAM perturbations separately from task adaptation, enabling flat minima in large-model fine-tuning without doubled memory or compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that applying SAM directly to LoRA parameters confines sharpness optimization to a narrow low-rank subspace and therefore fails to reach sufficiently flat minima. Bi-LoRA solves this by maintaining a primary LoRA module that performs ordinary gradient descent for task adaptation while an auxiliary LoRA module performs gradient ascent to capture loss-landscape sharpness. Because the two modules operate in parallel, the method keeps memory close to standard LoRA levels and completes perturbation and update in a single pass rather than doubling training steps. If correct, the approach would let practitioners apply SAM-style regularization to billion-parameter models on modest hardware and data budgets. Experiments across tasks and architectures are presented as evidence that the resulting minima generalize better than both plain LoRA and direct SAM-LoRA combinations.

Core claim

Bi-LoRA introduces an auxiliary low-rank adaptation module that explicitly models the adversarial weight perturbations required by Sharpness-Aware Minimization. The primary LoRA module continues to adapt to the downstream task through standard gradient descent, while the auxiliary module identifies directions of steepest loss increase through gradient ascent. This separation removes the subspace restriction that occurs when SAM is applied only to existing LoRA weights and simultaneously allows perturbation and optimization to occur in one forward-backward pass.

What carries the argument

The auxiliary LoRA module, which performs gradient ascent to represent SAM-style adversarial perturbations independently of the primary task-adapting LoRA module.

If this is right

Fine-tuning of large models can incorporate SAM-style flat-minima regularization while using memory comparable to standard LoRA.
Training time remains close to single-pass optimization instead of the doubled cost of conventional SAM.
Generalization improves on downstream tasks with limited data across multiple model architectures.
The method extends the applicability of sharpness-aware optimization beyond the scale where full-model SAM is feasible.
Broader sharpness is captured because perturbations are no longer confined to the primary LoRA subspace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same primary-auxiliary decoupling could be tested with other parameter-efficient adapters such as adapters or prefix tuning.
In extremely low-data regimes the auxiliary module might serve as an additional regularizer that reduces overfitting more reliably than single-module methods.
Sharing a subset of weights between primary and auxiliary modules could further reduce parameter overhead while preserving the separation of concerns.
The approach suggests a general pattern for embedding perturbation-based regularizers into any low-rank update scheme without recomputing full-model gradients.

Load-bearing premise

An auxiliary low-rank module can faithfully represent the adversarial perturbations needed for SAM without introducing interference or subspace mismatch that would reduce effective sharpness.

What would settle it

A direct comparison on the same large model and dataset where Bi-LoRA produces minima whose sharpness (measured by the maximum loss increase within a neighborhood) is no lower than that achieved by full-parameter SAM or whose downstream generalization is no better than plain LoRA fine-tuning.

read the original abstract

Fine-tuning large-scale pre-trained models with limited data presents significant challenges for generalization. While Sharpness-Aware Minimization (SAM) has proven effective in improving generalization by seeking flat minima, its substantial extra memory and computation overhead make it impractical for large models. Integrating SAM with parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) is a promising direction. However, we find that directly applying SAM to LoRA parameters limits the sharpness optimization to a restricted subspace, hindering its effectiveness. To address this limitation, we propose Bi-directional Low-Rank Adaptation (Bi-LoRA), which introduces an auxiliary LoRA module to model SAM's adversarial weight perturbations. It decouples SAM's weight perturbations from LoRA optimization: the primary LoRA module adapts to specific tasks via standard gradient descent, while the auxiliary module captures the sharpness of the loss landscape through gradient ascent. Such dual-module design enables Bi-LoRA to capture broader sharpness for achieving flatter minima while remaining memory-efficient. Another important benefit is that the dual design allows for simultaneous optimization and perturbation, eliminating SAM's doubled training costs. Extensive experiments across diverse tasks and architectures demonstrate Bi-LoRA's efficiency and effectiveness in enhancing generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bi-LoRA splits the adapter into primary and auxiliary low-rank modules to run SAM without doubling cost, but the auxiliary module's subspace may still limit how much real sharpness it captures.

read the letter

The core move is straightforward: keep one LoRA for normal task gradients and add a second low-rank module that does the ascent step for SAM perturbations. This decouples the two so you avoid both the memory spike and the extra forward-backward pass that normally comes with SAM on large models. The abstract makes a clear case that applying SAM directly to a single LoRA keeps the perturbation inside an already tiny subspace, which is the right problem to name. Experiments across tasks and architectures are presented as evidence that the dual setup improves generalization while staying efficient, which is the practical payoff they are after. That part looks like a useful engineering step on top of existing LoRA and SAM work. The soft spot is exactly the one the stress-test flags. Because the auxiliary module is still low-rank, its perturbations live in a restricted column space. Nothing in the description shows that this space aligns with the directions that actually control curvature in the full loss landscape. If it does not, the effective neighborhood explored for sharpness minimization stays smaller than full-parameter SAM, and the flatter-minima claim weakens. The paper would need to demonstrate that the gains survive this restriction, either through ablations on rank or direct comparison of the loss surfaces. This is aimed at people who fine-tune large models and want SAM-style regularization without the usual overhead. It is a concrete design with a real use case, so it deserves a serious referee even if the subspace question needs more scrutiny in review.

Referee Report

2 major / 3 minor

Summary. The paper proposes Bi-LoRA, a bi-directional low-rank adaptation method that augments standard LoRA with an auxiliary low-rank module. The primary module adapts to tasks via gradient descent while the auxiliary module models SAM-style adversarial perturbations via gradient ascent, with the goal of achieving flatter minima, broader sharpness capture, memory efficiency, and elimination of SAM's doubled forward/backward passes during fine-tuning of large models.

Significance. If the auxiliary module successfully approximates the necessary perturbations without substantial subspace mismatch, the approach could enable practical sharpness-aware fine-tuning at scale, addressing a key barrier to applying SAM to large models. The simultaneous optimization enabled by the dual-module design is a clear efficiency contribution.

major comments (2)

[Method description (around the Bi-LoRA formulation)] The central claim that Bi-LoRA captures 'broader sharpness' for flatter minima (abstract and method description) rests on the auxiliary LoRA faithfully representing the rho-scaled gradient-ascent perturbation. Because the auxiliary module is itself rank-r (r << d), its perturbations are confined to a low-dimensional column space; no analysis, bound, or alignment argument is given showing that this space overlaps sufficiently with the dominant curvature directions of the full-parameter loss, which directly undermines the 'broader' and 'flatter' assertions relative to full-model SAM.
[Experimental section (results tables)] Experiments demonstrate gains over LoRA and direct SAM+LoRA baselines, yet no ablation isolates the contribution of the auxiliary perturbation from other design choices (e.g., extra parameters or optimization schedule). A controlled comparison on a smaller model where full SAM is feasible would be needed to quantify how much effective sharpness is actually recovered.

minor comments (3)

Clarify the precise simultaneous update equations for the two modules and whether any gradient stopping or scaling is applied between them to prevent interference.
Provide a memory and FLOPs breakdown (or wall-clock timing) that quantifies the claimed elimination of SAM's doubled training costs.
Add a reference to prior work on low-rank approximations of adversarial perturbations or curvature estimation to situate the auxiliary-module design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The central claim that Bi-LoRA captures 'broader sharpness' for flatter minima (abstract and method description) rests on the auxiliary LoRA faithfully representing the rho-scaled gradient-ascent perturbation. Because the auxiliary module is itself rank-r (r << d), its perturbations are confined to a low-dimensional column space; no analysis, bound, or alignment argument is given showing that this space overlaps sufficiently with the dominant curvature directions of the full-parameter loss, which directly undermines the 'broader' and 'flatter' assertions relative to full-model SAM.

Authors: We acknowledge that the original manuscript provides no formal bound or alignment analysis between the auxiliary LoRA subspace and the dominant curvature directions of the full-parameter loss. The Bi-LoRA design intentionally decouples perturbation modeling into an independent low-rank module so that adversarial directions are not restricted to the primary task-adaptation subspace used by direct SAM+LoRA. Empirical improvements over the SAM+LoRA baseline support that this separation yields practically broader sharpness capture. In the revision we will add a dedicated discussion of subspace complementarity together with visualizations of perturbation directions to clarify the claim. revision: partial
Referee: Experiments demonstrate gains over LoRA and direct SAM+LoRA baselines, yet no ablation isolates the contribution of the auxiliary perturbation from other design choices (e.g., extra parameters or optimization schedule). A controlled comparison on a smaller model where full SAM is feasible would be needed to quantify how much effective sharpness is actually recovered.

Authors: We agree that isolating the auxiliary module's contribution and providing a direct comparison against full SAM on smaller models would strengthen the experimental section. The current tables already include comparisons against both LoRA and SAM applied to LoRA parameters, but additional controlled ablations and smaller-model experiments are feasible and will be included in the revised manuscript to quantify the recovered sharpness. revision: yes

Circularity Check

0 steps flagged

No circularity; Bi-LoRA is an architectural design validated by experiments

full rationale

The paper proposes Bi-LoRA as a dual LoRA module structure that decouples task adaptation (primary module via gradient descent) from sharpness capture (auxiliary module via gradient ascent). This is presented as a direct engineering solution to the subspace restriction observed when applying SAM directly to LoRA parameters. No equations, derivations, or first-principles results are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims of broader sharpness, memory efficiency, and eliminated doubled costs rest on the structural choice plus empirical results across tasks and architectures, making the contribution self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on the domain assumption that SAM's flat-minima benefit transfers when perturbations are modeled in a separate low-rank subspace, plus the ad-hoc design choice of an auxiliary module whose effectiveness is not independently justified in the provided abstract.

axioms (2)

domain assumption SAM improves generalization by seeking flat minima in the loss landscape
Standard premise in sharpness-aware optimization literature invoked to motivate the method.
ad hoc to paper An auxiliary low-rank module can independently capture adversarial perturbations without subspace mismatch
Core design assumption introduced to justify the dual-module architecture.

invented entities (1)

auxiliary LoRA module no independent evidence
purpose: To model SAM's adversarial weight perturbations separately from task adaptation
New component proposed in the paper to enable decoupling; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5752 in / 1358 out tokens · 40003 ms · 2026-05-18T21:23:16.702023+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bi-LoRA decouples SAM's weight perturbations from LoRA optimization: the primary LoRA module adapts to specific tasks via standard gradient descent, while the auxiliary module captures the sharpness of the loss landscape through gradient ascent
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 3.1 (Perturbation Space of LoRA-SAM). The effective weight perturbation in LoRA-SAM can be decomposed into two terms: BB⊤(∇WL) and (∇WL)A⊤A

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.