pith. sign in

arxiv: 2605.07111 · v2 · pith:7X2WC3QOnew · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Pith reviewed 2026-05-20 23:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM fine-tuningLoRAfull fine-tuningoptimizer routingmixture of expertsparameter-efficient adaptationmodel adaptationgradient routing
0
0 comments X

The pith

Dynamic routing at the optimizer level lets LLM adaptation draw on full fine-tuning or LoRA as needed during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neither full fine-tuning nor LoRA alone is optimal for every task because some require the full plasticity of updating every weight while others benefit from the regularization that low-rank updates provide. The paper introduces MoLF to route updates between these two regimes at the optimizer level so that exact gradients reach both the full-weight path and the LoRA path at every step. This setup supports stable training while removing the need to commit in advance to one static architecture. The memory-efficient variant restricts routing to a pair of LoRA experts and still improves on earlier adaptive LoRA methods across the tested tasks and models.

Core claim

MoLF is a unified framework that routes updates between full fine-tuning and LoRA at the optimizer level to ensure exact gradient signals are available to both experts throughout training, producing stable dynamics. Performance either improves on or stays within 1.5% of the better of FFT and LoRA across SQL, Medical QA, and Counterfactual Knowledge tasks on Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B. MoLF-Efficient freezes base weights and routes only among LoRA experts of varying rank, outperforming prior adaptive LoRA approaches by up to 20% on Fact and 9% on Med and SQL.

What carries the argument

The gradient-guided router operating at the optimizer level that decides per step whether to apply the update through the full-weight expert or the LoRA expert while preserving unmodified gradients for both.

If this is right

  • MoLF performance stays within 1.5% of the better static method across all tested tasks and model sizes from 1B to 3B parameters.
  • MoLF-Efficient improves over prior adaptive LoRA methods by up to 20% on fact tasks and 9% on medical and SQL tasks.
  • Both full-weight and LoRA paths receive exact gradients for the entire training run rather than approximate signals.
  • The same routing mechanism supports continuous navigation between high-plasticity and regularized regimes without upfront architecture choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The routing logic could be tested with other pairs of adaptation methods such as combining LoRA with prefix tuning to see if similar gains appear.
  • The frequency of router decisions might serve as a diagnostic signal for when a task truly requires full plasticity versus low-rank updates.
  • Extending the approach to models larger than 3B parameters would show whether routing patterns shift with scale or data entropy.

Load-bearing premise

That routing updates at the optimizer level will maintain stable training dynamics and deliver exact gradient signals to both the full and LoRA paths without introducing new instabilities.

What would settle it

A run on the same tasks and models where MoLF performance drops more than 1.5% behind the stronger of FFT and LoRA on any setting, or where training loss exhibits instability spikes absent from the pure baselines.

Figures

Figures reproduced from arXiv: 2605.07111 by Boxun Li, Haozhan Tang, Kevin Kuo, Virginia Smith, Xinyin Zhang, Xiuqi Zhu.

Figure 1
Figure 1. Figure 1: Our empirical evaluations reveal a structural trade-off in fine-tuning: FFT excels on [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MoLF Framework. 4.2 MoLF Architecture and Inference Structurally, MoLF unifies FFT and LoRA by formulating each linear projection as an unconditional superposition of expert pathways. For a given input activation x, the ungated forward pass evaluates: y = Wbasex + X N i=1 αi √ ri Bi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MoLF-E vs. adaptive PEFT baselines across three tasks and three models. MoLF-E (blue) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MoLF routing dynamics over training. Each heatmap row tracks one module’s structural [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregate router decisions over training. Bars represent the percentage of modules [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-module router decisions over training. Rows index the modules in parameter order [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rank ablation of MoLF-E: performance versus the first LoRA expert’s rank [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL. Our code is open-sourced at https://github.com/11785T23/molf.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MoLF (Mixture of LoRA and Full Fine-Tuning), a unified framework for LLM adaptation that dynamically routes updates between Full Fine-Tuning (FFT) and LoRA at the optimizer level. This enables continuous navigation between the high-plasticity FFT regime and the regularized LoRA regime. Through experiments on SQL, Medical QA, and Counterfactual Knowledge tasks using Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B models, the authors claim MoLF either improves upon or stays within 1.5% of the better of FFT and LoRA. They also introduce MoLF-Efficient, which routes between LoRA experts of varying rank while freezing base weights, and report it outperforms prior adaptive LoRA methods by up to 20% on Fact and 9% on Med/SQL. Code is open-sourced.

Significance. If the optimizer-level routing mechanism truly supplies unmodified, independent gradient signals to both FFT and LoRA paths without interference or new instabilities, the work would provide a practical bridge between the two dominant fine-tuning paradigms, reducing the need for static architecture choice based on task entropy. The open-sourced code is a clear strength that enables external verification of the routing rule and training dynamics. The empirical scope across multiple models and tasks adds value, though the significance is tempered by the current lack of implementation specifics needed to confirm the core stability and performance claims.

major comments (2)
  1. Abstract: The central claim that MoLF 'dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training' is load-bearing for the navigation-between-regimes argument and the reported performance within 1.5% of the better baseline, yet the abstract (and by extension the method description) provides no concrete routing rule, such as per-layer selection, gradient masking, dual optimizers, or any form of combination. Without this, it is impossible to verify whether the 'exact gradient' guarantee holds or whether blending occurs.
  2. Evaluation section: Performance claims (MoLF within 1.5% of the better baseline; MoLF-Efficient gains of up to 20% on Fact and 9% on Med/SQL) are presented without details on hyperparameter choices, number of random seeds, statistical significance tests, or train/validation/test splits. This directly affects the reliability of the cross-model and cross-task comparisons that underpin the main empirical contribution.
minor comments (1)
  1. Abstract: The expansion of the MoLF acronym is given but could be introduced more explicitly on first use to improve immediate readability for readers unfamiliar with the hybrid setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below. Revisions have been made to improve clarity on the routing mechanism and to add missing experimental details.

read point-by-point responses
  1. Referee: Abstract: The central claim that MoLF 'dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training' is load-bearing for the navigation-between-regimes argument and the reported performance within 1.5% of the better baseline, yet the abstract (and by extension the method description) provides no concrete routing rule, such as per-layer selection, gradient masking, dual optimizers, or any form of combination. Without this, it is impossible to verify whether the 'exact gradient' guarantee holds or whether blending occurs.

    Authors: We agree that the abstract would benefit from a concise description of the routing rule. The full method (Section 3) specifies dual optimizers with per-layer gradient masking: gradients are computed separately for the FFT path and the LoRA path, then masked so that each optimizer receives unmodified signals from its assigned parameters with no cross-path blending. A lightweight router (gradient-norm threshold) decides the per-layer allocation at each step. To make this immediately verifiable from the abstract, we have revised it to read: '...using dual optimizers with per-layer gradient masking to supply unmodified signals to both paths.' The method section already contains the precise implementation; the abstract change ensures the central claim is supported at first reading. revision: yes

  2. Referee: Evaluation section: Performance claims (MoLF within 1.5% of the better baseline; MoLF-Efficient gains of up to 20% on Fact and 9% on Med/SQL) are presented without details on hyperparameter choices, number of random seeds, statistical significance tests, or train/validation/test splits. This directly affects the reliability of the cross-model and cross-task comparisons that underpin the main empirical contribution.

    Authors: We acknowledge that the original submission omitted several reproducibility details. In the revised manuscript we have expanded the Evaluation section (and added an appendix table) to report: (i) exact hyperparameter grids and final values for each model-task pair, (ii) results averaged over three random seeds with standard deviations, (iii) paired t-tests confirming statistical significance of the reported gains, and (iv) explicit train/validation/test splits for SQL, MedQA, and Counterfactual tasks. These additions directly address the reliability concern while preserving the original performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal without derivation chain

full rationale

The paper is an empirical proposal that introduces MoLF as a dynamic routing framework between FFT and LoRA at the optimizer level, validated through experiments on SQL, Medical QA, and Counterfactual Knowledge tasks across Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B models. No mathematical derivation, first-principles result, or predictive equation is claimed that reduces by construction to fitted parameters, self-citations, or renamed inputs. Performance claims (within 1.5% of the better baseline, or up to 20% gains for MoLF-Efficient) rest on direct experimental comparison rather than any self-referential reduction. The open-source code link supplies an external reproducibility check, confirming the work is self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical performance across three tasks and three models plus the untested premise that optimizer-level routing preserves exact gradients and stable dynamics; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Some tasks require high-entropy representational changes while others benefit from low-rank regularization.
    Stated in the opening paragraph as the basis for the FFT-LoRA debate.
invented entities (1)
  • MoLF routing mechanism no independent evidence
    purpose: Dynamically decide per update whether to apply full or LoRA gradients
    New framework introduced to navigate between the two regimes.

pith-pipeline@v0.9.0 · 5836 in / 1307 out tokens · 51096 ms · 2026-05-20T23:46:19.344749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 8 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  3. [3]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  4. [4]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

  5. [5]

    Smart: Robust and efficient fine-tuning for pre-trained natural language models through princi- pled regularized optimization

    Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through princi- pled regularized optimization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, 2020

  6. [6]

    Better fine-tuning by reducing representational collapse

    Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. InInternational Conference on Learning Representations, 2020

  7. [7]

    Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models.arXiv preprint arXiv:2203.06904, 2022

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models.arXiv preprint arXiv:2203.06904, 2022

  8. [8]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Lingling Xu, Haoran Xie, S Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  9. [9]

    LoRA: Low-rank adaptation of large language models.ICLR, 1 (2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models.ICLR, 1 (2):3, 2022

  10. [10]

    AdaMix: Mixture-of-adaptations for parameter-efficient model tuning

    Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan, and Jianfeng Gao. AdaMix: Mixture-of-adaptations for parameter-efficient model tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5744–5760, 2022

  11. [11]

    RandLoRA: Full rank parameter-efficient fine-tuning of large models

    Paul Albert, Frederic Z Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. RandLoRA: Full rank parameter-efficient fine-tuning of large models. InThe Thirteenth International Conference on Learning Representations, 2025

  12. [12]

    Adaptive budget allocation for parameter-efficient fine-tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

  13. [13]

    IncreLoRA: Incremental parameter allocation method for parameter-efficient fine-tuning.arXiv preprint arXiv:2308.12043, 2023

    Feiyu Zhang, Liangzhi Li, Junhao Chen, Zhouqiang Jiang, Bowen Wang, and Yiming Qian. IncreLoRA: Incremental parameter allocation method for parameter-efficient fine-tuning.arXiv preprint arXiv:2308.12043, 2023

  14. [14]

    ALoRA: Allocating low-rank adaptation for fine-tuning large language models

    Zequan Liu, Jiawen Lyn, Wei Zhu, Xing Tian, and Yvette Graham. ALoRA: Allocating low-rank adaptation for fine-tuning large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 622–641, 2024. 10

  15. [15]

    Biderman, J

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. LoRA learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

  16. [16]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021

  17. [17]

    Lora without regret, September 2025

    John Schulman and Thinking Machines. Lora without regret, September 2025. URL https: //thinkingmachines.ai/blog/lora/. Accessed: 2026-05-06

  18. [18]

    FLoRA: Low-rank adapters are secretly gradient compressors

    Yongchang Hao, Yanshuai Cao, and Lili Mou. FLoRA: Low-rank adapters are secretly gradient compressors. InInternational Conference on Machine Learning, pages 17554–17571. PMLR, 2024

  19. [19]

    LoRA-GA: Low-rank adaptation with gradient approxi- mation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

    Shaowen Wang, Linxi Yu, and Jian Li. LoRA-GA: Low-rank adaptation with gradient approxi- mation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

  20. [20]

    ReLoRA: High-rank training through low-rank updates

    Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-rank training through low-rank updates. InThe Twelfth International Conference on Learning Representations, 2024

  21. [21]

    Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF.arXiv preprint arXiv:2309.09055, 2023

    Simeng Sun, Dhawal Gupta, and Mohit Iyyer. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF.arXiv preprint arXiv:2309.09055, 2023

  22. [22]

    A study on improving reasoning in language models

    Yuqing Du, Alexander Havrilla, Sainbayar Sukhbaatar, Pieter Abbeel, and Roberta Raileanu. A study on improving reasoning in language models. InI Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of F oundation Models, 2024

  23. [23]

    Camels in a changing climate: Enhancing lm adaptation with tulu 2,

    Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing LM adaptation with Tulu 2.arXiv preprint arXiv:2311.10702, 2023

  24. [24]

    How much knowledge can you pack into a LoRA adapter without harming LLM?arXiv preprint arXiv:2502.14502, 2025

    Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexan- der Panchenko, and Mikhail Salnikov. How much knowledge can you pack into a LoRA adapter without harming LLM?arXiv preprint arXiv:2502.14502, 2025

  25. [25]

    LoRA vs full fine-tuning: An illusion of equivalence

    Reece S Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  26. [26]

    ElaLoRA: Elastic & learnable low-rank adaptation for efficient model fine-tuning

    Huandong Chang, Zicheng Ma, Mingyuan Ma, Zhenting Qi, Andrew Sabot, Hong Jiang, and HT Kung. ElaLoRA: Elastic & learnable low-rank adaptation for efficient model fine-tuning. arXiv preprint arXiv:2504.00254, 2025

  27. [27]

    Sparse low-rank adaptation of pre-trained language models

    Ning Ding, Xingtai Lv, Qiaosen Wang, Yulin Chen, Bowen Zhou, Zhiyuan Liu, and Maosong Sun. Sparse low-rank adaptation of pre-trained language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 4133–4145, 2023

  28. [28]

    AutoLoRA: Automatically tuning matrix ranks in low-rank adaptation based on meta learning

    Ruiyi Zhang, Rushi Qiang, Sai Ashish Somayajula, and Pengtao Xie. AutoLoRA: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 5048–5060, 2024

  29. [29]

    DoRA: Enhancing parameter-efficient fine-tuning with dynamic rank distribution

    Yulong Mao, Kaiyu Huang, Changhao Guan, Ganglin Bao, Fengran Mo, and Jinan Xu. DoRA: Enhancing parameter-efficient fine-tuning with dynamic rank distribution. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 11662–11675, 2024

  30. [30]

    DyLoRA: Parameter- efficient tuning of pre-trained models using dynamic search-free low-rank adaptation

    Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter- efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computa- tional Linguistics, pages 3274–3287, 2023. 11

  31. [31]

    QDyLoRA: Quantized dynamic low-rank adaptation for efficient large language model tuning

    Hossein Rajabzadeh, Mojtaba Valipour, Tianshu Zhu, Marzieh S Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, and Mehdi Rezagholizadeh. QDyLoRA: Quantized dynamic low-rank adaptation for efficient large language model tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 712–718, 2024

  32. [32]

    Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

  33. [33]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  34. [34]

    Sira: Sparse mixture of low rank adaptation

    Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen, et al. Sira: Sparse mixture of low rank adaptation. arXiv preprint arXiv:2311.09179, 2023

  35. [35]

    AdaMoLE: Adaptive mixture of LoRA experts.arXiv preprint arXiv:2405.00361, 2024

    Zefang Liu and Jiahua Luo. AdaMoLE: Adaptive mixture of LoRA experts.arXiv preprint arXiv:2405.00361, 2024. URLhttps://arxiv.org/abs/2405.00361

  36. [36]

    Pushing mixture of experts to the limit: Extremely parameter efficient MoE for instruction tuning

    Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermis, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient MoE for instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024

  37. [37]

    Mixture of LoRA experts

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of LoRA experts. InThe Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison

    Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. MixLoRA: Enhancing large language models fine-tuning with LoRA-based mixture of experts.arXiv preprint arXiv:2404.15159, 2024

  39. [39]

    LoRAMoE: Alleviating world knowledge forgetting in large language models via MoE-style plugin

    Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Wei Shen, Limao Xiong, Yuhao Zhou, Xiao Wang, Zhiheng Xi, Xiaoran Fan, et al. LoRAMoE: Alleviating world knowledge forgetting in large language models via MoE-style plugin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1932–1945, 2024

  40. [40]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, 2022

  41. [41]

    MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, 2022

  42. [42]

    Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts

    Yev Meyer, Marjan Emadi, Dhruv Nathawani, Lipika Ramaswamy, Kendrick Boyd, Maarten Van Segbroeck, Matthew Grossman, Piotr Mlocek, and Drew Newberry. Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql , April 2024

  43. [43]

    The approximation of one matrix by another of lower rank

    Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936

  44. [44]

    Symmetric gauge functions and unitarily invariant norms.The quarterly journal of mathematics, 11(1):50–59, 1960

    Leon Mirsky. Symmetric gauge functions and unitarily invariant norms.The quarterly journal of mathematics, 11(1):50–59, 1960

  45. [45]

    A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

    Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with LoRA.arXiv preprint arXiv:2312.03732, 2023

  46. [46]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  47. [47]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 12

  48. [48]

    Gemma 3 Technical Report

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  49. [49]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report. https://arxiv.org/abs/2412 .15115, 2024. arXiv preprint arXiv:2412.15115. 13 A Experimental Details A.1 Derivation of the Expected Preconditioned Descent (EPD) Score The Expected Preconditioned Descent (EPD) score ...