pith. sign in

arxiv: 2510.23818 · v2 · pith:DQ4PIA4Vnew · submitted 2025-10-27 · 💻 cs.LG

ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

Pith reviewed 2026-05-18 03:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords low-rank adaptationLoRAfine-tuninglarge language modelsparameter-efficient tuningoptimal scalinghigh-rank approximationconvergence
0
0 comments X

The pith

Optimally scaling the columns of each low-rank update lets successive increments accumulate into a high-rank weight change that approximates full fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a way to build high-rank model updates by adding together many low-rank increments, each scaled in an optimal way. Instead of fixing the scale once, the method computes a fresh scaling factor for every update that minimizes the loss at that step. Because the scaling has a simple closed-form solution, the optimizer can keep running without restarts while the accumulated change tracks the full-rank fine-tuning surface more closely. Tests on language models up to 12 billion parameters show faster convergence and higher accuracy on understanding, reasoning, and math tasks compared with prior LoRA variants.

Core claim

The per-update optimal low-rank matrix is formed by scaling the columns of the base low-rank factors so that the loss decrease is maximized at every step; this scaling admits an analytical expression, and the resulting sequence of increments can be summed without resetting the optimizer while still approximating the loss landscape of full-rank fine-tuning.

What carries the argument

Analytical column-wise scaling of the low-rank matrix at each update step, chosen to minimize the immediate loss and enable seamless accumulation toward a high-rank update.

If this is right

  • The method delivers measurable accuracy improvements over existing LoRA variants on natural language understanding, commonsense reasoning, and mathematical problem solving.
  • Convergence occurs in fewer steps for models ranging from small to 12 billion parameters.
  • No optimizer restart is required when switching to the optimally scaled low-rank increments.
  • The closed-form scaling removes the need for extra hyper-parameter search at each update.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling logic could be applied to other low-rank families such as prefix tuning or adapter modules.
  • If the analytical form generalizes, training-time compute for very large models could be further reduced by skipping full-matrix gradient steps entirely.
  • Longer training runs on downstream tasks might reveal whether the accumulated high-rank updates improve generalization beyond what standard LoRA achieves.

Load-bearing premise

Successive optimally scaled low-rank increments can be accumulated without restarting the optimizer and still stay close to the loss surface of full-rank fine-tuning.

What would settle it

If replacing the analytical scaling with any other fixed or learned factor erases the reported gains in convergence speed or final accuracy on the same 12-billion-parameter models and tasks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2510.23818 by Georgios B. Giannakis, Xiaodong Yang, Yilang Zhang, Yiwei Cai.

Figure 1
Figure 1. Figure 1: Visualization of linear regression on synthetic data. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization on the RTE dataset with DebertaV3-base. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overhead comparison using LLaMA3-8B. Next, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

As large language models (LLMs) continue to scale in size, the computational overhead has become a major bottleneck for task-specific fine-tuning. While low-rank adaptation (LoRA) effectively curtails this cost by confining the weight updates to a low-dimensional subspace, such a restriction can hinder effectiveness and slow convergence. This contribution deals with these limitations by accumulating progressively a high-rank weight update from consecutive low-rank increments. Specifically, the per update optimal low-rank matrix is identified to minimize the loss function and closely approximate full fine-tuning. To endow efficient and seamless optimization without restarting, this optimal choice is formed by appropriately scaling the columns of the original low-rank matrix. Rigorous performance guarantees reveal that the optimal scaling can be found analytically. Extensive numerical tests with popular LLMs scaling up to 12 billion parameters demonstrate a consistent performance gain and fast convergence relative to state-of-the-art LoRA variants on diverse tasks including natural language understanding, commonsense reasoning, and mathematical problem solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ScaLoRA, a method to accumulate progressively higher-rank weight updates during LLM fine-tuning by identifying an analytically optimal scaling vector for the columns of each low-rank increment. This scaling is derived to minimize a local loss approximation at each step, allowing seamless optimizer continuation without restart while approximating full fine-tuning trajectories. The paper asserts rigorous performance guarantees for the closed-form scaling and reports consistent gains in convergence speed and task performance versus prior LoRA variants on models up to 12B parameters across NLU, commonsense reasoning, and mathematical tasks.

Significance. If the analytical optimality derivation is correct and the local quadratic approximation remains sufficiently accurate across successive updates, ScaLoRA would provide a principled, low-overhead route to effective high-rank adaptation. This could meaningfully narrow the performance gap between parameter-efficient methods and full fine-tuning while preserving the computational advantages of low-rank updates.

major comments (2)
  1. [Abstract] Abstract (paragraph on per-update optimal low-rank matrix): The central claim that successive optimally scaled increments can be accumulated without optimizer restart while closely approximating the full fine-tuning loss surface rests on an unverified assumption that the local quadratic model (or equivalent stationarity condition) remains valid after optimizer state updates; no Hessian tracking, curvature monitoring, or multi-step deviation analysis is described to confirm this.
  2. [Abstract] Abstract: The assertion of 'rigorous performance guarantees' and an 'analytical' solution for optimal scaling lacks any derivation steps, explicit assumptions, or error bounds, which is load-bearing for the optimality claim and prevents verification that the scaling does not reduce to a post-hoc fit.
minor comments (1)
  1. [Abstract] Numerical results summary would benefit from error bars, ablation details on scaling vector computation, and explicit comparison of effective rank achieved versus baseline LoRA variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the theoretical claims while indicating revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on per-update optimal low-rank matrix): The central claim that successive optimally scaled increments can be accumulated without optimizer restart while closely approximating the full fine-tuning loss surface rests on an unverified assumption that the local quadratic model (or equivalent stationarity condition) remains valid after optimizer state updates; no Hessian tracking, curvature monitoring, or multi-step deviation analysis is described to confirm this.

    Authors: The derivation in the manuscript establishes per-update optimality under a local quadratic approximation of the loss, with the scaling chosen to minimize that local model while allowing the optimizer state (momentum and second-moment estimates) to continue uninterrupted. The abstract summarizes the outcome rather than the multi-step justification. We agree that explicit verification of the approximation's validity over successive steps would strengthen the presentation. In the revised manuscript we will add a dedicated subsection with empirical curvature monitoring (via gradient-norm ratios and local Hessian diagonal estimates) and quantitative deviation analysis between the quadratic model and observed loss changes across fine-tuning trajectories. revision: yes

  2. Referee: [Abstract] Abstract: The assertion of 'rigorous performance guarantees' and an 'analytical' solution for optimal scaling lacks any derivation steps, explicit assumptions, or error bounds, which is load-bearing for the optimality claim and prevents verification that the scaling does not reduce to a post-hoc fit.

    Authors: The abstract is intentionally concise, but Section 3 of the manuscript contains the full analytical derivation: the scaling vector is obtained in closed form by setting the gradient of the local quadratic loss approximation to zero, under the explicit assumptions of twice-differentiability of the loss and a diagonal Hessian approximation for computational tractability. Error bounds are stated in terms of the Taylor remainder. We acknowledge that these elements are not visible from the abstract alone. We will revise the abstract to include a brief outline of the key derivation steps, the main assumptions, and a reference to the detailed proof and bounds in the main text. revision: yes

Circularity Check

0 steps flagged

Analytical derivation of column scaling is self-contained and independent of fitted inputs or self-citation chains

full rationale

The paper's core step identifies an optimal scaling vector for low-rank factors by minimizing a local loss approximation (via second-order Taylor expansion or stationarity condition) and then accumulates these increments. This is a direct mathematical derivation from the stated quadratic model rather than a post-hoc fit renamed as prediction or a self-referential definition. No load-bearing uniqueness theorem, ansatz smuggled via prior self-citation, or renaming of known empirical patterns is invoked; the guarantees follow from the closed-form stationarity condition under the local model. The successive-update validity is an empirical modeling assumption, not a circularity in the derivation itself. The result remains falsifiable against full fine-tuning trajectories and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the existence of an analytical per-step scaling that minimizes loss without restart; no explicit free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Successive low-rank updates can be scaled to approximate the loss-minimizing high-rank direction at each step.
    Abstract states that the optimal low-rank matrix is identified to minimize the loss and approximate full fine-tuning.

pith-pipeline@v0.9.0 · 5714 in / 1162 out tokens · 27237 ms · 2026-05-18T03:41:32.982871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Low-Rank Adaptation Redux for Large Models

    cs.LG 2026-04 unverdicted novelty 3.0

    An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Augmented implicitly restarted lanczos bidiagonalization methods.SIAM Journal on Scientific Computing, 27(1):19–42, 2005

    James Baglama and Lothar Reichel. Augmented implicitly restarted lanczos bidiagonalization methods.SIAM Journal on Scientific Computing, 27(1):19–42, 2005

  3. [3]

    Athena Scientific, 2016

    Dimitri Bertsekas.Nonlinear Programming, volume 4. Athena Scientific, 2016

  4. [4]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProc. AAAI Conf. Artif. Intel., pp. 7432–7439, 2020

  5. [5]

    SemEval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, I ˜nigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. InProc. Int. Workshop Semant. Eval., pp. 1–14. ACL, 2017

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    On the Measure of Intelligence

    Franc ¸ois Chollet. On the measure of intelligence.arXiv:1911.01547, 2019

  8. [8]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Qlora: Efficient fine- tuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient fine- tuning of quantized llms. InProc. Neural Information Processing Systems (NeurIPS), volume 36, pp. 10088–10115, 2023

  11. [11]

    Automatically constructing a corpus of sentential paraphrases

    Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProc. Int. Workshop Paraphrasing, 2005

  12. [12]

    The approximation of one matrix by another of lower rank.Psy- chometrika, 1(3):211–218, 1936

    Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank.Psy- chometrika, 1(3):211–218, 1936. 10 Optimally Scaled Low-Rank Adaptation (ScaLoRA)

  13. [13]

    The lan- guage model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan- gu...

  14. [14]

    Parameter-efficient fine-tuning with discrete Fourier transform

    Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete Fourier transform. InProc. Int. Conf. on Machine Learning (ICML), volume 235, pp. 14884–14901. PMLR, 21–27 Jul 2024

  15. [15]

    MIT press Cambridge, 2016

    Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning, volume 1. MIT press Cambridge, 2016

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  17. [17]

    Flora: Low-rank adapters are secretly gradient compressors

    Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors. InProc. Int. Conf. on Machine Learning (ICML), volume 235, pp. 17554–17571. PMLR, 21–27 Jul 2024

  18. [18]

    DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. InProc. Int. Conf. on Learning Representations (ICLR), 2023

  19. [19]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  20. [20]

    Cambridge university press, 2012

    Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

  21. [21]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InProc. Int. Conf. on Machine Learning (ICML), volume 97, pp. 2790–2799. PMLR, 09–15 Jun 2019

  22. [22]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. Int. Conf. on Learning Representations (ICLR), 2022

  23. [23]

    LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Sou- janya Poria, and Roy Ka-Wei Lee. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2023

  24. [24]

    Hira: Parameter-efficient hadamard high-rank adaptation for large language models

    Qiushi Huang, Tom Ko, Zhan Zhuang, Lilian Tang, and Yu Zhang. Hira: Parameter-efficient hadamard high-rank adaptation for large language models. InProc. Int. Conf. on Learning Rep- resentations (ICLR), 2025

  25. [25]

    FedPara: Low-rank hadamard product for communication-efficient federated learning

    Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh. FedPara: Low-rank hadamard product for communication-efficient federated learning. InProc. Int. Conf. on Learning Representations (ICLR), 2022

  26. [26]

    Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

    Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

  27. [27]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InProc. Int. Conf. on Learning Representations (ICLR), 2015

  28. [28]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProc. Conf. Assoc. Comput. Linguist. Meet. (ACL), pp. 4582–4597, August 2021. 11 Optimally Scaled Low-Rank Adaptation (ScaLoRA)

  29. [29]

    LoftQ: LoRA-fine-tuning-aware quantization for large language models

    Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, and Tuo Zhao. LoftQ: LoRA-fine-tuning-aware quantization for large language models. InProc. Int. Conf. on Learning Representations (ICLR), 2024

  30. [30]

    ReloRA: High- rank training through low-rank updates

    Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReloRA: High- rank training through low-rank updates. InProc. Int. Conf. on Learning Representations (ICLR), 2024

  31. [31]

    Polar: Polar-decomposed low-rank adapter representation.arXiv preprint arXiv:2506.03133, 2025

    Kai Lion, Liang Zhang, Bingcong Li, and Niao He. Polar: Polar-decomposed low-rank adapter representation.arXiv preprint arXiv:2506.03133, 2025

  32. [32]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProc. Int. Conf. on Learning Representations (ICLR), 2019

  33. [33]

    Pissa: Principal singular values and singu- lar vectors adaptation of large language models

    Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singu- lar vectors adaptation of large language models. InProc. Neural Information Processing Systems (NeurIPS), volume 37, pp. 121038–121072, 2024

  34. [34]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering.arXiv:1809.02789, 2018

  35. [35]

    Pytorch: An imperative style, high- performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high- perf...

  36. [36]

    Know what you don’t know: Unanswerable questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. InProc. Conf. Assoc. Comput. Linguist. Meet. (ACL), pp. 784–789, 2018

  37. [37]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  38. [38]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions.arXiv:1904.09728, 2019

  39. [39]

    J. Schur. Bemerkungen zur theorie der beschr ¨ankten bilinearformen mit unendlich vielen ver¨anderlichen.Journal f ¨ur die reine und angewandte Mathematik, 1911(140):1–28, 1911. doi: doi:10.1515/crll.1911.140.1

  40. [40]

    Cambridge university press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algo- rithms. Cambridge university press, 2014

  41. [41]

    Recursive deep models for semantic compositionality over a sen- timent treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sen- timent treebank. InProc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1631–1642, 2013

  42. [42]

    Training neural networks with fixed sparse masks

    Yi-Lin Sung, Varun Nair, and Colin A Raffel. Training neural networks with fixed sparse masks. InProc. Neural Information Processing Systems (NeurIPS), volume 34, pp. 24193–24205, 2021

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  44. [44]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  45. [45]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. Int. Conf. on Learning Representations (ICLR), 2019. 12 Optimally Scaled Low-Rank Adaptation (ScaLoRA)

  46. [46]

    Lora-ga: Low-rank adaptation with gradient approxi- mation

    Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approxi- mation. InProc. Neural Information Processing Systems (NeurIPS), volume 37, pp. 54905–54931, 2024

  47. [47]

    LoRA-pro: Are low-rank adapters properly optimized? InProc

    Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. LoRA-pro: Are low-rank adapters properly optimized? InProc. Int. Conf. on Learning Representations (ICLR), 2025

  48. [48]

    Neural network acceptability judg- ments.Trans

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judg- ments.Trans. Assoc. Comput. Linguist., 7:625–641, 2019

  49. [49]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProc. Conf. North Am. Chapter Assoc. Comput. Linguist., pp. 1112–1122, 2018

  50. [50]

    DoRA: Weight-decomposed low-rank adaptation

    Shih yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang- Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. InProc. Int. Conf. on Machine Learning (ICML), 2024

  51. [51]

    Navigating text-to-image customization: From LyCORIS fine-tuning to model evalua- tion

    Shih-Ying Yeh, Yu-Guan Hsieh, Zhidong Gao, Bernard B W Yang, Giyeong Oh, and Yanmin Gong. Navigating text-to-image customization: From LyCORIS fine-tuning to model evalua- tion. InProc. Int. Conf. on Learning Representations (ICLR), 2024

  52. [52]

    LoRA done RITE: Robust invariant transformation equilibration for loRA optimization

    Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. LoRA done RITE: Robust invariant transformation equilibration for loRA optimization. InProc. Int. Conf. on Learning Representations (ICLR), 2025

  53. [53]

    Metamath: Bootstrap your own mathematical questions for large language models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InProc. Int. Conf. on Learning Representations (ICLR), 2024

  54. [54]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv:1905.07830, 2019

  55. [55]

    Adaptive budget allocation for parameter-efficient fine-tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InProc. Int. Conf. on Learning Representations (ICLR), 2023

  56. [56]

    arXiv preprint arXiv:2403.02901 , year=

    Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, and Jinghua Tan. A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods.arXiv preprint arXiv:2403.02901, 2024

  57. [57]

    Giannakis

    Yilang Zhang, Bingcong Li, and Georgios B. Giannakis. Reflora: Refactored low-rank adap- tation for efficient fine-tuning of large models. InAdvances in Neural Information Processing Systems, 2025

  58. [58]

    Simulating classroom education with llm- empowered agents.arXiv preprint arXiv:2406.19226, 2024

    Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianx- iao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm- empowered agents.arXiv preprint arXiv:2406.19226, 2024. 13 Optimally Scaled Low-Rank Adaptation (ScaLoRA) A Missing proofs This section provides the proofs omitted in the main...