pith. sign in

arxiv: 2308.03303 · v3 · submitted 2023-08-07 · 💻 cs.CL

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Pith reviewed 2026-05-24 08:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords LoRAparameter-efficient fine-tuninglow-rank adaptationgradient correctionlarge language modelsfine-tuning efficiency
0
0 comments X

The pith

LoRA-FA freezes the down-projection matrix and uses gradient corrections to match full fine-tuning with reduced memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LoRA possesses an asymmetric collapsible structure allowing its low-rank update to be expressed as a single-layer linear regression. This permits freezing one factor, the projection-down matrix A, while training only the projection-up matrix B. Closed-form gradient corrections are derived to align the low-rank gradient with the full gradient. Experiments across GLUE, GSM8K, MT-Bench and HumanEval show performance comparable to full fine-tuning and other PEFT methods, alongside lower activation memory and compute.

Core claim

LoRA's update admits an asymmetric collapsible structure that reformulates the low-rank modification to the weight matrix as a single-layer linear regression; consequently one of the two LoRA factors can be frozen without loss of expressivity. LoRA-FA therefore freezes the projection-down matrix A and optimizes only the projection-up matrix B, while closed-form gradient corrections minimize the difference between the induced low-rank gradient and the full gradient.

What carries the argument

asymmetric collapsible structure of LoRA updates reformulated as single-layer linear regression allowing one factor to be frozen

If this is right

  • LoRA-FA achieves comparable performance to Full-FT on GLUE, GSM8K, MT-Bench, and HumanEval
  • LoRA-FA reduces activation memory consumption during fine-tuning
  • LoRA-FA lowers computational workload in fine-tuning

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gradient correction technique might be adapted to other low-rank adaptation variants.
  • Freezing one matrix could simplify implementation in distributed training setups.
  • Further memory reductions may allow fine-tuning on consumer hardware for models larger than those tested.

Load-bearing premise

The low-rank modification to the weight matrix can be reformulated as a single-layer linear regression so that freezing one LoRA factor loses no expressivity.

What would settle it

Measuring whether LoRA-FA without the closed-form gradient corrections reaches full fine-tuning accuracy on the GSM8K benchmark would test if the corrections are required for the claimed performance.

Figures

Figures reproduced from arXiv: 2308.03303 by Bo Li, Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu.

Figure 1
Figure 1. Figure 1: The illustration of (a) full-parameter fine-tuning (FT), (b) LoRA, and (c) LoRA-FA. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Convergence comparison among full-parameter fine-tuning (FT), LoRA, and LoRA-FA [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GPU memory footprint (GB) comparison under different rank sizes for fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fine-tuning performance comparison between LoRA and LoRA-FA under different ranks [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Fine-tuning large language models (LLMs) is crucial for improving their performance on downstream tasks, but full-parameter fine-tuning (Full-FT) is computationally expensive and memory-intensive. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this by optimizing only a small subset of parameters. However, LoRA may underperform Full-FT in certain scenarios due to the intrinsic limitations of its low-rank gradients. In this work, we reveal an asymmetric, collapsible structure in LoRA's update: the low-rank modification to W can be reformulated as a single-layer linear regression, implying that one of the LoRA factors can be frozen without sacrificing expressivity. Leveraging this insight, we introduce LoRA-FA, which freezes the projection-down matrix A and trains only the projection-up matrix B. We further close the gap to Full-FT by deriving closed-form gradient corrections that minimize the discrepancy between the induced low-rank gradient and the full gradient. Through extensive experiments on diverse benchmarks, including GLUE, GSM8K, MT-Bench, and HumanEval, we demonstrate that LoRA-FA consistently achieves comparable performance to existing PEFT methods and Full-FT. Experiments on system efficiency show that LoRA-FA significantly reduces activation memory consumption and computational workload in fine-tuning. Our code is available at https://github.com/huggingface/peft.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LoRA-FA, a PEFT variant of LoRA that freezes the down-projection matrix A and trains only the up-projection B after reformulating the low-rank update as a single-layer linear regression; it further derives closed-form gradient corrections to reduce the discrepancy with full fine-tuning gradients. Experiments across GLUE, GSM8K, MT-Bench, and HumanEval report performance comparable to Full-FT and prior PEFT methods, together with reduced activation memory and compute.

Significance. If the gradient-correction derivation is correct and the empirical gains hold under controlled ablations, the method could supply a lower-memory alternative to standard LoRA. The closed-form corrections and the public code release are positive features. The central justification that freezing A incurs no expressivity loss, however, appears to rest on an incorrect claim about the representable function class.

major comments (2)
  1. [Abstract] Abstract (and the paragraph on asymmetric collapsible structure): the statement that the linear-regression reformulation 'implies that one of the LoRA factors can be frozen without sacrificing expressivity' is incorrect. With A (r × d_in) fixed, every achievable ΔW = BA has row space contained in the fixed r-dimensional row space of A; jointly optimizing A permits any r-dimensional row space. The set of representable rank-≤r matrices is therefore strictly smaller. The closed-form gradient corrections address only the back-propagation discrepancy and do not enlarge this function class. This claim is load-bearing for the decision to freeze A.
  2. [§3] §3 (derivation of gradient corrections): the manuscript must explicitly state whether the corrections are derived under the assumption that A is already frozen or whether they are intended to compensate for the restricted row space. If the former, the corrections cannot restore the expressivity lost by freezing A; a concrete counter-example (e.g., a target update whose optimal row space lies outside span(A)) should be provided or the claim revised.
minor comments (2)
  1. [Table 1] Table 1 and Figure 2: clarify whether the reported memory numbers include the cost of storing the frozen A matrix or only the trainable B; also state the precise rank r and initialization used for all compared methods.
  2. [§4.2] §4.2: the GLUE results would benefit from reporting standard deviation across at least three random seeds rather than single-run numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and substantive review. The two major comments correctly identify an overstatement in our original claims about expressivity. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the paragraph on asymmetric collapsible structure): the statement that the linear-regression reformulation 'implies that one of the LoRA factors can be frozen without sacrificing expressivity' is incorrect. With A (r × d_in) fixed, every achievable ΔW = BA has row space contained in the fixed r-dimensional row space of A; jointly optimizing A permits any r-dimensional row space. The set of representable rank-≤r matrices is therefore strictly smaller. The closed-form gradient corrections address only the back-propagation discrepancy and do not enlarge this function class. This claim is load-bearing for the decision to freeze A.

    Authors: We agree that the referee's analysis is correct: fixing A restricts the row space of ΔW = BA, so the representable function class is strictly smaller than when both factors are optimized. The linear-regression reformulation was intended only to show that, for any chosen A, the optimal B can be solved in closed form within that subspace; it does not imply equivalence of expressivity. We will revise the abstract and the relevant paragraph to remove the phrase 'without sacrificing expressivity' and instead describe the approach as freezing A to enable efficient optimization of B within the induced subspace, with gradient corrections improving alignment to full fine-tuning within that constraint. revision: yes

  2. Referee: [§3] §3 (derivation of gradient corrections): the manuscript must explicitly state whether the corrections are derived under the assumption that A is already frozen or whether they are intended to compensate for the restricted row space. If the former, the corrections cannot restore the expressivity lost by freezing A; a concrete counter-example (e.g., a target update whose optimal row space lies outside span(A)) should be provided or the claim revised.

    Authors: The gradient corrections are derived under the assumption that A is already frozen; they minimize the discrepancy between the low-rank gradient (computed with fixed A) and the full gradient, but they operate entirely within the row space fixed by A and cannot enlarge that space. We will add an explicit statement in §3 clarifying this assumption and will align the surrounding discussion with the revised expressivity wording. Because we are revising the claim rather than maintaining it, a counter-example is not required. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation and validation are independent of inputs

full rationale

The paper's core steps are (1) a linear-algebra reformulation of the LoRA update as single-layer regression, (2) a derived closed-form gradient correction, and (3) empirical benchmarking on GLUE/GSM8K/etc. None of these reduce by construction to their own fitted values or to self-citations. The expressivity claim is presented as a direct consequence of the reformulation (not a renamed input), and performance numbers are external measurements rather than quantities fed back into the method. This is the common case of a self-contained derivation validated externally; no load-bearing step collapses to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reformulation of the LoRA update as a linear regression problem and on the assumption that the resulting gradient corrections can be computed in closed form without additional fitted constants.

axioms (1)
  • domain assumption The low-rank modification to W can be reformulated as a single-layer linear regression.
    This premise is invoked to justify freezing one LoRA factor without loss of expressivity.

pith-pipeline@v0.9.0 · 5790 in / 1144 out tokens · 57975 ms · 2026-05-24T08:01:55.878837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

    cs.CL 2026-04 unverdicted novelty 7.0

    Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.

  2. FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

    cs.CV 2026-04 unverdicted novelty 7.0

    FIT is a large-scale dataset of 1.13M try-on triplets with exact size data plus a synthetic generation pipeline that enables training of virtual try-on models capable of depicting realistic garment fit including ill-f...

  3. LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

    cs.LG 2025-10 conditional novelty 7.0

    LoRA-DA derives an optimal data-aware LoRA initialization by solving an optimization problem from asymptotic analysis of parameter discrepancy using Fisher-gradient bias and Fisher-information variance terms.

  4. Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

    cs.AI 2025-05 unverdicted novelty 7.0

    UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.

  5. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

    cs.LG 2024-03 conditional novelty 7.0

    GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

  6. HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

    cs.LG 2026-05 unverdicted novelty 6.0

    HELLoRA selectively applies LoRA adapters to hot experts in MoE layers, using as little as 15.7% of standard LoRA parameters while improving accuracy by 9.2% on OlMoE across math, code, and alignment tasks.

  7. S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain

    cs.CV 2026-05 unverdicted novelty 6.0

    S2FT replaces the sparse-spectrum assumption of prior Fourier PEFT with a learned rearrangement that maps a pre-estimated weight change into a domain where few spectral coefficients suffice.

  8. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

  9. Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity

    q-bio.NC 2026-04 conditional novelty 6.0

    RE-CONFIRM shows that standard fine-tuning of foundation models fails to recover known regional hubs in neurological disorders, while Hub-LoRA recovers them and outperforms custom models.

  10. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  11. MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

    cs.LG 2025-06 conditional novelty 6.0

    MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.

  12. GWT: Scalable Optimizer State Compression for Large Language Model Training

    cs.LG 2025-01 unverdicted novelty 6.0

    GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.

  13. DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs

    cs.CR 2026-04 unverdicted novelty 4.0

    DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.

  14. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 14 Pith papers · 4 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

  5. [5]

    Falcon-40b: an open large language model with state-of-the-art performance, 2023

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, et al. Falcon-40b: an open large language model with state-of-the-art performance, 2023

  6. [6]

    Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Eric Chu et al

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Eric Chu et al. Palm 2 technical report, 2023

  7. [7]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  8. [8]

    B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1--9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.1865...

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  10. [10]

    One-for-all: Generalized lora for parameter-efficient fine-tuning, 2023

    Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient fine-tuning, 2023

  11. [11]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

  12. [12]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  13. [13]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

  14. [14]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

    Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=kXwdL1cWOAi

  15. [15]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  16. [16]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

  17. [17]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 2022 a

  18. [18]

    8-bit optimizers via block-wise quantization

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022 b

  19. [19]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

  20. [20]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

  21. [21]

    Open llm leaderboard

    Beeching Edward, Fourrier Clémentine, Habib Nathan, Han Sheon, Lambert Nathan, Rajani Nazneen, Sanseviero Omar, Tunstall Lewis, and Wolf Thomas. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

  22. [22]

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, and Anthony et al. DiPofi. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628

  23. [23]

    PPT : Pre-trained prompt tuning for few-shot learning

    Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT : Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8410--8423, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.576. URL https://ac...

  24. [24]

    Deberta: Decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020

  25. [25]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  26. [26]

    Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022. URL https://arxiv.org/abs/2212.09689

  27. [27]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799. PMLR, 2019

  28. [28]

    Lo RA : Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  29. [29]

    Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023

  30. [30]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

  31. [31]

    Gonzalez

    Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. Checkmate: Breaking the memory wall with optimal tensor rematerialization, 2020

  32. [32]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp.\ 4171--4186, 2019

  33. [33]

    Semantic sentence matching with densely-connected recurrent and co-attentive information

    Seonhoon Kim, Inho Kang, and Nojun Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. Proceedings of the AAAI Conference on Artificial Intelligence, 33 0 (01): 0 6586--6593, Jul. 2019. doi:10.1609/aaai.v33i01.33016586. URL https://ojs.aaai.org/index.php/AAAI/article/view/4627

  34. [34]

    Reducing activation recomputation in large transformer models

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

  35. [35]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, 2021

  36. [36]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4582--4597, 2021

  37. [37]

    Stack more layers differently: High-rank training through low-rank updates, 2023

    Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Stack more layers differently: High-rank training through low-rank updates, 2023

  38. [38]

    Truthfulqa: Measuring how models mimic human falsehoods, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

  39. [39]

    Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

  40. [40]

    Gpt understands, too, 2021

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too, 2021

  41. [41]

    Roberta: A robustly optimized bert pretraining approach, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

  42. [42]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. International Conference on Learning Representations, 2017

  43. [43]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  44. [44]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  45. [45]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  46. [46]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020

  47. [47]

    Zero-offload: Democratizing billion-scale model training, 2021

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. Zero-offload: Democratizing billion-scale model training, 2021

  48. [48]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

  49. [49]

    Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022

    Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...

  50. [50]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, Seattle, Washington, USA, October 2013. Association for C...

  51. [51]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  52. [52]

    Llama: Open and efficient foundation language models, 2023 a

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023 a

  53. [53]

    Llama 2: Open foundation and fine-tuned chat models, 2023 b

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Shruti Bhosale et al. Llama 2: Open foundation and fine-tuned chat models, 2023 b

  54. [54]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  55. [55]

    Powersgd: Practical low-rank gradient compression for distributed optimization

    Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019

  56. [56]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019

  57. [57]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

  58. [58]

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, and Yeganeh et al. Kordi. Super- N atural I nstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5085--5109, Abu Dhabi, United Arab Emirates, December 2022 b . Association for Computa...

  59. [59]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7: 0 625--641, 2019

  60. [60]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021

  61. [61]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  62. [62]

    m T 5: A massively multilingual pre-trained text-to-text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 483--498, Online, ...

  63. [63]

    ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models . Transactions of the Association for Computational Linguistics, 10: 0 291--306, 03 2022. ISSN 2307-387X. doi:10.1162/tacl_a_00461. URL https://doi.org/10.1162/tacl\_a\_00461

  64. [64]

    Hellaswag: Can a machine really finish your sentence?, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

  65. [65]

    Evaluation and optimization of gradient compression for distributed deep learning

    Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Evaluation and optimization of gradient compression for distributed deep learning. 2023 IEEE 43rd International Conference on Distributed Computing Systems, 2023 a

  66. [66]

    Adaptive budget allocation for parameter-efficient fine-tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023 b

  67. [67]

    Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023

  68. [68]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...