LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Bo Li; Lin Zhang; Longteng Zhang; Shaohuai Shi; Xiaowen Chu

arxiv: 2308.03303 · v3 · submitted 2023-08-07 · 💻 cs.CL

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang , Lin Zhang , Shaohuai Shi , Xiaowen Chu , Bo Li This is my paper

Pith reviewed 2026-05-24 08:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords LoRAparameter-efficient fine-tuninglow-rank adaptationgradient correctionlarge language modelsfine-tuning efficiency

0 comments

The pith

LoRA-FA freezes the down-projection matrix and uses gradient corrections to match full fine-tuning with reduced memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LoRA possesses an asymmetric collapsible structure allowing its low-rank update to be expressed as a single-layer linear regression. This permits freezing one factor, the projection-down matrix A, while training only the projection-up matrix B. Closed-form gradient corrections are derived to align the low-rank gradient with the full gradient. Experiments across GLUE, GSM8K, MT-Bench and HumanEval show performance comparable to full fine-tuning and other PEFT methods, alongside lower activation memory and compute.

Core claim

LoRA's update admits an asymmetric collapsible structure that reformulates the low-rank modification to the weight matrix as a single-layer linear regression; consequently one of the two LoRA factors can be frozen without loss of expressivity. LoRA-FA therefore freezes the projection-down matrix A and optimizes only the projection-up matrix B, while closed-form gradient corrections minimize the difference between the induced low-rank gradient and the full gradient.

What carries the argument

asymmetric collapsible structure of LoRA updates reformulated as single-layer linear regression allowing one factor to be frozen

If this is right

LoRA-FA achieves comparable performance to Full-FT on GLUE, GSM8K, MT-Bench, and HumanEval
LoRA-FA reduces activation memory consumption during fine-tuning
LoRA-FA lowers computational workload in fine-tuning

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gradient correction technique might be adapted to other low-rank adaptation variants.
Freezing one matrix could simplify implementation in distributed training setups.
Further memory reductions may allow fine-tuning on consumer hardware for models larger than those tested.

Load-bearing premise

The low-rank modification to the weight matrix can be reformulated as a single-layer linear regression so that freezing one LoRA factor loses no expressivity.

What would settle it

Measuring whether LoRA-FA without the closed-form gradient corrections reaches full fine-tuning accuracy on the GSM8K benchmark would test if the corrections are required for the claimed performance.

Figures

Figures reproduced from arXiv: 2308.03303 by Bo Li, Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu.

**Figure 2.** Figure 2: Convergence comparison among full-parameter fine-tuning (FT), LoRA, and LoRA-FA [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: GPU memory footprint (GB) comparison under different rank sizes for fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Fine-tuning performance comparison between LoRA and LoRA-FA under different ranks [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Fine-tuning large language models (LLMs) is crucial for improving their performance on downstream tasks, but full-parameter fine-tuning (Full-FT) is computationally expensive and memory-intensive. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this by optimizing only a small subset of parameters. However, LoRA may underperform Full-FT in certain scenarios due to the intrinsic limitations of its low-rank gradients. In this work, we reveal an asymmetric, collapsible structure in LoRA's update: the low-rank modification to W can be reformulated as a single-layer linear regression, implying that one of the LoRA factors can be frozen without sacrificing expressivity. Leveraging this insight, we introduce LoRA-FA, which freezes the projection-down matrix A and trains only the projection-up matrix B. We further close the gap to Full-FT by deriving closed-form gradient corrections that minimize the discrepancy between the induced low-rank gradient and the full gradient. Through extensive experiments on diverse benchmarks, including GLUE, GSM8K, MT-Bench, and HumanEval, we demonstrate that LoRA-FA consistently achieves comparable performance to existing PEFT methods and Full-FT. Experiments on system efficiency show that LoRA-FA significantly reduces activation memory consumption and computational workload in fine-tuning. Our code is available at https://github.com/huggingface/peft.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoRA-FA freezes A from the start and adds closed-form gradient corrections, but the no-expressivity-loss argument does not hold up under the row-space constraint.

read the letter

LoRA-FA freezes the down-projection A and trains only B, then derives closed-form corrections to reduce the gap between the low-rank gradient and the full gradient. That specific combination is the concrete addition over standard LoRA. The experiments run the method on GLUE, GSM8K, MT-Bench, and HumanEval and report performance close to other PEFT baselines and full fine-tuning, plus measurable drops in activation memory and compute. Those system numbers are the part that could matter for people actually running fine-tuning jobs on limited hardware. The code release in the peft library also makes it straightforward to test. The soft spot is the central justification. The abstract treats the low-rank update as a single-layer linear regression and concludes that one factor can be frozen without losing expressivity. Linear algebra shows the opposite: once A is fixed, every achievable update BA has its row space contained in the row space of that fixed A. Joint training of A and B can reach any r-dimensional row space. The gradient corrections address only the back-propagation mismatch; they do not enlarge the set of representable matrices. If the full paper contains a derivation that resolves this restriction, it needs to be examined directly. Otherwise the claim rests on the empirical results alone. The work is aimed at practitioners already using LoRA who want a memory-light variant. It is coherent enough on its own terms to deserve a serious referee, even if the theory section requires tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LoRA-FA, a PEFT variant of LoRA that freezes the down-projection matrix A and trains only the up-projection B after reformulating the low-rank update as a single-layer linear regression; it further derives closed-form gradient corrections to reduce the discrepancy with full fine-tuning gradients. Experiments across GLUE, GSM8K, MT-Bench, and HumanEval report performance comparable to Full-FT and prior PEFT methods, together with reduced activation memory and compute.

Significance. If the gradient-correction derivation is correct and the empirical gains hold under controlled ablations, the method could supply a lower-memory alternative to standard LoRA. The closed-form corrections and the public code release are positive features. The central justification that freezing A incurs no expressivity loss, however, appears to rest on an incorrect claim about the representable function class.

major comments (2)

[Abstract] Abstract (and the paragraph on asymmetric collapsible structure): the statement that the linear-regression reformulation 'implies that one of the LoRA factors can be frozen without sacrificing expressivity' is incorrect. With A (r × d_in) fixed, every achievable ΔW = BA has row space contained in the fixed r-dimensional row space of A; jointly optimizing A permits any r-dimensional row space. The set of representable rank-≤r matrices is therefore strictly smaller. The closed-form gradient corrections address only the back-propagation discrepancy and do not enlarge this function class. This claim is load-bearing for the decision to freeze A.
[§3] §3 (derivation of gradient corrections): the manuscript must explicitly state whether the corrections are derived under the assumption that A is already frozen or whether they are intended to compensate for the restricted row space. If the former, the corrections cannot restore the expressivity lost by freezing A; a concrete counter-example (e.g., a target update whose optimal row space lies outside span(A)) should be provided or the claim revised.

minor comments (2)

[Table 1] Table 1 and Figure 2: clarify whether the reported memory numbers include the cost of storing the frozen A matrix or only the trainable B; also state the precise rank r and initialization used for all compared methods.
[§4.2] §4.2: the GLUE results would benefit from reporting standard deviation across at least three random seeds rather than single-run numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and substantive review. The two major comments correctly identify an overstatement in our original claims about expressivity. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and the paragraph on asymmetric collapsible structure): the statement that the linear-regression reformulation 'implies that one of the LoRA factors can be frozen without sacrificing expressivity' is incorrect. With A (r × d_in) fixed, every achievable ΔW = BA has row space contained in the fixed r-dimensional row space of A; jointly optimizing A permits any r-dimensional row space. The set of representable rank-≤r matrices is therefore strictly smaller. The closed-form gradient corrections address only the back-propagation discrepancy and do not enlarge this function class. This claim is load-bearing for the decision to freeze A.

Authors: We agree that the referee's analysis is correct: fixing A restricts the row space of ΔW = BA, so the representable function class is strictly smaller than when both factors are optimized. The linear-regression reformulation was intended only to show that, for any chosen A, the optimal B can be solved in closed form within that subspace; it does not imply equivalence of expressivity. We will revise the abstract and the relevant paragraph to remove the phrase 'without sacrificing expressivity' and instead describe the approach as freezing A to enable efficient optimization of B within the induced subspace, with gradient corrections improving alignment to full fine-tuning within that constraint. revision: yes
Referee: [§3] §3 (derivation of gradient corrections): the manuscript must explicitly state whether the corrections are derived under the assumption that A is already frozen or whether they are intended to compensate for the restricted row space. If the former, the corrections cannot restore the expressivity lost by freezing A; a concrete counter-example (e.g., a target update whose optimal row space lies outside span(A)) should be provided or the claim revised.

Authors: The gradient corrections are derived under the assumption that A is already frozen; they minimize the discrepancy between the low-rank gradient (computed with fixed A) and the full gradient, but they operate entirely within the row space fixed by A and cannot enlarge that space. We will add an explicit statement in §3 clarifying this assumption and will align the surrounding discussion with the revised expressivity wording. Because we are revising the claim rather than maintaining it, a counter-example is not required. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation and validation are independent of inputs

full rationale

The paper's core steps are (1) a linear-algebra reformulation of the LoRA update as single-layer regression, (2) a derived closed-form gradient correction, and (3) empirical benchmarking on GLUE/GSM8K/etc. None of these reduce by construction to their own fitted values or to self-citations. The expressivity claim is presented as a direct consequence of the reformulation (not a renamed input), and performance numbers are external measurements rather than quantities fed back into the method. This is the common case of a self-contained derivation validated externally; no load-bearing step collapses to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reformulation of the LoRA update as a linear regression problem and on the assumption that the resulting gradient corrections can be computed in closed form without additional fitted constants.

axioms (1)

domain assumption The low-rank modification to W can be reformulated as a single-layer linear regression.
This premise is invoked to justify freezing one LoRA factor without loss of expressivity.

pith-pipeline@v0.9.0 · 5790 in / 1144 out tokens · 57975 ms · 2026-05-24T08:01:55.878837+00:00 · methodology

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
cs.CL 2026-04 unverdicted novelty 7.0

Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On
cs.CV 2026-04 unverdicted novelty 7.0

FIT is a large-scale dataset of 1.13M try-on triplets with exact size data plus a synthetic generation pipeline that enables training of virtual try-on models capable of depicting realistic garment fit including ill-f...
LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
cs.LG 2025-10 conditional novelty 7.0

LoRA-DA derives an optimal data-aware LoRA initialization by solving an optimization problem from asymptotic analysis of parameter discrepancy using Fisher-gradient bias and Fisher-information variance terms.
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
cs.AI 2025-05 unverdicted novelty 7.0

UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
cs.LG 2024-03 conditional novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
cs.LG 2026-05 unverdicted novelty 6.0

HELLoRA selectively applies LoRA adapters to hot experts in MoE layers, using as little as 15.7% of standard LoRA parameters while improving accuracy by 9.2% on OlMoE across math, code, and alignment tasks.
S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
cs.CV 2026-05 unverdicted novelty 6.0

S2FT replaces the sparse-spectrum assumption of prior Fourier PEFT with a learned rearrangement that maps a pre-estimated weight change into a domain where few spectral coefficients suffice.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity
q-bio.NC 2026-04 conditional novelty 6.0

RE-CONFIRM shows that standard fine-tuning of foundation models fails to recover known regional hubs in neurological disorders, while Hub-LoRA recovers them and outperforms custom models.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
cs.LG 2025-06 conditional novelty 6.0

MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
GWT: Scalable Optimizer State Compression for Large Language Model Training
cs.LG 2025-01 unverdicted novelty 6.0

GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs
cs.CR 2026-04 unverdicted novelty 4.0

DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 14 Pith papers · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page
[5]

Falcon-40b: an open large language model with state-of-the-art performance, 2023

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, et al. Falcon-40b: an open large language model with state-of-the-art performance, 2023

work page 2023
[6]

Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Eric Chu et al

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Eric Chu et al. Palm 2 technical report, 2023

work page 2023
[7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1--9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.1865...

work page doi:10.18653/v1/2022.acl-short.1 2022
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[10]

One-for-all: Generalized lora for parameter-efficient fine-tuning, 2023

Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient fine-tuning, 2023

work page 2023
[11]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[13]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=kXwdL1cWOAi

work page 2023
[15]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[16]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

work page 2022
[17]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 2022 a

work page 2022
[18]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022 b

work page 2022
[19]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

work page doi:10.18653/v1/n19-1423 2019
[21]

Open llm leaderboard

Beeching Edward, Fourrier Clémentine, Habib Nathan, Han Sheon, Lambert Nathan, Rajani Nazneen, Sanseviero Omar, Tunstall Lewis, and Wolf Thomas. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023
[22]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, and Anthony et al. DiPofi. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628

work page doi:10.5281/zenodo.5371628 2021
[23]

PPT : Pre-trained prompt tuning for few-shot learning

Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT : Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8410--8423, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.576. URL https://ac...

work page doi:10.18653/v1/2022.acl-long.576 2022
[24]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020

work page 2020
[25]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[26]

Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022. URL https://arxiv.org/abs/2212.09689

work page arXiv 2022
[27]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799. PMLR, 2019

work page 2019
[28]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[29]

Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023

work page 2023
[30]

Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

work page 2017
[31]

Gonzalez

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. Checkmate: Breaking the memory wall with optimal tensor rematerialization, 2020

work page 2020
[32]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp.\ 4171--4186, 2019

work page 2019
[33]

Semantic sentence matching with densely-connected recurrent and co-attentive information

Seonhoon Kim, Inho Kang, and Nojun Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. Proceedings of the AAAI Conference on Artificial Intelligence, 33 0 (01): 0 6586--6593, Jul. 2019. doi:10.1609/aaai.v33i01.33016586. URL https://ojs.aaai.org/index.php/AAAI/article/view/4627

work page doi:10.1609/aaai.v33i01.33016586 2019
[34]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023
[35]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, 2021

work page 2021
[36]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4582--4597, 2021

work page 2021
[37]

Stack more layers differently: High-rank training through low-rank updates, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Stack more layers differently: High-rank training through low-rank updates, 2023

work page 2023
[38]

Truthfulqa: Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022
[39]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

work page 2022
[40]

Gpt understands, too, 2021

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too, 2021

work page 2021
[41]

Roberta: A robustly optimized bert pretraining approach, 2019

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

work page 2019
[42]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. International Conference on Learning Representations, 2017

work page 2017
[43]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[45]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020
[46]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020

work page 2020
[47]

Zero-offload: Democratizing billion-scale model training, 2021

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. Zero-offload: Democratizing billion-scale model training, 2021

work page 2021
[48]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

work page 2020
[49]

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...

work page 2022
[50]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, Seattle, Washington, USA, October 2013. Association for C...

work page 2013
[51]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[52]

Llama: Open and efficient foundation language models, 2023 a

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023 a

work page 2023
[53]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Shruti Bhosale et al. Llama 2: Open foundation and fine-tuned chat models, 2023 b

work page 2023
[54]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[55]

Powersgd: Practical low-rank gradient compression for distributed optimization

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[56]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019

work page 2019
[57]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

work page 2022
[58]

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, and Yeganeh et al. Kordi. Super- N atural I nstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5085--5109, Abu Dhabi, United Arab Emirates, December 2022 b . Association for Computa...

work page 2022
[59]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7: 0 625--641, 2019

work page 2019
[60]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021

work page 2021
[61]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[62]

m T 5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 483--498, Online, ...

work page doi:10.18653/v1/2021.naacl-main.41 2021
[63]

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models . Transactions of the Association for Computational Linguistics, 10: 0 291--306, 03 2022. ISSN 2307-387X. doi:10.1162/tacl_a_00461. URL https://doi.org/10.1162/tacl\_a\_00461

work page doi:10.1162/tacl_a_00461 2022
[64]

Hellaswag: Can a machine really finish your sentence?, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

work page 2019
[65]

Evaluation and optimization of gradient compression for distributed deep learning

Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Evaluation and optimization of gradient compression for distributed deep learning. 2023 IEEE 43rd International Conference on Distributed Computing Systems, 2023 a

work page 2023
[66]

Adaptive budget allocation for parameter-efficient fine-tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023 b

work page 2023
[67]

Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023

work page 2023
[68]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[5] [5]

Falcon-40b: an open large language model with state-of-the-art performance, 2023

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, et al. Falcon-40b: an open large language model with state-of-the-art performance, 2023

work page 2023

[6] [6]

Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Eric Chu et al

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Eric Chu et al. Palm 2 technical report, 2023

work page 2023

[7] [7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1--9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.1865...

work page doi:10.18653/v1/2022.acl-short.1 2022

[9] [9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901

[10] [10]

One-for-all: Generalized lora for parameter-efficient fine-tuning, 2023

Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient fine-tuning, 2023

work page 2023

[11] [11]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023

[13] [13]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=kXwdL1cWOAi

work page 2023

[15] [15]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018

[16] [16]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

work page 2022

[17] [17]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 2022 a

work page 2022

[18] [18]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022 b

work page 2022

[19] [19]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

work page doi:10.18653/v1/n19-1423 2019

[21] [21]

Open llm leaderboard

Beeching Edward, Fourrier Clémentine, Habib Nathan, Han Sheon, Lambert Nathan, Rajani Nazneen, Sanseviero Omar, Tunstall Lewis, and Wolf Thomas. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023

[22] [22]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, and Anthony et al. DiPofi. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628

work page doi:10.5281/zenodo.5371628 2021

[23] [23]

PPT : Pre-trained prompt tuning for few-shot learning

Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT : Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8410--8423, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.576. URL https://ac...

work page doi:10.18653/v1/2022.acl-long.576 2022

[24] [24]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020

work page 2020

[25] [25]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[26] [26]

Unnatural instructions: Tuning language models with (almost) no human labor.arXiv preprint arXiv:2212.09689,

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022. URL https://arxiv.org/abs/2212.09689

work page arXiv 2022

[27] [27]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799. PMLR, 2019

work page 2019

[28] [28]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[29] [29]

Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023

work page 2023

[30] [30]

Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

work page 2017

[31] [31]

Gonzalez

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. Checkmate: Breaking the memory wall with optimal tensor rematerialization, 2020

work page 2020

[32] [32]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp.\ 4171--4186, 2019

work page 2019

[33] [33]

Semantic sentence matching with densely-connected recurrent and co-attentive information

Seonhoon Kim, Inho Kang, and Nojun Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. Proceedings of the AAAI Conference on Artificial Intelligence, 33 0 (01): 0 6586--6593, Jul. 2019. doi:10.1609/aaai.v33i01.33016586. URL https://ojs.aaai.org/index.php/AAAI/article/view/4627

work page doi:10.1609/aaai.v33i01.33016586 2019

[34] [34]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023

[35] [35]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, 2021

work page 2021

[36] [36]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4582--4597, 2021

work page 2021

[37] [37]

Stack more layers differently: High-rank training through low-rank updates, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Stack more layers differently: High-rank training through low-rank updates, 2023

work page 2023

[38] [38]

Truthfulqa: Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022

[39] [39]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

work page 2022

[40] [40]

Gpt understands, too, 2021

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too, 2021

work page 2021

[41] [41]

Roberta: A robustly optimized bert pretraining approach, 2019

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

work page 2019

[42] [42]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. International Conference on Learning Representations, 2017

work page 2017

[43] [43]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023

[44] [44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022

[45] [45]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020

[46] [46]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020

work page 2020

[47] [47]

Zero-offload: Democratizing billion-scale model training, 2021

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. Zero-offload: Democratizing billion-scale model training, 2021

work page 2021

[48] [48]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

work page 2020

[49] [49]

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...

work page 2022

[50] [50]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, Seattle, Washington, USA, October 2013. Association for C...

work page 2013

[51] [51]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[52] [52]

Llama: Open and efficient foundation language models, 2023 a

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023 a

work page 2023

[53] [53]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Shruti Bhosale et al. Llama 2: Open foundation and fine-tuned chat models, 2023 b

work page 2023

[54] [54]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[55] [55]

Powersgd: Practical low-rank gradient compression for distributed optimization

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[56] [56]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019

work page 2019

[57] [57]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

work page 2022

[58] [58]

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, and Yeganeh et al. Kordi. Super- N atural I nstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5085--5109, Abu Dhabi, United Arab Emirates, December 2022 b . Association for Computa...

work page 2022

[59] [59]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7: 0 625--641, 2019

work page 2019

[60] [60]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021

work page 2021

[61] [61]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020

[62] [62]

m T 5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 483--498, Online, ...

work page doi:10.18653/v1/2021.naacl-main.41 2021

[63] [63]

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models . Transactions of the Association for Computational Linguistics, 10: 0 291--306, 03 2022. ISSN 2307-387X. doi:10.1162/tacl_a_00461. URL https://doi.org/10.1162/tacl\_a\_00461

work page doi:10.1162/tacl_a_00461 2022

[64] [64]

Hellaswag: Can a machine really finish your sentence?, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

work page 2019

[65] [65]

Evaluation and optimization of gradient compression for distributed deep learning

Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Evaluation and optimization of gradient compression for distributed deep learning. 2023 IEEE 43rd International Conference on Distributed Computing Systems, 2023 a

work page 2023

[66] [66]

Adaptive budget allocation for parameter-efficient fine-tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023 b

work page 2023

[67] [67]

Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023

work page 2023

[68] [68]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page