BEFT: Bias-Efficient Fine-Tuning of Language Models in Low-Data Regimes

Amir Aminifar; Ananth Balashankar; Baichuan Huang

arxiv: 2509.15974 · v2 · submitted 2025-09-19 · 💻 cs.CL · cs.AI· cs.LG

BEFT: Bias-Efficient Fine-Tuning of Language Models in Low-Data Regimes

Baichuan Huang , Ananth Balashankar , Amir Aminifar This is my paper

Pith reviewed 2026-05-18 15:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords bias fine-tuningparameter-efficient fine-tuninglow-data regimesattention projectionslarge language modelsvalue biasdownstream performance

0 comments

The pith

Fine-tuning the value bias in large language models leads to better downstream performance than tuning query or key biases in low-data regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores the effects of fine-tuning individual bias terms within the attention mechanisms of LLMs. It establishes that fine-tuning the value bias b_v tends to produce superior results on downstream tasks when data is scarce. This holds for a variety of model architectures and sizes. A reader would care because it points to a minimal-change strategy for adapting powerful models without extensive data or resources.

Core claim

Directly fine-tuning b_v generally leads to higher downstream performance in low-data regimes, in comparison to b_q and b_k. This unique property is evaluated extensively across encoder-only and decoder-only LLMs up to 6.7B parameters, including bias-free models, providing evidence for the effectiveness of this choice across various downstream tasks.

What carries the argument

The bias vector b_v in the value projection of transformer attention layers, which when updated alone produces the reported performance gains over the other bias choices.

If this is right

Updating only b_v supplies a parameter-efficient adaptation route suited to low-data conditions.
The benefit appears in both encoder-only and decoder-only transformer architectures.
The pattern persists in models as large as 6.7 billion parameters and in models that originally lack biases.
The approach maintains competitive task accuracy while changing far fewer parameters than full fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners facing very small datasets might default to tuning b_v when using bias-only adaptation.
The result hints that value projections may carry more task-specific information than query or key projections during low-data updates.
The same selective bias update could be tested as a lightweight addition inside other efficient-tuning recipes such as adapters.

Load-bearing premise

The performance advantage is attributable specifically to the choice of which bias vector to update rather than to differences in optimization dynamics or other experimental controls.

What would settle it

A set of controlled runs that match learning rates, optimizers, and all other settings exactly yet show equal performance when tuning b_q or b_k instead of b_v would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.15974 by Amir Aminifar, Ananth Balashankar, Baichuan Huang.

**Figure 1.** Figure 1: Importance ranking and accuracy (%) of finetuning different bias terms (query bq, key bk, and value bv) using various bias-selection approaches on the SST-2 dataset with BERTBASE (low-data: 1000 training samples). We expect higher-ranked bias terms to achieve higher accuracy. Our approach precisely and dynamically selects the particular bias terms to be effectively fine-tuned, compared to Magnitude (Za… view at source ↗

**Figure 2.** Figure 2: Our bias-efficient approach jointly considers both [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Approaches to select different bias terms for fine-tuning: (a) the magnitude of bias change before and after finetuning. Different fine-tuned biases situated on the green rhombus exhibit the same Magnitude value; (b) the empirical Fisher information before fine-tuning. Different gradients situated on the yellow circle result in the same Fisher value; (c) our biasefficient approach, which circumvents the … view at source ↗

**Figure 4.** Figure 4: Importance ranking and downstream performance of fine-tuning different bias terms using various bias-selection [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Fine-tuning the bias terms of large language models (LLMs) has the potential to achieve unprecedented parameter efficiency while maintaining competitive performance, particularly in low-data regimes. However, the link between fine-tuning different bias terms (i.e., $\boldsymbol{b}_q$, $\boldsymbol{b}_k$, and $\boldsymbol{b}_v$ in the query, key, or value projections) and downstream performance remains largely unclear to date. In this paper, we investigate the link between fine-tuning $\boldsymbol{b}_q$, $\boldsymbol{b}_k$, and $\boldsymbol{b}_v$ with the performance of the downstream task. Our key finding is that directly fine-tuning $\boldsymbol{b}_v$ generally leads to higher downstream performance in low-data regimes, in comparison to $\boldsymbol{b}_q$ and $\boldsymbol{b}_k$. We extensively evaluate this unique property across a wide range of LLMs spanning encoder-only and decoder-only architectures up to 6.7B parameters (including bias-free LLMs). Our results provide strong evidence for the effectiveness of directly fine-tuning $\boldsymbol{b}_v$ across various downstream tasks. The implementation code is available at https://github.com/whubaichuan/BEFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates fine-tuning individual bias terms (b_q, b_k, b_v) in the attention projections of LLMs for parameter efficiency in low-data regimes. The central claim is that directly fine-tuning b_v yields higher downstream performance than b_q or b_k, supported by evaluations across encoder-only and decoder-only models up to 6.7B parameters (including bias-free LLMs), with public code release.

Significance. If the performance advantage of b_v is confirmed to stem from its algebraic role rather than optimization artifacts, the result would be significant for practical low-resource fine-tuning and for understanding attention biases. The broad evaluation across architectures and scales, plus reproducible code, are strengths that would support the contribution once controls are verified.

major comments (2)

[Experimental Setup] Experimental Setup section: The protocol does not state that identical learning-rate schedules, optimizer states, gradient clipping, and total training steps were used for the b_v, b_q, and b_k bias-only runs. Without explicit equalization or ablation of these factors, the headline result that b_v fine-tuning is superior could be an artifact of better-tuned optimization dynamics rather than an intrinsic property of the value bias (directly addressing the weakest assumption and skeptic concern).
[Results] Results section (performance tables/figures): No statistical significance tests, standard deviations over random seeds, or controls for total parameter count and training steps are reported. This leaves the central claim only partially supported despite the abstract's assertion of extensive evaluation across architectures and sizes.

minor comments (2)

[Experimental Setup] Clarify the exact definition of the 'low-data regime' (e.g., number of training examples per task) in the experimental details.
[Figures] Add explicit legends or captions in figures comparing b_q, b_k, and b_v to improve readability of the performance gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications based on our experimental protocol and outlining the revisions we will incorporate to strengthen the presentation of results.

read point-by-point responses

Referee: [Experimental Setup] Experimental Setup section: The protocol does not state that identical learning-rate schedules, optimizer states, gradient clipping, and total training steps were used for the b_v, b_q, and b_k bias-only runs. Without explicit equalization or ablation of these factors, the headline result that b_v fine-tuning is superior could be an artifact of better-tuned optimization dynamics rather than an intrinsic property of the value bias (directly addressing the weakest assumption and skeptic concern).

Authors: We confirm that all bias-only fine-tuning runs (for b_v, b_q, and b_k) employed identical learning-rate schedules, the same optimizer (AdamW with default betas), gradient clipping threshold, and total training steps, where the latter were determined by using the same number of epochs and batch sizes on the low-data training sets. This protocol was followed uniformly across all model architectures and scales to isolate the effect of the bias term being tuned. We will revise the Experimental Setup section to explicitly document these equalized settings and add a brief statement confirming the absence of differential hyperparameter tuning. revision: yes
Referee: [Results] Results section (performance tables/figures): No statistical significance tests, standard deviations over random seeds, or controls for total parameter count and training steps are reported. This leaves the central claim only partially supported despite the abstract's assertion of extensive evaluation across architectures and sizes.

Authors: We agree that reporting standard deviations and significance tests would further strengthen the results. In the revised manuscript, we will update the performance tables and figures to include means and standard deviations over at least three random seeds, along with statistical significance tests (e.g., paired t-tests) comparing b_v fine-tuning against b_q and b_k. For controls: the total number of trainable parameters is identical across b_q, b_k, and b_v fine-tuning because each targets bias vectors of the same dimensionality within the attention projection layers. Training steps were equalized by fixing the number of epochs and effective batch size for all compared methods. We will add an explicit paragraph in the Results section clarifying these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance comparison is independent of any self-referential derivation

full rationale

The paper reports direct experimental results from bias-only fine-tuning runs on multiple LLMs (encoder- and decoder-only, up to 6.7B). The central claim—that updating b_v yields higher downstream accuracy than b_q or b_k in low-data regimes—is presented as an observed outcome of those runs, not as a mathematical derivation or fitted quantity defined in terms of itself. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to derive the performance ordering; the ordering is simply measured. The abstract and reader's summary contain no load-bearing self-citation chain or renaming of a known result. The finding is therefore self-contained against external benchmarks (replication of the reported fine-tuning protocol) and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim is supported by comparative fine-tuning experiments rather than new theoretical axioms or invented entities; no free parameters, domain assumptions, or postulated objects are introduced in the abstract.

pith-pipeline@v0.9.0 · 5753 in / 1096 out tokens · 42024 ms · 2026-05-18T15:29:37.639825+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our bias-efficient approach jointly considers both the angular change and magnitude change... I(b_T) = 1/L ∑ (1 − b_pre · b_post / max(‖b_pre‖², ‖b_post‖²))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuning b_v generally leads to higher downstream performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Aghajanyan, A.; Gupta, S.; and Zettlemoyer, L. 2021. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 7319--7328

work page 2021
[4]

Ansell, A.; Ponti, E.; Korhonen, A.; and Vuli \'c , I. 2022. Composable Sparse Fine-Tuning for Cross-Lingual Transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1778--1796

work page 2022
[5]

Bini, M.; Girrbach, L.; and Akata, Z. 2025. Decoupling Angles and Strength in Low-rank Adaptation. In The Thirteenth International Conference on Learning Representations

work page 2025
[6]

D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901

work page 2020
[7]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171--4186

work page 2019
[8]

Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.-M.; Chen, W.; et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3): 220--235

work page 2023
[9]

Doering, N.; Gorlla, C.; Tuttle, T.; and Vijay, A. 2024. Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models. arXiv preprint arXiv:2401.04051

work page arXiv 2024
[10]

Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; and Gardner, M. 2019. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2368--2378

work page 2019
[11]

R.; Bastani, O.; De Sa, C.; Yu, X.; et al

Guo, W.; Long, J.; Zeng, Y.; Liu, Z.; Yang, X.; Ran, Y.; Gardner, J. R.; Bastani, O.; De Sa, C.; Yu, X.; et al. 2025. Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity. In The Thirteenth International Conference on Learning Representations

work page 2025
[12]

Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning, 2790--2799. PMLR

work page 2019
[13]

J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

Hu, E. J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. Lo RA : Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations

work page 2022
[14]

Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045--3059

work page 2021
[15]

L.; and Liang, P

Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4582--4597

work page 2021
[16]

Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; and Raffel, C. A. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35: 1950--1965

work page 2022
[17]

F.; Cheng, K.-T.; and Chen, M.-H

Liu, S.-Y.; Wang, C.-Y.; Yin, H.; Molchanov, P.; Wang, Y.-C. F.; Cheng, K.-T.; and Chen, M.-H. 2024. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning

work page 2024
[18]

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[19]

Logan IV, R.; Bala z evi \'c , I.; Wallace, E.; Petroni, F.; Singh, S.; and Riedel, S. 2022. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. In Findings of the Association for Computational Linguistics: ACL 2022, 2824--2835

work page 2022
[20]

Mosbach, M.; Andriushchenko, M.; and Klakow, D. 2021. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines. In International Conference on Learning Representations

work page 2021
[21]

Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383--2392

work page 2016
[22]

Sung, Y.-L.; Nair, V.; and Raffel, C. A. 2021. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34: 24193--24205

work page 2021
[23]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30

work page 2017
[24]

Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2019 a . Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019
[25]

Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019 b . GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations

work page 2019
[26]

R.; Kerrigan, S.; and Yang, K

Weng, Y. R.; Kerrigan, S.; and Yang, K. 2024. BitFit+: Fine-tuning bias and gamma parameters

work page 2024
[27]

Xue, K.; Dong, M.; Tu, X.; and He, T. 2025. FISH-Tuning: Enhancing PEFT Methods with Fisher Information. arXiv preprint arXiv:2504.04050

work page arXiv 2025
[28]

Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; and Hu, X. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6): 1--32

work page 2024
[29]

B.; Goldberg, Y.; and Ravfogel, S

Zaken, E. B.; Goldberg, Y.; and Ravfogel, S. 2022. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 1--9

work page 2022
[30]

OPT: Open Pre-trained Transformer Language Models

Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X. V.; et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Zhao, J.; Zhang, Z.; Chen, B.; Wang, Z.; Anandkumar, A.; and Tian, Y. 2024. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. In Forty-first International Conference on Machine Learning

work page 2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Aghajanyan, A.; Gupta, S.; and Zettlemoyer, L. 2021. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 7319--7328

work page 2021

[4] [4]

Ansell, A.; Ponti, E.; Korhonen, A.; and Vuli \'c , I. 2022. Composable Sparse Fine-Tuning for Cross-Lingual Transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1778--1796

work page 2022

[5] [5]

Bini, M.; Girrbach, L.; and Akata, Z. 2025. Decoupling Angles and Strength in Low-rank Adaptation. In The Thirteenth International Conference on Learning Representations

work page 2025

[6] [6]

D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901

work page 2020

[7] [7]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171--4186

work page 2019

[8] [8]

Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.-M.; Chen, W.; et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3): 220--235

work page 2023

[9] [9]

Doering, N.; Gorlla, C.; Tuttle, T.; and Vijay, A. 2024. Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models. arXiv preprint arXiv:2401.04051

work page arXiv 2024

[10] [10]

Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; and Gardner, M. 2019. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2368--2378

work page 2019

[11] [11]

R.; Bastani, O.; De Sa, C.; Yu, X.; et al

Guo, W.; Long, J.; Zeng, Y.; Liu, Z.; Yang, X.; Ran, Y.; Gardner, J. R.; Bastani, O.; De Sa, C.; Yu, X.; et al. 2025. Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity. In The Thirteenth International Conference on Learning Representations

work page 2025

[12] [12]

Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning, 2790--2799. PMLR

work page 2019

[13] [13]

J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

Hu, E. J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. Lo RA : Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations

work page 2022

[14] [14]

Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045--3059

work page 2021

[15] [15]

L.; and Liang, P

Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4582--4597

work page 2021

[16] [16]

Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; and Raffel, C. A. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35: 1950--1965

work page 2022

[17] [17]

F.; Cheng, K.-T.; and Chen, M.-H

Liu, S.-Y.; Wang, C.-Y.; Yin, H.; Molchanov, P.; Wang, Y.-C. F.; Cheng, K.-T.; and Chen, M.-H. 2024. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning

work page 2024

[18] [18]

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[19] [19]

Logan IV, R.; Bala z evi \'c , I.; Wallace, E.; Petroni, F.; Singh, S.; and Riedel, S. 2022. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. In Findings of the Association for Computational Linguistics: ACL 2022, 2824--2835

work page 2022

[20] [20]

Mosbach, M.; Andriushchenko, M.; and Klakow, D. 2021. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines. In International Conference on Learning Representations

work page 2021

[21] [21]

Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383--2392

work page 2016

[22] [22]

Sung, Y.-L.; Nair, V.; and Raffel, C. A. 2021. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34: 24193--24205

work page 2021

[23] [23]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30

work page 2017

[24] [24]

Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2019 a . Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019

[25] [25]

Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019 b . GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations

work page 2019

[26] [26]

R.; Kerrigan, S.; and Yang, K

Weng, Y. R.; Kerrigan, S.; and Yang, K. 2024. BitFit+: Fine-tuning bias and gamma parameters

work page 2024

[27] [27]

Xue, K.; Dong, M.; Tu, X.; and He, T. 2025. FISH-Tuning: Enhancing PEFT Methods with Fisher Information. arXiv preprint arXiv:2504.04050

work page arXiv 2025

[28] [28]

Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; and Hu, X. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6): 1--32

work page 2024

[29] [29]

B.; Goldberg, Y.; and Ravfogel, S

Zaken, E. B.; Goldberg, Y.; and Ravfogel, S. 2022. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 1--9

work page 2022

[30] [30]

OPT: Open Pre-trained Transformer Language Models

Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X. V.; et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Zhao, J.; Zhang, Z.; Chen, B.; Wang, Z.; Anandkumar, A.; and Tian, Y. 2024. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. In Forty-first International Conference on Machine Learning

work page 2024