arxiv: 2511.21285 · v3 · pith:FHPN6PIUnew · submitted 2025-11-26 · 💻 cs.CL

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec , Branislav Pecher , Ivan Srba , Maria Bielikova This is my paper

Pith reviewed 2026-05-17 05:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords parameter-efficient fine-tuningPEFT methodsbenchmarklarge language modelsNLP evaluationcost-aware metricsfine-tuning efficiencyautoregressive models

0 comments

The pith

PEFT-Bench offers a standardized way to compare parameter-efficient fine-tuning methods for large language models while factoring in training and inference costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PEFT-Bench as a unified end-to-end benchmark for testing various parameter-efficient fine-tuning methods on autoregressive large language models. It applies the benchmark to 27 natural language processing datasets using 7 different methods to demonstrate consistent evaluation. The authors also introduce the PEFT Soft Cost Penalties metric to balance accuracy against the number of trainable parameters, inference speed, and training memory usage. This approach addresses limitations in current evaluations that are often narrow in scope and hard to reproduce. A reader would care because it provides concrete guidance for selecting fine-tuning strategies that maintain performance without excessive computational demands.

Core claim

The paper claims that PEFT-Bench, applied across 27 NLP datasets and 7 PEFT methods on autoregressive LLMs, combined with the PSCP metric that incorporates trainable parameters, inference speed, and training memory, enables more reproducible and practical comparisons of these methods than prior limited evaluations.

What carries the argument

PEFT-Bench, the unified end-to-end benchmark, and the PEFT Soft Cost Penalties (PSCP) metric that weights performance by training and inference costs.

If this is right

PEFT methods can now be ranked consistently across tasks instead of relying on scattered individual studies.
The PSCP metric produces efficiency-aware rankings that favor methods with lower memory and faster inference.
Researchers gain a shared testbed that makes it easier to identify which fine-tuning approaches scale to new tasks.
Adoption of the benchmark could reduce redundant experiments and improve comparability in the field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If widely used, the benchmark might steer new PEFT designs toward explicit optimization of the PSCP factors.
The approach could extend naturally to measuring long-term deployment costs beyond initial training.
Rankings from this setup might differ when tested on specialized domains or much larger model scales.

Load-bearing premise

The 27 datasets and 7 methods chosen are representative enough to support general claims about PEFT method quality, and the specific cost weightings in the PSCP metric match practical needs.

What would settle it

Re-evaluating the same methods on a new collection of datasets or models yields substantially different performance rankings or shows that high-PSCP methods fail in real-world low-resource deployments.

Figures

Figures reproduced from arXiv: 2511.21285 by Branislav Pecher, Ivan Srba, Maria Bielikova, Robert Belanec.

**Figure 1.** Figure 1: Diagram describing the methodology of PEFT-Bench. Blue components represent our contributions. We design PEFT-Factory, a framework based on LLaMa-Factory (Zheng et al., 2024) backbone to implement off-the-shelf methods from the HuggingFace PEFT library and an easy-to-use interface for new PEFT methods. Using these methods, we train LLaMa on selected datasets, which we have also included in the backbone. Af… view at source ↗

**Figure 2.** Figure 2: A diagram showing the overview and cate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We evaluate methods from additive, reparametrized, and selective PEFT categories. The diagram shows the categorization of each method. 3.2 PEFT Methods and Pretrained Models With the popularization of PEFT, many new methods are being introduced within a short period of time (Xu et al., 2023; Prottasha et al., 2025). Therefore, it is computationally expensive to evaluate them all. We design our PEFT metho… view at source ↗

**Figure 4.** Figure 4: Bar chart showing the stability of different PEFT methods on 4 low-resource datasets. IA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-Efficient Fine-Tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 7 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Cost Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEFT-Bench gives a unified testbed for PEFT methods on autoregressive LLMs plus a simple cost metric, but the dataset and method choices look narrow and the metric lacks justification for its weights.

read the letter

The main point is that the authors built PEFT-Bench to run the same end-to-end evaluation on seven PEFT methods across twenty-seven NLP datasets for autoregressive models, and they added the PSCP score that folds in trainable parameters, inference speed, and training memory. This setup directly targets the scattered, hard-to-reproduce comparisons that show up in most PEFT papers right now. Having one place where people can run the same pipeline is a practical step forward for anyone who needs to pick an adaptation method under real resource limits. The metric itself is transparent and uses only observable quantities, which avoids the circularity problems that sometimes appear in efficiency claims. The paper demonstrates the benchmark on concrete tasks and reports the resulting rankings, which at least lets readers see how the numbers move when cost factors are included. That demonstration is the clearest contribution. The soft spots are in the selection and validation steps. Twenty-seven datasets and seven methods do not automatically guarantee coverage of task diversity or method families, and the paper does not show strong evidence that these choices avoid selection bias or that the PSCP weights remain stable under different deployment constraints. No sensitivity checks or variance analysis across runs appear in the description, so it is hard to tell how much the final orderings depend on the specific setup. Readers who already work on efficient LLM adaptation will get the most out of this, because they can plug their own methods into the same framework and compare directly. The work is coherent on its own terms and shows clear thinking about what a usable benchmark needs, even if the current scope stays inside standard NLP tasks. I would send it to peer review so that reviewers can check the implementation details and push on the representativeness and metric justification.

Referee Report

3 major / 1 minor

Summary. The paper introduces PEFT-Bench, a unified end-to-end benchmark for evaluating diverse Parameter-Efficient Fine-Tuning (PEFT) methods on autoregressive LLMs. It demonstrates the benchmark across 27 NLP datasets and 7 PEFT methods, and proposes the PEFT Soft Cost Penalties (PSCP) metric that incorporates trainable parameters, inference speed, and training memory usage to support more comprehensive evaluations.

Significance. If the benchmark implementation includes proper statistical validation, variance handling, and the PSCP metric is shown to produce stable, practically useful rankings, this work could provide a valuable standardized framework for fair and reproducible comparisons of PEFT methods, helping address current limitations in evaluation scope within the field.

major comments (3)

Abstract: The abstract states the benchmark scope and introduces PSCP but provides no details on implementation, statistical validation of the metric, handling of variance across runs, or justification for dataset and method selection; this leaves the central claim of improved reproducibility and fair comparison without sufficient support.
§3 (Dataset and Method Selection): The representativeness of the 27 NLP datasets and 7 PEFT methods for drawing general conclusions about PEFT method quality is not evidenced; without analysis of task diversity (e.g., classification, generation, reasoning) or coverage of major PEFT families (LoRA variants, adapters, prefix-tuning), the resulting rankings may not generalize.
§4 (PSCP Metric Definition): The PSCP metric combines trainable parameters, inference speed, and training memory without explicit justification for the weighting scheme, sensitivity analysis, or comparison to existing cost models; this risks arbitrary and unstable rankings under different application constraints.

minor comments (1)

Notation and formulas: The mathematical definition of the PSCP metric would benefit from a clearer, self-contained equation to improve reproducibility and ease of implementation by other researchers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated to improve clarity and support for our claims.

read point-by-point responses

Referee: Abstract: The abstract states the benchmark scope and introduces PSCP but provides no details on implementation, statistical validation of the metric, handling of variance across runs, or justification for dataset and method selection; this leaves the central claim of improved reproducibility and fair comparison without sufficient support.

Authors: The abstract is kept concise to summarize the core contributions. Implementation details appear in Sections 3 and 4, and the experimental protocol includes multiple runs with different random seeds to report averaged results. To directly address the concern, we will revise the abstract to briefly note the multi-run evaluation for variance handling and the selection rationale for datasets and methods. revision: yes
Referee: §3 (Dataset and Method Selection): The representativeness of the 27 NLP datasets and 7 PEFT methods for drawing general conclusions about PEFT method quality is not evidenced; without analysis of task diversity (e.g., classification, generation, reasoning) or coverage of major PEFT families (LoRA variants, adapters, prefix-tuning), the resulting rankings may not generalize.

Authors: Section 3 describes the 27 datasets covering classification, generation, and reasoning tasks drawn from established benchmarks, along with 7 PEFT methods spanning adapter, LoRA, and prefix-tuning families. We agree an explicit diversity analysis is beneficial and will add a dedicated paragraph in §3 with categorization tables and references to demonstrate coverage of major categories. revision: yes
Referee: §4 (PSCP Metric Definition): The PSCP metric combines trainable parameters, inference speed, and training memory without explicit justification for the weighting scheme, sensitivity analysis, or comparison to existing cost models; this risks arbitrary and unstable rankings under different application constraints.

Authors: The PSCP weights prioritize trainable parameters as the dominant efficiency factor in PEFT settings, with secondary terms for memory and speed. We will expand §4 to include explicit justification for the chosen weights, results from sensitivity analysis under varied constraints, and direct comparisons to prior cost models in the PEFT literature. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metric defined from observables without self-referential reduction

full rationale

The paper's core contribution is the creation of PEFT-Bench as an evaluation framework and the PSCP metric, both constructed directly from observable quantities (trainable parameters, inference speed, training memory) rather than any derived prediction or equation that loops back to inputs. No load-bearing derivations, fitted parameters renamed as predictions, or self-citation chains appear in the described claims. The usage across 27 datasets and 7 methods is an empirical demonstration, not a self-definitional or uniqueness-imported result. The paper remains self-contained against external benchmarks for its stated purpose of providing a unified evaluation setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that the selected datasets and methods are representative and on the introduction of a new composite metric whose weighting is not externally validated in the abstract.

axioms (1)

domain assumption The 27 NLP datasets and 7 PEFT methods provide a representative sample for general PEFT evaluation.
Invoked when demonstrating usage across these resources to support claims of unified evaluation.

invented entities (1)

PEFT Soft Cost Penalties (PSCP) metric no independent evidence
purpose: To produce a single score that accounts for trainable parameters, inference speed, and training memory usage when comparing PEFT methods.
Newly defined in the paper to address cost factors not captured by accuracy alone.

pith-pipeline@v0.9.0 · 5457 in / 1539 out tokens · 36721 ms · 2026-05-17T05:04:49.167491+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose the PEFT Soft Cost Penalties metric (PSCP), which introduces a number of trainable parameters, memory usage, and inference speed in the final score calculation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift
cs.CV 2026-04 unverdicted novelty 5.0

Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. https://doi.org/10.18653/v1/N19-1245 M ath QA : Towards interpretable math word problem solving with operation-based formalisms . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics:...

work page doi:10.18653/v1/n19-1245 2019
[5]

Akari Asai, Mohammadreza Salehi, Matthew Peters, and Hannaneh Hajishirzi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.446 ATTEMPT : Parameter-efficient multi-task tuning via attentional mixtures of soft prompts . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655--6672, Abu Dhabi, United Arab Emirat...

work page doi:10.18653/v1/2022.emnlp-main.446 2022
[6]

Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge

work page 2006
[7]

Robert Belanec, Simon Ostermann, Ivan Srba, and Maria Bielikova. 2025. Task prompt vectors: Effective initialization through multi-task soft prompt transfer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 77--94. Springer

work page 2025
[8]

Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge

work page 2009
[9]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

work page 2020
[10]

Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pages 1--14, Vancouver, Canada. ACL

work page doi:10.18653/v1/s17-2001 2017
[11]

Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca

work page 2023
[12]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1300 2019
[13]

Charles W Cobb and Paul H Douglas. 1928. A theory of production. The American economic review, 18(1):139--165

work page 1928
[14]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. https://doi.org/10.1007/11736790_9 The pascal recognising textual entailment challenge . In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW'05, page 177–190, Berlin...

work page doi:10.1007/11736790_9 2005
[16]

Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107--124

work page 2019
[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019
[18]

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, and 1 others. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220--235

work page 2023
[19]

William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing

work page 2005
[20]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1--39

work page 2022
[22]

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1--9. Association for Computational Linguistics

work page 2007
[23]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

work page 2021
[24]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3

work page 2022
[25]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, and 1 others. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), page...

work page 2018
[27]

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2019. https://doi.org/10.18653/v1/D19-1281 What ' s missing: A knowledge gap guided approach for multi-hop question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p...

work page doi:10.18653/v1/d19-1281 2019
[28]

Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, and Xingwei Wang. 2025. https://aclanthology.org/2025.naacl-long.225/ Efficient and effective prompt tuning via prompt decomposition and compressed outer product . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

work page 2025
[29]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.243 The power of scale for parameter-efficient prompt tuning . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045--3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[30]

Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, page 47

work page 2011
[31]

Xiang Lisa Li and Percy Liang. 2021. https://doi.org/10.18653/v1/2021.acl-long.353 Prefix-tuning: Optimizing continuous prompts for generation . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582--4597, Onl...

work page doi:10.18653/v1/2021.acl-long.353 2021
[32]

Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647

work page arXiv 2023
[33]

Vijay Lingam, Atula Tejaswi Neerkaje, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Eunsol Choi, Alex Dimakis, Aleksandar Bojchevski, and sujay sanghavi. 2024. https://openreview.net/forum?id=DOUskwCqg5 SVFT : Parameter-efficient fine-tuning with singular vectors . In 2nd Workshop on Advancing Neural Network Training: Computational Ef...

work page 2024
[34]

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022 a . Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950--1965

work page 2022
[35]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023 a . https://doi.org/10.1145/3560815 Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing . ACM Comput. Surv., 55(9)

work page doi:10.1145/3560815 2023
[36]

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022 b . https://doi.org/10.18653/v1/2022.acl-short.8 P -tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61--68, Dub...

work page doi:10.18653/v1/2022.acl-short.8 2022
[37]

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2023 b . Gpt understands, too. AI Open

work page 2023
[38]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[39]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations

work page 2019
[40]

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

work page 2022
[41]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[42]

Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. https://doi.org/10.18653/v1/N19-1128 W i C : the word-in-context dataset for evaluating context-sensitive meaning representations . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and S...

work page doi:10.18653/v1/n19-1128 2019
[43]

Nusrat Jahan Prottasha, Upama Roy Chowdhury, Shetu Mohanto, Tasfia Nuzhat, Abdullah As Sami, Md Shamol Ali, Md Shohanur Islam Sobuj, Hafijur Raman, Md Kowsher, and Ozlem Ozmen Garibay. 2025. Peft a2z: Parameter-efficient fine-tuning survey for large language and vision models. arXiv preprint arXiv:2504.14117

work page arXiv 2025
[44]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

work page 2019
[45]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1--67

work page 2020
[46]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, pages 2383--2392. Association for Computational Linguistics

work page 2016
[47]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020
[48]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90--95

work page 2011
[49]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

work page 2021
[50]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. https://doi.org/10.18653/v1/D19-1454 Social IQ a: Commonsense reasoning about social interactions . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...

work page doi:10.18653/v1/d19-1454 2019
[51]

Zhengxiang Shi and Aldo Lipani. 2024. https://openreview.net/forum?id=KjegfPGRde De PT : Decomposed prompt tuning for parameter-efficient fine-tuning . In The Twelfth International Conference on Learning Representations

work page 2024
[52]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on EMNLP, pages 1631--1642

work page 2013
[53]

Qi Sun, Edoardo Cetin, and Yujin Tang. 2025. https://openreview.net/forum?id=dh4t9qmcvK Transformer-squared: Self-adaptive LLM s . In The Thirteenth International Conference on Learning Representations

work page 2025
[54]

Pengwei Tang, Xiaolin Hu, and Yong Liu. 2025. https://openreview.net/forum?id=fswihJIYbd AD e PT : Adaptive decomposed prompt tuning for parameter-efficient fine-tuning . In The Thirteenth International Conference on Learning Representations

work page 2025
[55]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, and 1 others. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems

work page 2017
[57]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019
[58]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. https://doi.org/10.1162/tacl_a_00290 Neural network acceptability judgments . Transactions of the ACL, 7:625--641

work page doi:10.1162/tacl_a_00290 2019
[60]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long Papers) , pages 1112--1122, New Orleans, Louisiana. ACL

work page doi:10.18653/v1/n18-1101 2018
[61]

Yi Xin, Siqi Luo, Xuyang Liu, Haodi Zhou, Xinyu Cheng, Christina E Lee, Junlong Du, Haozhe Wang, MingCai Chen, Ting Liu, and 1 others. 2024. V-petl bench: A unified visual parameter-efficient transfer learning benchmark. Advances in Neural Information Processing Systems, 37:80522--80535

work page 2024
[62]

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148

work page arXiv 2023
[63]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 39 others. 2024. https://api.semanticscholar.org/CorpusID:271212307 Qwen2 technical report . ArXiv, abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories, pages 476--486

work page 2018
[65]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1472 2019
[66]

Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, and Xi-He Qiu. 2025 a . https://aclanthology.org/2025.coling-main.265/ Parameter-efficient fine-tuning of large language models via deconvolution in subspace . In Proceedings of the 31st International Conference on Computational Linguistics, pages 3924--3935, Abu Dhabi, UAE. Association for Comput...

work page 2025
[67]

Pieyi Zhang, Richong Zhang, and Zhijie Nie. 2025 b . Dynamic task vector grouping for efficient multi-task prompt tuning. arXiv preprint arXiv:2503.18063

work page arXiv 2025
[68]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885

work page internal anchor Pith review Pith/arXiv arXiv 2018
[69]

Yi-Kai Zhang, Lu Ren, Chao Yi, Qi-Wei Wang, De-Chuan Zhan, and Han-Jia Ye. 2023. Zhijian: A unifying and rapidly deployable toolbox for pre-trained model reuse. arXiv preprint arXiv:2308.09158

work page arXiv 2023
[70]

Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. 2024. https://openreview.net/forum?id=YR3ETaElNK Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning . In The Twelfth International Conference on Learning Representations

work page 2024
[71]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. http://arxiv.org/abs/2403.13372 Llamafactory: Unified efficient fine-tuning of 100+ language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Assoc...

work page internal anchor Pith review Pith/arXiv arXiv 2024