pith. machine review for the scientific record. sign in

arxiv: 2511.21285 · v3 · pith:FHPN6PIUnew · submitted 2025-11-26 · 💻 cs.CL

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Pith reviewed 2026-05-17 05:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords parameter-efficient fine-tuningPEFT methodsbenchmarklarge language modelsNLP evaluationcost-aware metricsfine-tuning efficiencyautoregressive models
0
0 comments X

The pith

PEFT-Bench offers a standardized way to compare parameter-efficient fine-tuning methods for large language models while factoring in training and inference costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PEFT-Bench as a unified end-to-end benchmark for testing various parameter-efficient fine-tuning methods on autoregressive large language models. It applies the benchmark to 27 natural language processing datasets using 7 different methods to demonstrate consistent evaluation. The authors also introduce the PEFT Soft Cost Penalties metric to balance accuracy against the number of trainable parameters, inference speed, and training memory usage. This approach addresses limitations in current evaluations that are often narrow in scope and hard to reproduce. A reader would care because it provides concrete guidance for selecting fine-tuning strategies that maintain performance without excessive computational demands.

Core claim

The paper claims that PEFT-Bench, applied across 27 NLP datasets and 7 PEFT methods on autoregressive LLMs, combined with the PSCP metric that incorporates trainable parameters, inference speed, and training memory, enables more reproducible and practical comparisons of these methods than prior limited evaluations.

What carries the argument

PEFT-Bench, the unified end-to-end benchmark, and the PEFT Soft Cost Penalties (PSCP) metric that weights performance by training and inference costs.

If this is right

  • PEFT methods can now be ranked consistently across tasks instead of relying on scattered individual studies.
  • The PSCP metric produces efficiency-aware rankings that favor methods with lower memory and faster inference.
  • Researchers gain a shared testbed that makes it easier to identify which fine-tuning approaches scale to new tasks.
  • Adoption of the benchmark could reduce redundant experiments and improve comparability in the field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If widely used, the benchmark might steer new PEFT designs toward explicit optimization of the PSCP factors.
  • The approach could extend naturally to measuring long-term deployment costs beyond initial training.
  • Rankings from this setup might differ when tested on specialized domains or much larger model scales.

Load-bearing premise

The 27 datasets and 7 methods chosen are representative enough to support general claims about PEFT method quality, and the specific cost weightings in the PSCP metric match practical needs.

What would settle it

Re-evaluating the same methods on a new collection of datasets or models yields substantially different performance rankings or shows that high-PSCP methods fail in real-world low-resource deployments.

Figures

Figures reproduced from arXiv: 2511.21285 by Branislav Pecher, Ivan Srba, Maria Bielikova, Robert Belanec.

Figure 1
Figure 1. Figure 1: Diagram describing the methodology of PEFT-Bench. Blue components represent our contributions. We design PEFT-Factory, a framework based on LLaMa-Factory (Zheng et al., 2024) backbone to implement off-the-shelf methods from the HuggingFace PEFT library and an easy-to-use interface for new PEFT methods. Using these methods, we train LLaMa on selected datasets, which we have also included in the backbone. Af… view at source ↗
Figure 2
Figure 2. Figure 2: A diagram showing the overview and cate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We evaluate methods from additive, reparametrized, and selective PEFT categories. The diagram shows the categorization of each method. 3.2 PEFT Methods and Pretrained Models With the popularization of PEFT, many new meth￾ods are being introduced within a short period of time (Xu et al., 2023; Prottasha et al., 2025). Therefore, it is computationally expensive to eval￾uate them all. We design our PEFT metho… view at source ↗
Figure 4
Figure 4. Figure 4: Bar chart showing the stability of different PEFT methods on 4 low-resource datasets. IA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-Efficient Fine-Tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 7 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Cost Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces PEFT-Bench, a unified end-to-end benchmark for evaluating diverse Parameter-Efficient Fine-Tuning (PEFT) methods on autoregressive LLMs. It demonstrates the benchmark across 27 NLP datasets and 7 PEFT methods, and proposes the PEFT Soft Cost Penalties (PSCP) metric that incorporates trainable parameters, inference speed, and training memory usage to support more comprehensive evaluations.

Significance. If the benchmark implementation includes proper statistical validation, variance handling, and the PSCP metric is shown to produce stable, practically useful rankings, this work could provide a valuable standardized framework for fair and reproducible comparisons of PEFT methods, helping address current limitations in evaluation scope within the field.

major comments (3)
  1. Abstract: The abstract states the benchmark scope and introduces PSCP but provides no details on implementation, statistical validation of the metric, handling of variance across runs, or justification for dataset and method selection; this leaves the central claim of improved reproducibility and fair comparison without sufficient support.
  2. §3 (Dataset and Method Selection): The representativeness of the 27 NLP datasets and 7 PEFT methods for drawing general conclusions about PEFT method quality is not evidenced; without analysis of task diversity (e.g., classification, generation, reasoning) or coverage of major PEFT families (LoRA variants, adapters, prefix-tuning), the resulting rankings may not generalize.
  3. §4 (PSCP Metric Definition): The PSCP metric combines trainable parameters, inference speed, and training memory without explicit justification for the weighting scheme, sensitivity analysis, or comparison to existing cost models; this risks arbitrary and unstable rankings under different application constraints.
minor comments (1)
  1. Notation and formulas: The mathematical definition of the PSCP metric would benefit from a clearer, self-contained equation to improve reproducibility and ease of implementation by other researchers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: Abstract: The abstract states the benchmark scope and introduces PSCP but provides no details on implementation, statistical validation of the metric, handling of variance across runs, or justification for dataset and method selection; this leaves the central claim of improved reproducibility and fair comparison without sufficient support.

    Authors: The abstract is kept concise to summarize the core contributions. Implementation details appear in Sections 3 and 4, and the experimental protocol includes multiple runs with different random seeds to report averaged results. To directly address the concern, we will revise the abstract to briefly note the multi-run evaluation for variance handling and the selection rationale for datasets and methods. revision: yes

  2. Referee: §3 (Dataset and Method Selection): The representativeness of the 27 NLP datasets and 7 PEFT methods for drawing general conclusions about PEFT method quality is not evidenced; without analysis of task diversity (e.g., classification, generation, reasoning) or coverage of major PEFT families (LoRA variants, adapters, prefix-tuning), the resulting rankings may not generalize.

    Authors: Section 3 describes the 27 datasets covering classification, generation, and reasoning tasks drawn from established benchmarks, along with 7 PEFT methods spanning adapter, LoRA, and prefix-tuning families. We agree an explicit diversity analysis is beneficial and will add a dedicated paragraph in §3 with categorization tables and references to demonstrate coverage of major categories. revision: yes

  3. Referee: §4 (PSCP Metric Definition): The PSCP metric combines trainable parameters, inference speed, and training memory without explicit justification for the weighting scheme, sensitivity analysis, or comparison to existing cost models; this risks arbitrary and unstable rankings under different application constraints.

    Authors: The PSCP weights prioritize trainable parameters as the dominant efficiency factor in PEFT settings, with secondary terms for memory and speed. We will expand §4 to include explicit justification for the chosen weights, results from sensitivity analysis under varied constraints, and direct comparisons to prior cost models in the PEFT literature. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metric defined from observables without self-referential reduction

full rationale

The paper's core contribution is the creation of PEFT-Bench as an evaluation framework and the PSCP metric, both constructed directly from observable quantities (trainable parameters, inference speed, training memory) rather than any derived prediction or equation that loops back to inputs. No load-bearing derivations, fitted parameters renamed as predictions, or self-citation chains appear in the described claims. The usage across 27 datasets and 7 methods is an empirical demonstration, not a self-definitional or uniqueness-imported result. The paper remains self-contained against external benchmarks for its stated purpose of providing a unified evaluation setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that the selected datasets and methods are representative and on the introduction of a new composite metric whose weighting is not externally validated in the abstract.

axioms (1)
  • domain assumption The 27 NLP datasets and 7 PEFT methods provide a representative sample for general PEFT evaluation.
    Invoked when demonstrating usage across these resources to support claims of unified evaluation.
invented entities (1)
  • PEFT Soft Cost Penalties (PSCP) metric no independent evidence
    purpose: To produce a single score that accounts for trainable parameters, inference speed, and training memory usage when comparing PEFT methods.
    Newly defined in the paper to address cost factors not captured by accuracy alone.

pith-pipeline@v0.9.0 · 5457 in / 1539 out tokens · 36721 ms · 2026-05-17T05:04:49.167491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

    cs.CV 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.

  2. PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

    cs.CL 2025-12 unverdicted novelty 5.0

    PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 2 Pith papers · 12 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  4. [4]

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. https://doi.org/10.18653/v1/N19-1245 M ath QA : Towards interpretable math word problem solving with operation-based formalisms . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics:...

  5. [5]

    Akari Asai, Mohammadreza Salehi, Matthew Peters, and Hannaneh Hajishirzi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.446 ATTEMPT : Parameter-efficient multi-task tuning via attentional mixtures of soft prompts . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655--6672, Abu Dhabi, United Arab Emirat...

  6. [6]

    Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge

  7. [7]

    Robert Belanec, Simon Ostermann, Ivan Srba, and Maria Bielikova. 2025. Task prompt vectors: Effective initialization through multi-task soft prompt transfer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 77--94. Springer

  8. [8]

    Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge

  9. [9]

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

  10. [10]

    Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pages 1--14, Vancouver, Canada. ACL

  11. [11]

    Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca

  12. [12]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

  13. [13]

    Charles W Cobb and Paul H Douglas. 1928. A theory of production. The American economic review, 18(1):139--165

  14. [14]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  15. [15]

    Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. https://doi.org/10.1007/11736790_9 The pascal recognising textual entailment challenge . In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW'05, page 177–190, Berlin...

  16. [16]

    Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107--124

  17. [17]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

  18. [18]

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, and 1 others. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220--235

  19. [19]

    William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing

  20. [20]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  21. [21]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1--39

  22. [22]

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1--9. Association for Computational Linguistics

  23. [23]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

  24. [24]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3

  25. [25]

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, and 1 others. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825

  26. [26]

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), page...

  27. [27]

    Tushar Khot, Ashish Sabharwal, and Peter Clark. 2019. https://doi.org/10.18653/v1/D19-1281 What ' s missing: A knowledge gap guided approach for multi-hop question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p...

  28. [28]

    Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, and Xingwei Wang. 2025. https://aclanthology.org/2025.naacl-long.225/ Efficient and effective prompt tuning via prompt decomposition and compressed outer product . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

  29. [29]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.243 The power of scale for parameter-efficient prompt tuning . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045--3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  30. [30]

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, page 47

  31. [31]

    Xiang Lisa Li and Percy Liang. 2021. https://doi.org/10.18653/v1/2021.acl-long.353 Prefix-tuning: Optimizing continuous prompts for generation . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582--4597, Onl...

  32. [32]

    Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647

  33. [33]

    Vijay Lingam, Atula Tejaswi Neerkaje, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Eunsol Choi, Alex Dimakis, Aleksandar Bojchevski, and sujay sanghavi. 2024. https://openreview.net/forum?id=DOUskwCqg5 SVFT : Parameter-efficient fine-tuning with singular vectors . In 2nd Workshop on Advancing Neural Network Training: Computational Ef...

  34. [34]

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022 a . Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950--1965

  35. [35]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023 a . https://doi.org/10.1145/3560815 Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing . ACM Comput. Surv., 55(9)

  36. [36]

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022 b . https://doi.org/10.18653/v1/2022.acl-short.8 P -tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61--68, Dub...

  37. [37]

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2023 b . Gpt understands, too. AI Open

  38. [38]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  39. [39]

    Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations

  40. [40]

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

  41. [41]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

  42. [42]

    Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. https://doi.org/10.18653/v1/N19-1128 W i C : the word-in-context dataset for evaluating context-sensitive meaning representations . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and S...

  43. [43]

    Nusrat Jahan Prottasha, Upama Roy Chowdhury, Shetu Mohanto, Tasfia Nuzhat, Abdullah As Sami, Md Shamol Ali, Md Shohanur Islam Sobuj, Hafijur Raman, Md Kowsher, and Ozlem Ozmen Garibay. 2025. Peft a2z: Parameter-efficient fine-tuning survey for large language and vision models. arXiv preprint arXiv:2504.14117

  44. [44]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

  45. [45]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1--67

  46. [46]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, pages 2383--2392. Association for Computational Linguistics

  47. [47]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297

  48. [48]

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90--95

  49. [49]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

  50. [50]

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. https://doi.org/10.18653/v1/D19-1454 Social IQ a: Commonsense reasoning about social interactions . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...

  51. [51]

    Zhengxiang Shi and Aldo Lipani. 2024. https://openreview.net/forum?id=KjegfPGRde De PT : Decomposed prompt tuning for parameter-efficient fine-tuning . In The Twelfth International Conference on Learning Representations

  52. [52]

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on EMNLP, pages 1631--1642

  53. [53]

    Qi Sun, Edoardo Cetin, and Yujin Tang. 2025. https://openreview.net/forum?id=dh4t9qmcvK Transformer-squared: Self-adaptive LLM s . In The Thirteenth International Conference on Learning Representations

  54. [54]

    Pengwei Tang, Xiaolin Hu, and Yong Liu. 2025. https://openreview.net/forum?id=fswihJIYbd AD e PT : Adaptive decomposed prompt tuning for parameter-efficient fine-tuning . In The Thirteenth International Conference on Learning Representations

  55. [55]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, and 1 others. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

  56. [56]

    A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems

  57. [57]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

  58. [58]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

  59. [59]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. https://doi.org/10.1162/tacl_a_00290 Neural network acceptability judgments . Transactions of the ACL, 7:625--641

  60. [60]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long Papers) , pages 1112--1122, New Orleans, Louisiana. ACL

  61. [61]

    Yi Xin, Siqi Luo, Xuyang Liu, Haodi Zhou, Xinyu Cheng, Christina E Lee, Junlong Du, Haozhe Wang, MingCai Chen, Ting Liu, and 1 others. 2024. V-petl bench: A unified visual parameter-efficient transfer learning benchmark. Advances in Neural Information Processing Systems, 37:80522--80535

  62. [62]

    Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148

  63. [63]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 39 others. 2024. https://api.semanticscholar.org/CorpusID:271212307 Qwen2 technical report . ArXiv, abs/2407.10671

  64. [64]

    Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories, pages 476--486

  65. [65]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

  66. [66]

    Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, and Xi-He Qiu. 2025 a . https://aclanthology.org/2025.coling-main.265/ Parameter-efficient fine-tuning of large language models via deconvolution in subspace . In Proceedings of the 31st International Conference on Computational Linguistics, pages 3924--3935, Abu Dhabi, UAE. Association for Comput...

  67. [67]

    Pieyi Zhang, Richong Zhang, and Zhijie Nie. 2025 b . Dynamic task vector grouping for efficient multi-task prompt tuning. arXiv preprint arXiv:2503.18063

  68. [68]

    Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885

  69. [69]

    Yi-Kai Zhang, Lu Ren, Chao Yi, Qi-Wei Wang, De-Chuan Zhan, and Han-Jia Ye. 2023. Zhijian: A unifying and rapidly deployable toolbox for pre-trained model reuse. arXiv preprint arXiv:2308.09158

  70. [70]

    Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. 2024. https://openreview.net/forum?id=YR3ETaElNK Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning . In The Twelfth International Conference on Learning Representations

  71. [71]

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. http://arxiv.org/abs/2403.13372 Llamafactory: Unified efficient fine-tuning of 100+ language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Assoc...