Rethinking Adapter Placement: A Dominant Adaptation Module Perspective

Di Fang; Huiping Zhuang; Kaixuan Chen; Run He; Suoxin Zhang; Xiang Tan

arxiv: 2605.06183 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CL· cs.LG

Rethinking Adapter Placement: A Dominant Adaptation Module Perspective

Suoxin Zhang , Run He , Di Fang , Xiang Tan , Kaixuan Chen , Huiping Zhuang This is my paper

Pith reviewed 2026-05-08 10:15 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LoRAadapter placementparameter-efficient fine-tuningdominant adaptation modulegradient sensitivityinstruction tuninglarge language models

0 comments

The pith

A single LoRA adapter placed at one shallow FFN down-projection outperforms the standard practice of distributing many adapters while using far fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines where to locate a limited number of trainable LoRA adapters inside frozen large models to maximize downstream performance. It introduces PAGE, a gradient-based probe that scores every possible adapter site by the initial energy available for training. Across two model families and four task types the probe shows that this energy concentrates almost entirely inside one shallow feed-forward down-projection layer. The authors therefore define DomLoRA as the method that inserts only a single adapter at this dominant site and demonstrate that the resulting model exceeds vanilla LoRA on average while training roughly 0.7 percent as many parameters.

Core claim

Gradient analysis reveals that the projected adapter gradient energy concentrates overwhelmingly on a single shallow FFN down-projection. The layer index of this dominant module depends on model architecture yet remains stable across tasks. Inserting one low-rank adapter exactly at that location yields higher average accuracy than distributing adapters throughout the network while training only about 0.7 percent of the parameters required by standard LoRA.

What carries the argument

PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that ranks every candidate LoRA position by its initial trainable gradient energy and thereby locates the dominant adaptation module.

Load-bearing premise

The concentration of PAGE on one shallow FFN down-projection generalizes to other models and tasks and remains the single best placement without further search.

What would settle it

Repeating the PAGE measurement and DomLoRA training on a new model family or task category and finding that the identified single site does not match or exceed the performance of a standard multi-adapter LoRA configuration.

Figures

Figures reproduced from arXiv: 2605.06183 by Di Fang, Huiping Zhuang, Kaixuan Chen, Run He, Suoxin Zhang, Xiang Tan.

**Figure 1.** Figure 1: From broad to dominant placement. Vanilla LoRA places adapters across many layers and module types. (a) Layer-wise Reduction: fewer layers, all module types retained. (b) Module-type Reduction: fewer module types, all layers retained. (c) Reduction to a Single Dominant Module: PAGE is highly concentrated at one shallow FFN down-projection (magnified), suggesting that a single adapter placed there suffices.… view at source ↗

**Figure 2.** Figure 2: PAGE across projection modules. (a)–(g) show all attention and FFN projections of Qwen3-8B, and (h) shows the FFN down-projection of LLaMA-3.1-8B-Instruct. Dashed vertical lines indicate the dominant adaptation module. PAGE is highly concentrated at one shallow FFN down-projection. As shown in view at source ↗

**Figure 3.** Figure 3: PAGE of all projection modules in Qwen3-8B on Tulu. view at source ↗

**Figure 4.** Figure 4: PAGE of all projection modules in LLaMA-3.1-8B-Instruct on Tulu. view at source ↗

**Figure 5.** Figure 5: PAGE of all projection modules in Qwen3-8B on MetaMathQA. view at source ↗

**Figure 6.** Figure 6: PAGE of all projection modules in LLaMA-3.1-8B-Instruct on MetaMathQA. view at source ↗

**Figure 7.** Figure 7: PAGE of all projection modules in Qwen3-8B on Magicoder. view at source ↗

**Figure 8.** Figure 8: PAGE of all projection modules in LLaMA-3.1-8B-Instruct on Magicoder. view at source ↗

**Figure 9.** Figure 9: PAGE of all projection modules in Qwen3-8B on WizardLM. view at source ↗

**Figure 10.** Figure 10: PAGE of all projection modules in LLaMA-3.1-8B-Instruct on WizardLM. view at source ↗

read the original abstract

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds gradient energy concentrates on one shallow FFN layer, so a single adapter there beats distributed LoRA on their tests, but the pattern is only shown for two model families and four tasks.

read the letter

The main thing to know is that this work measures initial gradient energy across possible LoRA spots and finds it piles up in one shallow feed-forward down-projection. Placing a single adapter there, which they call DomLoRA, then matches or exceeds standard LoRA on instruction following, math, code, and conversation tasks while using roughly 0.7 percent of the parameters. The layer index shifts with model architecture but stays fixed across the tasks they ran, which is the concrete empirical result they report.

Referee Report

3 major / 2 minor

Summary. The paper introduces PAGE (Projected Adapter Gradient Energy), a gradient-based probe computed from initial gradients on target tasks, to identify a dominant adaptation module. It reports that PAGE concentrates on a single shallow FFN down-projection layer whose index is architecture-dependent but task-stable across two model families and four tasks (instruction following, mathematical reasoning, code generation, multi-turn conversation). Motivated by this, DomLoRA places one LoRA adapter at this module and claims to outperform vanilla LoRA on average while using only ~0.7% of the trainable parameters; the same placement also improves other LoRA variants.

Significance. If the concentration finding and performance gains hold under broader testing, the work supplies a practical, low-cost placement rule for parameter-efficient fine-tuning that reduces adapter count without sacrificing (and sometimes improving) downstream results. The pre-training gradient-energy probe is a methodological strength, as it avoids post-hoc or fitted-parameter circularity.

major comments (3)

[Abstract and §4] Abstract and §4 (empirical findings): the reported concentration of PAGE on one shallow FFN down-projection is demonstrated only on two model families and four tasks. Because the central claim is that this yields a general, task-stable placement guideline, additional experiments on at least one more architecture family (e.g., encoder-decoder) and a broader task distribution are required to establish that the single-module rule is not setup-specific.
[§5] §5 (DomLoRA results): the average outperformance is stated, but the manuscript must report per-task and per-model breakdowns, standard deviations across random seeds, and statistical significance tests. Without these, it is impossible to determine whether the claimed gains are robust or driven by a subset of the four tasks.
[§3] §3 (PAGE definition): the description must explicitly confirm that all gradient-energy measurements are taken on the frozen model before any adapter training or fine-tuning begins, and that layer selection is performed once per architecture rather than tuned post-hoc on validation performance.

minor comments (2)

[Figure captions and §4.1] Figure captions and §4.1 should include the precise mathematical definition of PAGE (including projection and energy aggregation) so readers can reproduce the metric without ambiguity.
[Abstract and introduction] The abstract and introduction should state the exact parameter count ratio (0.7%) relative to a concrete vanilla LoRA configuration (rank, alpha, target modules) for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments have helped us identify areas where the manuscript can be strengthened for clarity and robustness. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (empirical findings): the reported concentration of PAGE on one shallow FFN down-projection is demonstrated only on two model families and four tasks. Because the central claim is that this yields a general, task-stable placement guideline, additional experiments on at least one more architecture family (e.g., encoder-decoder) and a broader task distribution are required to establish that the single-module rule is not setup-specific.

Authors: We agree that the current experiments are confined to two decoder-only families and four tasks, which limits the strength of the generality claim. The observed task-stability and architecture-dependent index are consistent within the tested setups, but we acknowledge that encoder-decoder models represent an important additional family. In the revised manuscript we will add a new limitations paragraph in §6 explicitly discussing the scope of the evaluated architectures and tasks, and we commit to including at least one encoder-decoder experiment (e.g., on T5) together with two additional tasks in a future extended version. This partial revision clarifies the current evidence without overstating it. revision: partial
Referee: [§5] §5 (DomLoRA results): the average outperformance is stated, but the manuscript must report per-task and per-model breakdowns, standard deviations across random seeds, and statistical significance tests. Without these, it is impossible to determine whether the claimed gains are robust or driven by a subset of the four tasks.

Authors: The referee is correct that aggregate averages alone are insufficient. We have prepared expanded tables for the revised §5 that report per-task and per-model scores, standard deviations computed over three independent random seeds, and paired t-test p-values comparing DomLoRA against vanilla LoRA and other baselines. These additions will allow readers to assess robustness directly. revision: yes
Referee: [§3] §3 (PAGE definition): the description must explicitly confirm that all gradient-energy measurements are taken on the frozen model before any adapter training or fine-tuning begins, and that layer selection is performed once per architecture rather than tuned post-hoc on validation performance.

Authors: We confirm that PAGE is computed exclusively on the frozen pre-trained weights using a single forward-backward pass on target-task data before any adapters are inserted or training begins. Layer selection is performed once per model architecture from the resulting PAGE profile and is never adjusted on validation performance. We have revised the opening paragraphs of §3 to state this procedure explicitly and have added a short algorithmic note to remove any possible ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gradient probe directly informs placement rule without self-reference or fitting

full rationale

The paper's chain begins with the definition of PAGE as a gradient-energy probe computed from initial trainable gradients on the target tasks, followed by the empirical observation that this energy concentrates on one shallow FFN down-projection. This concentration is reported as a measured fact across the tested models and tasks rather than derived from any equation or prior result. DomLoRA is then defined simply as the rule that places the single adapter at the observed dominant module; its superiority is established by direct performance comparison against vanilla LoRA on the same tasks. No equations reduce to their own inputs, no parameters are fitted on a subset and then relabeled as predictions, and no load-bearing claims rest on self-citations. The derivation remains self-contained because the placement guideline is an output of the measurement step and is externally validated by the reported accuracy gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical. No free parameters are introduced to derive the result; the only modeling choice is the definition of PAGE itself, which is computed directly from gradients rather than fitted. No new entities are postulated beyond the observed concentration pattern.

axioms (1)

domain assumption Initial gradient energy computed before any fine-tuning is a reliable proxy for the ultimate utility of an adapter location.
Invoked when PAGE is used to rank and select the dominant module.

pith-pipeline@v0.9.0 · 5520 in / 1266 out tokens · 36937 ms · 2026-05-08T10:15:59.330300+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learn...

work page 2019
[2]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[3]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023

work page 2023
[4]

Towards a unified view of parameter-efficient transfer learning

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. InInternational Conference on Learning Representations, 2022

work page 2022
[5]

Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models

Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, and Jianke Zhu. Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1977–1...

work page 2024
[6]

From bottom to top: Extending the potential of parameter efficient fine-tuning

Jihao Gu, Zelin Wang, Yibo Zhang, Ziji Zhang, and Ping Gong. From bottom to top: Extending the potential of parameter efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3488–3500, Miami, Florida, USA, November 2024. Association ...

work page 2024
[7]

Layer-wise LoRA fine-tuning: A similarity metric approach, 2026

Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, and Artur Jordao. Layer-wise LoRA fine-tuning: A similarity metric approach, 2026

work page 2026
[8]

LoRA is all you need for safety alignment of reasoning LLMs, 2026

Yihao Xue and Baharan Mirzasoleiman. LoRA is all you need for safety alignment of reasoning LLMs, 2026

work page 2026
[9]

Train more parameters but mind their placement: Insights into language adaptation with PEFT

Jenny Kunz. Train more parameters but mind their placement: Insights into language adaptation with PEFT. In Richard Johansson and Sara Stymne, editors,Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 323–330, Tallinn, Estonia, March 2025...

work page 2025
[10]

PLoP: Precise LoRA placement for efficient finetun- ing of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. PLoP: Precise LoRA placement for efficient finetun- ing of large models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[11]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[12]

The Llama 3 herd of models, 2024

Aaron Grattafiori et al. The Llama 3 herd of models, 2024. 10

work page 2024
[13]

Adaptive budget allocation for parameter-efficient fine-tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[14]

DoRA: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang- Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine...

work page 2024
[15]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...

work page 2021
[16]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Ass...

work page 2021
[17]

LoRA+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Efficient low rank adaptation of large models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceedings of Machine Learning Research, ...

work page 2024
[18]

Kopiczko, Tijmen Blankevoort, and Yuki M

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based random matrix adaptation. InProceedings of the 2024 International Conference on Learning Represen- tations (ICLR), 2024

work page 2024
[19]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November ...

work page 2021
[20]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland, May 2022. Association ...

work page 2022
[21]

Gradient-based parameter selection for efficient fine-tuning

Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, and Shang- hang Zhang. Gradient-based parameter selection for efficient fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28566–28577, June 2024

work page 2024
[22]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

work page 2015
[23]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[24]

Smith, Iz Beltagy, and Hannaneh Hajishirzi

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with Tulu 2, 2023. 11

work page 2023
[25]

MetaMath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. MetaMath: Bootstrap your own mathematical questions for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

Magicoder: Empower- ing code generation with OSS-instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empower- ing code generation with OSS-instruct. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings o...

work page 2024
[27]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[28]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

work page 2020
[29]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...

work page 2019
[30]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computatio...

work page 2022
[31]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[32]

LogiQA: A challenge dataset for machine reading comprehension with logical reasoning, 7 2020

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning, 7 2020. Main track

work page 2020
[33]

Judging LLM-as-a-judge with MT-bench and chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot Arena. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Sys...

work page 2023
[34]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021
[35]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[36]

Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associate...

work page 2023
[37]

Fine-tuning with reserved majority for noise reduction

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Fine-tuning with reserved majority for noise reduction. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[38]

Please solve this problem step by step. Put your final answer in \boxed{}

Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, and Eunhyeok Park. GraLoRA: Granular low-rank adaptation for parameter-efficient fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 13 Appendix Overview The appendix provides additional theoretical details, empirical evidence, and experimental results. • Ap...

work page arXiv 2025

[1] [1]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learn...

work page 2019

[2] [2]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022

[3] [3]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023

work page 2023

[4] [4]

Towards a unified view of parameter-efficient transfer learning

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. InInternational Conference on Learning Representations, 2022

work page 2022

[5] [5]

Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models

Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, and Jianke Zhu. Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1977–1...

work page 2024

[6] [6]

From bottom to top: Extending the potential of parameter efficient fine-tuning

Jihao Gu, Zelin Wang, Yibo Zhang, Ziji Zhang, and Ping Gong. From bottom to top: Extending the potential of parameter efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3488–3500, Miami, Florida, USA, November 2024. Association ...

work page 2024

[7] [7]

Layer-wise LoRA fine-tuning: A similarity metric approach, 2026

Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, and Artur Jordao. Layer-wise LoRA fine-tuning: A similarity metric approach, 2026

work page 2026

[8] [8]

LoRA is all you need for safety alignment of reasoning LLMs, 2026

Yihao Xue and Baharan Mirzasoleiman. LoRA is all you need for safety alignment of reasoning LLMs, 2026

work page 2026

[9] [9]

Train more parameters but mind their placement: Insights into language adaptation with PEFT

Jenny Kunz. Train more parameters but mind their placement: Insights into language adaptation with PEFT. In Richard Johansson and Sara Stymne, editors,Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 323–330, Tallinn, Estonia, March 2025...

work page 2025

[10] [10]

PLoP: Precise LoRA placement for efficient finetun- ing of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. PLoP: Precise LoRA placement for efficient finetun- ing of large models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[11] [11]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025

[12] [12]

The Llama 3 herd of models, 2024

Aaron Grattafiori et al. The Llama 3 herd of models, 2024. 10

work page 2024

[13] [13]

Adaptive budget allocation for parameter-efficient fine-tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[14] [14]

DoRA: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang- Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine...

work page 2024

[15] [15]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...

work page 2021

[16] [16]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Ass...

work page 2021

[17] [17]

LoRA+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Efficient low rank adaptation of large models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceedings of Machine Learning Research, ...

work page 2024

[18] [18]

Kopiczko, Tijmen Blankevoort, and Yuki M

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based random matrix adaptation. InProceedings of the 2024 International Conference on Learning Represen- tations (ICLR), 2024

work page 2024

[19] [19]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November ...

work page 2021

[20] [20]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland, May 2022. Association ...

work page 2022

[21] [21]

Gradient-based parameter selection for efficient fine-tuning

Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, and Shang- hang Zhang. Gradient-based parameter selection for efficient fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28566–28577, June 2024

work page 2024

[22] [22]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

work page 2015

[23] [23]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[24] [24]

Smith, Iz Beltagy, and Hannaneh Hajishirzi

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with Tulu 2, 2023. 11

work page 2023

[25] [25]

MetaMath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. MetaMath: Bootstrap your own mathematical questions for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[26] [26]

Magicoder: Empower- ing code generation with OSS-instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empower- ing code generation with OSS-instruct. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings o...

work page 2024

[27] [27]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[28] [28]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

work page 2020

[29] [29]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...

work page 2019

[30] [30]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computatio...

work page 2022

[31] [31]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021

[32] [32]

LogiQA: A challenge dataset for machine reading comprehension with logical reasoning, 7 2020

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning, 7 2020. Main track

work page 2020

[33] [33]

Judging LLM-as-a-judge with MT-bench and chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot Arena. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Sys...

work page 2023

[34] [34]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021

[35] [35]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[36] [36]

Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associate...

work page 2023

[37] [37]

Fine-tuning with reserved majority for noise reduction

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Fine-tuning with reserved majority for noise reduction. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[38] [38]

Please solve this problem step by step. Put your final answer in \boxed{}

Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, and Eunhyeok Park. GraLoRA: Granular low-rank adaptation for parameter-efficient fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 13 Appendix Overview The appendix provides additional theoretical details, empirical evidence, and experimental results. • Ap...

work page arXiv 2025