Rethinking Adapter Placement: A Dominant Adaptation Module Perspective
Pith reviewed 2026-05-08 10:15 UTC · model grok-4.3
The pith
A single LoRA adapter placed at one shallow FFN down-projection outperforms the standard practice of distributing many adapters while using far fewer parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient analysis reveals that the projected adapter gradient energy concentrates overwhelmingly on a single shallow FFN down-projection. The layer index of this dominant module depends on model architecture yet remains stable across tasks. Inserting one low-rank adapter exactly at that location yields higher average accuracy than distributing adapters throughout the network while training only about 0.7 percent of the parameters required by standard LoRA.
What carries the argument
PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that ranks every candidate LoRA position by its initial trainable gradient energy and thereby locates the dominant adaptation module.
Load-bearing premise
The concentration of PAGE on one shallow FFN down-projection generalizes to other models and tasks and remains the single best placement without further search.
What would settle it
Repeating the PAGE measurement and DomLoRA training on a new model family or task category and finding that the identified single site does not match or exceed the performance of a standard multi-adapter LoRA configuration.
Figures
read the original abstract
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PAGE (Projected Adapter Gradient Energy), a gradient-based probe computed from initial gradients on target tasks, to identify a dominant adaptation module. It reports that PAGE concentrates on a single shallow FFN down-projection layer whose index is architecture-dependent but task-stable across two model families and four tasks (instruction following, mathematical reasoning, code generation, multi-turn conversation). Motivated by this, DomLoRA places one LoRA adapter at this module and claims to outperform vanilla LoRA on average while using only ~0.7% of the trainable parameters; the same placement also improves other LoRA variants.
Significance. If the concentration finding and performance gains hold under broader testing, the work supplies a practical, low-cost placement rule for parameter-efficient fine-tuning that reduces adapter count without sacrificing (and sometimes improving) downstream results. The pre-training gradient-energy probe is a methodological strength, as it avoids post-hoc or fitted-parameter circularity.
major comments (3)
- [Abstract and §4] Abstract and §4 (empirical findings): the reported concentration of PAGE on one shallow FFN down-projection is demonstrated only on two model families and four tasks. Because the central claim is that this yields a general, task-stable placement guideline, additional experiments on at least one more architecture family (e.g., encoder-decoder) and a broader task distribution are required to establish that the single-module rule is not setup-specific.
- [§5] §5 (DomLoRA results): the average outperformance is stated, but the manuscript must report per-task and per-model breakdowns, standard deviations across random seeds, and statistical significance tests. Without these, it is impossible to determine whether the claimed gains are robust or driven by a subset of the four tasks.
- [§3] §3 (PAGE definition): the description must explicitly confirm that all gradient-energy measurements are taken on the frozen model before any adapter training or fine-tuning begins, and that layer selection is performed once per architecture rather than tuned post-hoc on validation performance.
minor comments (2)
- [Figure captions and §4.1] Figure captions and §4.1 should include the precise mathematical definition of PAGE (including projection and energy aggregation) so readers can reproduce the metric without ambiguity.
- [Abstract and introduction] The abstract and introduction should state the exact parameter count ratio (0.7%) relative to a concrete vanilla LoRA configuration (rank, alpha, target modules) for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments have helped us identify areas where the manuscript can be strengthened for clarity and robustness. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (empirical findings): the reported concentration of PAGE on one shallow FFN down-projection is demonstrated only on two model families and four tasks. Because the central claim is that this yields a general, task-stable placement guideline, additional experiments on at least one more architecture family (e.g., encoder-decoder) and a broader task distribution are required to establish that the single-module rule is not setup-specific.
Authors: We agree that the current experiments are confined to two decoder-only families and four tasks, which limits the strength of the generality claim. The observed task-stability and architecture-dependent index are consistent within the tested setups, but we acknowledge that encoder-decoder models represent an important additional family. In the revised manuscript we will add a new limitations paragraph in §6 explicitly discussing the scope of the evaluated architectures and tasks, and we commit to including at least one encoder-decoder experiment (e.g., on T5) together with two additional tasks in a future extended version. This partial revision clarifies the current evidence without overstating it. revision: partial
-
Referee: [§5] §5 (DomLoRA results): the average outperformance is stated, but the manuscript must report per-task and per-model breakdowns, standard deviations across random seeds, and statistical significance tests. Without these, it is impossible to determine whether the claimed gains are robust or driven by a subset of the four tasks.
Authors: The referee is correct that aggregate averages alone are insufficient. We have prepared expanded tables for the revised §5 that report per-task and per-model scores, standard deviations computed over three independent random seeds, and paired t-test p-values comparing DomLoRA against vanilla LoRA and other baselines. These additions will allow readers to assess robustness directly. revision: yes
-
Referee: [§3] §3 (PAGE definition): the description must explicitly confirm that all gradient-energy measurements are taken on the frozen model before any adapter training or fine-tuning begins, and that layer selection is performed once per architecture rather than tuned post-hoc on validation performance.
Authors: We confirm that PAGE is computed exclusively on the frozen pre-trained weights using a single forward-backward pass on target-task data before any adapters are inserted or training begins. Layer selection is performed once per model architecture from the resulting PAGE profile and is never adjusted on validation performance. We have revised the opening paragraphs of §3 to state this procedure explicitly and have added a short algorithmic note to remove any possible ambiguity. revision: yes
Circularity Check
No circularity; empirical gradient probe directly informs placement rule without self-reference or fitting
full rationale
The paper's chain begins with the definition of PAGE as a gradient-energy probe computed from initial trainable gradients on the target tasks, followed by the empirical observation that this energy concentrates on one shallow FFN down-projection. This concentration is reported as a measured fact across the tested models and tasks rather than derived from any equation or prior result. DomLoRA is then defined simply as the rule that places the single adapter at the observed dominant module; its superiority is established by direct performance comparison against vanilla LoRA on the same tasks. No equations reduce to their own inputs, no parameters are fitted on a subset and then relabeled as predictions, and no load-bearing claims rest on self-citations. The derivation remains self-contained because the placement guideline is an output of the measurement step and is externally validated by the reported accuracy gains.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Initial gradient energy computed before any fine-tuning is a reliable proxy for the ultimate utility of an adapter location.
Reference graph
Works this paper leans on
-
[1]
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learn...
work page 2019
-
[2]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[3]
QLoRA: Efficient finetuning of quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023
work page 2023
-
[4]
Towards a unified view of parameter-efficient transfer learning
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. InInternational Conference on Learning Representations, 2022
work page 2022
-
[5]
Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, and Jianke Zhu. Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1977–1...
work page 2024
-
[6]
From bottom to top: Extending the potential of parameter efficient fine-tuning
Jihao Gu, Zelin Wang, Yibo Zhang, Ziji Zhang, and Ping Gong. From bottom to top: Extending the potential of parameter efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3488–3500, Miami, Florida, USA, November 2024. Association ...
work page 2024
-
[7]
Layer-wise LoRA fine-tuning: A similarity metric approach, 2026
Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, and Artur Jordao. Layer-wise LoRA fine-tuning: A similarity metric approach, 2026
work page 2026
-
[8]
LoRA is all you need for safety alignment of reasoning LLMs, 2026
Yihao Xue and Baharan Mirzasoleiman. LoRA is all you need for safety alignment of reasoning LLMs, 2026
work page 2026
-
[9]
Train more parameters but mind their placement: Insights into language adaptation with PEFT
Jenny Kunz. Train more parameters but mind their placement: Insights into language adaptation with PEFT. In Richard Johansson and Sara Stymne, editors,Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 323–330, Tallinn, Estonia, March 2025...
work page 2025
-
[10]
PLoP: Precise LoRA placement for efficient finetun- ing of large models
Soufiane Hayou, Nikhil Ghosh, and Bin Yu. PLoP: Precise LoRA placement for efficient finetun- ing of large models. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[11]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[12]
The Llama 3 herd of models, 2024
Aaron Grattafiori et al. The Llama 3 herd of models, 2024. 10
work page 2024
-
[13]
Adaptive budget allocation for parameter-efficient fine-tuning
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[14]
DoRA: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang- Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine...
work page 2024
-
[15]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...
work page 2021
-
[16]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Ass...
work page 2021
-
[17]
LoRA+: Efficient low rank adaptation of large models
Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Efficient low rank adaptation of large models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceedings of Machine Learning Research, ...
work page 2024
-
[18]
Kopiczko, Tijmen Blankevoort, and Yuki M
Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based random matrix adaptation. InProceedings of the 2024 International Conference on Learning Represen- tations (ICLR), 2024
work page 2024
-
[19]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November ...
work page 2021
-
[20]
Knowledge neurons in pretrained transformers
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland, May 2022. Association ...
work page 2022
-
[21]
Gradient-based parameter selection for efficient fine-tuning
Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, and Shang- hang Zhang. Gradient-based parameter selection for efficient fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28566–28577, June 2024
work page 2024
-
[22]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015
work page 2015
-
[23]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[24]
Smith, Iz Beltagy, and Hannaneh Hajishirzi
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with Tulu 2, 2023. 11
work page 2023
-
[25]
MetaMath: Bootstrap your own mathematical questions for large language models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. MetaMath: Bootstrap your own mathematical questions for large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[26]
Magicoder: Empower- ing code generation with OSS-instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empower- ing code generation with OSS-instruct. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings o...
work page 2024
-
[27]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021
work page 2021
-
[28]
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020
work page 2020
-
[29]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...
work page 2019
-
[30]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computatio...
work page 2022
-
[31]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
work page 2021
-
[32]
LogiQA: A challenge dataset for machine reading comprehension with logical reasoning, 7 2020
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning, 7 2020. Main track
work page 2020
-
[33]
Judging LLM-as-a-judge with MT-bench and chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot Arena. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Sys...
work page 2023
-
[34]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021
work page 2021
-
[35]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[36]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associate...
work page 2023
-
[37]
Fine-tuning with reserved majority for noise reduction
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Fine-tuning with reserved majority for noise reduction. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[38]
Please solve this problem step by step. Put your final answer in \boxed{}
Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, and Eunhyeok Park. GraLoRA: Granular low-rank adaptation for parameter-efficient fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 13 Appendix Overview The appendix provides additional theoretical details, empirical evidence, and experimental results. • Ap...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.