Recognition: no theorem link
Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
Pith reviewed 2026-05-11 02:15 UTC · model grok-4.3
The pith
Proportional LoRA rank allocation under GRPO lowers accuracy by 4.5 points versus uniform allocation on identical budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient-based proportional rank allocation for LoRA under GRPO reinforcement learning reduces accuracy by 4.5 points relative to uniform allocation on the same parameter budget. The GRPO gradient landscape is flatter than under SFT, with a max-to-min layer importance ratio of only 2.17x, so every layer carries meaningful signal. Non-uniform allocation triggers a gradient amplification effect that widens the importance spread to 3.00x, creating a positive feedback loop in which high-rank layers absorb more gradient while low-rank layers are progressively silenced.
What carries the argument
The gradient amplification effect under GRPO, in which non-uniform LoRA ranks increase the max-to-min gradient magnitude ratio from 2.17x to 3.00x and thereby create a positive feedback loop that silences low-rank layers.
If this is right
- Uniform rank allocation avoids the feedback loop and preserves higher accuracy under GRPO.
- Gradient importance measured at the start of training is not a reliable predictor of the capacity a layer needs during reinforcement learning.
- Naive transfer of SFT-era proportional LoRA allocation to alignment training should be avoided.
- All layers carry meaningful gradient signal under GRPO, unlike the highly skewed importance patterns reported for SFT.
Where Pith is reading between the lines
- The flatter gradient profile may stem from the relative nature of policy optimization rather than absolute token prediction, suggesting allocation strategies tailored to RL objectives.
- Future work could test whether dynamic rank adjustment during training, rather than static allocation, mitigates the amplification effect.
- The result may extend to other RL-based alignment methods that rely on relative advantage signals instead of supervised losses.
Load-bearing premise
That the gradient magnitudes measured on Qwen 2.5 1.5B with GSM8K reliably indicate the capacity each layer needs and that the observed performance gap generalizes to other models, tasks, and GRPO implementations.
What would settle it
Repeating the uniform-versus-proportional comparison on a different base model or task under the same GRPO setup and checking whether the 4.5-point gap persists or reverses.
Figures
read the original abstract
Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that gradient-based proportional LoRA rank allocation, which improves efficiency under SFT, fails to transfer to GRPO reinforcement learning. On Qwen 2.5 1.5B with GSM8K, proportional allocation yields 70.0% accuracy versus 74.5% for uniform allocation at identical parameter budgets. The authors attribute the 4.5-point gap to a flatter GRPO gradient landscape (max-to-min ratio 2.17x vs. >10x in SFT literature) where all layers carry signal, plus a positive feedback loop in which non-uniform ranks amplify the importance spread to 3.00x.
Significance. If the reported gradient flatness and amplification effects hold, the work supplies concrete empirical evidence that SFT-era adaptive LoRA heuristics are not reliable under GRPO-style RL, motivating the development of RL-specific rank allocation methods. The measurements of layer-wise gradient ratios and the observed feedback dynamic are potentially useful for practitioners tuning LoRA on alignment tasks.
major comments (2)
- [Abstract / Results] Abstract and results section: the central 4.5-point accuracy gap (70.0% vs. 74.5%) is reported without error bars, number of runs, random seeds, or statistical tests. Given that the claim rests on this difference being meaningful and reproducible, the absence of these details makes it impossible to judge whether the gap exceeds run-to-run variance.
- [Abstract] Abstract: the claim that gradient-based allocation 'should be avoided' for GRPO is supported only by measurements on a single 1.5B model and a single math task (GSM8K). The reported 2.17x flatness ratio and 3.00x amplification are presented as general properties of GRPO, yet no additional scales, architectures, or GRPO variants are shown; this single-experiment basis is load-bearing for the recommendation against transferring SFT methods.
minor comments (1)
- [Abstract] The abstract refers to 'gradient-magnitude profiling' without stating at which training step(s) the ratios are computed or whether they are averaged; a brief clarification would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with revisions where feasible to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results section: the central 4.5-point accuracy gap (70.0% vs. 74.5%) is reported without error bars, number of runs, random seeds, or statistical tests. Given that the claim rests on this difference being meaningful and reproducible, the absence of these details makes it impossible to judge whether the gap exceeds run-to-run variance.
Authors: We agree that reporting the 4.5-point gap without error bars, run counts, seeds, or statistical tests limits the ability to assess its robustness against variance. In the revised manuscript, we will add results from five independent runs using distinct random seeds, include standard deviation error bars on all relevant accuracy figures and tables, and report a paired t-test p-value confirming statistical significance of the difference. These updates will appear in both the abstract and results section. revision: yes
-
Referee: [Abstract] Abstract: the claim that gradient-based allocation 'should be avoided' for GRPO is supported only by measurements on a single 1.5B model and a single math task (GSM8K). The reported 2.17x flatness ratio and 3.00x amplification are presented as general properties of GRPO, yet no additional scales, architectures, or GRPO variants are shown; this single-experiment basis is load-bearing for the recommendation against transferring SFT methods.
Authors: We acknowledge that the study uses a single model scale and task, which restricts the generality of the observed flatness ratio and amplification effect. The core contribution is an empirical demonstration that SFT-style gradient allocation fails to transfer under GRPO in this setting. We have revised the abstract to replace the prescriptive phrasing 'should be avoided' with 'may not transfer reliably to GRPO, motivating RL-specific methods,' better aligning the language with the evidence presented. Broader validation would be valuable but is not feasible within current resource constraints. revision: partial
- Additional experiments across multiple model scales, architectures, and GRPO variants to establish broader properties of GRPO gradient landscapes.
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential fits.
full rationale
The paper conducts direct experiments profiling gradient magnitudes on Qwen 2.5 1.5B under GRPO with GSM8K, then measures accuracy for uniform vs. proportional LoRA rank allocations using identical parameter budgets. Reported values (74.5% uniform vs. 70.0% proportional; gradient ratio 2.17x widening to 3.00x) are observed experimental outcomes, not quantities obtained by fitting parameters to a subset and renaming the fit as a prediction. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to derive the central claims; the work contains no derivation chain that reduces to its own inputs. The findings are falsifiable via replication on other models/tasks and stand as independent empirical evidence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient magnitude during training is a valid proxy for the importance of a layer when deciding LoRA rank allocation
Reference graph
Works this paper leans on
-
[1]
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=
-
[2]
Zhang, Qingru and Chen, Minshuo and Bukharin, Alexander and He, Pengcheng and Cheng, Yu and Chen, Weizhu and Zhao, Tuo , booktitle=
-
[3]
He, Haonan and Ye, Peng and Ren, Yuchen and Yuan, Yuan and Zhou, Luyang and Ju, Shucun and Chen, Lei , booktitle=
-
[4]
Cui, Xuan and Li, Huiyue and Zeng, Run and Zhao, Yunfei and Qian, Jinrui and Duan, Wei and Liu, Bo and Zhou, Zhanpeng , journal=
-
[5]
Aletheia: Gradient-Guided Layer Selection for Efficient
Saket, Abdulmalek , journal=. Aletheia: Gradient-Guided Layer Selection for Efficient
-
[6]
Understanding Layer Significance in
Shi, Guangyuan and Lu, Zexin and Dong, Xiaoyu and Zhang, Wenlong and Zhang, Xuanyu and Feng, Yujie and Wu, Xiao-Ming , journal=. Understanding Layer Significance in
-
[7]
Young, Robin , journal=. Why Is
-
[8]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=
-
[10]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
IGU-LoRA : Adaptive rank allocation via integrated gradients and uncertainty-aware scoring
Cui, X., Li, H., Zeng, R., Zhao, Y., Qian, J., Duan, W., Liu, B., and Zhou, Z. IGU-LoRA : Adaptive rank allocation via integrated gradients and uncertainty-aware scoring. arXiv preprint arXiv:2603.13792, 2026
-
[12]
GoRA : Gradient-driven adaptive low rank adaptation
He, H., Ye, P., Ren, Y., Yuan, Y., Zhou, L., Ju, S., and Chen, L. GoRA : Gradient-driven adaptive low rank adaptation. In Advances in Neural Information Processing Systems, 2025
work page 2025
-
[13]
J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[14]
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Saket, A. Aletheia: Gradient-guided layer selection for efficient LoRA fine-tuning across architectures. arXiv preprint arXiv:2604.15351, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Understanding layer significance in llm alignment, 2025
Shi, G., Lu, Z., Dong, X., Zhang, W., Zhang, X., Feng, Y., and Wu, X.-M. Understanding layer significance in LLM alignment. arXiv preprint arXiv:2410.17875, 2024
-
[17]
Why is rlhf alignment shallow? a gradient analysis.arXiv preprint arXiv: 2603.04851, 2026
Young, R. Why is RLHF alignment shallow? A gradient analysis. arXiv preprint arXiv:2603.04851, 2026
-
[18]
AdaLoRA : Adaptive budget allocation for parameter-efficient fine-tuning
Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., and Zhao, T. AdaLoRA : Adaptive budget allocation for parameter-efficient fine-tuning. In International Conference on Learning Representations, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.