pith. machine review for the scientific record. sign in

arxiv: 2605.12207 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords low-rank adaptationLoRAparameter placementgradient structuresupervised fine-tuningGRPOparameter-efficient fine-tuning
0
0 comments X

The pith

The choice of which parameters to update in LoRA adapters matters far more than the number updated, especially under GRPO training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether, for a fixed number of trainable entries in the LoRA B matrix, the specific locations chosen affect final performance. Under supervised fine-tuning, random selection performs nearly as well as informed selection. Under GRPO on base models, however, random placement yields no improvement while gradient-informed placement matches full LoRA results. This difference arises because SFT gradients are low-rank and stable across steps, allowing any subset to accumulate useful updates, whereas GRPO gradients are high-rank and nearly orthogonal, requiring selection of entries with consistent gradient signs to preserve the learning signal. A simple scoring method finds these key parameters in seconds at negligible cost.

Core claim

Under GRPO, only gradient-informed placement of the k trainable parameters in LoRA's B matrix recovers the accuracy of standard LoRA, while random placement fails to beat the base model; this occurs because GRPO gradients are high-rank and near-orthogonal across steps, so only consistently signed entries retain the update signal, unlike the low-rank stable gradients in SFT.

What carries the argument

The gradient-informed scoring procedure that ranks parameters by consistency of gradient signs or magnitudes to select the critical subset for training.

If this is right

  • Selected parameters concentrate on residual-stream-writing projections V, O, and Down across different model families and scales from 1.5B to 8B.
  • Under supervised fine-tuning, any random subset of k parameters achieves comparable performance to informed selection.
  • The scoring procedure runs in under 10 seconds and costs less than 0.5% of full training.
  • Gradient structure determines whether placement choice matters: low-rank stable gradients in SFT vs high-rank orthogonal in GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar placement sensitivity may appear in other parameter-efficient fine-tuning methods beyond LoRA when using reinforcement learning objectives like GRPO.
  • Practitioners could integrate this quick scoring step into standard LoRA workflows to reduce trainable parameters without loss of performance under policy optimization.
  • Testing the method on larger models or different tasks could reveal whether the concentration on V, O, Down projections holds more broadly.
  • Future work might explore whether modifying the optimizer or gradient accumulation could make random placement viable under GRPO.

Load-bearing premise

The performance difference between random and informed placement under GRPO stems directly from the described differences in gradient rank and directional stability rather than from unexamined factors like optimizer settings or data order.

What would settle it

Running the same GRPO experiments but measuring if random placement succeeds when gradients are forced to be more stable or low-rank would falsify the claim if it then matches informed performance.

Figures

Figures reproduced from arXiv: 2605.12207 by Arijit Sehanobish, Charles Lovering.

Figure 1
Figure 1. Figure 1: Synthetic validation. (a) Under dense signal, random and informed placement perform similarly— any parameter subset captures the distributed gradient. (b) Under concentrated signal, informed placement at 2% of B nearly matches Full LoRA, while random placement barely improves even at 50% [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OOD generalization (Qwen2.5-3B, GRPO). (a, b) AIME 2024 pass@k for models trained on MATH and GSM8K respectively. (c) Cross-task greedy transfer: solid bars = MATH→GSM8K, hatched = GSM8K→MATH; dashed/dotted lines show base-model accuracy. Fˆ at <1% of adapter parameters matches or exceeds Full LoRA across all settings. 4.4 Out-of-Distribution Generalization Does informed placement generalize beyond the tra… view at source ↗
Figure 3
Figure 3. Figure 3: Fˆ knockout sweep (3B/7B × MATH/GSM8K). Accuracy vs. fraction of B entries zeroed by Fˆ score (circuit, red) or at random (blue). Dashed: full LoRA; dotted: base model. Circuit-ordered knockout degrades faster in all configurations, with the gap scaling with LoRA effect size. the pretrained weight’s dominant singular vectors, and residual writers (O, Down) that produce spectrally concentrated updates, both… view at source ↗
Figure 4
Figure 4. Figure 4: Divergence vs. effective update norm (Qwen2.5-1.5B/3B/7B, MATH-500, GRPO). For both rows, the y-axis is 1 − top-25 token overlap, measuring behavioral change via logit lens relative to the base model. Top: each trajectory traces training checkpoints, opacity increases with step. x-axis: |Bnz|2 · α/r, the scaled norm of non-zero B weights. Bottom: per-layer divergence at the final checkpoint. x-axis: relati… view at source ↗
Figure 5
Figure 5. Figure 5: MATH-trained pass@k and maj@k (Qwen2.5-3B/7B). Fˆ at 50K reaches 100% pass@64 on AMC (3B) and 95% (7B). On AIME 2025, Full LoRA degrades below base at 3B while Sˆ reaches 36.7% pass@64; at 7B all methods improve over base. AIME maj@k is near-zero for all methods; AMC maj@64 peaks at 52.5% (Fˆ, 3B) and 65% (Sˆ, 7B). 3B 7B 55 65 75 85 95 Accuracy (%) MATH → GSM8K 3B 7B 25 35 45 55 65 75 GSM8K → MATH-500 Base… view at source ↗
Figure 6
Figure 6. Figure 6: Cross-task greedy transfer (Qwen2.5-3B/7B). Gaps indicate methods not evaluated at that scale. MATH→GSM8K: circuits match Full LoRA to within 1.1pp. GSM8K→MATH-500: at 3B, Full LoRA degrades to 34.2% (−14.8pp below base) while Sˆ at 10K improves to 53.8%; at 7B, Full LoRA transfers well (64.8%, +11.2pp). B.2.1 Cross-Task Transfer MATH→GSM8K transfer is robust at both scales. GSM8K→MATH is asymmetric: at 3B… view at source ↗
Figure 7
Figure 7. Figure 7: GSM8K-trained pass@k and maj@k (Qwen2.5-3B/7B). At 3B, Full LoRA degrades below base on AMC while Sˆ at 10K reaches 97.5% pass@64. At 7B, Full LoRA transfers well and dominates on AIME 2024. Why do circuits generalize? Circuits discovered on different math datasets (GSM8K, MATH, NuminaMath) share ∼30% of their top-k elements on Qwen2.5-1.5B, roughly 12× above the 2.4% chance overlap; same-domain pairs shar… view at source ↗
Figure 8
Figure 8. Figure 8: visualizes the same comparison as scaling curves. 1K 10K 100K500K 4 6 8 Budget k Perplexity ↓ (a) Qwen3-0.6B 1K 10K 100K500K 55 60 65 Budget k Token Accuracy (%) (b) Qwen3-0.6B 1K 10K 100K 55 60 Base Full LoRA Budget k 7-bench Avg (%) (c) Qwen2.5-1.5B Base Full LoRA B-only A+B [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Circuit discovery is stable. (a) Top-10K overlap with the N=100 reference circuit as a function of discovery examples N. Both Sˆ and Fˆ reach 98% overlap at N=50. Random circuits overlap by < 0.1%. (b) Sˆ-circuit overlap under perturbation of A. The circuit degrades smoothly; at 10% perturbation, 92% of elements are unchanged [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Budget sweep under Alpaca SFT (7-benchmark avg). Dashed lines mark base and full-LoRA references. On 1.5B, random placement is flat across budgets (55–56%) due to persistent MMLU collapse, while both circuits scale steadily and match full LoRA by k=50K. On 3B, circuits lead random by 3–4pp at k=10K; all methods converge toward full LoRA by k=100K. |B| = 1,024 parameters. A mask m ∈ {0, 1} 64×16 selects k … view at source ↗
Figure 11
Figure 11. Figure 11: Llama-3.2-3B-Instruct, GSM8K (top 500K). At the larger budget (2% of B), the circuit is more distributed but still favors V/O in the first half of layers. The instruct model shows similar module preferences to base Qwen models. D.2 Per-Layer Score Concentration The heatmaps in §D.1 show which modules receive budget; [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qwen2.5-VL-7B, MathVista (top 10K). Circuit discovery automatically routes budget to the vision encoder: Sˆ allocates 73% to vision (vs. 16% for random), and Fˆ allocates 99%. Within vision, both methods concentrate on early blocks (0–6) and block 16. The few language elements selected by Sˆ go to V/O in late layers (16–19, 26)—the same module preference as LLM-only circuits. Random placement inverts this… view at source ↗
Figure 13
Figure 13. Figure 13: Score distributions for Qwen2.5-3B on GSM8K (left) and MATH (right). Both Sˆ and Fˆ exhibit heavy-tailed distributions; the top-10K threshold (dashed lines) selects from the extreme tail. Fˆ scores span a wider dynamic range (∼10 orders of magnitude) than Sˆ (∼4 orders). 17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 2.5 log (score magnitude) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density Score distributions with top-10K t… view at source ↗
Figure 14
Figure 14. Figure 14: Score distributions for Qwen2.5-7B on GSM8K (left) and MATH (right). Fisher equals the squared mean gradient plus the gradient variance. An element can have high Fisher but low Sˆ if its gradients are large but oscillate in sign; high Sˆ requires directional consistency across examples. Why small N suffices. For ranking stability, what matters is separation between top-ranked and median scores. Empiricall… view at source ↗
Figure 15
Figure 15. Figure 15: Per-layer circuit analysis for Qwen2.5-3B (36L) and Qwen2.5-7B (28L) on GSM8K with budget k=10K. (a) Signal retention: fraction of each layer’s gradient energy captured by the top 0.1% of elements. Both models show strong concentration in early layers (∼40−80× above the random baseline), confirming that circuit scores are far from uniform. (b) Circuit budget allocation: fraction of the global top-k budget… view at source ↗
Figure 16
Figure 16. Figure 16: Sˆ knockout sweep). Sˆ-ranked entries also degrade accuracy faster than random, but with smaller gaps: the peak is −7.8pp (7B MATH at 75%) vs. −12.2pp for Fˆ. Sˆ knockout ( [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Gradient sign consistency by module type (3B/7B × MATH/GSM8K). Modules ordered by descending consistency; bold labels indicate modules overrepresented in circuits. Down is consistently highest (0.82−0.85). Gate and Up achieve consistency comparable to the selected modules V and O, yet are underrepresented in circuits—consistency is necessary but not sufficient for selection [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 18
Figure 18. Figure 18: SVD alignment by module type (3B/7B × MATH/GSM8K). (a) Left singular vector alignment: V and K dominate, identifying them as attention readers whose updates reinforce existing information-selection directions. (b) Spectral concentration: O and Down dominate, identifying them as residual writers whose updates are low-rank. Shaded bands highlight the dominant pair in each panel [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 19
Figure 19. Figure 19: Cross-architecture dissociation. Each point is one module type, positioned by its gradient magnitude (mean |∇B|, x-axis) and spectral concentration of ∆W (y-axis). In both Qwen-2.5-3B and Llama-3.2-3B, residual writers (O, Down; red squares) cluster at high spectral ratio while attention readers (V, K; blue circles) receive the strongest gradient signal. The spatial separation of these two functional role… view at source ↗
Figure 20
Figure 20. Figure 20: Divergence vs. effective update norm (a: Qwen2.5-1.5B/3B/7B, MATH-500, GRPO); (B: Llama-3.2-3B/Llama-3.18B, MATH-500, GRPO). For both rows, the y-axis is 1 − top-25 token overlap, measuring behavioral change via logit lens relative to the base model. Top: each trajectory traces training checkpoints, opacity increases with step. x-axis: |Bnz|2 · α/r, the scaled norm of non-zero B weights. Bottom: per-layer… view at source ↗
Figure 21
Figure 21. Figure 21: Divergence vs. effective update norm (a: Qwen2.5-1.5B/3B/7B, MATH-500, GRPO); (B: Llama-3.2-3B/Llama-3.18B, MATH-500, GRPO). For both rows, the y-axis is 1 − top-25 token overlap, measuring behavioral change via logit lens relative to the base model. Top: each trajectory traces training checkpoints, opacity increases with step. x-axis: |Bnz|2 · α/r, the scaled norm of non-zero B weights. Bottom: per-layer… view at source ↗
read the original abstract

We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper studies the parameter placement problem in LoRA: with a fixed budget of k trainable entries in the B matrix (A frozen), does the specific choice of which entries matter? It reports that under supervised fine-tuning (SFT), random and gradient-informed subsets achieve comparable performance. Under GRPO on base models, however, random placement fails to improve over the base model while gradient-informed placement recovers standard LoRA accuracy. The authors attribute this regime dependence to differences in gradient structure—low-rank and directionally stable gradients in SFT versus high-rank, near-orthogonal gradients in GRPO—and introduce an efficient scoring procedure (under 10 seconds, <0.5% of training cost) that identifies critical parameters concentrated on residual-stream projections (V, O, Down), stable across 1.5B–8B models.

Significance. If the empirical contrasts hold after proper controls, the work would usefully demonstrate that parameter placement is not uniform across fine-tuning regimes and that a cheap gradient-based selector can recover full LoRA performance in the more demanding GRPO setting. The reported concentration of selected parameters on specific projection types and its stability across scales constitute a concrete, falsifiable observation that could guide future adapter designs. The low computational overhead of the scoring procedure is a practical strength.

major comments (2)
  1. [Abstract / GRPO experimental regime] The central regime-dependence claim (random placement fails under GRPO while informed succeeds) is load-bearing for the paper’s contribution, yet the abstract and experimental description do not indicate control experiments that hold optimizer momentum, per-step learning-rate scaling, and batch ordering fixed while varying only the selection rule. Without such isolation, the performance gap cannot be unambiguously attributed to gradient rank or sign consistency rather than to confounding training dynamics.
  2. [Abstract] No quantitative results, error bars, model sizes, dataset details, or statistical tests are supplied for the reported contrasts (e.g., “recovers standard LoRA accuracy”). This absence prevents verification of effect sizes and reliability, directly undermining assessment of the central empirical claim.
minor comments (1)
  1. The scoring procedure is described only at a high level; a brief equation or pseudocode in the main text would clarify how per-parameter scores are computed from gradients.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments highlight important aspects of experimental rigor and reporting clarity. We address each below, providing clarifications on our controls and committing to revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract / GRPO experimental regime] The central regime-dependence claim (random placement fails under GRPO while informed succeeds) is load-bearing for the paper’s contribution, yet the abstract and experimental description do not indicate control experiments that hold optimizer momentum, per-step learning-rate scaling, and batch ordering fixed while varying only the selection rule. Without such isolation, the performance gap cannot be unambiguously attributed to gradient rank or sign consistency rather than to confounding training dynamics.

    Authors: We agree that unambiguous attribution requires isolating the selection rule. In all reported comparisons, random and gradient-informed placements were trained under identical conditions: the same optimizer (AdamW with identical momentum parameters and initialization), the same per-step learning-rate schedule and scaling, the same batch size and ordering (via fixed random seeds for data shuffling), and the same number of steps. The sole difference is the binary mask determining which entries of B receive gradient updates; optimizer states for non-selected entries remain zero and are never updated. This setup ensures that any performance divergence arises from which gradients are applied rather than from differences in training dynamics. We will add an explicit paragraph in Section 4 (Experiments) documenting these controls, including confirmation that batch seeds were held constant across paired runs. No new experiments are required for this clarification. revision: partial

  2. Referee: [Abstract] No quantitative results, error bars, model sizes, dataset details, or statistical tests are supplied for the reported contrasts (e.g., “recovers standard LoRA accuracy”). This absence prevents verification of effect sizes and reliability, directly undermining assessment of the central empirical claim.

    Authors: We acknowledge that the abstract would be strengthened by quantitative anchors. The revised abstract will include: (i) the specific recovery level under GRPO (e.g., informed placement reaches within X% of full LoRA while random remains near base-model performance), (ii) the model sizes (1.5B–8B), (iii) a note that all main figures report means and standard deviations over 3–5 seeds, and (iv) the datasets used. These details are already present in the body and figures; we will surface the most salient numbers in the abstract to improve immediate verifiability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons stand on direct measurements without reduction to fitted inputs or self-citations

full rationale

The paper reports direct empirical results comparing random versus gradient-informed parameter subsets under SFT and GRPO, with performance gaps attributed to observed differences in gradient rank and sign consistency. No equations or derivations are presented that would make any reported accuracy recovery equivalent to a fitted parameter by construction. The scoring procedure is described as an independent low-cost empirical step rather than a self-referential fit, and no load-bearing self-citations or uniqueness theorems are invoked to force the central claims. The analysis therefore remains self-contained against external training benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard LoRA decomposition and the assumption that gradient statistics computed on a small number of steps are representative of the full training trajectory; no new entities are postulated.

axioms (1)
  • domain assumption LoRA adapter consists of a frozen A matrix and a trainable B matrix whose entries can be selectively activated.
    Standard construction used throughout the LoRA literature and invoked to define the placement problem.

pith-pipeline@v0.9.0 · 5466 in / 1343 out tokens · 86106 ms · 2026-05-13T06:06:12.708753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

107 extracted references · 107 canonical work pages · 9 internal anchors

  1. [1]

    Adilova and Samariddin Kushmuratov

    F.T. Adilova and Samariddin Kushmuratov. OLoRA+: A hybrid approach to parameter-efficient fine-tuning of large language models. In Conference of Mathematics of AI, 2026. URL https://openreview.net/forum?id=c75JefyklT

  2. [2]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language ...

  3. [3]

    Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad

    Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. RandloRA: Full rank parameter-efficient fine-tuning of large models. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=Hn5eoTunHN

  4. [4]

    LoRA-XS: Low-rank adaptation with extremely small number of parameters, 2024

    Klaudia Bałazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. LoRA-XS: Low-rank adaptation with extremely small number of parameters, 2024. URL https://openreview. net/forum?id=l80AgHoRaN

  5. [5]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023

  6. [6]

    BitFit: Simple parameter-efficient fine- tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient fine- tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume2: Short Papers), pages 1–9, Dublin, Ireland, May

  7. [7]

    2022 , publisher =

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https://aclanthology.org/2022.acl-short.1/

  8. [8]

    LoRA learns less and forgets less

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= aloEru2qCG....

  9. [9]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  10. [10]

    Olora: Orthonormal low-rank adaptation of large language models

    Kerim Büyükakyüz. Olora: Orthonormal low-rank adaptation of large language models. arXiv preprint arXiv:2406.01775, 2024

  11. [11]

    Curve circuits

    Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea V oss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. doi: 10.23915/distill.00024.006. https://distill.pub/2020/circuits/curve-circuits

  12. [12]

    The lottery ticket hypothesis for pre-trained bert networks

    Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15834–15846. Curran Associates, Inc., 2020. URL...

  13. [13]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019. 10

  14. [14]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

  15. [15]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  16. [16]

    Unsloth, 2023

    Michael Han Daniel Han and Unsloth team. Unsloth, 2023. URL https://github.com/ unslothai/unsloth

  17. [17]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088– 10115, 2023

  18. [18]

    Sparse low-rank adaptation of pre-trained language models

    Ning Ding, Xingtai Lv, Qiaosen Wang, Yulin Chen, Bowen Zhou, Zhiyuan Liu, and Maosong Sun. Sparse low-rank adaptation of pre-trained language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4133–4145, Singapore, December 2023. Association for Compu...

  19. [19]

    Quantifying elicitation of latent capabilities in language models

    Elizabeth Donoway, Hailey Joren, Arushi Somani, Henry Sleight, Julian Michael, Michael R DeWeese, John Schulman, Ethan Perez, Fabien Roger, and Jan Leike. Quantifying elicitation of latent capabilities in language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  20. [20]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  21. [21]

    https://transformer-circuits.pub/2021/framework/index.html

  22. [22]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=rJl-b3RcF7

  23. [23]

    Sparsegpt: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pages 10323–10337. PMLR, 2023

  24. [24]

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  25. [25]

    Parameter-efficient fine-tuning with discrete fourier transform

    Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform. In International Conference on Machine Learning, pages 14884–14901. PMLR, 2024

  26. [26]

    Au- tomatically identifying local and global circuits with linear computation graphs, 2024

    Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, and Xipeng Qiu. Au- tomatically identifying local and global circuits with linear computation graphs, 2024. URL https://arxiv.org/abs/2405.13868

  27. [27]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  28. [28]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

  29. [29]

    Parameter-efficient transfer learning with diff pruning

    Demi Guo, Alexander Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th 11 International Joint Conference on Natural Language Processing (V olume1: Long Papers), ...

  30. [30]

    When models manipulate manifolds: The geometry of a counting task

    Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/ linebreaks/index.html

  31. [31]

    Position-aware automatic circuit discovery, 2025

    Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, and Yonatan Belinkov. Position-aware automatic circuit discovery, 2025. URLhttps://arxiv.org/abs/2502.04577

  32. [32]

    Flora: Low-rank adapters are secretly gradient compressors

    Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors. In Forty-first International Conference on Machine Learning, 2024. URL https: //openreview.net/forum?id=uubBZKM99Y

  33. [33]

    Optimal brain surgeon and general network pruning

    Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993

  34. [34]

    Lora+: Efficient low rank adaptation of large models

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024

  35. [35]

    Sensitivity-aware visual parameter-efficient fine-tuning

    Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, and Bohan Zhuang. Sensitivity-aware visual parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11825–11835, 2023

  36. [36]

    SparseAdapter: An easy approach for improving the parameter-efficiency of adapters

    Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. SparseAdapter: An easy approach for improving the parameter-efficiency of adapters. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2184–2190, Abu Dhabi, United Arab Emirates, December 2022. As- sociatio...

  37. [37]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  38. [38]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  39. [39]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

  40. [40]

    Lora training in the ntk regime has no spurious local minima

    Uijeong Jang, Jason D Lee, and Ernest K Ryu. Lora training in the ntk regime has no spurious local minima. In International Conference on Machine Learning, pages 21306–21328. PMLR, 2024

  41. [41]

    Aime problem set 2024, 2024

    Maxwell Jia. Aime problem set 2024, 2024. URL https://huggingface.co/datasets/ Maxwell-Jia/AIME_2024

  42. [42]

    MoRA: High- rank updating for parameter-efficient fine-tuning, 2024

    Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Songtao Wang, deqing wang, and Fuzhen Zhuang. MoRA: High- rank updating for parameter-efficient fine-tuning, 2024. URL https://openreview.net/ forum?id=SxOrhLuuVz

  43. [43]

    A rank stabilization scaling factor for fine-tuning with LoRA

    Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732, 2023

  44. [44]

    Scaling sparse feature circuit finding for in-context learning, 2025

    Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, and Neel Nanda. Scaling sparse feature circuit finding for in-context learning, 2025. URL https://arxiv.org/abs/ 2504.13756

  45. [45]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13): 3...

  46. [46]

    Vera: Vector-based random matrix adaptation

    Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, 2024

  47. [47]

    arXiv preprint arXiv:2403.00745 , year=

    János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An efficient and scalable method for localizing llm behaviour to components, 2024. URL https://arxiv.org/abs/ 2403.00745

  48. [48]

    Bespoke-stratos: The unreasonable effectiveness of reasoning distil- lation

    Bespoke Labs. Bespoke-stratos: The unreasonable effectiveness of reasoning distil- lation. https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of- reasoning-distillation, 2025. Accessed: 2025-01-22

  49. [49]

    Optimal brain damage

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989

  50. [50]

    Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: single-shot network pruning based on connection sensitivity. CoRR, abs/1810.02340, 2018. URLhttp://arxiv. org/abs/1810.02340

  51. [51]

    Measuring the intrinsic dimension of objective landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations,

  52. [52]

    URLhttps://openreview.net/forum?id=ryup8-WCW

  53. [53]

    Vb-lora: Extreme parameter efficient fine-tuning with vector banks

    Yang Li, Shaobo Han, and Shihao Ji. Vb-lora: Extreme parameter efficient fine-tuning with vector banks. In The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

  54. [54]

    Remax: a simple, effective, and efficient reinforcement learning method for aligning large language models

    Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: a simple, effective, and efficient reinforcement learning method for aligning large language models. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  55. [55]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 3214–3252, Dublin, Ireland, May

  56. [56]

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229

  57. [57]

    Dora: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024

  58. [58]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Conference on Language Modeling (COLM), 2025

  59. [59]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024

  60. [60]

    Merging models with fisher-weighted averaging

    Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

  61. [61]

    Aime problem set 2025, 2025

    math ai. Aime problem set 2025, 2025. URL https://huggingface.co/datasets/ math-ai/aime25

  62. [62]

    Pissa: Principal singular values and singular vectors adaptation of large language models

    Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. Advances in Neural Information Processing Systems, 37:121038–121072, 2024

  63. [63]

    Mass-Editing Memory in a Transformer

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022

  64. [64]

    Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar

    John X Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar. Learning to reason in 13 parameters. arXiv preprint arXiv:2602.04118, 2026

  65. [65]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023

  66. [66]

    Amc 2023: American mathematics competitions 2023 dataset

    Mathematical Association of America (MAA). Amc 2023: American mathematics competitions 2023 dataset. https://huggingface.co/datasets/math-ai/amc23, 2023. Accessed: 2024-05-05. 13

  67. [67]

    Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning

    Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv:2403.17919, 2024

  68. [68]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

  69. [69]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

  70. [70]

    MELoRA: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning

    Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, and Jiahuan Pei. MELoRA: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long ...

  71. [71]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019

  72. [72]

    Movement pruning: Adaptive sparsity by fine-tuning

    Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems, 33:20378–20389, 2020

  73. [73]

    Efficient rlhf: Reducing the memory usage of ppo

    Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient rlhf: Reducing the memory usage of ppo. arXiv preprint arXiv:2309.00754, 2023

  74. [74]

    LoRA without regret

    John Schulman and Thinking Machines Lab. Lora without regret. Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

  75. [75]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  76. [76]

    Ash, and Dipendra Misra

    Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=ozX92bu8VA

  77. [77]

    Math-llava: Bootstrapping mathematical reasoning for multimodal large language models, 2024

    Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models, 2024

  78. [78]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=PxoFut3dWW

  79. [80]

    URLhttps://arxiv.org/abs/2006.05467

  80. [81]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

Showing first 80 references.