SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

Daling Wang; Feiliang Ren; Hinrich Sch\"utze; Mengjie Zhao; Qian Li; Shanru Zhang; Shi Feng; Xing Li; Yongkang Liu; Zijing Wang

arxiv: 2605.21147 · v1 · pith:TMZOORUZnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

Yongkang Liu , Xing Li , Mengjie Zhao , Shanru Zhang , Zijing Wang , Qian Li , Shi Feng , Feiliang Ren

show 2 more authors

Daling Wang Hinrich Sch\"utze

This is my paper

Pith reviewed 2026-05-21 05:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords parameter-efficient fine-tuningLoRAspectrum modulationHadamard modulationspectral blockslow-rank adaptationlarge language models

0 comments

The pith

SMoA improves fine-tuning performance over LoRA at lower parameter budgets by partitioning layers into aligned spectral blocks with Hadamard modulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SMoA to address the representational limits of low-rank methods like LoRA when ranks are reduced to control parameter count. It partitions each layer into multiple aligned spectral blocks and applies a Hadamard-modulated low-rank branch to each diagonal block. This construction is meant to reach a wider set of pretrained spectral directions without raising the parameter budget. Readers would care if the result holds because it offers a way to adapt large models more effectively when compute and memory are constrained.

Core claim

SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions under a smaller parameter budget than a comparable-rank LoRA update.

What carries the argument

Partitioning the weight matrix into aligned spectral blocks and applying per-block Hadamard-modulated low-rank branches.

If this is right

Broader coverage of pretrained spectral directions is obtained at fixed parameter cost.
Average performance rises in current lower-budget fine-tuning regimes relative to LoRA.
Competitive results are retained against other LoRA-style baselines while using fewer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The block-wise modulation idea could be tested in other parameter-efficient methods to check whether spectral coverage gains appear more generally.
Measuring the actual singular-value coverage achieved by SMoA versus LoRA on real weight matrices would provide a direct test of the proposed mechanism.

Load-bearing premise

That spectral block partitioning together with per-block Hadamard modulation enlarges the family of spectrum-aware updates while keeping the total trainable parameters below those of a comparable-rank LoRA.

What would settle it

A head-to-head experiment on standard benchmarks where SMoA shows no average performance gain over LoRA at the same low parameter budget would disprove the central empirical claim.

Figures

Figures reproduced from arXiv: 2605.21147 by Daling Wang, Feiliang Ren, Hinrich Sch\"utze, Mengjie Zhao, Qian Li, Shanru Zhang, Shi Feng, Xing Li, Yongkang Liu, Zijing Wang.

**Figure 1.** Figure 1: Pretrained weights have informative spectral tails. Top: Distribution of normalized singular values ν for Llama-2-7B’s layer 5, 15, and 25. The black dashed curve shows Marchenko–Pastur prediction for a random matrix of the same shape; values beyond the bulk edge (vertical line) are tail outliers encoding structured and task-relevant information. Bottom: Maximum overlap Ok between right singular vectors a… view at source ↗

**Figure 2.** Figure 2: Overview of SMoA pipeline (when K = 4). We illustrate the informative spectral tails of pretrained model weights in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Original Llama2-7B vs. SMoA finetuned spectra and overlaps on BoolQ for key weight matrix of attention. We compare all layers before and after finetuning. The figure shows the 5th, 15th, and 25th layers, results for other layers are in the Appendix F. Top: empirical distributions of normalized singular values ν; light brown denotes the pretrained model and orange denotes the finetuned model. The dashed cur… view at source ↗

**Figure 4.** Figure 4: Left: comparison of the measured update rank of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Original Llama2-7B vs. finetuned spectra and overlaps for key weight matrix (layer 0 to 15). Top: empirical distributions of normalized singular values ν; light brown denotes the pretrained model and orange denotes the finetuned model. The dashed curve is the theoretical random-matrix bulk. Bottom: the corresponding maximum overlap Ok between right singular vectors and activationcovariance eigenvectors, w… view at source ↗

**Figure 6.** Figure 6: Pretrained vs. finetuned spectra and overlaps on Llama2-7B key weight matrix (layer 16 to 31). Top: empirical distributions of normalized singular values ν; light brown denotes the pretrained model and orange denotes the finetuned model. The dashed curve is the theoretical random-matrix bulk. Bottom: the corresponding maximum overlap Ok between right singular vectors and activation-covariance eigenvectors,… view at source ↗

**Figure 7.** Figure 7: Attention key singular-value histograms of the Llama-2-7B based on SMoA fine-tuning in [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity. Theory suggests that LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix. As the rank increases, more principal singular directions are preserved, which generally improves the model's performance. However, a larger rank also introduces more trainable parameters, leading to higher computational cost. To overcome this dilemma, we propose SMoA, a \textbf{S}pectrum \textbf{Mo}dulation \textbf{A}dapter that enlarges the accessible family of spectrum-aware updates under a smaller parameter budget. SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions. We provide theoretical analysis and empirical results on multiple tasks. In our experiments, SMoA improves average performance in the current lower-budget setting over LoRA and competitive LoRA-style baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMoA partitions weights into spectral blocks and adds per-block Hadamard modulation to low-rank branches, but the abstract leaves open whether this actually reaches new singular directions at the same total parameter count as plain LoRA.

read the letter

The paper's main proposal is to split each layer's weight matrix into aligned spectral blocks and attach one Hadamard-modulated low-rank adapter to each diagonal block. The goal is to cover more of the pretrained singular directions than a standard LoRA while staying inside a tighter parameter budget. That construction is the concrete novelty here; it is not just another rank schedule but a spectrum-aware repartitioning step followed by element-wise scaling inside the blocks.

Referee Report

3 major / 3 minor

Summary. The paper proposes SMoA (Spectrum Modulation Adapter), a PEFT method for large language models. It partitions each pretrained weight matrix into multiple aligned spectral blocks, then applies a single Hadamard-modulated low-rank branch per diagonal block. The central claim is that this construction enlarges the family of spectrum-aware updates relative to standard LoRA while using a strictly smaller parameter budget, yielding broader coverage of pretrained singular directions; the claim is supported by a theoretical analysis and by empirical results showing higher average performance than LoRA and other LoRA-style baselines in low-budget regimes.

Significance. If the theoretical argument is made rigorous and the empirical gains prove robust to rank and block-size choices, SMoA would constitute a meaningful incremental advance in the design of spectrum-aware low-rank adapters. The explicit use of the pretrained singular spectrum to guide both partitioning and modulation is a concrete idea that could be adopted by other PEFT variants; reproducible code and clear ablation tables would further increase its utility to the community.

major comments (3)

[Method] Method section (definition of SMoA): the paper must supply an explicit matrix-level equation showing how the per-block Hadamard modulation vector is constructed from the singular values of the full weight matrix and how the sum of the per-block ranks is constrained to remain below the rank that would be required for a comparable LoRA update. Without this, it is impossible to verify that the construction is not algebraically equivalent to a single low-rank factor of the same total parameter count.
[Theoretical analysis] Theoretical analysis section: the claim that the modulation injects non-zero components into singular vectors outside the top-r subspace of the full matrix must be accompanied by a short proof sketch or a concrete low-dimensional counter-example demonstrating that the effective column space is strictly larger than that of a standard LoRA update with identical total trainable parameters. The current argument appears to rest on the assumption that the blocks and modulation are independent of the low-rank factors; this independence needs to be shown formally.
[Experiments] Experiments section (Table X and Figure Y): the reported performance advantage must be accompanied by the exact rank and block-size settings used for SMoA versus each LoRA baseline so that the parameter-budget comparison is transparent. In addition, standard deviations or confidence intervals over at least three random seeds should be provided; without them the average improvement cannot be assessed for statistical reliability.

minor comments (3)

[Abstract] The abstract states that SMoA “improves average performance … over LoRA and competitive LoRA-style baselines” but does not quantify the improvement or name the tasks; adding one sentence with the magnitude and the main benchmarks would improve clarity.
[Method] Notation for the Hadamard product and the modulation vector should be introduced once in the method section and used consistently thereafter; several passages currently reuse the same symbol for different quantities.
[Related work] The related-work section should explicitly contrast SMoA with prior spectral or block-wise LoRA variants (e.g., those that also exploit singular-value information) to clarify the incremental contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to improve clarity in the method definition, rigor in the theoretical analysis, and transparency in the experimental reporting. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Method] Method section (definition of SMoA): the paper must supply an explicit matrix-level equation showing how the per-block Hadamard modulation vector is constructed from the singular values of the full weight matrix and how the sum of the per-block ranks is constrained to remain below the rank that would be required for a comparable LoRA update. Without this, it is impossible to verify that the construction is not algebraically equivalent to a single low-rank factor of the same total parameter count.

Authors: We agree that an explicit matrix-level formulation would strengthen verifiability. In the revised manuscript we will insert a new displayed equation in Section 3 that defines the per-block Hadamard modulation vector directly from the singular values of the corresponding spectral partition of the pretrained weight matrix and states the global constraint that the sum of the per-block ranks is strictly smaller than the rank that would be needed for a LoRA update achieving the same total parameter count. This addition will make the algebraic distinction from a single low-rank factor transparent. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section: the claim that the modulation injects non-zero components into singular vectors outside the top-r subspace of the full matrix must be accompanied by a short proof sketch or a concrete low-dimensional counter-example demonstrating that the effective column space is strictly larger than that of a standard LoRA update with identical total trainable parameters. The current argument appears to rest on the assumption that the blocks and modulation are independent of the low-rank factors; this independence needs to be shown formally.

Authors: We acknowledge that a concise formal demonstration would be helpful. The revised theoretical section will contain a short proof sketch establishing that the combination of spectral partitioning and Hadamard modulation (derived solely from pretrained singular values) produces an effective column space that properly contains directions outside the top-r subspace of the full matrix, even under a smaller total parameter budget. We will also supply a concrete 4-by-4 low-dimensional counter-example that isolates the contribution of the block-wise modulation and clarifies the independence between the modulation vectors and the trainable low-rank factors. revision: yes
Referee: [Experiments] Experiments section (Table X and Figure Y): the reported performance advantage must be accompanied by the exact rank and block-size settings used for SMoA versus each LoRA baseline so that the parameter-budget comparison is transparent. In addition, standard deviations or confidence intervals over at least three random seeds should be provided; without them the average improvement cannot be assessed for statistical reliability.

Authors: We agree that explicit hyper-parameter disclosure and statistical reporting are necessary for reproducibility and fair comparison. In the revised version we will augment Table X and Figure Y with the precise rank and block-size values employed for SMoA and every baseline, together with the resulting parameter counts. We will additionally report mean performance plus standard deviation over three independent random seeds for all tasks, allowing readers to evaluate the reliability of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: SMoA construction and claims remain independent of fitted inputs or self-citation chains.

full rationale

The paper defines SMoA via an explicit architectural choice—partitioning into aligned spectral blocks with per-block Hadamard-modulated low-rank branches—and supports its claim of broader spectral coverage under reduced parameter count through theoretical analysis plus direct empirical comparison against external LoRA baselines. No equation or claim in the provided text reduces the reported gains or the 'enlarged family of spectrum-aware updates' to a quantity defined by the method's own fitted parameters, nor does any load-bearing premise rest on a self-citation whose content is itself unverified. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract invokes the existing theoretical claim that LoRA rank-r updates converge to the top-r singular directions; the new design elements (spectral blocks, Hadamard modulation) are introduced without explicit free parameters or new invented entities listed.

axioms (1)

domain assumption LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix.
Stated directly in the abstract as background theory that motivates the need for broader spectral coverage.

pith-pipeline@v0.9.0 · 5783 in / 1313 out tokens · 37844 ms · 2026-05-21T05:56:11.907089+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 (Spectrum-Aware Rank Ceiling) ... rank(ΔW) ≤ U := Σ min(sk, ρ rank(Mk))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 14 internal anchors

[1]

Winogrande: An adversarial winograd schema challenge at scale. 2019

work page 2019
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[5]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Huma...

work page doi:10.18653/v1/n19-1300 2019
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023
[10]

The second conversational intelligence challenge (convai2)

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Ur- banek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). InThe NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations, pages 187–208. Springer, 2019

work page 2019
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Hayou, N

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024
[14]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

work page 2022
[15]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 10

work page 2019
[16]

Lora: Low-rank adaptation of large language models.ICLR, 1 (2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1 (2):3, 2022

work page 2022
[17]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 5254–5276, 2023

work page 2023
[18]

Hira: Parameter-efficient hadamard high-rank adaptation for large language models

Qiushi Huang, Tom Ko, Zhan Zhuang, Lilian Tang, and Yu Zhang. Hira: Parameter-efficient hadamard high-rank adaptation for large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

work page arXiv 2024
[20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Vera: Vector-based random matrix adaptation.arXiv preprint arXiv:2310.11454, 2023

Dawid J Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation.arXiv preprint arXiv:2310.11454, 2023

work page arXiv 2023
[22]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021

work page 2021
[23]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

work page 2021
[24]

Stack more layers differently: High-rank training through low-rank updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. Stack more layers differently: High-rank training through low-rank updates. 2023

work page 2023
[25]

Relora: High- rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High- rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

work page arXiv 2023
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Dora: weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: weight-decomposed low-rank adaptation. In Proceedings of the 41st International Conference on Machine Learning, pages 32100–32121, 2024

work page 2024
[28]

P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks

Xiang Liu et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. InACL, 2021

work page 2021
[29]

P- tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P- tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

work page 2022
[30]

Gpt understands, too.AI Open, 5:208–215, 2024

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too.AI Open, 5:208–215, 2024

work page 2024
[31]

Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, and Hinrich Schütze. Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025. 11

work page arXiv 2025
[32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

Vladimir A Marˇcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

work page 1967
[34]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

work page 2018
[35]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[36]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[37]

Melora: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning

Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten Rijke, Zhumin Chen, and Jiahuan Pei. Melora: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3052–3064, 2024

work page 2024
[38]

Rank-accuracy trade-off for lora: A gradient-flow analysis

Michael Rushka and Diego Klabjan. Rank-accuracy trade-off for lora: A gradient-flow analysis. arXiv preprint arXiv:2602.10212, 2026

work page arXiv 2026
[39]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions.arXiv preprint arXiv:1904.09728, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[40]

Small singular values matter: A random matrix analysis of transformer models.arXiv preprint arXiv:2410.17770, 2024

Max Staats, Matthias Thamm, and Bernd Rosenow. Small singular values matter: A random matrix analysis of transformer models.arXiv preprint arXiv:2410.17770, 2024

work page arXiv 2024
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations

work page
[43]

Batched low-rank adaptation of foundation models

Yeming Wen and Swarat Chaudhuri. Batched low-rank adaptation of foundation models. In The Twelfth International Conference on Learning Representations

work page
[44]

Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024
[45]

Ssmlora: Enhancing low-rank adaptation with state space model

Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. Ssmlora: Enhancing low-rank adaptation with state space model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4493–4506, 2025

work page 2025
[46]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

work page 2022
[47]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[48]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[51]

Time- varying lora: Towards effective cross-domain fine-tuning of diffusion models.Advances in Neural Information Processing Systems, 37:73920–73951, 2024

Zhan Zhuang, Yulong Zhang, Xuehao Wang, Jiangang Lu, Ying Wei, and Yu Zhang. Time- varying lora: Towards effective cross-domain fine-tuning of diffusion models.Advances in Neural Information Processing Systems, 37:73920–73951, 2024

work page 2024
[52]

open-book

Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, et al. Come together, but not right now: A progressive strategy to boost low-rank adaptation. In42nd International Conference on Machine Learning, ICML 2025, 2025. 13 A Datasets We evaluate three categories of tasks: commonsense reasoning...

work page 2025
[53]

Limitations

If rank(g∆W ⋆ )> r,(58) then P ⊤ outg∆W ⋆ Pin /∈ FLoRA(r).(59) Therefore, any block-aligned anchor-modulated target whose rank exceedsr serves as a witness that FSMoA(W0;r, K)\ F LoRA(r)̸=∅.(60) Proof.Since rank(Ck)≤ρ,(61) each matrixC k admits a rank-ρfactorization Ck =B kAk, A k ∈R ρ×din/K, B k ∈R dout/K×ρ .(62) Substituting these factorizations into th...

work page
[54]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Winogrande: An adversarial winograd schema challenge at scale. 2019

work page 2019

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020

[5] [5]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Huma...

work page doi:10.18653/v1/n19-1300 2019

[6] [6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [9]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023

[9] [10]

The second conversational intelligence challenge (convai2)

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Ur- banek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). InThe NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations, pages 187–208. Springer, 2019

work page 2019

[10] [11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [12]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [13]

Hayou, N

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024

[13] [14]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

work page 2022

[14] [15]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 10

work page 2019

[15] [16]

Lora: Low-rank adaptation of large language models.ICLR, 1 (2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1 (2):3, 2022

work page 2022

[16] [17]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 5254–5276, 2023

work page 2023

[17] [18]

Hira: Parameter-efficient hadamard high-rank adaptation for large language models

Qiushi Huang, Tom Ko, Zhan Zhuang, Lilian Tang, and Yu Zhang. Hira: Parameter-efficient hadamard high-rank adaptation for large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[18] [19]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

work page arXiv 2024

[19] [20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[20] [21]

Vera: Vector-based random matrix adaptation.arXiv preprint arXiv:2310.11454, 2023

Dawid J Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation.arXiv preprint arXiv:2310.11454, 2023

work page arXiv 2023

[21] [22]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021

work page 2021

[22] [23]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

work page 2021

[23] [24]

Stack more layers differently: High-rank training through low-rank updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. Stack more layers differently: High-rank training through low-rank updates. 2023

work page 2023

[24] [25]

Relora: High- rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High- rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

work page arXiv 2023

[25] [26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Dora: weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: weight-decomposed low-rank adaptation. In Proceedings of the 41st International Conference on Machine Learning, pages 32100–32121, 2024

work page 2024

[27] [28]

P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks

Xiang Liu et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. InACL, 2021

work page 2021

[28] [29]

P- tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P- tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

work page 2022

[29] [30]

Gpt understands, too.AI Open, 5:208–215, 2024

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too.AI Open, 5:208–215, 2024

work page 2024

[30] [31]

Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, and Hinrich Schütze. Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025. 11

work page arXiv 2025

[31] [32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [33]

Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

Vladimir A Marˇcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

work page 1967

[33] [34]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

work page 2018

[34] [35]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[35] [36]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002

[36] [37]

Melora: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning

Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten Rijke, Zhumin Chen, and Jiahuan Pei. Melora: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3052–3064, 2024

work page 2024

[37] [38]

Rank-accuracy trade-off for lora: A gradient-flow analysis

Michael Rushka and Diego Klabjan. Rank-accuracy trade-off for lora: A gradient-flow analysis. arXiv preprint arXiv:2602.10212, 2026

work page arXiv 2026

[38] [39]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions.arXiv preprint arXiv:1904.09728, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[39] [40]

Small singular values matter: A random matrix analysis of transformer models.arXiv preprint arXiv:2410.17770, 2024

Max Staats, Matthias Thamm, and Bernd Rosenow. Small singular values matter: A random matrix analysis of transformer models.arXiv preprint arXiv:2410.17770, 2024

work page arXiv 2024

[40] [41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [42]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations

work page

[42] [43]

Batched low-rank adaptation of foundation models

Yeming Wen and Swarat Chaudhuri. Batched low-rank adaptation of foundation models. In The Twelfth International Conference on Learning Representations

work page

[43] [44]

Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024

[44] [45]

Ssmlora: Enhancing low-rank adaptation with state space model

Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. Ssmlora: Enhancing low-rank adaptation with state space model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4493–4506, 2025

work page 2025

[45] [46]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

work page 2022

[46] [47]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[47] [48]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [49]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [50]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[50] [51]

Time- varying lora: Towards effective cross-domain fine-tuning of diffusion models.Advances in Neural Information Processing Systems, 37:73920–73951, 2024

Zhan Zhuang, Yulong Zhang, Xuehao Wang, Jiangang Lu, Ying Wei, and Yu Zhang. Time- varying lora: Towards effective cross-domain fine-tuning of diffusion models.Advances in Neural Information Processing Systems, 37:73920–73951, 2024

work page 2024

[51] [52]

open-book

Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, et al. Come together, but not right now: A progressive strategy to boost low-rank adaptation. In42nd International Conference on Machine Learning, ICML 2025, 2025. 13 A Datasets We evaluate three categories of tasks: commonsense reasoning...

work page 2025

[52] [53]

Limitations

If rank(g∆W ⋆ )> r,(58) then P ⊤ outg∆W ⋆ Pin /∈ FLoRA(r).(59) Therefore, any block-aligned anchor-modulated target whose rank exceedsr serves as a witness that FSMoA(W0;r, K)\ F LoRA(r)̸=∅.(60) Proof.Since rank(Ck)≤ρ,(61) each matrixC k admits a rank-ρfactorization Ck =B kAk, A k ∈R ρ×din/K, B k ∈R dout/K×ρ .(62) Substituting these factorizations into th...

work page

[53] [54]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page