PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

Albert Bifet; Azzedine Idir Ait Said; Mariam Barry; Mustafa Hayri Bilgin; Soumya Banerjee

arxiv: 2605.24549 · v1 · pith:Q4HTZZNFnew · submitted 2026-05-23 · 💻 cs.AI

PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

Mustafa Hayri Bilgin , Mariam Barry , Albert Bifet , Azzedine Idir Ait Said , Soumya Banerjee This is my paper

Pith reviewed 2026-06-30 12:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords PALoRALoRAsingular value fine-tuningreasoning preservationknowledge injectionparameter-efficient fine-tuninglarge language modelsspectral adaptation

0 comments

The pith

PALoRA shields reasoning subspaces in LLMs during factual updates by using an SVF probe to enforce orthogonal LoRA changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reasoning ability in models like Llama 3.1 8B and Mistral 7B sits across the full singular spectrum of MLP weights rather than only in the largest directions. It introduces a two-stage method that first trains a singular-value fine-tuning expert on reasoning data to learn a scaling vector marking the critical subspace, then applies LoRA updates for new facts while constraining those updates to stay orthogonal to the marked directions. If this works, models can absorb evolving knowledge without the usual erosion of math, code, and science skills. Experiments report that the approach keeps 95 percent of the expert's reasoning scores on average across benchmarks while matching standard LoRA on factual recall and adding under 0.006 percent extra parameters.

Core claim

Reasoning-critical information is distributed across the singular spectrum of weight matrices; training an SVF expert on a reasoning set produces a singular scaling vector that serves as a reliable geometric probe; performing subsequent LoRA adaptation under an orthogonality constraint relative to this vector injects facts while preserving the target skill.

What carries the argument

The frozen singular scaling vector from the SVF expert, which identifies the skill-relevant subspace and supplies the structural orthogonality constraint for LoRA updates.

If this is right

Knowledge injection can be performed with minimal interference to distributed reasoning representations.
Spectral probes trained on one skill set can guide parameter updates for other capabilities.
The added overhead remains below 0.006 percent while outperforming prior spectral PEFT baselines on retention.
The same two-stage pattern applies across 7B-8B models on mathematical, coding, and scientific tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the probe generalizes, similar subspace shielding could protect other skills such as safety alignment during capability updates.
The distributed nature of reasoning suggests that full-rank updates may be needed only inside narrow subspaces rather than everywhere.
Testing the method on continual learning streams would check whether repeated orthogonal injections accumulate without compounding skill loss.

Load-bearing premise

The singular scaling vector learned from reasoning data correctly marks the directions whose avoidance will protect reasoning performance.

What would settle it

A direct comparison showing that LoRA updates forced into the identified subspace produce substantially larger reasoning drops than the orthogonal updates on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.24549 by Albert Bifet, Azzedine Idir Ait Said, Mariam Barry, Mustafa Hayri Bilgin, Soumya Banerjee.

**Figure 2.** Figure 2: Visual evidence for distributed skill information. The SVF scaling factors show a sharp [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics for Llama-3.1-8B with the GSM8K expert and 1000 unknown facts. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: SVF visualization for the Llama-3.1-8B-Instruct GSM8K expert. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: SVF visualization for the Llama-3.1-8B-Instruct MBPP expert. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: SVF visualization for the Llama-3.1-8B-Instruct AI2-ARC expert. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: SVF visualization for the Mistral-7B-Instruct-v0.3 GSM8K expert. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: SVF visualization for the Mistral-7B-Instruct-v0.3 MBPP expert. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: SVF visualization for the Mistral-7B-Instruct-v0.3 AI2-ARC expert. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Efficiently updating Large Language Models (LLMs) with new or evolving factual knowledge remains a central challenge, as even parameter-efficient adaptation can erode previously acquired reasoning abilities. This tension reflects a plasticity-stability dilemma: models must incorporate new knowledge while preserving skill-critical representations. In this work, we study this trade-off through the spectral structure of multilayer perceptron weight matrices. We show, both theoretically and empirically, that information essential for reasoning is not localized only in dominant singular directions, but is instead distributed across the singular spectrum. Motivated by this observation, we introduce PALoRA, a two-stage framework for knowledge injection with reduced interference. PALoRA first trains a Singular Value Fine-Tuning (SVF) expert on a reasoning dataset and uses its learned singular scaling vector as a frozen geometric probe to identify components that are critical for the target skill. It then performs factual knowledge injection with Low-Rank Adaptation (LoRA) under a structural orthogonality constraint, ensuring that updates avoid the identified skill-relevant subspace. Across Llama 3.1 8B and Mistral 7B, and across mathematical, coding, and scientific reasoning benchmarks, PALoRA preserves on average 95% of the SVF expert's reasoning performance while maintaining competitive factual recall. It consistently improves skill retention over prior spectral Parameter-Efficient Fine-Tuning (PEFT) methods while adding less than 0.006% parameter overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PALoRA's orthogonality constraint via SVF probe is a reasonable idea for the stability-plasticity issue but the abstract leaves the subspace claim and 95% result too thin to judge without the full experiments.

read the letter

The paper's main move is to train an SVF expert on reasoning data, freeze its singular scaling vector as a probe for skill-critical directions, then run factual LoRA updates under an orthogonality constraint so the updates stay out of that subspace. This two-stage setup with the structural constraint is the concrete new piece.

It does a clean job naming the distributed nature of reasoning information across the singular spectrum instead of assuming it lives only in the top directions. That observation directly motivates the method and fits the practical goal of adding facts without tanking math or coding performance.

The soft spots sit in the verification layer. The abstract states both theoretical and empirical backing for the spectral claim and the 95% average preservation on Llama 3.1 8B and Mistral 7B, yet gives no derivation steps, no dataset sizes or splits, and no ablation on whether the scaling vector is stable across different reasoning corpora. The stress-test worry that the vector might latch onto corpus artifacts rather than a general reasoning subspace is not obviously answered by what is shown. If the protected directions shift once the base model receives the factual updates, the orthogonality guarantee could weaken.

The work is aimed at people already working on spectral or constrained PEFT for LLMs. A reader who wants another data point on how to trade off knowledge injection against skill retention will find the framing useful, but anyone needing reproducible numbers or a closed-form argument will have to wait for the full text.

It is worth sending to peer review so the experiments and any formal argument can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces PALoRA, a two-stage PEFT framework for injecting factual knowledge into LLMs without eroding reasoning skills. It first trains an SVF expert on a reasoning dataset to obtain a singular scaling vector that identifies skill-critical directions distributed across the singular spectrum of MLP weights (rather than only dominant ones). It then applies LoRA updates for factual adaptation subject to a structural orthogonality constraint that avoids the identified subspace. Experiments on Llama 3.1 8B and Mistral 7B across math, coding, and science benchmarks report that PALoRA retains on average 95% of the SVF expert's reasoning performance while achieving competitive factual recall and outperforming prior spectral PEFT baselines with <0.006% added parameters.

Significance. If the empirical results and the underlying subspace-identification claim hold under scrutiny, the work would offer a concrete, low-overhead approach to the plasticity-stability trade-off in LLM adaptation. The spectral-distribution observation and the use of a learned geometric probe to enforce orthogonality could inform future PEFT designs that aim to protect distributed capabilities rather than relying on magnitude-based or random projections.

major comments (2)

[Abstract, §3] Abstract and §3 (Method): the central 95% preservation claim rests on the SVF-derived singular scaling vector accurately marking a stable reasoning subspace whose orthogonal complement can safely receive factual LoRA updates. No derivation or ablation is supplied showing why a single scaling vector (rather than the full support of non-zero scalings) suffices, nor how the orthogonality constraint remains valid after the base weights are subsequently modified; this directly affects whether the reported performance numbers can be attributed to the proposed mechanism rather than dataset-specific artifacts.
[§4] §4 (Experiments): the abstract states both theoretical and empirical support for the distributed-spectrum claim and the 95% result, yet provides no dataset descriptions, control conditions (e.g., random vs. learned probe), or statistical tests. Without these, the cross-model, cross-benchmark superiority over prior spectral PEFT methods cannot be verified as load-bearing evidence.

minor comments (2)

[Abstract] Notation for the singular scaling vector and the orthogonality projection operator should be introduced once with explicit definitions and reused consistently; current usage in the abstract is informal.
[§4] The parameter-overhead figure (<0.006%) should be broken down by model size and compared directly to the baselines in a table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional justification and experimental details.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Method): the central 95% preservation claim rests on the SVF-derived singular scaling vector accurately marking a stable reasoning subspace whose orthogonal complement can safely receive factual LoRA updates. No derivation or ablation is supplied showing why a single scaling vector (rather than the full support of non-zero scalings) suffices, nor how the orthogonality constraint remains valid after the base weights are subsequently modified; this directly affects whether the reported performance numbers can be attributed to the proposed mechanism rather than dataset-specific artifacts.

Authors: We agree that the manuscript would benefit from an explicit derivation and targeted ablations. While empirical results across models support the single-vector probe, we will add a derivation in §3 based on the SVD decomposition properties showing why the learned scaling vector captures the distributed subspace, plus ablations comparing it to the full non-zero support. We will also include analysis of the orthogonality constraint's invariance under subsequent base-weight updates. These additions will be made in the revised version. revision: yes
Referee: [§4] §4 (Experiments): the abstract states both theoretical and empirical support for the distributed-spectrum claim and the 95% result, yet provides no dataset descriptions, control conditions (e.g., random vs. learned probe), or statistical tests. Without these, the cross-model, cross-benchmark superiority over prior spectral PEFT methods cannot be verified as load-bearing evidence.

Authors: We acknowledge the need for expanded experimental reporting. The manuscript contains dataset descriptions in §4.1 and cross-model comparisons, but we will add explicit control experiments (random probes and full-spectrum baselines), statistical tests (e.g., significance testing on performance deltas), and fuller dataset details in the revised §4 to strengthen verifiability of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's method trains an SVF expert on a separate reasoning dataset to obtain a singular scaling vector, then applies an orthogonality constraint during factual LoRA adaptation; the 95% preservation claim is an empirical measurement on held-out mathematical/coding/scientific benchmarks rather than a quantity that reduces by the paper's own equations to the fitted vector itself. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The theoretical claim that reasoning information is distributed across the singular spectrum is offered as motivation and is externally falsifiable via the reported benchmark comparisons, leaving the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that reasoning information is distributed across the singular spectrum and that the SVF-derived scaling vector provides a reliable probe for protecting that information; no free parameters beyond the learned scaling vector or invented entities are mentioned.

free parameters (1)

singular scaling vector
Learned by the SVF expert on the reasoning dataset and used as a frozen probe; its values are data-dependent.

axioms (1)

domain assumption Information essential for reasoning is not localized only in dominant singular directions but is distributed across the singular spectrum of multilayer perceptron weight matrices.
This observation is stated as the theoretical motivation for identifying skill-relevant components via SVF.

pith-pipeline@v0.9.1-grok · 5806 in / 1444 out tokens · 42025 ms · 2026-06-30T12:59:41.977810+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[2]

How much knowledge can you pack into a lora adapter without harming llm? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4309–4322, 2025

Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexan- der Panchenko, and Mikhail Salnikov. How much knowledge can you pack into a lora adapter without harming llm? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4309–4322, 2025

2025
[3]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021
[4]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022

2022
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021.URL https://arxiv. org/abs/2110.14168, 9, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

2024
[9]

Milora: Harnessing minor singular components for parameter-efficient llm finetuning

Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4823–4836, 2025

2025
[10]

Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

Vijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham K Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

2024
[11]

Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning

Yifeng Xiong and Xiaohui Xie. Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34088–34096, 2026

2026
[12]

Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen L Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. Corda: Context-oriented decomposition adaptation of large language models for task-aware parameter-efficient fine-tuning.Advances in Neural Information Processing Systems, 37:71768–71791, 2024. 10

2024
[13]

Transformer-squared: Self-adaptive llms

Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer-squared: Self-adaptive llms. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representa- tions, volume 2025, pages 13878–13895, 2025

2025
[14]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[15]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

1989
[16]

The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013

Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013

2013
[17]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[18]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InInternational conference on artificial intelligence and statistics, pages 3762–3773. PMLR, 2020

2020
[19]

Gradient projection memory for continual learning

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InInternational Conference on Learning Representations, 2021

2021
[20]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

2023
[21]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

1992
[22]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

2012
[23]

Matrix perturbation theory.(No Title), 1990

Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory.(No Title), 1990

1990
[24]

JHU press, 2013

Gene H Golub and Charles F Van Loan.Matrix computations. JHU press, 2013

2013
[25]

Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differen- tialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung).Mathematische Annalen, 71(4):441–479, 1912. 11 Appendix This appendix provides supplementary material supporting the theoretical, methodological, and empirical results of the paper. We f...

1912
[26]

The optimization schedule is model-specific: for Llama-3.1-8B-Instruct, we use a learning rate of 2×10 −4 and train for 10 epochs, whereas for Mistral-7B-Instruct-v0.3, we use a learning rate of 2×10 −5 and train for 5 epochs. Evaluation is likewise standardized: knowledge performance is measured with the same Unknown and HighlyKnown splits, and skill pre...

[1] [1]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[2] [2]

How much knowledge can you pack into a lora adapter without harming llm? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4309–4322, 2025

Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexan- der Panchenko, and Mikhail Salnikov. How much knowledge can you pack into a lora adapter without harming llm? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4309–4322, 2025

2025

[3] [3]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021

[4] [4]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022

2022

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021.URL https://arxiv. org/abs/2110.14168, 9, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

2024

[9] [9]

Milora: Harnessing minor singular components for parameter-efficient llm finetuning

Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4823–4836, 2025

2025

[10] [10]

Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

Vijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham K Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

2024

[11] [11]

Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning

Yifeng Xiong and Xiaohui Xie. Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34088–34096, 2026

2026

[12] [12]

Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen L Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. Corda: Context-oriented decomposition adaptation of large language models for task-aware parameter-efficient fine-tuning.Advances in Neural Information Processing Systems, 37:71768–71791, 2024. 10

2024

[13] [13]

Transformer-squared: Self-adaptive llms

Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer-squared: Self-adaptive llms. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representa- tions, volume 2025, pages 13878–13895, 2025

2025

[14] [14]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[15] [15]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

1989

[16] [16]

The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013

Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013

2013

[17] [17]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[18] [18]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InInternational conference on artificial intelligence and statistics, pages 3762–3773. PMLR, 2020

2020

[19] [19]

Gradient projection memory for continual learning

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InInternational Conference on Learning Representations, 2021

2021

[20] [20]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

2023

[21] [21]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

1992

[22] [22]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

2012

[23] [23]

Matrix perturbation theory.(No Title), 1990

Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory.(No Title), 1990

1990

[24] [24]

JHU press, 2013

Gene H Golub and Charles F Van Loan.Matrix computations. JHU press, 2013

2013

[25] [25]

Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differen- tialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung).Mathematische Annalen, 71(4):441–479, 1912. 11 Appendix This appendix provides supplementary material supporting the theoretical, methodological, and empirical results of the paper. We f...

1912

[26] [26]

The optimization schedule is model-specific: for Llama-3.1-8B-Instruct, we use a learning rate of 2×10 −4 and train for 10 epochs, whereas for Mistral-7B-Instruct-v0.3, we use a learning rate of 2×10 −5 and train for 5 epochs. Evaluation is likewise standardized: knowledge performance is measured with the same Unknown and HighlyKnown splits, and skill pre...