PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Bernhard Sch\"olkopf; Jiale Kang; Ruotian Peng; Weiyang Liu; Yandong Wen; Yangyi Huang; Zeju Qiu

arxiv: 2605.28819 · v1 · pith:MXIXUY25new · submitted 2026-05-27 · 💻 cs.LG · cs.CL

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Yangyi Huang , Ruotian Peng , Zeju Qiu , Jiale Kang , Yandong Wen , Bernhard Sch\"olkopf , Weiyang Liu This is my paper

Pith reviewed 2026-06-29 13:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords parameter-efficient finetuningstability-plasticitylarge language modelscatastrophic forgettingorthogonal finetuningbenchmarkgeometric analysis

0 comments

The pith

PEFT methods exhibit distinct stability-plasticity profiles, with orthogonal finetuning reaching the best Pareto frontier under matched parameter budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that parameter-efficient finetuning of large language models should be judged by the stability-plasticity trade-off: how much it improves a target task while avoiding loss of general pretrained capabilities. It introduces PEFT-Arena to measure both downstream performance and retention of general abilities on the same runs. Results across methods show clear differences in these profiles, and orthogonal finetuning sits farthest along the favorable frontier when parameter counts are comparable. Geometric breakdowns in weight space and activation space then account for why certain methods distort representations more than others, with final checkpoints often passing an earlier, better retention point.

Core claim

PEFT-Arena jointly tracks downstream accuracy and retention of general capabilities to expose the stability-plasticity dilemma in finetuning. Across methods, distinct profiles appear on this trade-off; orthogonal finetuning occupies the most favorable Pareto frontier at comparable parameter budgets. Weight-space spectral analysis shows how each parameterization interacts with the pretrained singular-value structure. Activation-space retention metrics tie forgetting to non-isometric distortion of general-capability representations. Supervised finetuning checkpoints frequently overshoot a superior earlier operating point, which path-wise rewinding can recover post hoc.

What carries the argument

PEFT-Arena benchmark that jointly measures downstream performance and general capability retention to quantify the stability-plasticity trade-off.

If this is right

Different PEFT methods produce reliably distinct stability-plasticity profiles.
Orthogonal finetuning achieves the strongest combined downstream gain and retention under matched parameter budgets.
Forgetting correlates with non-isometric distortion of general representations in activation space.
Spectral properties of updates in weight space determine compatibility with pretrained singular-value structure.
Final supervised finetuning checkpoints commonly overshoot an earlier point of better retention that path-wise rewinding can restore.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

When retention of broad capabilities matters more than peak task accuracy, orthogonal methods become the default choice under fixed budgets.
The geometric diagnostics could be used to screen candidate PEFT designs before full benchmark runs.
Trajectory monitoring during training might replace single final-checkpoint evaluation as standard practice.
The same stability-plasticity lens could be applied to continual learning settings beyond single-task adaptation.

Load-bearing premise

The chosen downstream performance and capability retention metrics accurately reflect the stability-plasticity trade-off without being distorted by task selection, model scale, or evaluation protocol details.

What would settle it

Re-running PEFT-Arena on a new collection of tasks or at a different model scale and obtaining a substantially altered ordering of methods along the Pareto frontier between downstream gain and retention loss.

Figures

Figures reproduced from arXiv: 2605.28819 by Bernhard Sch\"olkopf, Jiale Kang, Ruotian Peng, Weiyang Liu, Yandong Wen, Yangyi Huang, Zeju Qiu.

**Figure 2.** Figure 2: Weight-space spectral retention-adaptation profiling. (a & b) Distributions of projected spectrum changes [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Target-general performance trade-off with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training trajectory versus interpolation tra [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Additional weight-space profiles for SFT. We visualize diagonal-projection changes on the pretrained [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Additional OFT geometry diagnostic using singular vector alignment (SVA). OFT primarily changes [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Full SFT interpolation curves across PEFT methods and target domains. These results complement the [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise OFT update strength for Qwen2.5-7B math SFT with OFT-b32. Values are normalized by the [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Trade-off alternatives achieve consistent improvement on OFT adapters of different sizes. This figure [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEFT-Arena introduces a joint stability-plasticity benchmark for PEFT and reports orthogonal finetuning on the best Pareto frontier, but the experimental details needed to assess that claim are missing from the abstract.

read the letter

The paper's main addition is PEFT-Arena, which scores methods on both downstream task gains and retention of general capabilities instead of accuracy alone. They observe distinct trade-off profiles across PEFT approaches and place orthogonal finetuning ahead on the frontier at matched parameter budgets. The geometric sections look at singular-value structure in weight space and isometric properties in activation space to link forgetting to non-isometric distortion, and they note that final checkpoints often pass a better retention point, with rewinding as a simple fix.

This framing is useful because most PEFT papers still optimize only for the target task and treat forgetting as an afterthought. Adding a retention suite and showing that methods differ systematically on it gives practitioners a clearer way to choose among options when base-model behavior matters.

The soft spots sit in the strength of the supporting evidence. The abstract states the Pareto result but supplies no information on run counts, variance, exact retention metrics, or how tasks were selected. Without those controls it is difficult to rule out that the observed ordering depends on the particular downstream and retention suites chosen. The geometric analyses are post-hoc, so they describe the data rather than test whether the ranking survives changes in task distribution or evaluation protocol. The stress-test concern about confounding therefore lands until the full paper shows robustness checks.

The rewinding case studies are a concrete, low-cost suggestion that stands on its own.

The work is aimed at researchers and engineers who adapt LLMs under real constraints on forgetting. A reader who needs to pick or design a PEFT method for production use would get direct value from the benchmark setup and the retention focus, even before the specific rankings are confirmed.

It deserves peer review because the joint evaluation idea fills a clear gap and the geometric angle is worth checking, even if the current experiments require more documentation and controls before the frontier claim can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The paper introduces PEFT-Arena, a benchmark jointly measuring downstream task performance and retention of general capabilities to evaluate PEFT methods through the stability-plasticity trade-off. It reports that methods exhibit distinct profiles and that, under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. Geometric analyses in weight space (spectral structure) and activation space (isometry/distortion) are used to explain differences, with an additional observation that final SFT checkpoints often overshoot better operating points and case studies of path-wise rewinding for post-hoc gains.

Significance. If the empirical ordering holds under controlled conditions, the work supplies a useful shift in PEFT evaluation away from accuracy-only metrics toward explicit trade-off measurement, together with geometric diagnostics that link parameterization choices to forgetting. The benchmark and rewinding case studies constitute concrete, falsifiable contributions that could be adopted by the community. The paper is an empirical benchmark study without machine-checked proofs or parameter-free derivations.

major comments (2)

[§3 and §4] §3 (PEFT-Arena definition) and §4 (results): the central claim that orthogonal finetuning achieves the most favorable Pareto frontier requires that the chosen downstream tasks and retention suite isolate the stability-plasticity trade-off. The manuscript provides no analysis or ablation showing robustness of the ranking to task selection or to the precise definition of the retention metrics; without these controls the observed frontier could be an artifact of the particular evaluation protocol.
[§4] §4 (experimental protocol): the abstract and results sections state clear empirical findings on Pareto frontiers but report neither the number of independent runs, random seeds, statistical significance tests, nor exclusion criteria for outlier checkpoints. This information is load-bearing for any claim that one method dominates the frontier.

minor comments (1)

[Figures 2-4] Figure captions and axis labels in the Pareto plots should explicitly state whether points represent single runs or means, and whether error bars or confidence intervals are shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for robustness checks and clearer statistical reporting. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3 and §4] §3 (PEFT-Arena definition) and §4 (results): the central claim that orthogonal finetuning achieves the most favorable Pareto frontier requires that the chosen downstream tasks and retention suite isolate the stability-plasticity trade-off. The manuscript provides no analysis or ablation showing robustness of the ranking to task selection or to the precise definition of the retention metrics; without these controls the observed frontier could be an artifact of the particular evaluation protocol.

Authors: We agree that explicit sensitivity analysis would strengthen the claim. The PEFT-Arena tasks were selected to span diverse domains (reasoning, knowledge recall, and instruction following) and the retention suite uses established general-capability benchmarks; however, the manuscript does not contain ablations on task subsets or alternative retention metric formulations. In the revision we will add a dedicated subsection with such ablations, confirming that the Pareto ordering, including the position of orthogonal finetuning, is stable under reasonable variations of the evaluation protocol. revision: yes
Referee: [§4] §4 (experimental protocol): the abstract and results sections state clear empirical findings on Pareto frontiers but report neither the number of independent runs, random seeds, statistical significance tests, nor exclusion criteria for outlier checkpoints. This information is load-bearing for any claim that one method dominates the frontier.

Authors: We accept this point. All reported results were obtained from three independent random seeds per method, with means and standard deviations shown in the figures; no checkpoints were excluded as outliers. The revised manuscript will state these details explicitly in §4, add a statistical significance analysis (paired t-tests across seeds) for the frontier comparisons, and include the seed values and run counts in the experimental protocol description. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with post-hoc analyses

full rationale

The paper is an empirical benchmark study that introduces PEFT-Arena to measure downstream performance and general capability retention across PEFT methods, then reports observed stability-plasticity profiles and a Pareto frontier ranking. These are direct measurements from experiments, not quantities derived from fitted parameters or self-referential definitions. The weight-space spectral analysis and activation-space isometry checks are interpretive post-hoc examinations of the empirical results. No equations reduce claims to inputs by construction, no uniqueness theorems are imported via self-citation, and no ansatz or renaming patterns appear. The derivation chain consists of experimental protocol followed by observation and geometric interpretation, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study; it introduces no new mathematical axioms, free parameters fitted inside a derivation, or postulated entities. It relies on standard machine-learning assumptions about evaluation metrics and model training.

axioms (1)

domain assumption Standard assumptions that benchmark metrics for downstream accuracy and capability retention reflect meaningful stability-plasticity trade-offs
Invoked when defining PEFT-Arena and interpreting its results as profiles.

pith-pipeline@v0.9.1-grok · 5738 in / 1256 out tokens · 43800 ms · 2026-06-29T13:56:17.133498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages · 2 internal anchors

[1]

New insights on reducing abrupt representa- tion change in online continual learning. InICLR. Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. 2025. Benchmarking large language models on answering and explaining challenging medical questions. InNAACL. Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. 2023. Parameter-effici...

work page arXiv 2025
[2]

Mozhdeh Gheini, Xiang Ren, and Jonathan May

Krona: Parameter efficient tuning with kro- necker adapter.arXiv preprint arXiv:2212.10650. Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. Cross-attention is all you need: Adapting pretrained transformers for machine translation. InEMNLP. Demi Guo, Alexander M Rush, and Yoon Kim. 2021. Parameter-efficient transfer learning with diff prun- ing.ACL. Jun...

work page arXiv 2021
[3]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, and 1 others. 2025. Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758. Yong Lin, Hangyu Lin, We...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Orthogonal over-parameterized training. In CVPR. Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bernhard Schölkopf. 2024b. Parameter-efficient orthogonal finetuning via butter- fly factorization. InICLR. 10 Xiao Liu, Kaixuan Ji, Yichen...

work page arXiv 2022
[5]

PiSSA: Principal Singular Values and Sin- gular Vectors Adaptation of Large Language Models. NeurIPS. Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. 2013. The stability-plasticity dilemma: Inves- tigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in psychology, 4:504. Ari Morcos, Maithra Raghu, and Samy ...

work page arXiv 2013
[6]

5 technical report.arXiv preprint

Qwen2. 5 technical report.arXiv preprint. Sihan Yang, Kexuan Shi, and Weiyang Liu. 2026. Orthogonal model merging.arXiv preprint arXiv:2602.05943. Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel

work page arXiv 2026
[7]

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. 2023. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. InICLR. Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. 2024. Neural prompt search...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

task vectors

The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567. Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. MedxpertQA: Benchmark- ing expert-level medical reasoning and understanding. InICML. 12 Appendix Table of Contents A Related Work 14 B Implementatio...

work page arXiv 2025
[9]

These methods primarily combine multiple task-specialized models

merge task vectors within input-representation space. These methods primarily combine multiple task-specialized models. Our use of interpolation is related in form but different in purpose. Rather than proposing a new model-merging technique, merging multiple task-specialized models, or targeting distribution-shift robustness, we use interpolation as a pa...

2021

[1] [1]

New insights on reducing abrupt representa- tion change in online continual learning. InICLR. Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. 2025. Benchmarking large language models on answering and explaining challenging medical questions. InNAACL. Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. 2023. Parameter-effici...

work page arXiv 2025

[2] [2]

Mozhdeh Gheini, Xiang Ren, and Jonathan May

Krona: Parameter efficient tuning with kro- necker adapter.arXiv preprint arXiv:2212.10650. Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. Cross-attention is all you need: Adapting pretrained transformers for machine translation. InEMNLP. Demi Guo, Alexander M Rush, and Yoon Kim. 2021. Parameter-efficient transfer learning with diff prun- ing.ACL. Jun...

work page arXiv 2021

[3] [3]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, and 1 others. 2025. Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758. Yong Lin, Hangyu Lin, We...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Orthogonal over-parameterized training. In CVPR. Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bernhard Schölkopf. 2024b. Parameter-efficient orthogonal finetuning via butter- fly factorization. InICLR. 10 Xiao Liu, Kaixuan Ji, Yichen...

work page arXiv 2022

[5] [5]

PiSSA: Principal Singular Values and Sin- gular Vectors Adaptation of Large Language Models. NeurIPS. Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. 2013. The stability-plasticity dilemma: Inves- tigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in psychology, 4:504. Ari Morcos, Maithra Raghu, and Samy ...

work page arXiv 2013

[6] [6]

5 technical report.arXiv preprint

Qwen2. 5 technical report.arXiv preprint. Sihan Yang, Kexuan Shi, and Weiyang Liu. 2026. Orthogonal model merging.arXiv preprint arXiv:2602.05943. Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel

work page arXiv 2026

[7] [7]

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. 2023. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. InICLR. Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. 2024. Neural prompt search...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

task vectors

The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567. Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. MedxpertQA: Benchmark- ing expert-level medical reasoning and understanding. InICML. 12 Appendix Table of Contents A Related Work 14 B Implementatio...

work page arXiv 2025

[9] [9]

These methods primarily combine multiple task-specialized models

merge task vectors within input-representation space. These methods primarily combine multiple task-specialized models. Our use of interpolation is related in form but different in purpose. Rather than proposing a new model-merging technique, merging multiple task-specialized models, or targeting distribution-shift robustness, we use interpolation as a pa...

2021