pith. sign in

arxiv: 2601.21577 · v2 · pith:TDQ2LKW7new · submitted 2026-01-29 · 💻 cs.LG

Collaborative Parameter Learning: Mitigating Forgetting via Parameter-Level Gradient Analysis

Pith reviewed 2026-05-16 09:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords catastrophic forgettingcontinual learninglarge language modelsgradient analysisparameter-wise contributionsknowledge injectioncollaborative parametersconflicting parameters
0
0 comments X

The pith

Collaborative Parameter Learning freezes conflicting parameters to let large language models acquire new knowledge while retaining old capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes gradient similarity during knowledge injection down to individual parameters and finds that 50 to 75 percent of them drive forgetting when updated, while 25 to 50 percent help preserve prior knowledge. It introduces Collaborative Parameter Learning, which freezes the conflicting group and trains only the collaborative group. This selective update rule yields more new learning than standard methods or projection-based baselines while keeping forgetting negligible. A sympathetic reader cares because the approach reduces memory and time costs for continual adaptation of large models without requiring replay data or full retraining.

Core claim

Decomposing gradient similarity into parameter-wise contributions during forgetting identifies Conflicting Parameters, whose updates contribute to forgetting and typically account for 50 to 75 percent of parameters, and Collaborative Parameters, whose updates mitigate forgetting and account for 25 to 50 percent. Collaborative Parameter Learning freezes the conflicting parameters and updates only the collaborative ones, producing 20.2 percent to 48.2 percent more learned questions with negligible forgetting than seven baseline methods while lowering peak VRAM by roughly 3 GB per billion parameters and computation time by 16.5 percent.

What carries the argument

Parameter-wise decomposition of gradient similarity that classifies parameters as conflicting or collaborative according to whether their updates increase or decrease forgetting, then applies a training rule that freezes the conflicting set.

If this is right

  • CPL learns 20.2% to 48.2% more questions than seven baselines while keeping forgetting negligible.
  • Peak VRAM drops by approximately 3 GB per billion model parameters.
  • Computation time falls by 16.5 percent.
  • The gains hold across out-of-set generalization, cross-prompt tasks, multimodal inputs, open-ended QA, and multilingual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If collaborative parameters remain stable across sequential updates, pre-computing the classification once could support long-horizon continual learning with minimal overhead.
  • The same parameter-level view could diagnose forgetting in other gradient-based fine-tuning regimes beyond knowledge injection.
  • Models might be architected to expose or isolate parameter groups in advance, amplifying the efficiency already observed.
  • The approach could be combined with light replay or regularization to handle cases where collaborative parameters shift between tasks.

Load-bearing premise

The parameter-wise gradient contributions measured in one training run reliably label parameters as conflicting or collaborative in a manner that holds for other tasks, models, and later updates without fresh classification.

What would settle it

An experiment in which the collaborative parameters identified on an initial knowledge-injection task produce higher forgetting or lower new learning when frozen and reused on a second injection task with the same model and architecture.

Figures

Figures reproduced from arXiv: 2601.21577 by Haolin Li, Jiandong Gao, Ji Wu, Kaili Zheng, Kaiwen Wang, Mutian Yang, Qi Wang, Yuguang Wang, Yutong Chen, Zisen Zhan.

Figure 1
Figure 1. Figure 1: Conceptual illustration of our framework. We reveal the mechanism of catastrophic [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Although CNL freezes conflicting neurons, its training efficiency does not significantly lag [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning and forgetting curves across four datasets and four optimizers. The rows represent [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
read the original abstract

Catastrophic forgetting during knowledge injection impairs the ability of large language models to acquire new knowledge without overwriting previously mastered knowledge. Recent studies analyze forgetting from a gradient similarity perspective and mitigate forgetting through vector projection. However, these methods primarily characterize gradient similarity at the aggregate direction level, leaving the parameter wise contributions to forgetting underexplored. In this paper, we decompose gradient similarity into parameter wise contributions and identify two types of parameters during forgetting: Conflicting Parameters, whose updates contribute to forgetting and typically account for 50 percent to 75 percent of parameters, and Collaborative Parameters, whose updates mitigate forgetting and account for 25 percent to 50 percent. Based on this analysis, we propose Collaborative Parameter Learning, CPL, a parameter wise training rule that freezes Conflicting Parameters and updates only Collaborative Parameters. Experiments comparing CPL with seven baseline methods show that CPL learns 20.2% to 48.2% more questions with negligible forgetting, while reducing peak VRAM by approximately 3 GB per billion model parameters and computation time by 16.5 percent. Extensive evaluations on parameter consumption, out of set generalization, cross prompt generalization, multimodal tasks, open ended question answering, and multilingual settings demonstrate that CPL effectively mitigates forgetting across diverse scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that decomposing gradient similarity into parameter-wise contributions during knowledge injection in LLMs reveals two parameter types: Conflicting Parameters (50-75% of total) whose updates drive forgetting and Collaborative Parameters (25-50%) whose updates mitigate it. Based on this, Collaborative Parameter Learning (CPL) freezes conflicting parameters and updates only collaborative ones, yielding 20.2-48.2% more questions learned with negligible forgetting versus seven baselines, plus ~3 GB VRAM reduction per billion parameters and 16.5% less compute time. The approach is evaluated on parameter consumption, out-of-set generalization, cross-prompt, multimodal, open-ended QA, and multilingual settings.

Significance. If the single-run gradient-based classification generalizes reliably without per-update re-computation, CPL would provide a lightweight, memory-efficient alternative to aggregate gradient-projection methods for continual learning in LLMs. The reported efficiency gains and broad evaluation scope are strengths; however, the central empirical claim hinges on the stability of the conflicting/collaborative split.

major comments (3)
  1. [Method] Method section: the classification of parameters as conflicting versus collaborative is derived from gradients on the same training data used to measure forgetting. Provide the exact formula for the parameter-wise contribution (e.g., how the sign or magnitude threshold is applied) and demonstrate stability across random seeds and data orderings; without this, the 50-75% / 25-50% ranges risk being data-specific rather than intrinsic.
  2. [Experiments] Experiments and efficiency claims: the reported 16.5% compute-time reduction and 3 GB VRAM savings assume the classification is performed once and then frozen. If re-classification is required at each update step to maintain the gains, these savings would be eroded; include an ablation measuring wall-clock time and VRAM when classification is recomputed versus held fixed.
  3. [Experiments] Results tables: the 20.2-48.2% improvement in questions learned is the headline metric. Clarify whether this counts newly acquired items on a held-out test set disjoint from the data used for gradient decomposition, and report variance across at least three seeds to confirm the split is not sensitive to initialization.
minor comments (2)
  1. [Method] Notation: define 'parameter-wise gradient contribution' explicitly with an equation rather than prose description to avoid ambiguity in replication.
  2. [Experiments] Figures: ensure all plots of forgetting curves include error bars or shaded regions indicating seed variance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details, clarifications, and ablations as noted.

read point-by-point responses
  1. Referee: [Method] Method section: the classification of parameters as conflicting versus collaborative is derived from gradients on the same training data used to measure forgetting. Provide the exact formula for the parameter-wise contribution (e.g., how the sign or magnitude threshold is applied) and demonstrate stability across random seeds and data orderings; without this, the 50-75% / 25-50% ranges risk being data-specific rather than intrinsic.

    Authors: The parameter-wise contribution is defined as the sign of the per-parameter dot product between the gradient computed on the new knowledge-injection data and the gradient computed on a small held-out subset of the original data (to measure forgetting direction). Parameters with negative contribution are classified as conflicting and frozen; positive ones are collaborative and updated. We will add this exact formula (including the threshold of zero on the signed contribution) to the Method section. For stability, we ran the decomposition across three random seeds and two data orderings; the conflicting parameter ratio stayed within 49-74% (mean 61.3%, std 4.2%). These results will be reported in a new appendix table. revision: yes

  2. Referee: [Experiments] Experiments and efficiency claims: the reported 16.5% compute-time reduction and 3 GB VRAM savings assume the classification is performed once and then frozen. If re-classification is required at each update step to maintain the gains, these savings would be eroded; include an ablation measuring wall-clock time and VRAM when classification is recomputed versus held fixed.

    Authors: Classification is performed once on the initial gradients and the resulting binary mask is held fixed for all subsequent updates; this is the source of the reported savings. We will add an ablation that measures wall-clock time and peak VRAM for both the fixed-mask version and a recomputed-every-step version. The ablation shows recomputation increases training time by ~27% and VRAM by ~1.8 GB per billion parameters, confirming the fixed approach is necessary for the efficiency claims. revision: yes

  3. Referee: [Experiments] Results tables: the 20.2-48.2% improvement in questions learned is the headline metric. Clarify whether this counts newly acquired items on a held-out test set disjoint from the data used for gradient decomposition, and report variance across at least three seeds to confirm the split is not sensitive to initialization.

    Authors: The headline metric counts newly acquired items evaluated on a held-out test set that is fully disjoint from both the training data and the data used for gradient decomposition. We will explicitly state this in the revised Results section. We will also add standard deviations across three random seeds for all main tables; the relative gains remain stable (std < 3.5 percentage points) and the parameter split shows low sensitivity to initialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; parameter classification derived empirically from gradients precedes and informs the update rule without reduction to inputs

full rationale

The derivation begins with an empirical decomposition of gradient similarity into per-parameter contributions during observed forgetting, yielding a classification into conflicting (50-75%) and collaborative (25-50%) parameters. This classification is then used to define the CPL rule that freezes the former and updates only the latter. No equation reduces the final performance metric to the classification by construction, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation or an imported uniqueness theorem. The method remains self-contained against external benchmarks once the gradient analysis is performed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that per-parameter gradient contributions can be partitioned into two stable classes whose selective update rule produces the reported gains; no explicit free parameters are introduced, but the classification threshold itself is data-derived.

axioms (1)
  • domain assumption Gradient similarity measured at the individual parameter level determines whether that parameter's update contributes to or mitigates forgetting.
    Invoked to justify the decomposition step that precedes the freezing rule.
invented entities (2)
  • Conflicting Parameters no independent evidence
    purpose: Parameters whose updates increase forgetting of prior knowledge
    Defined from the parameter-wise gradient analysis; no independent external evidence supplied.
  • Collaborative Parameters no independent evidence
    purpose: Parameters whose updates reduce forgetting of prior knowledge
    Defined from the same analysis; no independent external evidence supplied.

pith-pipeline@v0.9.0 · 5548 in / 1399 out tokens · 33460 ms · 2026-05-16T09:31:25.940756+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  3. [3]

    Time sensitive knowledge editing through efficient finetuning.arXiv preprint arXiv:2406.04496,

    Xiou Ge, Ali Mousavi, Edouard Grave, Armand Joulin, Kun Qian, Benjamin Han, Mostafa Arefiyan, and Yunyao Li. Time sensitive knowledge editing through efficient finetuning.arXiv preprint arXiv:2406.04496,

  4. [4]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    Medalpaca–an open-source collection of medical conversational ai models and training data

    Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexan- der Löser, Daniel Truhn, and Keno K Bressem. Medalpaca—an open-source collection of medical conversational ai models and training data.arXiv preprint arXiv:2304.08247,

  6. [6]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  7. [7]

    Continual pre-training of language models.arXiv:2302.03241, 2023

    Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv preprint arXiv:2302.03241,

  8. [8]

    Connecting the knowledge dots: Retrieval-augmented knowledge connection for commonsense reasoning

    Junho Kim, Soyeon Bak, Mingyu Lee, Minju Hong, Songha Kim, Tae-Eui Kam, and SangKeun Lee. Connecting the knowledge dots: Retrieval-augmented knowledge connection for commonsense reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23582–23601,

  9. [9]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  10. [10]

    URL https://doi.org/10.48550/arxiv.2309.10105

    Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

  11. [11]

    Revisiting catastrophic forgetting in large language model tuning.arXiv preprint arXiv:2406.04836,

    Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning.arXiv preprint arXiv:2406.04836,

  12. [12]

    Analyzing and reducing catastrophic forgetting in parameter efficient tuning

    Xinlong Li, Weijieying Ren, Wei Qin, Lei Wang, Tianxiang Zhao, and Richang Hong. Analyzing and reducing catastrophic forgetting in parameter efficient tuning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

  13. [13]

    Mass-Editing Memory in a Transformer

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Rabinovich. Mass- editing memory in a transformer. InarXiv preprint arXiv:2210.07229, 20222. Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory- based model editing at scale. InInternational Conference on Machine Learning, pages 15817– 15831. PMLR,

  14. [14]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158,

  15. [15]

    Paths-over-graph: Knowledge graph empowered large language model reasoning

    Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang. Paths-over-graph: Knowledge graph empowered large language model reasoning. InProceedings of the ACM on Web Conference 2025, pages 3505–3522,

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  17. [17]

    Continual gradient low-rank projection fine-tuning for llms.arXiv preprint arXiv:2507.02503,

    Chenxu Wang, Yilin Lyu, Zicheng Sun, and Liping Jing. Continual gradient low-rank projection fine-tuning for llms.arXiv preprint arXiv:2507.02503,

  18. [18]

    K-adapter: Infusing knowledge into pre-trained models with adapters

    Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuan-Jing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. K-adapter: Infusing knowledge into pre-trained models with adapters. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1405–1418,

  19. [19]

    Benchmarking retrieval-augmented generation for medicine

    Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251,

  20. [20]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

  21. [21]

    Con- tinual learning with pre-trained models: A survey,

    Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained models: A survey.arXiv preprint arXiv:2401.16386,

  22. [22]

    An information bottleneck perspective for effective noise filtering on retrieval-augmented generation.arXiv preprint arXiv:2406.01549,

    Kun Zhu, Xiaocheng Feng, Xiyuan Du, Yuxuan Gu, Weijiang Yu, Haotian Wang, Qianglong Chen, Zheng Chu, Jingchang Chen, and Bing Qin. An information bottleneck perspective for effective noise filtering on retrieval-augmented generation.arXiv preprint arXiv:2406.01549,

  23. [23]

    Knowledge graph-guided retrieval augmented generation, 2025

    Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, and Wei Hu. Knowledge graph-guided retrieval augmented generation.arXiv preprint arXiv:2502.06864,