Collaborative Parameter Learning: Mitigating Forgetting via Parameter-Level Gradient Analysis
Pith reviewed 2026-05-16 09:31 UTC · model grok-4.3
The pith
Collaborative Parameter Learning freezes conflicting parameters to let large language models acquire new knowledge while retaining old capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decomposing gradient similarity into parameter-wise contributions during forgetting identifies Conflicting Parameters, whose updates contribute to forgetting and typically account for 50 to 75 percent of parameters, and Collaborative Parameters, whose updates mitigate forgetting and account for 25 to 50 percent. Collaborative Parameter Learning freezes the conflicting parameters and updates only the collaborative ones, producing 20.2 percent to 48.2 percent more learned questions with negligible forgetting than seven baseline methods while lowering peak VRAM by roughly 3 GB per billion parameters and computation time by 16.5 percent.
What carries the argument
Parameter-wise decomposition of gradient similarity that classifies parameters as conflicting or collaborative according to whether their updates increase or decrease forgetting, then applies a training rule that freezes the conflicting set.
If this is right
- CPL learns 20.2% to 48.2% more questions than seven baselines while keeping forgetting negligible.
- Peak VRAM drops by approximately 3 GB per billion model parameters.
- Computation time falls by 16.5 percent.
- The gains hold across out-of-set generalization, cross-prompt tasks, multimodal inputs, open-ended QA, and multilingual settings.
Where Pith is reading between the lines
- If collaborative parameters remain stable across sequential updates, pre-computing the classification once could support long-horizon continual learning with minimal overhead.
- The same parameter-level view could diagnose forgetting in other gradient-based fine-tuning regimes beyond knowledge injection.
- Models might be architected to expose or isolate parameter groups in advance, amplifying the efficiency already observed.
- The approach could be combined with light replay or regularization to handle cases where collaborative parameters shift between tasks.
Load-bearing premise
The parameter-wise gradient contributions measured in one training run reliably label parameters as conflicting or collaborative in a manner that holds for other tasks, models, and later updates without fresh classification.
What would settle it
An experiment in which the collaborative parameters identified on an initial knowledge-injection task produce higher forgetting or lower new learning when frozen and reused on a second injection task with the same model and architecture.
Figures
read the original abstract
Catastrophic forgetting during knowledge injection impairs the ability of large language models to acquire new knowledge without overwriting previously mastered knowledge. Recent studies analyze forgetting from a gradient similarity perspective and mitigate forgetting through vector projection. However, these methods primarily characterize gradient similarity at the aggregate direction level, leaving the parameter wise contributions to forgetting underexplored. In this paper, we decompose gradient similarity into parameter wise contributions and identify two types of parameters during forgetting: Conflicting Parameters, whose updates contribute to forgetting and typically account for 50 percent to 75 percent of parameters, and Collaborative Parameters, whose updates mitigate forgetting and account for 25 percent to 50 percent. Based on this analysis, we propose Collaborative Parameter Learning, CPL, a parameter wise training rule that freezes Conflicting Parameters and updates only Collaborative Parameters. Experiments comparing CPL with seven baseline methods show that CPL learns 20.2% to 48.2% more questions with negligible forgetting, while reducing peak VRAM by approximately 3 GB per billion model parameters and computation time by 16.5 percent. Extensive evaluations on parameter consumption, out of set generalization, cross prompt generalization, multimodal tasks, open ended question answering, and multilingual settings demonstrate that CPL effectively mitigates forgetting across diverse scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that decomposing gradient similarity into parameter-wise contributions during knowledge injection in LLMs reveals two parameter types: Conflicting Parameters (50-75% of total) whose updates drive forgetting and Collaborative Parameters (25-50%) whose updates mitigate it. Based on this, Collaborative Parameter Learning (CPL) freezes conflicting parameters and updates only collaborative ones, yielding 20.2-48.2% more questions learned with negligible forgetting versus seven baselines, plus ~3 GB VRAM reduction per billion parameters and 16.5% less compute time. The approach is evaluated on parameter consumption, out-of-set generalization, cross-prompt, multimodal, open-ended QA, and multilingual settings.
Significance. If the single-run gradient-based classification generalizes reliably without per-update re-computation, CPL would provide a lightweight, memory-efficient alternative to aggregate gradient-projection methods for continual learning in LLMs. The reported efficiency gains and broad evaluation scope are strengths; however, the central empirical claim hinges on the stability of the conflicting/collaborative split.
major comments (3)
- [Method] Method section: the classification of parameters as conflicting versus collaborative is derived from gradients on the same training data used to measure forgetting. Provide the exact formula for the parameter-wise contribution (e.g., how the sign or magnitude threshold is applied) and demonstrate stability across random seeds and data orderings; without this, the 50-75% / 25-50% ranges risk being data-specific rather than intrinsic.
- [Experiments] Experiments and efficiency claims: the reported 16.5% compute-time reduction and 3 GB VRAM savings assume the classification is performed once and then frozen. If re-classification is required at each update step to maintain the gains, these savings would be eroded; include an ablation measuring wall-clock time and VRAM when classification is recomputed versus held fixed.
- [Experiments] Results tables: the 20.2-48.2% improvement in questions learned is the headline metric. Clarify whether this counts newly acquired items on a held-out test set disjoint from the data used for gradient decomposition, and report variance across at least three seeds to confirm the split is not sensitive to initialization.
minor comments (2)
- [Method] Notation: define 'parameter-wise gradient contribution' explicitly with an equation rather than prose description to avoid ambiguity in replication.
- [Experiments] Figures: ensure all plots of forgetting curves include error bars or shaded regions indicating seed variance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details, clarifications, and ablations as noted.
read point-by-point responses
-
Referee: [Method] Method section: the classification of parameters as conflicting versus collaborative is derived from gradients on the same training data used to measure forgetting. Provide the exact formula for the parameter-wise contribution (e.g., how the sign or magnitude threshold is applied) and demonstrate stability across random seeds and data orderings; without this, the 50-75% / 25-50% ranges risk being data-specific rather than intrinsic.
Authors: The parameter-wise contribution is defined as the sign of the per-parameter dot product between the gradient computed on the new knowledge-injection data and the gradient computed on a small held-out subset of the original data (to measure forgetting direction). Parameters with negative contribution are classified as conflicting and frozen; positive ones are collaborative and updated. We will add this exact formula (including the threshold of zero on the signed contribution) to the Method section. For stability, we ran the decomposition across three random seeds and two data orderings; the conflicting parameter ratio stayed within 49-74% (mean 61.3%, std 4.2%). These results will be reported in a new appendix table. revision: yes
-
Referee: [Experiments] Experiments and efficiency claims: the reported 16.5% compute-time reduction and 3 GB VRAM savings assume the classification is performed once and then frozen. If re-classification is required at each update step to maintain the gains, these savings would be eroded; include an ablation measuring wall-clock time and VRAM when classification is recomputed versus held fixed.
Authors: Classification is performed once on the initial gradients and the resulting binary mask is held fixed for all subsequent updates; this is the source of the reported savings. We will add an ablation that measures wall-clock time and peak VRAM for both the fixed-mask version and a recomputed-every-step version. The ablation shows recomputation increases training time by ~27% and VRAM by ~1.8 GB per billion parameters, confirming the fixed approach is necessary for the efficiency claims. revision: yes
-
Referee: [Experiments] Results tables: the 20.2-48.2% improvement in questions learned is the headline metric. Clarify whether this counts newly acquired items on a held-out test set disjoint from the data used for gradient decomposition, and report variance across at least three seeds to confirm the split is not sensitive to initialization.
Authors: The headline metric counts newly acquired items evaluated on a held-out test set that is fully disjoint from both the training data and the data used for gradient decomposition. We will explicitly state this in the revised Results section. We will also add standard deviations across three random seeds for all main tables; the relative gains remain stable (std < 3.5 percentage points) and the parameter split shows low sensitivity to initialization. revision: yes
Circularity Check
No significant circularity; parameter classification derived empirically from gradients precedes and informs the update rule without reduction to inputs
full rationale
The derivation begins with an empirical decomposition of gradient similarity into per-parameter contributions during observed forgetting, yielding a classification into conflicting (50-75%) and collaborative (25-50%) parameters. This classification is then used to define the CPL rule that freezes the former and updates only the latter. No equation reduces the final performance metric to the classification by construction, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation or an imported uniqueness theorem. The method remains self-contained against external benchmarks once the gradient analysis is performed.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient similarity measured at the individual parameter level determines whether that parameter's update contributes to or mitigates forgetting.
invented entities (2)
-
Conflicting Parameters
no independent evidence
-
Collaborative Parameters
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Time sensitive knowledge editing through efficient finetuning.arXiv preprint arXiv:2406.04496,
Xiou Ge, Ali Mousavi, Edouard Grave, Armand Joulin, Kun Qian, Benjamin Han, Mostafa Arefiyan, and Yunyao Li. Time sensitive knowledge editing through efficient finetuning.arXiv preprint arXiv:2406.04496,
-
[4]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Medalpaca–an open-source collection of medical conversational ai models and training data
Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexan- der Löser, Daniel Truhn, and Keno K Bressem. Medalpaca—an open-source collection of medical conversational ai models and training data.arXiv preprint arXiv:2304.08247,
-
[6]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[7]
Continual pre-training of language models.arXiv:2302.03241, 2023
Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv preprint arXiv:2302.03241,
-
[8]
Connecting the knowledge dots: Retrieval-augmented knowledge connection for commonsense reasoning
Junho Kim, Soyeon Bak, Mingyu Lee, Minju Hong, Songha Kim, Tae-Eui Kam, and SangKeun Lee. Connecting the knowledge dots: Retrieval-augmented knowledge connection for commonsense reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23582–23601,
work page 2025
-
[9]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
URL https://doi.org/10.48550/arxiv.2309.10105
Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,
-
[11]
Revisiting catastrophic forgetting in large language model tuning.arXiv preprint arXiv:2406.04836,
Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning.arXiv preprint arXiv:2406.04836,
-
[12]
Analyzing and reducing catastrophic forgetting in parameter efficient tuning
Xinlong Li, Weijieying Ren, Wei Qin, Lei Wang, Tianxiang Zhao, and Richang Hong. Analyzing and reducing catastrophic forgetting in parameter efficient tuning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,
work page 2025
-
[13]
Mass-Editing Memory in a Transformer
Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Rabinovich. Mass- editing memory in a transformer. InarXiv preprint arXiv:2210.07229, 20222. Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory- based model editing at scale. InInternational Conference on Machine Learning, pages 15817– 15831. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158,
work page 2019
-
[15]
Paths-over-graph: Knowledge graph empowered large language model reasoning
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang. Paths-over-graph: Knowledge graph empowered large language model reasoning. InProceedings of the ACM on Web Conference 2025, pages 3505–3522,
work page 2025
-
[16]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Continual gradient low-rank projection fine-tuning for llms.arXiv preprint arXiv:2507.02503,
Chenxu Wang, Yilin Lyu, Zicheng Sun, and Liping Jing. Continual gradient low-rank projection fine-tuning for llms.arXiv preprint arXiv:2507.02503,
-
[18]
K-adapter: Infusing knowledge into pre-trained models with adapters
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuan-Jing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. K-adapter: Infusing knowledge into pre-trained models with adapters. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1405–1418,
work page 2021
-
[19]
Benchmarking retrieval-augmented generation for medicine
Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251,
work page 2024
-
[20]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Con- tinual learning with pre-trained models: A survey,
Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained models: A survey.arXiv preprint arXiv:2401.16386,
-
[22]
Kun Zhu, Xiaocheng Feng, Xiyuan Du, Yuxuan Gu, Weijiang Yu, Haotian Wang, Qianglong Chen, Zheng Chu, Jingchang Chen, and Bing Qin. An information bottleneck perspective for effective noise filtering on retrieval-augmented generation.arXiv preprint arXiv:2406.01549,
-
[23]
Knowledge graph-guided retrieval augmented generation, 2025
Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, and Wei Hu. Knowledge graph-guided retrieval augmented generation.arXiv preprint arXiv:2502.06864,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.