Recognition: no theorem link
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3
The pith
Forgetting during sequential LLM updates occurs when the covariance geometry of a new task misaligns with the current model state geometry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high.
What carries the argument
The covariance geometry induced by a task's parameter update, which acts as a descriptor of whether the update will integrate compatibly with the geometry of the current model state.
If this is right
- Sequential updates transfer when their induced covariance geometry stays compatible with the model state shaped by earlier updates.
- Interference and forgetting increase when state-relative geometry conflict becomes high.
- Geometry conflict can serve as a control signal to gate how updates are integrated or corrected.
- A geometry-aware merging procedure improves retention and final performance on domain-continual and capability-continual tasks without replay data.
Where Pith is reading between the lines
- If geometry conflict is the decisive factor, then task ordering could be chosen in advance by computing pairwise geometry alignments to reduce expected forgetting.
- The same compatibility diagnostic might apply to other continual adaptation settings such as instruction tuning or multi-domain fine-tuning.
- Preprocessing updates to reduce their geometry conflict before merging could offer an additional lever for preserving capabilities.
Load-bearing premise
The covariance geometry of a task's parameter update is a sufficient and stable descriptor of whether that task will integrate compatibly with the current model state.
What would settle it
Measuring geometry conflict on a sequence of tasks and finding no reliable correlation between higher conflict values and greater forgetting rates would falsify the central claim.
Figures
read the original abstract
Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that forgetting in LLM continual post-training arises as a state-relative update-integration failure when covariance geometries induced by task parameter updates misalign with the geometry of the evolving model state. Sequential updates transfer when compatible with the prior state and interfere at high geometry conflict. The authors propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free method using Gaussian Wasserstein barycenters to build a shared metric and gate geometry-aware corrections. On Qwen3 models (0.6B–14B) in domain-continual and capability-continual settings, GCWM outperforms data-free baselines in retention and final performance.
Significance. If the geometry-conflict interpretation is shown to be load-bearing rather than a proxy for simpler signals, the work would supply both an explanatory account of when updates integrate or interfere and a practical data-free control mechanism. The direct linkage of analysis to the GCWM algorithm and the use of optimal-transport barycenters are technical strengths that could influence continual-learning research beyond replay or regularization heuristics.
major comments (3)
- [Abstract] Abstract (central finding paragraph): the claim that covariance geometry is a sufficient and stable descriptor of state compatibility is presented as following from the analysis, yet no evidence is given that geometry conflict adds explanatory power beyond proxies such as ||Δθ|| or task embedding similarity. Ablation experiments that isolate the geometry term are required to substantiate the sufficiency assumption.
- [Method] GCWM construction (method section): the method relies on Gaussian approximations and Wasserstein barycenters whose covariance parameters are estimated from the same updates used to measure conflict; the manuscript must show that these parameters are independent of the evaluation data or provide a derivation demonstrating that the circularity does not affect the reported gains.
- [Experiments] Experimental results (tables/figures): no description of covariance estimation procedure, run-to-run variance, or statistical significance tests is supplied, so it is impossible to determine whether the observed improvements over baselines reliably support the geometry-conflict explanation rather than implementation details.
minor comments (1)
- [Abstract] The abstract packs the central claim, method, and results into a single dense paragraph; splitting the finding into two sentences would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of our geometry-conflict analysis and the GCWM method.
read point-by-point responses
-
Referee: [Abstract] Abstract (central finding paragraph): the claim that covariance geometry is a sufficient and stable descriptor of state compatibility is presented as following from the analysis, yet no evidence is given that geometry conflict adds explanatory power beyond proxies such as ||Δθ|| or task embedding similarity. Ablation experiments that isolate the geometry term are required to substantiate the sufficiency assumption.
Authors: We agree that additional evidence is needed to demonstrate that geometry conflict provides explanatory power beyond simpler proxies. In the revised manuscript we will add ablation experiments that directly compare geometry conflict against ||Δθ|| and task-embedding cosine similarity as predictors of forgetting and interference. These ablations will quantify the incremental predictive value of the geometry term on held-out validation sets. We will also revise the abstract to state the findings more precisely as supported by both the geometric analysis and the new ablations. revision: yes
-
Referee: [Method] GCWM construction (method section): the method relies on Gaussian approximations and Wasserstein barycenters whose covariance parameters are estimated from the same updates used to measure conflict; the manuscript must show that these parameters are independent of the evaluation data or provide a derivation demonstrating that the circularity does not affect the reported gains.
Authors: We will expand the method section to clarify that covariance matrices are estimated solely from the task-specific parameter updates (via sample covariance of the delta vectors or mini-batch gradients collected during fine-tuning). These statistics are computed before any evaluation on downstream tasks and do not incorporate test or validation data. We will include a short derivation showing that the Wasserstein barycenter construction uses only these pre-computed update covariances to define the shared metric, ensuring the conflict measurement and subsequent merging step remain independent of the reported performance metrics. revision: yes
-
Referee: [Experiments] Experimental results (tables/figures): no description of covariance estimation procedure, run-to-run variance, or statistical significance tests is supplied, so it is impossible to determine whether the observed improvements over baselines reliably support the geometry-conflict explanation rather than implementation details.
Authors: We acknowledge the omission of these experimental details. In the revised manuscript we will add: (i) a precise description of the covariance estimation procedure (sample covariance over update vectors with explicit batch size and regularization), (ii) mean and standard deviation of all metrics over at least three independent runs with different random seeds, and (iii) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing GCWM against each baseline. Updated tables and figures will report these statistics. revision: yes
Circularity Check
No significant circularity; central claim is interpretive observation from geometry analysis, not reduced by construction to inputs.
full rationale
The paper's derivation chain begins with representing tasks via parameter updates and analyzing induced covariance geometries, leading to the interpretive claim that forgetting arises from state-relative misalignment. This is presented as a finding from the geometry study rather than a mathematical derivation or fitted prediction. GCWM is motivated by the finding and applies Gaussian Wasserstein barycenters with geometry conflict gating, but the abstract and description provide no equations showing that conflict metrics or barycenter parameters are fitted directly to forgetting outcomes or reduce the explanatory claim to the input updates by definition. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are evident in the provided text. The analysis remains self-contained as an empirical geometry-based interpretation without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Covariance geometry induced by a task's parameter update reflects the compatibility of that task with the current model state
invented entities (1)
-
Geometry conflict signal
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025
work page 2025
-
[2]
arXiv preprint arXiv:2502.21321
Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025
-
[3]
Demystifying domain-adaptive post-training for financial llms
Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. Demystifying domain-adaptive post-training for financial llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31021–31047, 2025
work page 2025
-
[4]
Redone: Revealing domain-specific llm post-training in social networking services
Fei Zhao, Chonggang Lu, Zheyong Xie, Ziyan Liu, Haofu Qian, Jianzhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, et al. Redone: Revealing domain-specific llm post-training in social networking services. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2648–2674, 2025
work page 2025
-
[5]
Synthesizing post-training data for llms through multi-agent simulation
Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng Wang, and Siheng Chen. Synthesizing post-training data for llms through multi-agent simulation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23306–23335, 2025
work page 2025
-
[6]
Lamdagent: An autonomous framework for post-training pipeline optimization via llm agents
Taro Yano, Yoichi Ishibashi, and Masafumi Oyamada. Lamdagent: An autonomous framework for post-training pipeline optimization via llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30066–30083, 2025
work page 2025
-
[7]
Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.arXiv preprint arXiv:2509.25300, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Yizhou Sun, Himabindu Lakkaraju, and Shichang Zhang. How post-training reshapes llms: A mechanistic view on knowledge, truthfulness, refusal, and confidence. InThe First Workshop on the Application of LLM Explainability to Reasoning and Planning, 2025
work page 2025
-
[9]
Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024
Gido M Van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024
-
[10]
Overcoming catastrophic forgetting in neural networks.arXiv preprint arXiv:2507.10485, 2025
Brandon Shuen Yi Loke, Filippo Quadri, Gabriel Vivanco, Maximilian Casagrande, and Saúl Fenollosa. Overcoming catastrophic forgetting in neural networks.arXiv preprint arXiv:2507.10485, 2025
-
[11]
Continual training of language models for few-shot learning
Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. Continual training of language models for few-shot learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10205–10216, 2022
work page 2022
-
[12]
See: Continual fine-tuning with sequential ensemble of experts
Zhilin Wang, Yafu Li, Xiaoye Qu, and Yu Cheng. See: Continual fine-tuning with sequential ensemble of experts. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7418–7432, 2025
work page 2025
-
[13]
Scalable strategies for continual learning with replay.arXiv preprint arXiv:2505.12512, 2025
Truman Hickok. Scalable strategies for continual learning with replay.arXiv preprint arXiv:2505.12512, 2025
-
[14]
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019
work page 2019
-
[15]
Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual learning with adaptive regularization.Advances in neural information processing systems, 32, 2019. 11
work page 2019
-
[16]
Jary Pomponi, Simone Scardapane, Vincenzo Lomonaco, and Aurelio Uncini. Efficient contin- ual learning in neural networks with embedding regularization.Neurocomputing, 397:139–148, 2020
work page 2020
-
[17]
Yujie Feng, Jian Li, Xiaoyu Dong, Pengfei Xu, Xiaohui Zhou, Yujia Zhang, Zexin Lu, Yasha Wang, Alan Zhao, Xu Chu, et al. Aimmerging: Adaptive iterative model merging using training trajectories for language model continual learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13431–13448, 2025
work page 2025
-
[18]
Merge then realign: Simple and effective modality-incremental continual learning for multimodal llms
Dingkun Zhang, Shuhan Qi, Xinyu Xiao, Kehai Chen, and Xuan Wang. Merge then realign: Simple and effective modality-incremental continual learning for multimodal llms. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13159–13175, 2025
work page 2025
-
[19]
Model Merging Scaling Laws in Large Language Models
Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures–wasserstein distance between positive definite matrices.Expositiones mathematicae, 37(2):165–191, 2019
work page 2019
-
[21]
Task singular vectors: Reducing task interference in model merging
Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025
work page 2025
-
[22]
Zirui Wang and Yulia Tsvetkov. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. InProceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[23]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations
-
[24]
No task left behind: Isotropic model merging with common and task-specific subspaces
Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Bagdanov An- drew D, and Joost van de Weije. No task left behind: Isotropic model merging with common and task-specific subspaces. In39th International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR), 2025
work page 2025
-
[25]
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020
work page 2020
-
[26]
Udapdr: unsupervised domain adaptation via llm prompting and distillation of rerankers
Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Sultan, and Christopher Potts. Udapdr: unsupervised domain adaptation via llm prompting and distillation of rerankers. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11265–11279, 2023
work page 2023
-
[27]
Explor- ing the effectiveness of llm domain adaptation for business it machine translation
Johannes Eschbach-Dymanus, Frank Essenberger, Bianka Buschbeck, and Miriam Exel. Explor- ing the effectiveness of llm domain adaptation for business it machine translation. InProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 610–622, 2024
work page 2024
-
[28]
Enhancing llm capabilities beyond scaling up
Wenpeng Yin, Muhao Chen, Rui Zhang, Ben Zhou, Fei Wang, and Dan Roth. Enhancing llm capabilities beyond scaling up. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 1–10, 2024
work page 2024
-
[29]
Llm augmented llms: Expanding capabilities through composition
Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, and Partha Talukdar. Llm augmented llms: Expanding capabilities through composition. InThe Twelfth International Conference on Learning Representations, 2024. 12
work page 2024
-
[30]
Behavior alignment: a new perspective of evaluating llm-based conversational recommendation systems
Dayu Yang, Fumian Chen, and Hui Fang. Behavior alignment: a new perspective of evaluating llm-based conversational recommendation systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2286– 2290, 2024
work page 2024
-
[31]
Align 3gr: Unified multi-level alignment for llm-based generative recommendation
Wencai Ye, Mingjie Sun, Shuhang Chen, Wenjin Wu, and Peng Jiang. Align 3gr: Unified multi-level alignment for llm-based generative recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16154–16162, 2026
work page 2026
-
[32]
Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana R Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024
work page 2024
-
[33]
Fuli Qiao and Mehrdad Mahdavi. Learn more, but bother less: parameter efficient continual learning.Advances in Neural Information Processing Systems, 37:97476–97498, 2024
work page 2024
-
[34]
Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, and Qingcai Chen. Gere: Towards efficient anti-forgetting in continual learning of llm via general samples replay.arXiv preprint arXiv:2508.04676, 2025
-
[35]
FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning
Yujie Feng, Hao Wang, Jian Li, Xu Chu, Zhaolu Kang, Yiran Liu, Yasha Wang, Philip S Yu, and Xiao-Ming Wu. Forever: Forgetting curve-inspired memory replay for language model continual learning.arXiv preprint arXiv:2601.03938, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Yuheng Lu, Bingshuo Qian, Caixia Yuan, Huixing Jiang, and Xiaojie Wang. Controlled low- rank adaptation with subspace regularization for continued training on large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19165–19181, 2025
work page 2025
-
[37]
Magmax: Leveraging model merging for seamless continual learning
Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024
work page 2024
-
[38]
Yuanyi Wang, Yanggan Gu, Zihao Wang, Kunxi Li, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Mergepipe: A budget-aware parameter management system for scalable llm merging.arXiv preprint arXiv:2602.13273, 2026
-
[39]
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026
work page 2026
-
[40]
Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025
Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhijie Sang, Zhaoyi Yan, Zhen Li, Shengyu Zhang, Fei Wu, and Hongxia Yang. Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025
work page 2025
-
[41]
Became: Bayesian continual learning with adaptive model merging
Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, and Hongtao Lu. Became: Bayesian continual learning with adaptive model merging. InForty-second International Conference on Machine Learning
-
[42]
Doanh C Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K Nguyen, and Yasuhiko Nakashima. Mergeslide: Continual model merging and task-to-class prompt-aligned inference for lifelong learning on whole slide images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4859–4868, 2026
work page 2026
-
[43]
Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Zhen Li, Chi Yung Chung, and Hongxia Yang. Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 14(1):37–49, 2026
work page 2026
-
[44]
Merging on the fly without retraining: A sequential approach to scalable continual model merging
Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Merging on the fly without retraining: A sequential approach to scalable continual model merging. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 13
work page 2025
-
[45]
Mingle: Mixture of null-space gated low-rank experts for test-time continual model merging
Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Mingle: Mixture of null-space gated low-rank experts for test-time continual model merging. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[46]
Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Null-space filtering for data-free continual model merging: Preserving transparency, promoting fidelity.arXiv preprint arXiv:2509.21413, 2025
-
[47]
Donald Shenaj, Ondrej Bohdal, Taha Ceritli, Mete Ozay, Pietro Zanuttigh, and Umberto Michieli. K-merge: Online continual merging of adapters for on-device large language models.arXiv preprint arXiv:2510.13537, 2025
-
[48]
Toward a holistic approach to continual model merging.arXiv preprint arXiv:2509.23592, 2025
Hoang Phan, Sungmin Cha, Tung Lam Tran, and Qi Lei. Toward a holistic approach to continual model merging.arXiv preprint arXiv:2509.23592, 2025
-
[49]
Zhikang Chen, Sen Cui, Deheng Ye, Min Zhang, Gang Niu, Yu Zhang, Masashi Sugiyama, and Tingting Zhu. From coefficients to directions: Rethinking model merging with directional alignment.arXiv preprint arXiv:2512.00391, 2025
-
[50]
Modeling multi-task model merging as adaptive projective gradient descent
Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent. InInternational Conference on Machine Learning, pages 66178–66193. PMLR, 2025
work page 2025
-
[51]
Merging by matching models in task parameter subspaces.Transactions on Machine Learning Research
Derek Tam, Mohit Bansal, and Colin Raffel. Merging by matching models in task parameter subspaces.Transactions on Machine Learning Research
-
[52]
Pedro C Álvarez-Esteban, E Del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed-point ap- proach to barycenters in wasserstein space.Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016
work page 2016
-
[53]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[55]
Whoever started the interference should end it: Guiding data-free model merging via task vectors
Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InInternational Conference on Machine Learning, pages 10121–10143. PMLR, 2025
work page 2025
-
[56]
UW-Madison-Lee-Lab. Mmlu-pro-cot-train-labeled. https://huggingface.co/datasets/ UW-Madison-Lee-Lab/MMLU-Pro-CoT-Train-Labeled, 2025. Hugging Face dataset
work page 2025
-
[57]
Nemotron-post-training-dataset-v1
NVIDIA. Nemotron-post-training-dataset-v1. https://huggingface.co/datasets/ nvidia/Nemotron-Post-Training-Dataset-v1, 2025. Hugging Face dataset
work page 2025
-
[58]
Opencodeinterpreter: Integrating code generation with execution and refinement
Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12834–12859, 2024
work page 2024
-
[59]
Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research, 2024, 2024
work page 2024
-
[60]
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023
work page 2023
-
[61]
Language models are super mario: Absorbing abilities from homologous models as a free lunch
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024. 14
work page 2024
-
[62]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024
work page 2024
-
[63]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[64]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
-
[65]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[66]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[67]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling
-
[68]
Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion
Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025
-
[69]
Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025
-
[70]
Yifan Yang, Jinjia Li, Kunxi Li, Puhao Zheng, Yuanyi Wang, Zheyan Qu, Yang Yu, Jianmin Wu, Ming Li, and Hongxia Yang. Inficoevalchain: A blockchain-based decentralized framework for collaborative llm evaluation.arXiv preprint arXiv:2602.08229, 2026. 15 Limitations Our analysis and experiments focus on Qwen3-scale open LLMs and on domain and capability con...
-
[71]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.