Recognition: no theorem link
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
Pith reviewed 2026-05-13 07:51 UTC · model grok-4.3
The pith
Language models can learn to consolidate new context into their own weights by generating layer-specific update instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCoL is a post-training framework in which an LLM learns to generate textual update instructions for its own Transformer layers to write current context into model weights while limiting interference with previously consolidated information, trained with meta-reinforcement learning over an evolving model state, leading to improved performance on SQuAD and LongBench v2.
What carries the argument
Meta-reinforcement learning over an evolving model state to train generation of textual layer-update instructions that produce sparse updates aligned with high Fisher information layers.
Load-bearing premise
The model can learn to produce update instructions that cause stable changes without unintended interference as the model state changes over time.
What would settle it
Observing that performance on retaining earlier information drops significantly when processing a long sequence of new contexts with SCoL, compared to baselines.
Figures
read the original abstract
Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Self-Consolidating Language Models (SCoL), a post-training framework in which an LLM generates textual instructions specifying which of its own Transformer layers to update given new context. Because updates alter the generator itself, training uses meta-reinforcement learning over an evolving model state. Experiments on SQuAD (supervised QA rewards) and LongBench v2 (intrinsic likelihood rewards) claim improved acquisition and retention relative to prompting, summarization, batch test-time training, and sequential finetuning. Post-hoc analysis indicates that learned selections are sparse and align with high-Fisher-information layers; the method also transfers from shorter meta-training streams to longer evaluation streams.
Significance. If the stability of the learned textual updates under evolving model state can be verified and the quantitative gains substantiated, the work would offer a practical route to continual knowledge incorporation in LLMs that selectively routes plasticity while limiting interference, with potential scalability to streaming settings.
major comments (2)
- [§3 (Meta-RL Training Loop)] §3 (Meta-RL Training Loop): The central claim that SCoL improves retention over sequential finetuning rests on the assumption that generated layer-update instructions produce stable, non-interfering effects as the base model evolves. No ablations on update magnitude, sign consistency across sequential steps, or retention decay after repeated consolidation cycles are reported, leaving open the possibility that net effects drift or cancel prior consolidations.
- [Experimental Results (SQuAD and LongBench v2 sections)] Experimental Results (SQuAD and LongBench v2 sections): The abstract and main claims assert performance gains in acquisition and retention, yet the manuscript supplies no concrete metrics, error bars, statistical significance tests, or detailed baseline hyper-parameters, preventing verification of the magnitude and reliability of the reported improvements.
minor comments (2)
- [§2 (Method Overview)] The format and application procedure for the generated textual update instructions would be clearer with a concrete example (e.g., a sample instruction string and the corresponding layer-modification code) placed in the main text rather than only in the appendix.
- [Analysis of Learned Selection Patterns] Fisher-information alignment analysis would benefit from an explicit definition of how Fisher values are computed (e.g., which loss and which subset of parameters) to allow replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications on update stability and committing to include the requested experimental details and analyses in the revision.
read point-by-point responses
-
Referee: [§3 (Meta-RL Training Loop)] §3 (Meta-RL Training Loop): The central claim that SCoL improves retention over sequential finetuning rests on the assumption that generated layer-update instructions produce stable, non-interfering effects as the base model evolves. No ablations on update magnitude, sign consistency across sequential steps, or retention decay after repeated consolidation cycles are reported, leaving open the possibility that net effects drift or cancel prior consolidations.
Authors: We agree that explicit verification of update stability is important to support the retention claims. Our LongBench v2 results demonstrate sustained gains over long sequential streams relative to finetuning baselines, which indirectly suggests non-interfering effects, but we did not report dedicated ablations on update magnitude, sign consistency, or multi-cycle decay. In the revised manuscript we will add these analyses, including norms of weight updates, consistency of selected layers across steps, and retention curves after repeated consolidation cycles. revision: yes
-
Referee: [Experimental Results (SQuAD and LongBench v2 sections)] Experimental Results (SQuAD and LongBench v2 sections): The abstract and main claims assert performance gains in acquisition and retention, yet the manuscript supplies no concrete metrics, error bars, statistical significance tests, or detailed baseline hyper-parameters, preventing verification of the magnitude and reliability of the reported improvements.
Authors: We apologize that the numerical results were not presented with sufficient prominence or detail in the submitted version. The full manuscript contains tables with exact metrics (F1 on SQuAD, likelihood/perplexity on LongBench v2), standard deviations from multiple runs, and statistical significance tests against baselines; all baseline hyperparameters are listed in the appendix. We will revise the main text and results sections to prominently display key numbers, error bars, and significance values, with explicit cross-references to the appendix for reproducibility. revision: yes
Circularity Check
No significant circularity; derivation uses external rewards and empirical baselines
full rationale
The paper's core method trains an LLM via meta-RL to output textual layer-update instructions, with rewards drawn from independent sources (supervised QA accuracy on SQuAD and intrinsic likelihood on LongBench v2). These rewards evaluate post-update performance rather than being fitted to the selection policy itself. No equations or steps reduce the claimed improvements to self-definition, renamed fits, or load-bearing self-citations. The reported gains are measured against external baselines (prompting, summarization, batch test-time training, sequential finetuning), and the Fisher-information alignment is presented as post-hoc analysis rather than a definitional input. The framework is therefore self-contained against verifiable external metrics.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs possess differentiable layers whose updates can be selectively triggered by generated instructions
- domain assumption Meta-reinforcement learning can optimize policies over an evolving model state without instability
invented entities (1)
-
SCoL update-instruction generator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
-
[2]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024
work page 2024
-
[3]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, 2020
work page 2020
-
[4]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...
work page 2022
-
[5]
Augmenting language models with long-term memory
Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[6]
Random-access infinite context length for trans- formers
Amirkeivan Mohtashami and Martin Jaggi. Random-access infinite context length for trans- formers. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[7]
Adapting language models to compress contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, Singapore, 2023. Association for Computational Linguistics
work page 2023
-
[8]
LLMLingua: Com- pressing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, 2023. Association for Computational Linguistics
work page 2023
-
[9]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024
work page 2024
-
[11]
James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419– 457, 1995
work page 1995
-
[12]
Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016
work page 2016
-
[13]
The consolidation and transformation of memory
Yadin Dudai, Avi Karni, and Jan Born. The consolidation and transformation of memory. Neuron, 88(1):20–32, 2015
work page 2015
-
[14]
Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989. 10
work page 1989
-
[15]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...
work page 2017
-
[16]
Gido M. van de Ven. On the computation of the fisher information in continual learning.arXiv preprint arXiv:2502.11756, 2025. To appear in the ICLR 2025 Blogpost Track
-
[17]
Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[18]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[19]
SQuAD: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics
work page 2016
-
[20]
LongBench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...
work page 2025
-
[21]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pages 17359–17372, 2022
work page 2022
-
[22]
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[23]
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models.Advances in Neural Information Processing Systems, 37:53764– 53797, 2024
work page 2024
-
[24]
Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning
Qizhou Chen, Taolin Zhang, Xiaofeng He, Dongyang Li, Chengyu Wang, Longtao Huang, and Hui Xue’. Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13565–13580, Miami, Florida, USA, 2024. Association for Computational Linguistics
work page 2024
-
[25]
Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, 2020. Association for Computational Linguistics
work page 2020
-
[26]
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025
work page 2025
-
[27]
Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, and Xuemin Lin. Continual learn- ing in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658, 2026
-
[28]
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2018. 11
work page 2018
-
[29]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, Singapore, 2023. Association for Computational Linguistics
work page 2023
-
[30]
LEMoE: Advanced mixture of experts adaptor for lifelong model editing of large language models
Renzhi Wang and Piji Li. LEMoE: Advanced mixture of experts adaptor for lifelong model editing of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2551–2575, Miami, Florida, USA, 2024. Association for Computational Linguistics
work page 2024
-
[31]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 2020
work page 2020
-
[32]
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507, 2019
-
[33]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[34]
Re- flexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[35]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...
work page 2022
-
[36]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645:633–638, 2025
work page 2025
-
[37]
Thinking fast and slow with deep learning and tree search
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InAdvances in Neural Information Processing Systems 30 (NIPS), 2017
work page 2017
-
[38]
arXiv preprint arXiv:2501.06252 , year=
Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer-squared: Self-adaptive llms.arXiv preprint arXiv:2501.06252, 2025
-
[39]
Meta-learning online adaptation of language models
Nathan Hu, Eric Mitchell, Christopher Manning, and Chelsea Finn. Meta-learning online adaptation of language models. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 4418–4432, Singapore, 2023. Association for Computational Linguistics
work page 2023
-
[40]
Mass- editing memory in a transformer
Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[41]
Knowledge editing for large language models: A survey
Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.arXiv preprint arXiv:2310.16218, 2024
-
[42]
Universal language model fine-tuning for text classi- fication
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 328–339, 2018
work page 2018
-
[43]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. 12
work page 2020
-
[44]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. InProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
work page 2024
-
[45]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language modeling.arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review arXiv 2023
-
[46]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review arXiv 2023
-
[47]
Learning to compress prompts with gist tokens
Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023
work page 2023
-
[48]
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. InInternational Conference on Learning Representations, 2024
work page 2024
-
[49]
Analyzing and reducing catastrophic forgetting in parameter efficient tuning, 2024
Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning, 2024
work page 2024
-
[50]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
work page 2021
-
[51]
Continual learning through synaptic intelligence
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3987–3995. PMLR, 2017
work page 2017
-
[52]
PackNet: Adding multiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018
work page 2018
-
[53]
Overcoming catastrophic forgetting with hard attention to the task
Joan Serrà, Dídac Surís, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4548–4557. PMLR, 2018
work page 2018
-
[54]
Kimi k2.6 tech blog: Advancing open-source coding, 2026
Moonshot AI. Kimi k2.6 tech blog: Advancing open-source coding, 2026. Accessed: 2026-05- 05
work page 2026
-
[55]
Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025
-
[56]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[57]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023
work page 2023
-
[58]
Attributing mode collapse in the fine-tuning of large language models
Laura O’Mahony, Léo Grinsztajn, Hailey Schoelkopf, and Stella Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024
work page 2024
-
[59]
Preserving diversity in supervised fine-tuning of large language models, 2025
Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models.arXiv preprint arXiv:2408.16673, 2024. 13
-
[60]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.