arxiv: 2605.07076 · v2 · submitted 2026-05-08 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

Zekun Wang , Anant Gupta , Zihan Dong , Christopher J. MacLellan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:51 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords continual learningknowledge incorporationlanguage modelsmeta-reinforcement learningtransformer updatescontext consolidationlayer selection

0 comments

The pith

Language models can learn to consolidate new context into their own weights by generating layer-specific update instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Consolidating Language Models to address how LLMs can incorporate streams of new information without losing prior knowledge. Instead of relying on longer context windows, SCoL trains the model to output instructions that update specific Transformer layers based on the current context. This is done using meta-reinforcement learning because each update changes the model for future contexts. Experiments on question answering and long-context tasks show better acquisition and retention than standard methods like prompting or sequential finetuning.

Core claim

SCoL is a post-training framework in which an LLM learns to generate textual update instructions for its own Transformer layers to write current context into model weights while limiting interference with previously consolidated information, trained with meta-reinforcement learning over an evolving model state, leading to improved performance on SQuAD and LongBench v2.

What carries the argument

Meta-reinforcement learning over an evolving model state to train generation of textual layer-update instructions that produce sparse updates aligned with high Fisher information layers.

Load-bearing premise

The model can learn to produce update instructions that cause stable changes without unintended interference as the model state changes over time.

What would settle it

Observing that performance on retaining earlier information drops significantly when processing a long sequence of new contexts with SCoL, compared to baselines.

Figures

Figures reproduced from arXiv: 2605.07076 by Anant Gupta, Christopher J. MacLellan, Zekun Wang, Zihan Dong.

**Figure 2.** Figure 2: Illustrations of experiment setup under two reward settings. Top: each context (SQuAD passages) comes with an environment reward (downstream QA). Bottom: QA is evaluated at the end of the context stream. Additional dataset details are provided in Section A.7. Policy, inner adaptation, and meta update. We choose the policy as a Qwen2.5-7B-Instruct model: at each context, the model is prompted to emit a l… view at source ↗

**Figure 3.** Figure 3: Left: immediate accuracy (top) and retention (bottom) across IPO updates. Top Right: top layer selection frequencies after the final IPO update, with the full reward variant in red and the acquisition only variant in blue. Bottom Right: examples of per-layer weight importance measured by Fisher information (brighter is higher) with SCoL’s selected layers marked by white stars. as a batch-access upper bound… view at source ↗

**Figure 4.** Figure 4: Selection prompt template. The model’s response is parsed deterministically: all integers in the output are extracted by regex, filtered to the range [0, L − 1], deduplicated preserving first-seen order, and truncated to the budget B. Each retained layer index is expanded to its seven projection matrices for adapter attachment. Malformed outputs that yield an empty parse produce an empty action, which in t… view at source ↗

**Figure 5.** Figure 5: Implication-generation prompt template, used unmodified from SEAL. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Judge prompt. A.6 Metric formulas All main-text metrics are summary statistics of the sequential-edit accuracy matrix M ∈ [0, 1]N×N , with N = 100 the length of the evaluation stream. Writing Mi,j for the accuracy on Qj after editing through ci (j ≤ i), we report Immediate acquisition = 1 N P i Mi,i, Retention = 1 N−1 P j<N−1 MN−1,j , A.7 Stream and round configuration [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 7.** Figure 7: Per-layer change in selection count between the initial and final IPO updates. Layers are [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Summarization prompt used for Kimi-K2.6. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Layer selections for SCoL for the long context setting for short and long regimes. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCoL's meta-RL setup for self-generated layer-update instructions is a clean new angle on continual consolidation, but the abstract supplies no numbers so the actual gains stay unproven.

read the letter

The paper's main move is to have the LLM generate its own textual instructions for which Transformer layers to update when new context comes in, then train that policy with meta-RL over the changing model state. That combination is not in the prior work they cite and gives a way to route plasticity without hand-designed update rules. It beats the listed baselines on both new knowledge acquisition and retention for SQuAD and LongBench v2, and the post-hoc check that selections land on high-Fisher layers is a reasonable sign the method is limiting interference. The transfer result from short meta-training streams to longer evaluation streams is also useful evidence that the policy might scale. The soft spot is exactly what the stress-test flags: no ablations on update magnitude, sign consistency, or retention decay after many cycles, and the abstract gives zero quantitative deltas or error bars. Without those, it is hard to know whether the stability assumption actually holds once the base model has shifted. The rewards are external, so there is no obvious circularity. This is for people working on streaming agents and continual learning in LLMs. A reader who needs a concrete framework to try on their own data will get value from the setup even before the numbers are filled in. It deserves a serious referee because the problem is central and the method is grounded, though the experiments will need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Self-Consolidating Language Models (SCoL), a post-training framework in which an LLM generates textual instructions specifying which of its own Transformer layers to update given new context. Because updates alter the generator itself, training uses meta-reinforcement learning over an evolving model state. Experiments on SQuAD (supervised QA rewards) and LongBench v2 (intrinsic likelihood rewards) claim improved acquisition and retention relative to prompting, summarization, batch test-time training, and sequential finetuning. Post-hoc analysis indicates that learned selections are sparse and align with high-Fisher-information layers; the method also transfers from shorter meta-training streams to longer evaluation streams.

Significance. If the stability of the learned textual updates under evolving model state can be verified and the quantitative gains substantiated, the work would offer a practical route to continual knowledge incorporation in LLMs that selectively routes plasticity while limiting interference, with potential scalability to streaming settings.

major comments (2)

[§3 (Meta-RL Training Loop)] §3 (Meta-RL Training Loop): The central claim that SCoL improves retention over sequential finetuning rests on the assumption that generated layer-update instructions produce stable, non-interfering effects as the base model evolves. No ablations on update magnitude, sign consistency across sequential steps, or retention decay after repeated consolidation cycles are reported, leaving open the possibility that net effects drift or cancel prior consolidations.
[Experimental Results (SQuAD and LongBench v2 sections)] Experimental Results (SQuAD and LongBench v2 sections): The abstract and main claims assert performance gains in acquisition and retention, yet the manuscript supplies no concrete metrics, error bars, statistical significance tests, or detailed baseline hyper-parameters, preventing verification of the magnitude and reliability of the reported improvements.

minor comments (2)

[§2 (Method Overview)] The format and application procedure for the generated textual update instructions would be clearer with a concrete example (e.g., a sample instruction string and the corresponding layer-modification code) placed in the main text rather than only in the appendix.
[Analysis of Learned Selection Patterns] Fisher-information alignment analysis would benefit from an explicit definition of how Fisher values are computed (e.g., which loss and which subset of parameters) to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications on update stability and committing to include the requested experimental details and analyses in the revision.

read point-by-point responses

Referee: [§3 (Meta-RL Training Loop)] §3 (Meta-RL Training Loop): The central claim that SCoL improves retention over sequential finetuning rests on the assumption that generated layer-update instructions produce stable, non-interfering effects as the base model evolves. No ablations on update magnitude, sign consistency across sequential steps, or retention decay after repeated consolidation cycles are reported, leaving open the possibility that net effects drift or cancel prior consolidations.

Authors: We agree that explicit verification of update stability is important to support the retention claims. Our LongBench v2 results demonstrate sustained gains over long sequential streams relative to finetuning baselines, which indirectly suggests non-interfering effects, but we did not report dedicated ablations on update magnitude, sign consistency, or multi-cycle decay. In the revised manuscript we will add these analyses, including norms of weight updates, consistency of selected layers across steps, and retention curves after repeated consolidation cycles. revision: yes
Referee: [Experimental Results (SQuAD and LongBench v2 sections)] Experimental Results (SQuAD and LongBench v2 sections): The abstract and main claims assert performance gains in acquisition and retention, yet the manuscript supplies no concrete metrics, error bars, statistical significance tests, or detailed baseline hyper-parameters, preventing verification of the magnitude and reliability of the reported improvements.

Authors: We apologize that the numerical results were not presented with sufficient prominence or detail in the submitted version. The full manuscript contains tables with exact metrics (F1 on SQuAD, likelihood/perplexity on LongBench v2), standard deviations from multiple runs, and statistical significance tests against baselines; all baseline hyperparameters are listed in the appendix. We will revise the main text and results sections to prominently display key numbers, error bars, and significance values, with explicit cross-references to the appendix for reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external rewards and empirical baselines

full rationale

The paper's core method trains an LLM via meta-RL to output textual layer-update instructions, with rewards drawn from independent sources (supervised QA accuracy on SQuAD and intrinsic likelihood on LongBench v2). These rewards evaluate post-update performance rather than being fitted to the selection policy itself. No equations or steps reduce the claimed improvements to self-definition, renamed fits, or load-bearing self-citations. The reported gains are measured against external baselines (prompting, summarization, batch test-time training, sequential finetuning), and the Fisher-information alignment is presented as post-hoc analysis rather than a definitional input. The framework is therefore self-contained against verifiable external metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that layer-wise plasticity can be controlled via generated text instructions and that meta-RL can optimize selection policies over changing model states; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)

domain assumption LLMs possess differentiable layers whose updates can be selectively triggered by generated instructions
Invoked as the basis for the update mechanism
domain assumption Meta-reinforcement learning can optimize policies over an evolving model state without instability
Required for the training procedure described

invented entities (1)

SCoL update-instruction generator no independent evidence
purpose: To produce textual directives for layer updates
New component introduced by the framework

pith-pipeline@v0.9.0 · 5559 in / 1308 out tokens · 31847 ms · 2026-05-13T07:51:20.366059+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

[1]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[2]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

work page 2024
[3]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, 2020

work page 2020
[4]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

work page 2022
[5]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[6]

Random-access infinite context length for trans- formers

Amirkeivan Mohtashami and Martin Jaggi. Random-access infinite context length for trans- formers. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[7]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, Singapore, 2023. Association for Computational Linguistics

work page 2023
[8]

LLMLingua: Com- pressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, 2023. Association for Computational Linguistics

work page 2023
[9]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024

work page 2024
[11]

McClelland, Bruce L

James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419– 457, 1995

work page 1995
[12]

McClelland

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016

work page 2016
[13]

The consolidation and transformation of memory

Yadin Dudai, Avi Karni, and Jan Born. The consolidation and transformation of memory. Neuron, 88(1):20–32, 2015

work page 2015
[14]

Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989. 10

work page 1989
[15]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page 2017
[16]

van de Ven

Gido M. van de Ven. On the computation of the fisher information in continual learning.arXiv preprint arXiv:2502.11756, 2025. To appear in the ICLR 2025 Blogpost Track

work page arXiv 2025
[17]

Self-adapting language models

Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[18]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[19]

SQuAD: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics

work page 2016
[20]

LongBench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

work page 2025
[21]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pages 17359–17372, 2022

work page 2022
[22]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[23]

Wise: Rethinking the knowledge memory for lifelong model editing of large language models.Advances in Neural Information Processing Systems, 37:53764– 53797, 2024

Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models.Advances in Neural Information Processing Systems, 37:53764– 53797, 2024

work page 2024
[24]

Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning

Qizhou Chen, Taolin Zhang, Xiaofeng He, Dongyang Li, Chengyu Wang, Longtao Huang, and Hui Xue’. Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13565–13580, Miami, Florida, USA, 2024. Association for Computational Linguistics

work page 2024
[25]

Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, 2020. Association for Computational Linguistics

work page 2020
[26]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

work page 2025
[27]

Continual learn- ing in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658, 2026

Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, and Xuemin Lin. Continual learn- ing in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658, 2026

work page arXiv 2026
[28]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2018

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2018. 11

work page 2018
[29]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, Singapore, 2023. Association for Computational Linguistics

work page 2023
[30]

LEMoE: Advanced mixture of experts adaptor for lifelong model editing of large language models

Renzhi Wang and Piji Li. LEMoE: Advanced mixture of experts adaptor for lifelong model editing of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2551–2575, Miami, Florida, USA, 2024. Association for Computational Linguistics

work page 2024
[31]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 2020

work page 2020
[32]

Ratner, N., Levine, Y ., Belinkov, Y ., Ram, O., Abend, O., Karpas, E., Shashua, A., Leyton-Brown, K., and Shoham, Y

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507, 2019

work page arXiv 1911
[33]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[34]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[35]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

work page 2022
[36]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645:633–638, 2025

work page 2025
[37]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InAdvances in Neural Information Processing Systems 30 (NIPS), 2017

work page 2017
[38]

arXiv preprint arXiv:2501.06252 , year=

Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer-squared: Self-adaptive llms.arXiv preprint arXiv:2501.06252, 2025

work page arXiv 2025
[39]

Meta-learning online adaptation of language models

Nathan Hu, Eric Mitchell, Christopher Manning, and Chelsea Finn. Meta-learning online adaptation of language models. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 4418–4432, Singapore, 2023. Association for Computational Linguistics

work page 2023
[40]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[41]

Knowledge editing for large language models: A survey

Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.arXiv preprint arXiv:2310.16218, 2024

work page arXiv 2024
[42]

Universal language model fine-tuning for text classi- fication

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 328–339, 2018

work page 2018
[43]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. 12

work page 2020
[44]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. InProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024

work page 2024
[45]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language modeling.arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review arXiv 2023
[46]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review arXiv 2023
[47]

Learning to compress prompts with gist tokens

Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

work page 2023
[48]

A is B” fail to learn “B is A

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. InInternational Conference on Learning Representations, 2024

work page 2024
[49]

Analyzing and reducing catastrophic forgetting in parameter efficient tuning, 2024

Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning, 2024

work page 2024
[50]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

work page 2021
[51]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3987–3995. PMLR, 2017

work page 2017
[52]

PackNet: Adding multiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

work page 2018
[53]

Overcoming catastrophic forgetting with hard attention to the task

Joan Serrà, Dídac Surís, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4548–4557. PMLR, 2018

work page 2018
[54]

Kimi k2.6 tech blog: Advancing open-source coding, 2026

Moonshot AI. Kimi k2.6 tech blog: Advancing open-source coding, 2026. Accessed: 2026-05- 05

work page 2026
[55]

Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

work page arXiv 2025
[56]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[57]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

work page 2023
[58]

Attributing mode collapse in the fine-tuning of large language models

Laura O’Mahony, Léo Grinsztajn, Hailey Schoelkopf, and Stella Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

work page 2024
[59]

Preserving diversity in supervised fine-tuning of large language models, 2025

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models.arXiv preprint arXiv:2408.16673, 2024. 13

work page arXiv 2024
[60]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023