MeMo: Memory as a Model

Alfred Wei Lun Leong; Alok Prakash; Armando Solar-Lezama; Arun Verma; Bryan Kian Hsiang Low; Daniela Rus; Nancy F. Chen; Ryan Wei Heng Quek; Sanghyuk Lee

arxiv: 2605.15156 · v2 · pith:LRYPOV6Gnew · submitted 2026-05-14 · 💻 cs.CL · cs.AI· cs.LG

MeMo: Memory as a Model

Ryan Wei Heng Quek , Sanghyuk Lee , Alfred Wei Lun Leong , Arun Verma , Alok Prakash , Nancy F. Chen , Bryan Kian Hsiang Low , Daniela Rus

show 1 more author

Armando Solar-Lezama

This is my paper

Pith reviewed 2026-05-21 08:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords memory augmentationlarge language modelsknowledge integrationretrieval augmentationplug and playcross document reasoning

0 comments

The pith

MeMo encodes new knowledge into a dedicated memory model while leaving the LLM parameters frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MeMo as a modular framework that stores fresh information in a separate memory component instead of updating or accessing the core large language model. This setup lets the system incorporate timely or domain-specific facts without retraining the LLM or risking loss of prior capabilities. A sympathetic reader would care because real-world uses often demand up-to-date knowledge that static pretrained models cannot provide, and current approaches either require full model access or scale poorly with data size. MeMo is positioned to handle multi-document relations and work with both open and closed models at fixed retrieval cost.

Core claim

MeMo encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. It captures complex cross-document relationships, stays robust to retrieval noise, avoids catastrophic forgetting, needs no access to LLM weights or output logits for plug-and-play use with open or proprietary models, and keeps retrieval cost independent of corpus size at inference time. Results on BrowseComp-Plus, NarrativeQA, and MuSiQue benchmarks indicate strong performance relative to existing methods.

What carries the argument

The dedicated memory model that encodes and retrieves new knowledge separately from the LLM.

If this is right

Integration with closed-source LLMs becomes possible without exposing model internals.
Retrieval costs remain constant even as the stored knowledge corpus grows larger.
The underlying LLM avoids any catastrophic forgetting of its original training.
Complex relationships that span multiple documents can be represented directly in memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This separation could enable incremental knowledge addition in production systems that must stay current without periodic full retraining cycles.
Private or user-specific knowledge bases could be maintained alongside a shared base model for personalized applications.
The approach might extend to settings where retrieval must operate under strict latency or cost constraints as data volume increases.

Load-bearing premise

A dedicated memory model can reliably capture complex cross-document relationships and remain robust to retrieval noise without any access to the LLM's weights or output logits.

What would settle it

A controlled test on NarrativeQA or MuSiQue where MeMo is given noisy multi-document inputs and fails to outperform standard retrieval baselines in answer accuracy would undermine the robustness and cross-document claims.

Figures

Figures reproduced from arXiv: 2605.15156 by Alfred Wei Lun Leong, Alok Prakash, Armando Solar-Lezama, Arun Verma, Bryan Kian Hsiang Low, Daniela Rus, Nancy F. Chen, Ryan Wei Heng Quek, Sanghyuk Lee.

**Figure 1.** Figure 1: Overview of the training and inference pipeline of MEMO. During MEMORY model training (left), a frozen GENERATOR model transforms a target corpus into a reflection QA dataset via fact extraction, consolidation, verification, entity surfacing, and cross-document synthesis, which is then used to train a dedicated MEMORY model. During inference (right), the frozen EXECUTIVE model answers complex user queries … view at source ↗

**Figure 2.** Figure 2: Cost–accuracy trade-off on NarrativeQA when a second corpus arrives (K=2, MEMORY model = Qwen2.5-14B-Instruct, 8×H100). Cumulative training cost is shown on the x-axis (one Qwen-14B SFT run takes ≈ 24 GPU-hours on a 640k-QA-pair corpus). Merging trains MEMORY model only on the new corpus, costing X+Y ≈ 48 GPU-hours, while full retraining re-runs on the union, costing X+(X+Y ) ≈ 72 GPU-hours — a 33% savin… view at source ↗

**Figure 3.** Figure 3: BrowseComp-Plus accuracy (%) vs. training epoch (Full SFT) for each [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: NarrativeQA accuracy (%) vs. training epoch (Full SFT) for each [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: MuSiQue accuracy (%) vs. training epoch (Full SFT) for each [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters frozen. It claims five advantages over prior methods: capturing complex cross-document relationships, robustness to retrieval noise, avoidance of catastrophic forgetting, compatibility with closed-source LLMs via no access to weights or logits, and inference-time retrieval cost independent of corpus size. Experimental results on BrowseComp-Plus, NarrativeQA, and MuSiQue are said to demonstrate strong performance relative to existing methods.

Significance. If the empirical results hold and the memory model demonstrably delivers the listed properties without relying on the downstream LLM, the approach would offer a practical route for timely knowledge injection into both open and proprietary LLMs, addressing a common limitation of retrieval-augmented systems.

major comments (1)

The central empirical claim—that MeMo achieves strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue—is asserted in the abstract and experimental summary but is unsupported by any reported metrics, baselines, ablation studies, or experimental details in the manuscript text. This omission prevents assessment of whether the memory model itself, rather than the LLM, is responsible for the claimed robustness and relational capacity.

minor comments (1)

The abstract enumerates advantages (a)–(e) without indicating which architectural choices or training objectives are intended to realize each property; a short forward reference to the relevant sections would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and outline the revisions we will make to strengthen the empirical presentation.

read point-by-point responses

Referee: The central empirical claim—that MeMo achieves strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue—is asserted in the abstract and experimental summary but is unsupported by any reported metrics, baselines, ablation studies, or experimental details in the manuscript text. This omission prevents assessment of whether the memory model itself, rather than the LLM, is responsible for the claimed robustness and relational capacity.

Authors: We agree that the current manuscript text does not include the specific quantitative metrics, baseline comparisons, ablation studies, or full experimental details needed to substantiate the claims and to isolate the memory model's contributions. This is a valid observation that limits evaluation of whether the reported advantages arise from the memory model rather than the frozen LLM. In the revised version, we will add a comprehensive experimental section containing tables with exact performance numbers on BrowseComp-Plus, NarrativeQA, and MuSiQue, direct comparisons to relevant baselines, and targeted ablations (e.g., with and without the memory model, under varying retrieval noise levels) that demonstrate the memory model's role in capturing cross-document relations and providing robustness while the LLM remains unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experiments

full rationale

The paper presents MeMo as a modular framework that encodes knowledge into a dedicated memory model while keeping LLM parameters fixed. It lists advantages (cross-document relationships, robustness to noise, no access to weights/logits, fixed retrieval cost) and reports empirical results on BrowseComp-Plus, NarrativeQA, and MuSiQue. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims are supported by benchmark comparisons rather than reducing to self-definitional inputs or ansatzes smuggled via prior work. The derivation chain is therefore self-contained and independent of the circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or explicit parameters are described in the abstract; the framework is presented at a high level without free parameters, axioms, or new postulated entities beyond the memory model itself.

pith-pipeline@v0.9.0 · 5742 in / 1086 out tokens · 33981 ms · 2026-05-21T08:32:34.938031+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce MEMO (Memory as a Model), a modular framework that encodes new knowledge into a dedicated MEMORY model while keeping the LLM parameters unchanged.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five-step data synthesis pipeline ... reflections ... cross-document synthesis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 17 internal anchors

[1]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.arXiv:2205.11916, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Method- ology, 2026

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghoon Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Method- ology, 2026

work page 2026
[4]

Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

work page arXiv 2024
[5]

Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024

work page arXiv 2024
[6]

Smith, Yejin Choi, and Kentaro Inui

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?arXiv:2207.13332, 2024

work page arXiv 2024
[7]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj...

work page arXiv 2022
[8]

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv:2303.17564, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv:2005.11401, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2005
[10]

Large language models struggle to learn long-tail knowledge.arXiv:2211.08411, 2023

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge.arXiv:2211.08411, 2023

work page arXiv 2023
[11]

Sustainable ai: Environmen- tal implications, challenges and opportunities

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmen- tal implications, challenges and opportunities. InProc. MLSys, pages 795–813, 2022

work page 2022
[12]

Robertson and Steve Walker

Stephen E. Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. InProc. SIGIR, 1994

work page 1994
[13]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InProc. NeurIPS, pages 9459–9474, 2020

work page 2020
[15]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv:2404.16130, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Hipporag: Neuro- biologically inspired long-term memory for large language models

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neuro- biologically inspired long-term memory for large language models. InProc. NeurIPS, pages 59532–59569, 2024

work page 2024
[17]

From rag to memory: Non-parametric continual learning for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models. InProc. ICML, 2025

work page 2025
[18]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InProc. NeurIPS, pages 1877–1901, 2020

work page 1901
[19]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProc. EMNLP, 2024

work page 2024
[20]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025

Jiaen Lin, Jingyu Liu, and Yingbo Liu. Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025

work page arXiv 2025
[22]

Continual pre-training of language models.arXiv:2302.03241, 2023

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv:2302.03241, 2023

work page arXiv 2023
[23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InProc. NeurIPS, 2022

work page 2022
[24]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProc. ACL, pages 13484–13508, 2023

work page 2023
[25]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024

work page 2024
[26]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv:2308.08747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv:2501.17161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProc. EMNLP, 2023

work page 2023
[29]

Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Proc. NeurIPS, 2023

work page 2023
[30]

In-context autoencoder for context compression in a large language model

Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. InProc. ICLR, 2024

work page 2024
[31]

Memgen: Weaving generative latent memory for self-evolving agents

Guibin Zhang, Muxin Fu, and Shuicheng Y AN. Memgen: Weaving generative latent memory for self-evolving agents. InProc. ICLR, 2026

work page 2026
[32]

Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022

Bohan Li, Yutai Hou, and Wanxiang Che. Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022

work page 2022
[33]

An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023

Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023

work page 2023
[34]

Physics of language models: part 3.1, knowledge storage and extraction

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: part 3.1, knowledge storage and extraction. InProc. ICML, pages 1067–1077, 2024. 12

work page 2024
[35]

Synthetic qa corpora generation with roundtrip consistency

Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. Synthetic qa corpora generation with roundtrip consistency. InProc. ACL, pages 6168–6173, 2019

work page 2019
[36]

Training question answering models from synthetic data

Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. Training question answering models from synthetic data. InProc. EMNLP, pages 5811–5826, 2020

work page 2020
[37]

Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InProc. ACL, pages 14664–14690, 2024

work page 2024
[38]

Self-training large language models through knowledge detection

Yeo Wei Jie, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria. Self-training large language models through knowledge detection. InProc. EMNLP Findings, pages 15033–15045, 2024

work page 2024
[39]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProc. NeurIPS, 2017

work page 2017
[40]

Scaling context requires rethinking attention.arXiv:2507.04239, 2025

Carles Gelada, Jacob Buckman, Sean Zhang, and Txus Bach. Scaling context requires rethinking attention.arXiv:2507.04239, 2025

work page arXiv 2025
[41]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[42]

RULER: What’s the real context size of your long-context language models? InProc

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InProc. COLM, 2024

work page 2024
[43]

The power of noise: Redefining retrieval for rag systems

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. InProc. SIGIR, 2024

work page 2024
[44]

Tackling the inherent difficulty of noise filtering in rag

Jingyu Liu, Jiaen Lin, and Yong Liu. Tackling the inherent difficulty of noise filtering in rag. arXiv:2601.01896, 2026

work page arXiv 2026
[45]

Understanding the relationship between prompts and response uncertainty in large language models

Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, and Bryan Kian Hsiang Low. Understanding the relationship between prompts and response uncertainty in large language models. InProc. ACL Findings, 2026

work page 2026
[46]

ERNIE 2.0: A continual pre-training framework for language understanding

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. ERNIE 2.0: A continual pre-training framework for language understanding. InProc. AAAI, 2020

work page 2020
[47]

Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018

work page 2018
[48]

Mapping post- training forgetting in language models at scale.arXiv:2510.17776, 2025

Jackson Harmon, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. Mapping post- training forgetting in language models at scale.arXiv:2510.17776, 2025

work page arXiv 2025
[49]

Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc. ICLR, 2024

work page 2024
[50]

Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023

Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, et al. Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023

work page arXiv 2023
[51]

Understanding the performance and estimating the cost of llm fine-tuning

Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Callie Hao, and Nishil Talati. Understanding the performance and estimating the cost of llm fine-tuning. InProc. IISWC, 2024

work page 2024
[52]

The open source advantage in large language models (llms).arXiv:2412.12004, 2025

Jiya Manchanda, Laura Boettcher, Matheus Westphalen, and Jasser Jasser. The open source advantage in large language models (llms).arXiv:2412.12004, 2025. 13

work page arXiv 2025
[53]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. InProc. ICLR, 2022

work page 2022
[56]

General- ization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. General- ization through memorization: Nearest neighbor language models. InProc. ICLR, 2020

work page 2020
[58]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv:2309.12288, 2023

work page arXiv 2023
[59]

Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

work page arXiv 2023
[60]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 2024

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 2024

work page 2024
[61]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2...

work page arXiv 2025
[62]

langdetect.https://github.com/Mimino666/langdetect, 2021

Michal Danilák. langdetect.https://github.com/Mimino666/langdetect, 2021

work page 2021
[63]

The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, pages 317–328, 2018

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, pages 317–328, 2018

work page 2018
[64]

Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

work page arXiv 2022
[65]

Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv:2506.06266, 2025

Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv:2506.06266, 2025

work page arXiv 2025
[66]

Memory decoder: A pretrained, plug-and-play memory for large language models

Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin. Memory decoder: A pretrained, plug-and-play memory for large language models. arXiv:2508.09874, 2025

work page arXiv 2025
[67]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report.arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention.arXiv:2309.06180, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, page 127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, page 127063, 2024

work page 2024
[70]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InProc. ICLR, 2024. 14

work page 2024
[71]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020
[73]

Gemini 3 flash model card

Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, December 2025

work page 2025
[74]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, et al. Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

deepeval

Jeffrey Ip and Kritin V ongthongsri. deepeval. https://github.com/confident-ai/ deepeval, 2025

work page 2025
[76]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

LFM2 technical report.arXiv preprint arXiv:2511.23404,

Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, et al. LFM2 technical report.arXiv:2511.23404, 2025

work page arXiv 2025
[78]

Ties-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InProc. NeurIPS, 2023. 15

work page 2023
[79]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

work page 1998
[80]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[81]

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLM s

Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Fine-tuning or retrieval? comparing knowledge injection in llms. InProc. EMNLP, pages 237–250, 2024

work page 2024

Showing first 80 references.

[1] [1]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.arXiv:2205.11916, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Method- ology, 2026

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghoon Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Method- ology, 2026

work page 2026

[4] [4]

Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

work page arXiv 2024

[5] [5]

Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024

work page arXiv 2024

[6] [6]

Smith, Yejin Choi, and Kentaro Inui

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?arXiv:2207.13332, 2024

work page arXiv 2024

[7] [7]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj...

work page arXiv 2022

[8] [8]

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv:2303.17564, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv:2005.11401, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2005

[10] [10]

Large language models struggle to learn long-tail knowledge.arXiv:2211.08411, 2023

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge.arXiv:2211.08411, 2023

work page arXiv 2023

[11] [11]

Sustainable ai: Environmen- tal implications, challenges and opportunities

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmen- tal implications, challenges and opportunities. InProc. MLSys, pages 795–813, 2022

work page 2022

[12] [12]

Robertson and Steve Walker

Stephen E. Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. InProc. SIGIR, 1994

work page 1994

[13] [13]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InProc. NeurIPS, pages 9459–9474, 2020

work page 2020

[15] [15]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv:2404.16130, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Hipporag: Neuro- biologically inspired long-term memory for large language models

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neuro- biologically inspired long-term memory for large language models. InProc. NeurIPS, pages 59532–59569, 2024

work page 2024

[17] [17]

From rag to memory: Non-parametric continual learning for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models. InProc. ICML, 2025

work page 2025

[18] [18]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InProc. NeurIPS, pages 1877–1901, 2020

work page 1901

[19] [19]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProc. EMNLP, 2024

work page 2024

[20] [20]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025

Jiaen Lin, Jingyu Liu, and Yingbo Liu. Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025

work page arXiv 2025

[22] [22]

Continual pre-training of language models.arXiv:2302.03241, 2023

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv:2302.03241, 2023

work page arXiv 2023

[23] [23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InProc. NeurIPS, 2022

work page 2022

[24] [24]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProc. ACL, pages 13484–13508, 2023

work page 2023

[25] [25]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024

work page 2024

[26] [26]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv:2308.08747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv:2501.17161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProc. EMNLP, 2023

work page 2023

[29] [29]

Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Proc. NeurIPS, 2023

work page 2023

[30] [30]

In-context autoencoder for context compression in a large language model

Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. InProc. ICLR, 2024

work page 2024

[31] [31]

Memgen: Weaving generative latent memory for self-evolving agents

Guibin Zhang, Muxin Fu, and Shuicheng Y AN. Memgen: Weaving generative latent memory for self-evolving agents. InProc. ICLR, 2026

work page 2026

[32] [32]

Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022

Bohan Li, Yutai Hou, and Wanxiang Che. Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022

work page 2022

[33] [33]

An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023

Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023

work page 2023

[34] [34]

Physics of language models: part 3.1, knowledge storage and extraction

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: part 3.1, knowledge storage and extraction. InProc. ICML, pages 1067–1077, 2024. 12

work page 2024

[35] [35]

Synthetic qa corpora generation with roundtrip consistency

Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. Synthetic qa corpora generation with roundtrip consistency. InProc. ACL, pages 6168–6173, 2019

work page 2019

[36] [36]

Training question answering models from synthetic data

Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. Training question answering models from synthetic data. InProc. EMNLP, pages 5811–5826, 2020

work page 2020

[37] [37]

Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InProc. ACL, pages 14664–14690, 2024

work page 2024

[38] [38]

Self-training large language models through knowledge detection

Yeo Wei Jie, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria. Self-training large language models through knowledge detection. InProc. EMNLP Findings, pages 15033–15045, 2024

work page 2024

[39] [39]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProc. NeurIPS, 2017

work page 2017

[40] [40]

Scaling context requires rethinking attention.arXiv:2507.04239, 2025

Carles Gelada, Jacob Buckman, Sean Zhang, and Txus Bach. Scaling context requires rethinking attention.arXiv:2507.04239, 2025

work page arXiv 2025

[41] [41]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024

[42] [42]

RULER: What’s the real context size of your long-context language models? InProc

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InProc. COLM, 2024

work page 2024

[43] [43]

The power of noise: Redefining retrieval for rag systems

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. InProc. SIGIR, 2024

work page 2024

[44] [44]

Tackling the inherent difficulty of noise filtering in rag

Jingyu Liu, Jiaen Lin, and Yong Liu. Tackling the inherent difficulty of noise filtering in rag. arXiv:2601.01896, 2026

work page arXiv 2026

[45] [45]

Understanding the relationship between prompts and response uncertainty in large language models

Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, and Bryan Kian Hsiang Low. Understanding the relationship between prompts and response uncertainty in large language models. InProc. ACL Findings, 2026

work page 2026

[46] [46]

ERNIE 2.0: A continual pre-training framework for language understanding

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. ERNIE 2.0: A continual pre-training framework for language understanding. InProc. AAAI, 2020

work page 2020

[47] [47]

Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018

work page 2018

[48] [48]

Mapping post- training forgetting in language models at scale.arXiv:2510.17776, 2025

Jackson Harmon, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. Mapping post- training forgetting in language models at scale.arXiv:2510.17776, 2025

work page arXiv 2025

[49] [49]

Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc. ICLR, 2024

work page 2024

[50] [50]

Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023

Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, et al. Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023

work page arXiv 2023

[51] [51]

Understanding the performance and estimating the cost of llm fine-tuning

Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Callie Hao, and Nishil Talati. Understanding the performance and estimating the cost of llm fine-tuning. InProc. IISWC, 2024

work page 2024

[52] [52]

The open source advantage in large language models (llms).arXiv:2412.12004, 2025

Jiya Manchanda, Laura Boettcher, Matheus Westphalen, and Jasser Jasser. The open source advantage in large language models (llms).arXiv:2412.12004, 2025. 13

work page arXiv 2025

[53] [53]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. InProc. ICLR, 2022

work page 2022

[56] [56]

General- ization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. General- ization through memorization: Nearest neighbor language models. InProc. ICLR, 2020

work page 2020

[57] [58]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv:2309.12288, 2023

work page arXiv 2023

[58] [59]

Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

work page arXiv 2023

[59] [60]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 2024

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 2024

work page 2024

[60] [61]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2...

work page arXiv 2025

[61] [62]

langdetect.https://github.com/Mimino666/langdetect, 2021

Michal Danilák. langdetect.https://github.com/Mimino666/langdetect, 2021

work page 2021

[62] [63]

The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, pages 317–328, 2018

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, pages 317–328, 2018

work page 2018

[63] [64]

Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

work page arXiv 2022

[64] [65]

Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv:2506.06266, 2025

Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv:2506.06266, 2025

work page arXiv 2025

[65] [66]

Memory decoder: A pretrained, plug-and-play memory for large language models

Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin. Memory decoder: A pretrained, plug-and-play memory for large language models. arXiv:2508.09874, 2025

work page arXiv 2025

[66] [67]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report.arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [68]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention.arXiv:2309.06180, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [69]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, page 127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, page 127063, 2024

work page 2024

[69] [70]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InProc. ICLR, 2024. 14

work page 2024

[70] [71]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[71] [72]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020

[72] [73]

Gemini 3 flash model card

Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, December 2025

work page 2025

[73] [74]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, et al. Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [75]

deepeval

Jeffrey Ip and Kritin V ongthongsri. deepeval. https://github.com/confident-ai/ deepeval, 2025

work page 2025

[75] [76]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [77]

LFM2 technical report.arXiv preprint arXiv:2511.23404,

Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, et al. LFM2 technical report.arXiv:2511.23404, 2025

work page arXiv 2025

[77] [78]

Ties-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InProc. NeurIPS, 2023. 15

work page 2023

[78] [79]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

work page 1998

[79] [80]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [81]

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLM s

Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Fine-tuning or retrieval? comparing knowledge injection in llms. InProc. EMNLP, pages 237–250, 2024

work page 2024