Context Recycling for Long-Horizon LLM Inference

Derek Thomas

arxiv: 2606.26105 · v1 · pith:HATNR7Q5new · submitted 2026-05-01 · 💻 cs.CL · cs.AI· cs.LG

Context Recycling for Long-Horizon LLM Inference

Derek Thomas This is my paper

Pith reviewed 2026-07-01 08:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords context recyclinglong-horizon inferenceLLM efficiencyconversational agentsexternal memorymulti-turn reasoningtoken reduction

0 comments

The pith

ContextForge recycles context via structured queries and external memory to reduce token use in long LLM conversations while keeping accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose performance in long conversations because of context window limits and wasteful token replay. ContextForge tackles this by generating structured queries to pull relevant details from external memory and then synthesizing controlled responses instead of replaying full history. The system is tested on a 15-turn benchmark of healthcare queries that includes multi-turn reasoning, back-references, and domain shifts. Results show better consistency and lower token counts than baseline agents that use the same models, with matching response accuracy. This suggests context recycling can stretch existing models to longer horizons without larger windows or retraining.

Core claim

ContextForge enables efficient reuse of prior computation in LLM inference for long-horizon tasks by combining structured query generation, external memory retrieval, and controlled synthesis. This maintains task-relevant information across turns without full context replay. On a 15-turn conversational benchmark testing multi-turn reasoning, back-references, and domain shifts across structured healthcare queries, ContextForge achieves improved consistency and reduced token consumption while maintaining comparable response accuracy to baseline agents using identical underlying models.

What carries the argument

ContextForge system that recycles context through structured query generation, external memory retrieval, and controlled synthesis to reuse prior computation without full replay.

If this is right

LLM capabilities extend to longer conversational horizons without needing larger context windows or model retraining.
Token consumption drops in multi-turn tasks while answer quality stays comparable.
Consistency improves on benchmarks that require back-references and domain shifts.
Existing models can handle extended interactions through context recycling rather than full history replay.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recycling pattern could apply to long non-conversational contexts such as document summarization chains.
Combining ContextForge with other memory techniques might further lower token counts in production systems.
Applying the method to benchmarks with more than 15 turns or different domains would test its scaling limits.

Load-bearing premise

That structured query generation combined with external memory retrieval and controlled synthesis can keep task-relevant information intact across turns without major loss of context or accuracy.

What would settle it

A run of the 15-turn healthcare benchmark in which ContextForge shows lower accuracy or loses critical details compared with the baseline agent on the same models.

Figures

Figures reproduced from arXiv: 2606.26105 by Derek Thomas.

**Figure 2.** Figure 2: Query processing flow through the fivelayer hierarchy. Each query traverses keyword extraction, index lookup, cache resolution, context assembly with LoRA activation, LLM generation, and context recycling. 3. The LLM generates a response using the assembled context. 4. The branch is freed—its tokens are released from the active context. 5. The next query can load an entirely different branch using the s… view at source ↗

read the original abstract

Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce ContextForge, a system for context recycling that maintains task-relevant information across turns by combining structured query generation, external memory retrieval, and controlled synthesis. The system enables efficient reuse of prior computation without relying on full context replay, reducing token overhead while preserving answer quality. We evaluate ContextForge using a 15-turn conversational benchmark that tests multi-turn reasoning, back-references, and domain shifts across structured healthcare queries. Compared to a baseline agent using identical underlying models, ContextForge demonstrates improved consistency and reduced token consumption, while maintaining comparable response accuracy. These results suggest that context recycling provides a practical approach for extending LLM capabilities in long-horizon tasks without requiring larger context windows or model retraining. Code and evaluation artifacts are available at https://github.com/Betanu701/ContextForge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextForge applies standard retrieval and memory techniques to a 15-turn healthcare benchmark with public code, but the abstract gives no numbers so the size of any gains is impossible to judge from the description.

read the letter

The main takeaway is that this paper takes the usual RAG pattern—structured queries, external memory lookup, and controlled synthesis—and runs it on a new 15-turn conversational benchmark focused on healthcare queries. It reports better consistency and lower token counts with accuracy that stays comparable to a plain baseline agent using the same models.

The work is honest about the practical problem of context window limits in long sessions and shows a modular way to recycle prior turns without full replay. Releasing the code and artifacts on GitHub is the clearest positive, since it lets anyone rerun the exact setup on the benchmark that includes back-references and domain shifts.

The soft spots are straightforward. The abstract states the improvements without any percentages, baselines, variance numbers, or statistical tests, so the claims cannot be assessed from the summary alone. The individual pieces are already common in the literature, so the paper’s value depends entirely on whether the combined system actually delivers measurable gains on this specific benchmark. If the full text has those details and they hold up, the contribution is modest but usable.

This is the kind of paper that might interest engineers who need to stretch LLM conversations in production without buying larger context windows. It is not going to shift research directions, but the public artifacts make the results checkable.

I would send it to peer review. The evaluation is concrete enough and the code is available, so referees can verify the numbers rather than accept the abstract at face value.

Referee Report

1 major / 0 minor

Summary. The paper introduces ContextForge, a modular system combining structured query generation, external memory retrieval, and controlled synthesis for context recycling in long-horizon LLM conversations. It evaluates the approach on a 15-turn conversational benchmark focused on multi-turn reasoning, back-references, and domain shifts in structured healthcare queries, claiming improved consistency and reduced token consumption with comparable accuracy relative to a baseline agent using identical models. Code and evaluation artifacts are provided for reproducibility.

Significance. If the empirical claims hold, the work demonstrates a practical, modular method for extending LLM performance on long conversations without larger context windows or retraining, by reusing prior computation via external memory. The availability of code and artifacts is a strength that enables direct verification of whether task-relevant information is preserved across turns.

major comments (1)

[Abstract and Evaluation] The abstract and evaluation description assert improved consistency, reduced token consumption, and comparable accuracy on the 15-turn benchmark but supply no quantitative metrics, error bars, baseline details, or statistical tests, preventing verification that the data support the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and commit to revisions that provide the requested quantitative details.

read point-by-point responses

Referee: [Abstract and Evaluation] The abstract and evaluation description assert improved consistency, reduced token consumption, and comparable accuracy on the 15-turn benchmark but supply no quantitative metrics, error bars, baseline details, or statistical tests, preventing verification that the data support the central claim.

Authors: We agree that the current abstract and evaluation description lack specific quantitative metrics, error bars, baseline details, and statistical tests. This prevents full verification of the claims as presented. We will revise the abstract to report key numerical results (e.g., token reduction percentages, consistency and accuracy scores with comparisons to the baseline) and expand the evaluation section to include error bars, explicit baseline specifications, and any applicable statistical tests. These additions will be supported by the existing code and artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system evaluation only

full rationale

The paper presents ContextForge as an engineering system (structured query generation + external memory + controlled synthesis) evaluated empirically on a 15-turn benchmark. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. The central claims rest on direct comparison of token consumption, consistency, and accuracy against a baseline using the same models; these outcomes are externally falsifiable via the supplied code and artifacts rather than reducing to any definitional or self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, mathematical axioms, or invented physical entities; the system is presented as a composition of standard components whose effectiveness is asserted via evaluation.

pith-pipeline@v0.9.1-grok · 5681 in / 1038 out tokens · 36997 ms · 2026-07-01T08:09:17.989765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 17 canonical work pages · 9 internal anchors

[1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- mann, Trevor Cai, Elber Rutherford, Katie Mil- lican, George van den Driessche, Jean-Baptiste Lasserre, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Mag- giore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irvi...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpola- tion.arXiv preprint arXiv:2306.15595,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases en- ables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adapta- tion of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention.arXiv preprint arXiv:2309.06180,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai and Manaal Faruqui. Leave no context behind: Efficient infinite context trans- formers with infini-attention.arXiv preprint arXiv:2404.07143,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InfiniPot: Infinite context pro- cessing on memory-constrained LLMs.arXiv preprint arXiv:2410.01518,

Myeongjun Kim, Kibeom Shim, Jungwoo Choi, and Sungjoo Chang. InfiniPot: Infinite context pro- cessing on memory-constrained LLMs.arXiv preprint arXiv:2410.01518,

work page arXiv
[11]

K-LoRA: Unlocking training-free fusion of any subject and style LoRAs.arXiv preprint arXiv:2502.18461,

Zhenhailong Ouyang, Zhixuan Li, and Qimin Hou. K-LoRA: Unlocking training-free fusion of any subject and style LoRAs.arXiv preprint arXiv:2502.18461,

work page arXiv
[12]

Cross-LoRA: A data-free LoRA transfer framework across hetero- geneous LLMs.arXiv preprint arXiv:2508.05232,

Fan Xia, Min Liao, Yun Fang, Dong Li, Yuxin Xie, Wenzhong Li, and Ye Li. Cross-LoRA: A data-free LoRA transfer framework across hetero- geneous LLMs.arXiv preprint arXiv:2508.05232,

work page arXiv
[13]

Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605,

Ben Jia Chan, Chieh-Ting Chen, Jia-Hua Cheng, and Hen-Hsen Huang. Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605,

work page arXiv
[14]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. HiAgent: Hier- archical working memory management for solv- ing long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,

work page arXiv
[15]

HiMem: Hierarchical long-term memory for LLM long-horizon agents.CoRR, abs/2601.06377, 2026

Nan Zhang, Xu Yang, Zeyu Tan, Weidong Deng, and Wei Wang. HiMem: Hierarchical long-term memory for LLM long-horizon agents.arXiv preprint arXiv:2601.06377,

work page arXiv
[16]

Cognitive memory in large language models

Lei Shan, Songlin Luo, Zhuo Zhu, Yongpeng Yuan, and Yong Wu. Cognitive memory in large lan- guage models.arXiv preprint arXiv:2504.02441,

work page arXiv
[17]

TreeRAG: Unleash- ing the power of hierarchical storage for en- hanced knowledge retrieval in long documents

Wenyu Tao, Xiaofen Xing, Yirong Chen, Linyi Huang, and Xiangmin Xu. TreeRAG: Unleash- ing the power of hierarchical storage for en- hanced knowledge retrieval in long documents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 356–371, Vienna, Austria,

2025
[18]

https://aclanthology.org/2025

Association for Computational Linguistics. https://aclanthology.org/2025. findings-acl.20/. OpenAI. Introducing GPT-5.4. March 5,

2025
[19]

Wu, Y., et al

https: //learn.microsoft.com/en-us/fabric/ data-science/how-to-create-data-agent. Wu, Y., et al. ContextBudget: Budget-aware con- text management for long-horizon search agents. arXiv preprint arXiv:2604.01664,

work page arXiv

[1] [1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[2] [2]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- mann, Trevor Cai, Elber Rutherford, Katie Mil- lican, George van den Driessche, Jean-Baptiste Lasserre, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Mag- giore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irvi...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpola- tion.arXiv preprint arXiv:2306.15595,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases en- ables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adapta- tion of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention.arXiv preprint arXiv:2309.06180,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai and Manaal Faruqui. Leave no context behind: Efficient infinite context trans- formers with infini-attention.arXiv preprint arXiv:2404.07143,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

InfiniPot: Infinite context pro- cessing on memory-constrained LLMs.arXiv preprint arXiv:2410.01518,

Myeongjun Kim, Kibeom Shim, Jungwoo Choi, and Sungjoo Chang. InfiniPot: Infinite context pro- cessing on memory-constrained LLMs.arXiv preprint arXiv:2410.01518,

work page arXiv

[11] [11]

K-LoRA: Unlocking training-free fusion of any subject and style LoRAs.arXiv preprint arXiv:2502.18461,

Zhenhailong Ouyang, Zhixuan Li, and Qimin Hou. K-LoRA: Unlocking training-free fusion of any subject and style LoRAs.arXiv preprint arXiv:2502.18461,

work page arXiv

[12] [12]

Cross-LoRA: A data-free LoRA transfer framework across hetero- geneous LLMs.arXiv preprint arXiv:2508.05232,

Fan Xia, Min Liao, Yun Fang, Dong Li, Yuxin Xie, Wenzhong Li, and Ye Li. Cross-LoRA: A data-free LoRA transfer framework across hetero- geneous LLMs.arXiv preprint arXiv:2508.05232,

work page arXiv

[13] [13]

Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605,

Ben Jia Chan, Chieh-Ting Chen, Jia-Hua Cheng, and Hen-Hsen Huang. Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605,

work page arXiv

[14] [14]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. HiAgent: Hier- archical working memory management for solv- ing long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,

work page arXiv

[15] [15]

HiMem: Hierarchical long-term memory for LLM long-horizon agents.CoRR, abs/2601.06377, 2026

Nan Zhang, Xu Yang, Zeyu Tan, Weidong Deng, and Wei Wang. HiMem: Hierarchical long-term memory for LLM long-horizon agents.arXiv preprint arXiv:2601.06377,

work page arXiv

[16] [16]

Cognitive memory in large language models

Lei Shan, Songlin Luo, Zhuo Zhu, Yongpeng Yuan, and Yong Wu. Cognitive memory in large lan- guage models.arXiv preprint arXiv:2504.02441,

work page arXiv

[17] [17]

TreeRAG: Unleash- ing the power of hierarchical storage for en- hanced knowledge retrieval in long documents

Wenyu Tao, Xiaofen Xing, Yirong Chen, Linyi Huang, and Xiangmin Xu. TreeRAG: Unleash- ing the power of hierarchical storage for en- hanced knowledge retrieval in long documents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 356–371, Vienna, Austria,

2025

[18] [18]

https://aclanthology.org/2025

Association for Computational Linguistics. https://aclanthology.org/2025. findings-acl.20/. OpenAI. Introducing GPT-5.4. March 5,

2025

[19] [19]

Wu, Y., et al

https: //learn.microsoft.com/en-us/fabric/ data-science/how-to-create-data-agent. Wu, Y., et al. ContextBudget: Budget-aware con- text management for long-horizon search agents. arXiv preprint arXiv:2604.01664,

work page arXiv