arxiv: 2604.09852 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.LG

Recognition: unknown

MEMENTO: Teaching LLMs to Manage Their Own Context

Vasilis Kontonis , Yuchen Zeng , Shivam Garg , Lingjiao Chen , Hao Tang , Ziyan Wang , Ahmed Awadallah , Eric Horvitz

show 2 more authors

John Langford Dimitris Papailiopoulos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords context compressionLLM reasoningKV cache reductionmemento summariesself-managed contextreasoning tracessupervised fine-tuningdual information stream

0 comments

The pith

LLMs can be trained to segment their reasoning into blocks, generate dense memento summaries for each, and continue while attending mainly to those summaries plus retained KV states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning models produce long unstructured streams that inflate context length, KV cache, and compute. MEMENTO teaches them to divide the stream into blocks, output a compact memento summary for each block, and reason forward by attending primarily to the mementos. A two-stage supervised fine-tuning process on a new 228K-trace dataset works across model families and sizes from 8B to 32B parameters. The resulting models keep strong accuracy on math, science, and coding benchmarks while cutting peak KV cache size by roughly 2.5 times and raising throughput by about 1.75 times. The method also exposes a dual information channel: both the explicit memento text and the implicit KV states retained from the original block are needed for full performance.

Core claim

By training models to produce a memento as a dense textual summary of each reasoning block and to retain the KV cache states generated during that block, subsequent reasoning can proceed with far shorter active context while preserving the information required for correct continuation. The dual stream is essential; dropping the KV channel alone reduces accuracy by 15 points on AIME24.

What carries the argument

The memento, a model-generated dense textual summary of one reasoning block, paired with the KV cache states retained from the original tokens of that block.

If this is right

Peak KV cache usage falls by a factor of about 2.5 while benchmark accuracy stays intact.
Inference throughput rises by roughly 1.75 times with the extended vLLM implementation.
The two-stage training recipe succeeds on Qwen3, Phi-4, and Olmo 3 models from 8B to 32B parameters.
Reinforcement learning can be run on the compressed context to gain further accuracy.
Information from each block travels in two parallel channels: the explicit memento text and the retained KV states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Longer reasoning horizons become feasible because active context no longer grows linearly with every token generated.
Models could learn to decide when to emit a memento and how detailed it should be, turning context management into an explicit policy.
The retained KV states appear to hold recoverable implicit details that pure text summaries miss, suggesting future work on distilling that channel into more compact forms.
The same block-and-memento pattern might extend to agent loops or multi-turn tool use where memory management is the main bottleneck.

Load-bearing premise

The mementos plus the retained KV states from each reasoning block together hold every piece of information the model needs to continue correctly, with no critical loss from the compression step.

What would settle it

Measure whether a MEMENTO model forced to use only the memento text (no retained KV states) solves fewer AIME24 problems than the full dual-channel version or than an uncompressed baseline model on the same problems.

Figures

Figures reproduced from arXiv: 2604.09852 by Ahmed Awadallah, Dimitris Papailiopoulos, Eric Horvitz, Hao Tang, John Langford, Lingjiao Chen, Shivam Garg, Vasilis Kontonis, Yuchen Zeng, Ziyan Wang.

**Figure 1.** Figure 1: Boundary scoring assigns each inter-sentence boundary a score from 0 to 3. Sentence 179 wraps up a conclusion [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: OPENMEMENTOS dataset distributions by domain (228K samples). (a) Math and code have ∼9 blocks/sample; science has ∼7. (b) Block sizes range from 2.3K (science) to 3.8K (math) chars. (c) Summary sizes cluster around 509–603 chars across all domains, indicating a stable compression target. (d) Math achieves the tightest compression ratio (median 0.16) due to its larger blocks. 4 Training the MEMENTO Models W… view at source ↗

**Figure 3.** Figure 3: Training data scaling. Pass@1 accuracy on AIME24 and AIME25 for Qwen2.5-7B-Instruct fine-tuned on 1K–100K examples. All methods improve monotonically with data size. Fine-Tuning Reasoning Models. When starting with already strong reasoning models, we found that training for more epochs on fewer samples is more effective than training on more samples with fewer epochs. We train on 31K samples from the 228K … view at source ↗

**Figure 4.** Figure 4: KV cache traces on individual problems (Qwen3-8B, both answers correct). (a) AIME24 P2: typical sawtooth pattern with 6 compactions; peak 0.77 vs 2.17 GB (2.8× reduction). (b) AIME24 P26: MEMENTO solves the problem in 12k tokens (vs 23k); frequent compactions keep peak at 0.82 vs 3.41 GB (4.2×). (c) AIME24 P5: MEMENTO generates 3× more tokens (31k vs 10k) with many compactions. Peak is still lower (1.27 vs… view at source ↗

**Figure 5.** Figure 5: CDF of compression ratio (summary chars / block chars). OLMo3-7B and Qwen3-8B achieve the tightest compression on competition math. Phi-4 has the widest compression spread, especially on MATH-500. Summary length is stable; compression scales with difficulty. Across four model families and four benchmarks, mementos converge to ∼260–615 characters regardless of block length, matching training targets. Compre… view at source ↗

**Figure 6.** Figure 6: Majority-vote headroom and Qwen3-8B CISPO (MiniMax et al., 2025) RL trajectory. Left: AIME 2026 maj@k for the three MEMENTO SFT models ( [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Serving throughput (Qwen3-8B, 1× B200 GPU). AIME24 × 8 repetitions (240 requests, 32K max tokens). Left: MEMENTO sustains 1.75× higher token throughput at full concurrency. Right: 1.58× faster batch completion. Vanilla plateaus as KV cache fills GPU memory. Throughput experiments. We benchmark serving throughput on Qwen3-8B with AIME24 × 8 repetitions (240 concurrent requests) with 32K max tokens on a sing… view at source ↗

**Figure 8.** Figure 8: Restart ablation. Step 1: Mi is generated with full attention to Ti (same as normal memento attention). Step 2: KV cache is discarded and recomputed via prefill over prompt + M1..i only—Ti is masked, so Mi ’s KV states no longer encode block information. The 15 pp accuracy drop ( [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Probing the implicit KV channel. Both Qwen3-8B and Qwen3-32B recover the passcode well above 10% chance from masked memento positions (26.7% and 23.0%), while causal controls show exactly chance-level accuracy. The dotted line marks 10% chance (random guessing over 10 digits) [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Iterative memento refinement on a reasoning block about NBA playoff probabilities. The initial memento (pass 1, [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Multi-stage SFT ablation on AIME 2024, AIME 2025, and GPQA-Diamond (Pass@1, n=8, Qwen2.5-7B). OT = OpenThoughts only; OM/Full = OPENMEMENTOS Full Attention; OM/Mem = OPENMEMENTOS Memento Attention; 2-Stg = OT → OM/Full; 3-Stg = OT → OM/Full → OM/Mem (Ours). Training directly on OPENMEMENTOS from the base model (OM variants) substantially underperforms vanilla reasoning SFT (OT). Our three-stage pipeline e… view at source ↗

**Figure 12.** Figure 12: Toy transformer KV probing (keep0). Left: Probe accuracy by layer depth—deeper layers carry more signal, especially for the masked condition. Center: Masked probe accuracy decays with distance from the target block but remains above chance even 7 hops away. Right: Leakage stays constant (17–19%) across training checkpoints even as task accuracy climbs from 77% to 95%, confirming the channel is architectur… view at source ↗

read the original abstract

Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEMENTO trains models to segment reasoning, write mementos, and keep a dual text-plus-KV stream for ~2.5x cache cuts while holding accuracy, but the memory accounting for retained block states needs a clear check.

read the letter

The main point is that this work trains LLMs to break long reasoning into blocks, turn each into a short memento summary, and continue by attending to the mementos plus retained KV states from the blocks. This produces models that keep solid results on math, science, and coding benchmarks with a reported 2.5x drop in peak KV cache size, plus a 1.75x throughput lift after vLLM changes and further gains from RL.

Referee Report

2 major / 0 minor

Summary. The paper introduces MEMENTO, a method that teaches LLMs to segment long reasoning streams into blocks, compress each into a dense memento summary, and continue by attending only to mementos (plus retained KV states from original blocks), thereby reducing context length, KV cache, and compute. It releases the OpenMementos dataset of 228K segmented and annotated reasoning traces derived from OpenThoughts-v3. A two-stage SFT recipe is shown to work across model families (Qwen3, Phi-4, Olmo 3) and scales (8B-32B), with trained models maintaining strong accuracy on math/science/coding benchmarks while achieving ~2.5× peak KV cache reduction; vLLM is extended for inference support yielding ~1.75× throughput gains and enabling further RL. A dual information stream is identified, with ablation showing 15 pp AIME24 accuracy drop when the KV channel is removed.

Significance. If the efficiency and accuracy claims hold after clarification, the work would be a meaningful contribution to scalable reasoning in LLMs by enabling models to self-manage and compress intermediate state. Notable strengths include the public release of OpenMementos (228K traces), the vLLM extension that supports both inference and RL, and consistent evaluation across multiple model families and parameter scales.

major comments (2)

[Abstract] Abstract: The ~2.5× peak KV cache reduction claim is in tension with the dual information stream (memento text plus retained KV states from each original reasoning block). Standard transformer KV cache memory scales linearly with the number of tokens for which K/V tensors are stored; retaining full original-block KV tensors while also storing KV for mementos would produce a cache footprint at least as large as the baseline unless the retained states are compressed, sparsified, or excluded from the reported metric. No equation, section, or description details the cache accounting, position mapping, or vLLM modifications that reconcile retention with net reduction. This is load-bearing for the central efficiency claim.
[Abstract and evaluation sections] Abstract and evaluation sections: The manuscript states that trained models maintain strong accuracy on math, science, and coding benchmarks and reports a 15 pp AIME24 drop when the KV channel is removed, but provides no numerical accuracy tables, baseline comparisons, standard deviations, or details on how segmentation and annotation were performed. This limits verification of the maintained-performance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comments point by point below, providing clarifications on the KV cache mechanism and committing to expanded evaluation details in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The ~2.5× peak KV cache reduction claim is in tension with the dual information stream (memento text plus retained KV states from each original reasoning block). Standard transformer KV cache memory scales linearly with the number of tokens for which K/V tensors are stored; retaining full original-block KV tensors while also storing KV for mementos would produce a cache footprint at least as large as the baseline unless the retained states are compressed, sparsified, or excluded from the reported metric. No equation, section, or description details the cache accounting, position mapping, or vLLM modifications that reconcile retention with net reduction. This is load-bearing for the central efficiency claim.

Authors: We appreciate the referee highlighting this important point regarding the KV cache accounting. In MEMENTO, after processing each reasoning block and generating its memento, the KV cache entries corresponding to the original block tokens are evicted; only the KV tensors for the much shorter memento tokens are retained for subsequent attention. The dual information stream therefore consists of the explicit memento text plus the KV states generated while producing and attending to that memento, which carry implicit information from the block. This yields a peak KV cache size governed by the cumulative length of mementos rather than the full trace, producing the reported ~2.5× reduction. We will add a dedicated subsection (Methods 3.4) containing (i) the cache-size equations (baseline O(total tokens) vs. MEMENTO O(∑ memento lengths)), (ii) position-ID remapping rules, and (iii) the precise vLLM modifications for selective eviction and retention. These additions will eliminate any ambiguity. revision: yes
Referee: [Abstract and evaluation sections] Abstract and evaluation sections: The manuscript states that trained models maintain strong accuracy on math, science, and coding benchmarks and reports a 15 pp AIME24 drop when the KV channel is removed, but provides no numerical accuracy tables, baseline comparisons, standard deviations, or details on how segmentation and annotation were performed. This limits verification of the maintained-performance claim.

Authors: We agree that explicit numerical tables and procedural details are necessary for verification. The full manuscript already contains Table 1 (benchmark accuracies for Qwen3/Phi-4/Olmo-3 at 8B–32B scales vs. standard SFT baselines), Table 2 (KV-channel ablation with the 15 pp AIME24 drop and standard deviations over three seeds), and Section 3.2 (segmentation/annotation pipeline using GPT-4o on OpenThoughts-v3 traces). To strengthen the submission we will (i) expand the evaluation section with additional per-benchmark tables, (ii) include direct comparisons against long-context baselines without MEMENTO, and (iii) provide the exact segmentation prompts and annotation guidelines. These revisions will make all performance claims fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training on new dataset with external benchmark evaluation

full rationale

The paper introduces OpenMementos as a new dataset of 228K traces, applies a two-stage SFT recipe, and reports measured accuracy on standard math/science/coding benchmarks plus observed KV cache reduction. No equations, parameters, or claims are shown to reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The dual information stream is presented as an empirical ablation result rather than a definitional loop. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that reasoning traces admit useful segmentation into compressible blocks whose essential state can be captured by short text summaries plus KV states.

axioms (1)

domain assumption Long reasoning traces can be segmented into blocks that each possess a meaningful, summarizable intermediate state.
Invoked when constructing OpenMementos and when training models to generate mementos.

invented entities (1)

Memento no independent evidence
purpose: Dense textual summary of a reasoning block that replaces the full block for subsequent attention.
New concept introduced by the paper to organize and compress intermediate state.

pith-pipeline@v0.9.0 · 5576 in / 1309 out tokens · 41635 ms · 2026-05-10T17:12:17.577343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 33 canonical work pages · 15 internal anchors

[1]

arXiv preprint arXiv:2510.06557 , year=

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, and Siva Reddy. The Markovian Thinker : Architecture-agnostic linear scaling of reasoning. arXiv preprint arXiv:2510.06557, 2025

work page arXiv 2025
[2]

MathArena : Evaluating LLMs on uncontaminated math competitions

Mislav Balunovi \'c , Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi \'c , and Martin Vechev. MathArena : Evaluating LLMs on uncontaminated math competitions. Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2025

2025
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv preprint arXiv:2505.24133,

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV : Redundancy-aware KV cache compression for reasoning models. arXiv preprint arXiv:2505.24133, 2025

work page arXiv 2025
[5]

DeepSeek-V3 Technical Report

DeepSeek-AI . DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, et al. DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review arXiv 2025
[8]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review arXiv 2024
[9]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. ThinkPrune : Pruning long chain-of-thought of LLMs via reinforcement learning. arXiv preprint arXiv:2504.01296, 2025

work page arXiv 2025
[10]

C3oT : Generating shorter chain-of-thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3oT : Generating shorter chain-of-thought without compromising effectiveness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312--24320, 2025

2025
[11]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023

2023
[12]

org/abs/2508.03346

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing LLM chain-of-thought via step entropy. arXiv preprint arXiv:2508.03346, 2025

work page arXiv 2025
[13]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax , Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, et al. MiniMax-M1 : Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[14]

Monea, Y

Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kiant \'e Brantley, and Yoav Artzi. Breadcrumbs reasoning: Memory-efficient reasoning with compression beacons. arXiv preprint arXiv:2510.13797, 2025

work page arXiv 2025
[15]

Learning to reason with LLMs

OpenAI . Learning to reason with LLMs . https://openai.com/index/learning-to-reason-with-llms/, 2024. Blog post, September 2024

2024
[16]

QwQ : Reflect deeply on the boundaries of the unknown

Qwen Team . QwQ : Reflect deeply on the boundaries of the unknown. https://qwenlm.github.io/blog/qwq-32b-preview/, 2024. Blog post, November 2024

2024
[17]

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models. arXiv preprint arXiv:2510.01290, 2025

work page internal anchor Pith review arXiv 2025
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Codi: Compressing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 677--693, 2025

2025
[21]

Sample more to think less: Group filtered policy optimization for concise reasoning

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. arXiv preprint arXiv:2508.09726, 2025

work page arXiv 2025
[22]

Reasoning path compression: Com- pressing generation trajectories for efficient llm reasoning

Jiwon Song, Dongwon Jo, Yulhwa Kim, and Jae-Joon Kim. Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning. arXiv preprint arXiv:2505.13866, 2025

work page arXiv 2025
[23]

Think silently, think fast: Dynamic latent compression of llm reasoning chains

Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Ruihua Song, and Jian Luan. Think silently, think fast: Dynamic latent compression of LLM reasoning chains. arXiv preprint arXiv:2505.16552, 2025. URL https://arxiv.org/abs/2505.16552

work page arXiv 2025
[24]

Team Olmo , Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

2020
[26]

Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025

Jun Wang. Memento 2: Learning by stateful reflective memory, 2026. URL https://arxiv.org/abs/2512.22716

work page arXiv 2026
[27]

R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint arXiv:2505.16838, 2025

Yibo Wang, Haotian Luo, Huanjin Yao, Tiansheng Huang, Haiying He, Rui Liu, Naiqiang Tan, Jiaxing Huang, Xiaochun Cao, Dacheng Tao, and Li Shen. R1-compress: Long chain-of-thought compression via chunk compression and search, 2025. URL https://arxiv.org/abs/2505.16838

work page arXiv 2025
[28]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. ReSum : Unlocking long-horizon search intelligence via context summarization. arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[29]

TokenSkip : Controllable chain-of-thought compression in LLMs

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. TokenSkip : Controllable chain-of-thought compression in LLMs . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3351--3363, 2025

2025
[30]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM : Agentic memory for LLM agents. arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review arXiv 2025
[31]

Inftythink: Breaking the length limits of long-context reasoning in large language models.arXiv preprint arXiv:2503.06692, 2025

Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, and Yueting Zhuang. InftyThink : Breaking the length limits of long-context reasoning in large language models. arXiv preprint arXiv:2503.06692, 2025

work page arXiv 2025
[32]

InftyThink+ : Effective and efficient infinite-horizon reasoning via reinforcement learning

Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, and Yongliang Shen. InftyThink+ : Effective and efficient infinite-horizon reasoning via reinforcement learning. arXiv preprint arXiv:2602.06960, 2026

work page arXiv 2026
[33]

Pencil: Long thoughts with short memory.arXiv preprint arXiv:2503.14337, 2025

Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. PENCIL : Long thoughts with short memory. arXiv preprint arXiv:2503.14337, 2025

work page arXiv 2025
[34]

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Wenlei Shi, Yiwei Wang, Xiaodan Liang, and Jing Tang. Accordion-Thinking : Self-regulated step summaries for efficient and readable LLM reasoning. arXiv preprint arXiv:2602.03249, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. MemAgent : Reshaping long-context LLM with multi-conv RL -based memory agent. arXiv preprint arXiv:2507.02259, 2025

work page internal anchor Pith review arXiv 2025
[36]

LazyEviction : Lagged KV eviction with attention pattern observation for efficient long reasoning

Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, and Song Guo. LazyEviction : Lagged KV eviction with attention pattern observation for efficient long reasoning. arXiv preprint arXiv:2506.15969, 2025 a

work page arXiv 2025
[37]

LightThinker : Thinking step-by-step compression

Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. LightThinker : Thinking step-by-step compression. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13318--13339, 2025 b

2025
[38]

Tokensqueeze: Performance-preserving compression for reasoning llms

Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, and Jieping Ye. Tokensqueeze: Performance-preserving compression for reasoning llms. arXiv preprint arXiv:2511.13223, 2025 c

work page arXiv 2025
[39]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang : Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024

work page internal anchor Pith review arXiv 2024
[40]

Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs , 2025 a . URL https://arxiv.org/abs/2508.16153

work page arXiv 2025
[41]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1 : Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025 b

work page internal anchor Pith review arXiv 2025
[42]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[43]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[44]

>3N=\:I1RNǔJ(9s ޟz e | S \޵ IxrJ!ߘDjR (L<3NX V. J´C : 2x0 ;ctRK |J4; +WݟJ9ࠍT ` 1i! +ؘ T7a2owϢ)pe09 Gv q *Zj `N ô#¬D< ˘ʬI, ar5fŊspW>r , #ˬ2w N 1As Q T Kp e唡_drz f <G

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1921