Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

Andrew Semenov; Svyatoslav Dorofeev

arxiv: 2606.11213 · v1 · pith:UDZE2D72new · submitted 2026-05-01 · 💻 cs.CL

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

Andrew Semenov , Svyatoslav Dorofeev This is my paper

Pith reviewed 2026-07-01 07:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords context managementlong-horizon agentsLLM agentscontext evictionepisode annotationtoken budgetagentic benchmarksstructured eviction

0 comments

The pith

Structured episode annotations and deterministic eviction let LLM agents run 89 sequential tasks across 80 million tokens with no accuracy loss relative to isolated sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Context Window Lifecycle (CWL), a context-management scheme that keeps LLM agent sessions within fixed token budgets by having the agent annotate its own trajectory as typed, dependency-linked episodes and then applying a deterministic, LLM-free eviction policy on that graph. The method preserves user turns and active reasoning context while shedding action episodes whose effects are already persisted in the environment. It claims to avoid the lossiness, causal destruction, and hallucination risks of summarization-based compaction as well as the relevance-blind cuts of simple recency truncation. The central demonstration is a single continuous session completing 89 tasks over 80 million tokens with task accuracy indistinguishable from running each task in isolation.

Core claim

By requiring the agent to annotate its trajectory as typed, dependency-linked episodes and then using a deterministic policy to evict content in priority order within that structure, CWL maintains context near a stable ceiling that preserves performance equivalent to per-task isolated sessions even after 80 million tokens of cumulative history.

What carries the argument

The episode graph formed by the agent's typed, dependency-linked annotations, together with the deterministic, LLM-free eviction policy that removes the oldest recoverable action episodes first.

If this is right

Context size remains bounded near a stable ceiling regardless of total session length.
Eviction decisions are fully deterministic and do not require additional LLM calls for compression.
Causal structure among episodes is preserved because only recoverable, environment-persisted content is removed.
The approach scales to sessions whose total history greatly exceeds typical context windows while avoiding very-large-prompt degradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation-plus-eviction loop could be applied to non-LLM sequential planners if they can emit comparable typed dependency records.
Persistent environment state becomes even more valuable because CWL explicitly relies on it to justify eviction.
The method implies that long-horizon reliability may depend more on accurate self-annotation than on raw model scale.

Load-bearing premise

The agent can produce accurate typed, dependency-linked episode annotations during operation such that the deterministic eviction policy removes only recoverable content without breaking task performance.

What would settle it

Measure task accuracy on the 89-task long-horizon benchmark using CWL versus per-task isolated sessions; a statistically significant drop would falsify the no-degradation claim.

Figures

Figures reproduced from arXiv: 2606.11213 by Andrew Semenov, Svyatoslav Dorofeev.

**Figure 1.** Figure 1: Components of CWL. The agent annotates its trajectory as it works; the episode graph accumulates these annotations; the eviction policy consults token accounting and evicts content in priority order when the budget is exceeded. Principle 6 (Graduated, not catastrophic). Compression proceeds in the smallest increments that will meet the budget. The first response to overflow is to strip the single most reco… view at source ↗

**Figure 2.** Figure 2: Mean accuracy (%) for CWL (single session, 80k token budget) and the individual-sessions baseline across four benchmarks. Terminal Bench 2.0: n = 5; SWE Bench Lite, Recovery Bench, LongCLI Bench: n = 3. the agent’s ability to recover from failure states mid-task; and LongCLI Bench [3], which covers long-horizon command-line interaction tasks. Protocol. The standard evaluation protocol for all four benchmar… view at source ↗

read the original abstract

We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CWL gives agents a dependency-graph eviction policy that looks cleaner than summarization or truncation, but the 89-task result hinges on unverified annotation quality.

read the letter

The paper's core move is to have the agent itself label its trajectory into typed episodes with explicit dependency links, then apply a fixed, non-LLM policy that drops only the episodes whose effects are already persisted elsewhere. That structure lets it keep user turns and active reasoning while shedding the rest, which is a clear step past both lossy compaction and blind recency cuts.

It does avoid the four problems listed in the abstract, and the single long run (89 tasks, 80 M tokens, no accuracy drop versus isolated baselines) is the kind of end-to-end number that matters for agent work. The protocol and token-accounting loop are described at a usable level of detail.

The soft spot is exactly the one the stress test flags: everything rests on the agent producing accurate typed annotations and correct dependency edges in real time. If those are noisy, the deterministic policy will evict the wrong content and the task-accuracy metric will not catch it. The abstract gives no separate measurement of annotation fidelity, no ablation on annotation errors, and no statistical detail on the baseline comparison. That makes the headline result suggestive rather than conclusive.

This is for groups already running long-horizon agents and looking for practical context-management alternatives. It is worth sending to referees because the idea is concrete, the limitations of prior approaches are stated plainly, and the evaluation, while thin, is on the right kind of benchmark. A serious review would mainly press for annotation-quality checks and fuller experimental controls.

Referee Report

2 major / 0 minor

Summary. The paper presents Context Window Lifecycle (CWL), a context-management scheme for long-horizon LLM agents. The agent annotates its trajectory on-the-fly as typed, dependency-linked episodes; a deterministic, LLM-free policy then performs graduated eviction when the token budget is exceeded, preserving user turns and active reasoning context while shedding recoverable action episodes. The central claim is that this yields an effectively unbounded working horizon, demonstrated by a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions.

Significance. If the headline performance result holds under rigorous controls, CWL would represent a meaningful advance over both summarization-based compaction (by avoiding lossiness, causal-structure destruction, and hallucination) and recency truncation (by using semantic dependency structure). The deterministic, parameter-free eviction step is a clear strength that could enable more reliable long-horizon agent deployments.

major comments (2)

[Abstract] Abstract (evaluation paragraph): the claim that a single session completes 89 sequential tasks across 80 million tokens 'with no measurable degradation in task accuracy relative to per-task isolated sessions' supplies no details on the exact accuracy metric, baseline implementations, statistical tests for equivalence, or experimental controls. This information is load-bearing for the central empirical claim.
[Annotation protocol and eviction policy] Annotation protocol and eviction policy sections: the deterministic eviction policy is stated to remove only recoverable content, yet the manuscript provides no independent verification (e.g., human audit or held-out annotation accuracy metric) of the fidelity of the agent-generated episode types and dependency edges. Systematic annotation errors would cause non-recoverable state to be evicted while the reported task-accuracy metric would not isolate this failure mode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important aspects of the empirical claims and the validation of the annotation process. We address each point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (evaluation paragraph): the claim that a single session completes 89 sequential tasks across 80 million tokens 'with no measurable degradation in task accuracy relative to per-task isolated sessions' supplies no details on the exact accuracy metric, baseline implementations, statistical tests for equivalence, or experimental controls. This information is load-bearing for the central empirical claim.

Authors: We agree that the abstract would be strengthened by including a concise description of the evaluation setup. The full details appear in Section 4, where task accuracy is defined as the binary success rate on each sequential task (identical to the isolated-session baseline), the baseline consists of independent per-task sessions using the same agent and prompt template, and controls include fixed model, temperature, and environment state. No statistical equivalence tests were run because success rates matched exactly across the 89 tasks. We will revise the abstract to briefly note the metric and baseline. revision: yes
Referee: [Annotation protocol and eviction policy] Annotation protocol and eviction policy sections: the deterministic eviction policy is stated to remove only recoverable content, yet the manuscript provides no independent verification (e.g., human audit or held-out annotation accuracy metric) of the fidelity of the agent-generated episode types and dependency edges. Systematic annotation errors would cause non-recoverable state to be evicted while the reported task-accuracy metric would not isolate this failure mode.

Authors: The manuscript does not provide an independent human audit or held-out accuracy metric for the agent-generated annotations. We maintain that the primary evaluation—exact parity in task success rate versus isolated sessions—serves as an indirect but rigorous check, because any systematic eviction of non-recoverable state would produce observable degradation. Nevertheless, we acknowledge the referee’s point as a methodological gap and will add a short paragraph in the Discussion section explicitly noting the reliance on end-to-end task metrics rather than direct annotation verification. revision: partial

Circularity Check

0 steps flagged

No circularity; method presented as new policy without equations or self-referential reductions

full rationale

The paper describes CWL as a context-management scheme relying on agent-generated typed episode annotations and a deterministic LLM-free eviction policy. No equations, fitted parameters, self-citations, or derivation steps are present that would reduce the claimed performance (89 tasks, 80M tokens, no accuracy drop) to inputs by construction. The central result is an empirical evaluation of the policy rather than a mathematical derivation; the annotation accuracy assumption is a correctness precondition, not a circularity pattern. The method is self-contained against external benchmarks as a proposed system.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of the annotation protocol and eviction policy.

pith-pipeline@v0.9.1-grok · 5771 in / 1068 out tokens · 25813 ms · 2026-07-01T07:37:27.942176+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
cs.AI 2026-06 unverdicted novelty 8.0

Context compaction erases in-context governance constraints in LLM agents, raising policy violation rates from 0% to 30% (up to 59% for some models) on the ConstraintRot benchmark.
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
cs.AI 2026-06 unverdicted novelty 6.0

Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.

Reference graph

Works this paper leans on

15 extracted references · cited by 1 Pith paper

[1]

Recovery-bench: Evaluating agentic recovery from mistakes, 2025

Anonymous. Recovery-bench: Evaluating agentic recovery from mistakes, 2025. URL https: //openreview.net/pdf/3b7f176c50002e59438321f581063295986b269e.pdf. Submitted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

2025
[2]

Mem0: Building production-ready AI agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. InProceedings of the 27th European Conference on Artificial Intelligence (ECAI), 2025

2025
[3]

LongCLI-Bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

Yukang Feng et al. LongCLI-Bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

2026
[4]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=VTF8yNQM66

2024
[5]

Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing context compression for long-horizon LLM agents, 2025

2025
[6]

pi-cwl: Open-source agent harness with context window lifecycle

Kiz8 Team. pi-cwl: Open-source agent harness with context window lifecycle. https://github. com/Kiz8-Team/pi-cwl, 2026

2026
[7]

The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management. InNeurIPS 2025 Workshop on Deep Learning for Code (DL4C), 2025

2025
[8]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL https://www.tbench. ai/. 15

2026
[9]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems. InConference on Language Modeling (COLM), 2024

2024
[10]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

2023
[11]

Zep: A temporal knowledge graph architecture for agent memory, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory, 2025

2025
[12]

Scaling long-horizon LLM agent via context-folding, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding, 2025

2025
[13]

ReSum: Unlocking long-horizon search intelligence via context summarization, 2025

Xixi Wu, Kuan Li, Yida Zhao, et al. ReSum: Unlocking long-horizon search intelligence via context summarization, 2025

2025
[14]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

2024
[15]

A-MEM: Agentic memory for LLM agents, 2025

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents, 2025. 16

2025

[1] [1]

Recovery-bench: Evaluating agentic recovery from mistakes, 2025

Anonymous. Recovery-bench: Evaluating agentic recovery from mistakes, 2025. URL https: //openreview.net/pdf/3b7f176c50002e59438321f581063295986b269e.pdf. Submitted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

2025

[2] [2]

Mem0: Building production-ready AI agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. InProceedings of the 27th European Conference on Artificial Intelligence (ECAI), 2025

2025

[3] [3]

LongCLI-Bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

Yukang Feng et al. LongCLI-Bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

2026

[4] [4]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=VTF8yNQM66

2024

[5] [5]

Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing context compression for long-horizon LLM agents, 2025

2025

[6] [6]

pi-cwl: Open-source agent harness with context window lifecycle

Kiz8 Team. pi-cwl: Open-source agent harness with context window lifecycle. https://github. com/Kiz8-Team/pi-cwl, 2026

2026

[7] [7]

The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management. InNeurIPS 2025 Workshop on Deep Learning for Code (DL4C), 2025

2025

[8] [8]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL https://www.tbench. ai/. 15

2026

[9] [9]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems. InConference on Language Modeling (COLM), 2024

2024

[10] [10]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

2023

[11] [11]

Zep: A temporal knowledge graph architecture for agent memory, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory, 2025

2025

[12] [12]

Scaling long-horizon LLM agent via context-folding, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding, 2025

2025

[13] [13]

ReSum: Unlocking long-horizon search intelligence via context summarization, 2025

Xixi Wu, Kuan Li, Yida Zhao, et al. ReSum: Unlocking long-horizon search intelligence via context summarization, 2025

2025

[14] [14]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

2024

[15] [15]

A-MEM: Agentic memory for LLM agents, 2025

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents, 2025. 16

2025