pith. sign in

arxiv: 2604.12376 · v2 · pith:LDXXKYC4new · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Pith reviewed 2026-05-10 15:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM memory managementkeyword bookmarkscooperative paginglong-horizon conversationscontext evictionrecall toolLoCoMo benchmarkmulti-session dialogues
0
0 comments X

The pith

Cooperative paging with keyword bookmarks lets LLMs recover evicted conversation turns on demand and achieve higher answer quality than full context or retrieval methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes cooperative memory paging for conversations that exceed an LLM's context window. Evicted segments are replaced by short keyword bookmarks, and the model receives a recall tool to fetch the original text when needed. On the LoCoMo benchmark of ten real multi-session conversations with over 300 turns, this approach produces the best answers among six strategies, including truncation, BM25, word-overlap retrieval, a search-tool baseline, and keeping the full context. The result holds across four models and is confirmed by four independent judges. Design ablations show that fixed-size pages work better than topic-based ones and that bookmark distinctiveness drives most of the remaining performance gap.

Core claim

By inserting compact keyword bookmarks in place of evicted conversation segments and giving the model a recall tool, cooperative paging yields higher answer quality on long-horizon dialogues than truncation, external retrieval, or retaining the entire history. On the LoCoMo benchmark the method ranks first among the six tested approaches across GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, and GLM-5, with statistical support from paired bootstrap tests.

What carries the argument

The cooperative paging mechanism that substitutes evicted text with minimal keyword bookmarks and equips the model with a recall tool for selective retrieval.

If this is right

  • Coarse fixed-size pages reach 96.7 percent success while topic-shift boundaries fall to 56.7 percent.
  • Eviction policy performance depends on data type, with FIFO strongest on synthetic probes and LFU strongest on LoCoMo.
  • Two improved bookmark-generation strategies each raise end-to-end scores over the basic heuristic.
  • The main remaining limit is bookmark discrimination, since the model triggers recall 96 percent of the time but picks the correct page only 57 percent when cues are weak.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bookmark-plus-tool pattern could be applied to multi-agent histories where agents share a common long-term record.
  • Models could be post-trained to generate more discriminative bookmarks automatically.
  • The 25-point accuracy swing from keyword specificity suggests that prompt engineering or fine-tuning focused on bookmark reading may close much of the remaining gap.

Load-bearing premise

Keyword bookmarks contain enough distinctive information for the model to correctly select the right page when it calls the recall tool.

What would settle it

A dataset in which all bookmarks use only generic keywords, causing correct page selection to drop well below 57 percent and overall answer quality to fall below the truncation baseline.

Figures

Figures reproduced from arXiv: 2604.12376 by Ziyang Liu.

Figure 1
Figure 1. Figure 1: NLL-based fault detection fails. (a) Mean [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cooperative memory paging. As the conversation grows (left), turns are grouped into pages that occupy [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LoCoMo per-category comparison. Book￾mark+Recall outperforms baselines across all 5 QA categories, with the largest gains on temporal reason￾ing and open-domain questions, which require access￾ing distant conversation history [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Page boundary ablation. Coarse fixed-size [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Eviction policy ablation. Bel´ ady’s oracle ´ (rightmost, highlighted) upper-bounds online policies by 8–14 points, revealing headroom for smarter practi￾cal policies [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Two conversation topologies explain the eviction-policy inversion. In forward-moving conversations [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bookmark format Pareto analysis (n=22 controlled probes). Minimal keywords (∼24 tokens) achieve the highest accuracy at the lowest token cost. LRU vs. LFU matters much less than on GPT-4o￾mini. This is consistent with DeepSeek being a stronger model: it can compensate for mediocre eviction by more aggressively calling recall() to bring evicted pages back. The Bel´ ady gap ´ shrinks from 14.3 to 4.7 points,… view at source ↗
Figure 8
Figure 8. Figure 8: The information gap principle. A minimal bookmark (left) creates just enough uncertainty for the model to call recall() before answering, retrieving the full page and responding correctly. A rich bookmark (right) gives the same model a false sense of sufficiency, suppresses the recall call, and leads to a hallucinated answer. Paradoxically, more information in the bookmark yields worse end-to-end accuracy.… view at source ↗
Figure 9
Figure 9. Figure 9: Heatmap of boundary×eviction accuracy. Page granularity (rows) dominates eviction policy (columns): fixed 20 is uniformly high regardless of policy. 159 probes per strategy). All strategies degrade compared to the fixed 10 results in the main text, but the ranking is preserved: hybrid remains the best strategy (54.7%), and random remains worst (22.6%) [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes cooperative paging for long-horizon LLM conversations: evicted context segments are replaced by compact keyword bookmarks ([pN:keywords]), and the model is equipped with a recall() tool to fetch full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging outperforms truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), with the advantage confirmed by four independent LLM judges at p=0.017 via paired bootstrap. A 5x4 ablation over page-boundary strategies and eviction policies (3,176 synthetic + 1,600 LoCoMo probes) identifies coarse fixed-size pages as superior to topic-shift boundaries, data-dependent eviction effects, and two improved bookmark generators (+4.4 / +8.7 E2E points), while highlighting bookmark discrimination as the remaining bottleneck (96% recall trigger rate but only 57% correct page selection when bookmarks lack specificity).

Significance. If the end-to-end gains prove robust, cooperative paging supplies a lightweight, model-in-the-loop mechanism for recovering evicted history without full-context overhead or irreversible loss, which could improve reliability in extended multi-session applications. The multi-model evaluation, paired-bootstrap significance testing, and explicit ablation of boundary/eviction choices are methodological strengths; the paper also earns credit for openly reporting the 57% selection-rate limitation rather than concealing it.

major comments (2)
  1. [Abstract] Abstract: the headline claim that cooperative paging achieves the highest answer quality rests on a 57% correct-page-selection rate when bookmarks are insufficiently distinctive. Because the LoCoMo evaluation is end-to-end and the benchmark contains only 10 conversations, it is unclear whether the observed quality edge is produced by successful paging or by incidental retrieval of useful fragments in the subset of cases where selection happens to succeed. A breakdown of answer quality conditioned on correct versus incorrect recall() outcomes is required to establish that the paging mechanism is the causal source of the gains.
  2. [Evaluation] Evaluation section (LoCoMo results): the paired-bootstrap significance (p=0.017) is computed over only 10 conversations. Given that eviction-policy performance is explicitly data-dependent (FIFO best on synthetic probes, LFU best on LoCoMo) and that bookmark specificity alone drives a 25-point accuracy gap, the small sample size makes it difficult to rule out conversation-specific artifacts; results should be replicated on a larger, held-out set of multi-session dialogues.
minor comments (3)
  1. [Abstract] Abstract: main metrics are reported without error bars or confidence intervals, making it hard to judge the practical magnitude of the improvements over the six baselines.
  2. The manuscript does not state whether code, prompts, or the exact bookmark-generation heuristics will be released, which limits reproducibility of the 5x4 ablation and the two improved bookmark generators.
  3. The consistency of the four LLM judges is not quantified (e.g., inter-rater agreement or agreement with human labels on a subset); this is especially relevant because the primary metric is LLM-as-judge quality.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating planned revisions to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that cooperative paging achieves the highest answer quality rests on a 57% correct-page-selection rate when bookmarks are insufficiently distinctive. Because the LoCoMo evaluation is end-to-end and the benchmark contains only 10 conversations, it is unclear whether the observed quality edge is produced by successful paging or by incidental retrieval of useful fragments in the subset of cases where selection happens to succeed. A breakdown of answer quality conditioned on correct versus incorrect recall() outcomes is required to establish that the paging mechanism is the causal source of the gains.

    Authors: We agree that conditioning answer quality on recall() success would strengthen the causal interpretation. The manuscript already reports the 96% recall trigger rate and 57% correct-page selection rate, along with the 25-point accuracy gap attributable to bookmark specificity. However, we did not include the explicit split of end-to-end quality metrics for correct versus incorrect selections. We will add this breakdown (e.g., as a table or additional rows in the LoCoMo results) to the revised manuscript to demonstrate that gains are concentrated in successful paging cases. revision: yes

  2. Referee: [Evaluation] Evaluation section (LoCoMo results): the paired-bootstrap significance (p=0.017) is computed over only 10 conversations. Given that eviction-policy performance is explicitly data-dependent (FIFO best on synthetic probes, LFU best on LoCoMo) and that bookmark specificity alone drives a 25-point accuracy gap, the small sample size makes it difficult to rule out conversation-specific artifacts; results should be replicated on a larger, held-out set of multi-session dialogues.

    Authors: We acknowledge that N=10 conversations is a limitation of the LoCoMo benchmark and that our ablations already highlight data-dependent effects (e.g., FIFO vs. LFU) and the impact of bookmark quality. The paired bootstrap is applied over the 10 conversations to assess significance given the sample, and the large-scale synthetic probes (3,176) plus LoCoMo probes (1,600) provide supporting evidence of robustness. Replicating the full end-to-end evaluation on an entirely new, larger held-out corpus of multi-session dialogues would require new data collection and is not feasible in the current work. In revision we will expand the limitations discussion to more explicitly address generalizability and conversation-specific risks. revision: partial

standing simulated objections not resolved
  • Replicating the end-to-end LoCoMo evaluation on a larger held-out set of multi-session dialogues, as this requires substantial new data collection beyond the scope of the present study.

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external benchmark

full rationale

The paper contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains. All claims rest on direct measurements from the LoCoMo benchmark (10 conversations), synthetic probes, and ablations across boundary/eviction strategies and bookmark generators, with results confirmed by multiple independent models and judges. The reported 57% page-selection rate and 96% recall trigger rate are explicit empirical observations, not inputs that are redefined as outputs. The method is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The approach introduces new interface elements and relies on assumptions about model tool-use behavior rather than deriving from prior fitted constants.

free parameters (2)
  • bookmark token length
    Set to ~8-24 tokens to remain minimal while informative; tested via ablation.
  • page boundary strategy
    fixed_20 vs topic_shift; fixed_20 reached 96.7% while topic_shift fell to 56.7%.
axioms (1)
  • domain assumption LLMs can be instructed to use the recall() tool effectively when given keyword bookmarks
    Central to enabling cooperative paging; invoked throughout the method description.
invented entities (2)
  • keyword bookmark no independent evidence
    purpose: Compact proxy replacing evicted conversation segments
    New construct introduced to support paging without full context retention.
  • recall() tool no independent evidence
    purpose: On-demand retrieval of full page content
    New tool interface provided to the model.

pith-pipeline@v0.9.0 · 5589 in / 1453 out tokens · 39735 ms · 2026-05-10T15:35:26.791767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.