Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Ziyang Liu

REVIEW 4 major objections 2 minor 4 references

Cooperative paging replaces evicted chat history with tiny keyword bookmarks and a recall() tool, and it beats truncation, retrieval, and full context on long multi-session conversations.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-12 21:21 UTC pith:LDXXKYC4

load-bearing objection Clean systems idea with a strong reported LoCoMo ranking, but the full-text dump is the wrong paper, so the p=0.017 win and ablations are still unchecked. the 4 major comments →

arxiv 2604.12376 v2 pith:LDXXKYC4 submitted 2026-04-14 cs.CL cs.AI

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Ziyang Liu This is my paper

classification cs.CL cs.AI

keywords cooperative pagingkeyword bookmarkslong-horizon conversationsmemory evictionLLM toolsLoCoMocontext management

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When an LLM conversation exceeds the context window, older turns must be dropped, yet the model still needs a way to recover them when a later question depends on that history. This paper proposes cooperative paging: each evicted segment is replaced by a short keyword bookmark of roughly 8–24 tokens, and the model is given an explicit recall() tool so it can fetch the full page on demand. On the LoCoMo benchmark of real multi-session chats (10 conversations, 300+ turns), the method produces the highest answer quality among six competitors—including BM25, word-overlap retrieval, a search-tool baseline, simple truncation, and even keeping the entire history—across four different models and four independent LLM judges. Ablations further show that coarse fixed-size pages work far better than topic-aware boundaries, that eviction policy is data-dependent, and that the remaining failure mode is not whether the model calls recall() but whether the bookmark is distinctive enough for it to pick the right page.

Core claim

On long-horizon multi-session conversations, replacing evicted segments with minimal keyword bookmarks plus a model-callable recall() tool yields higher answer quality than truncation, standard retrieval methods, a search-tool baseline, and even full-context retention, with the ranking holding across four models and four independent LLM judges (p=0.017).

What carries the argument

Cooperative paging: each page is summarized by a short keyword bookmark of the form [pN:keywords] that stays in context; the model is given a recall() tool that returns the original full page when the bookmark is insufficient.

Load-bearing premise

That scores from four independent LLM judges are a faithful measure of answer quality for multi-session factual recall, and that the ten LoCoMo conversations plus the synthetic probes are representative enough for the superiority claim to generalize.

What would settle it

Run the same six methods on a fresh set of multi-session conversations scored by human raters (or by a held-out human-validated automatic metric) and check whether cooperative paging still ranks first with a comparable effect size and significance.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Long-running agent or multi-session chat systems can keep only a few dozen bookmark tokens per page instead of the full history while still recovering needed facts.
Bookmark distinctiveness, not recall frequency, becomes the primary engineering target: more specific keywords alone move page-selection accuracy by roughly 25 points.
Coarse fixed-size paging (e.g., fixed_20) is preferable to content-aware topic-shift segmentation for this style of memory.
Eviction policy should be chosen per domain (FIFO for synthetic probes, LFU for real LoCoMo chats) rather than assumed universal.
Two improved bookmark-generation strategies already add 4–9 end-to-end points over a simple heuristic, suggesting further gains are available from better keyword extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bookmark-plus-recall pattern could be applied to tool-use traces or long agent trajectories, not only human–LLM chat logs.
If bookmark discrimination remains the bottleneck, a cheap second-stage re-ranker or a small learned bookmark encoder might close the remaining gap without enlarging the context window.
The result that full context underperforms paging implies that raw length can introduce more distraction than signal once conversations exceed a few hundred turns.
Because the method is model-agnostic and needs only a tool interface, it can be layered on top of any existing long-context or RAG stack with almost no architectural change.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Clean systems idea with a strong reported LoCoMo ranking, but the full-text dump is the wrong paper, so the p=0.017 win and ablations are still unchecked.

read the letter

The one thing you need to know: this is a practical paging-plus-tool design for long multi-session chats—evict old segments, leave short keyword bookmarks of the form [pN:keywords], and give the model a recall() tool—and the abstract claims it beats truncation, BM25, word-overlap, a search tool, and even full context on LoCoMo across four models, with four LLM judges and p=0.017. That ranking, if real, is the contribution.

What is actually new is the cooperative framing: the model, not a silent retriever, decides when to page content back in, with bookmarks as the only residual in context. The design-space study is the useful part of the abstract—fixed_20 pages at 96.7% vs topic_shift at 56.7%, eviction policy data-dependent, better bookmark generators (+4.4 / +8.7 E2E), and a clear bottleneck that recall fires 96% of the time but correct-page selection is only 57% when bookmarks are weak, with keyword specificity alone worth 25 points. That is honest systems work, not a theory paper.

Soft spots, in proportion: we do not have the correct full manuscript. The cached “full text” is SCRIPT (Korean subcharacter injection, 2604.12377), so protocol, judge prompts, conversation-level breakdowns, and code are not checkable here. On the abstract alone, N=10 LoCoMo conversations is thin for a bootstrap p-value; beating full context is the claim that most needs human correlation or leave-one-conversation-out, because judges may prefer short tool-mediated answers. Free parameters (page boundary, eviction policy, bookmark length) are acknowledged in the ablations rather than hidden. Circularity is low; residual risk is ordinary LLM-judge bias.

Who it is for: people building long-horizon agents and chat memory, not core theory. If the real paper and artifacts match the abstract, it is a solid new_method piece worth engaging. Until then the superiority claim is unverified.

I would send it to peer review on the strength of the systems idea and the reported design-space findings, with referees asked hard for human ratings, per-conversation robustness, and release of code/data. Engage if you work on agent memory; wait for the correct PDF before citing the ranking.

Referee Report

4 major / 2 minor

Summary. The abstract proposes cooperative paging for long-horizon LLM conversations: when content exceeds the context window, evicted segments are replaced by compact keyword bookmarks of the form [pN:keywords] (~8–24 tokens), and the model is given a recall() tool to restore full pages on demand. On LoCoMo (10 multi-session conversations, 300+ turns) the method is reported to achieve the highest answer quality among six baselines (truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context) across four models, with statistical support from four LLM judges (p=0.017, paired bootstrap). A 5×4 ablation over page-boundary strategies and eviction policies (synthetic + LoCoMo probes) yields design findings: fixed-size pages outperform topic-shift segmentation, eviction policy is data-dependent, improved bookmark generation helps, and the residual bottleneck is bookmark discrimination (high recall trigger rate but only ~57% correct-page selection when bookmarks are weak).

Significance. If the ranking and ablation results hold under proper verification, the work would offer a lightweight, model-agnostic alternative to pure truncation or external retrieval for multi-session dialogue, with a clear mechanistic diagnosis (bookmark specificity) that is actionable. The design-space study (boundary × eviction, bookmark generation variants) is a useful contribution even if absolute superiority over full context proves fragile. Code or reproducible probes would strengthen the claim; none are checkable from the materials supplied for this arXiv ID.

major comments (4)

The supplied full manuscript text is not the paper under review. paper_id 2604.12376 and the abstract describe Cooperative Memory Paging; the body is the unrelated SCRIPT paper (Korean subcharacter module, arXiv 2604.12377). No sections, tables, equations, or experimental protocols for cooperative paging are available. All numerical claims (p=0.017, 96.7% vs 56.7%, +4.4/+8.7 E2E, 57% correct-page selection, 25 pp keyword effect) are therefore unverifiable. A review of the actual manuscript is required before any accept/reject decision can be grounded.
From the abstract alone: the load-bearing superiority claim rests on N=10 LoCoMo conversations. With such a small conversation set, paired bootstrap p-values can be dominated by a few dialogues; the abstract reports no leave-one-conversation-out, per-conversation scores, or inter-judge agreement. This is insufficient to secure a ranking that includes beating full context.
From the abstract alone: answer quality is measured solely by four LLM judges with no reported human correlation or bias audit. Beating full context is counter-intuitive (paging discards information) and is most plausible if judges systematically prefer concise, tool-mediated answers. Without human ratings or a controlled preference study, the ranking and p=0.017 cannot be treated as established.
From the abstract alone: the paper itself identifies a 57% correct-page selection rate when bookmarks are weak, and a 25 pp accuracy gap driven by keyword specificity. End-to-end quality gains may therefore be concentrated on easy probes. Absent a breakdown of quality by probe difficulty or by correct vs incorrect recall, it is unclear whether cooperative paging robustly recovers long-horizon facts or mainly succeeds when discrimination is trivial.

minor comments (2)

Abstract notation for bookmarks ([pN:keywords]) and the recall() tool interface should be defined more precisely (token budget, generation method, failure modes) once the correct manuscript is supplied.
The six baselines and four models are named but not characterized (context lengths, retrieval corpus construction, search-tool prompt). These details matter for interpreting 'outperforms full context'.

Circularity Check

0 steps flagged

No circular derivation: cooperative paging claims are empirical rankings on external LoCoMo/synthetic probes, not results forced by definition or self-fit.

full rationale

The paper’s load-bearing claim is an empirical ranking: cooperative paging (keyword bookmarks + recall() tool) yields the highest answer quality among six methods on LoCoMo (10 multi-session conversations), across four models and four LLM judges (p=0.019-style paired bootstrap). That ranking is produced by running the methods on held-out conversation probes and scoring outputs; it is not obtained by defining a quantity in terms of itself, fitting a parameter on a subset and re-labeling a related quantity as a prediction, or importing a uniqueness theorem from the authors’ prior work. The subsequent 5×4 ablation (boundary strategies × eviction policies; 3,176 synthetic + 1,600 LoCoMo probes) and the bookmark-generation comparisons are likewise exploratory measurements of accuracy and end-to-end quality, not closed-form derivations. Keyword-specificity’s 25-point accuracy gap is a reported correlation, not a tautology. Residual risks (N=10 conversations, unvalidated LLM judges, counter-intuitive win over full context) are validity/robustness concerns, not circularity of the derivation chain. No self-definitional step, fitted-input-as-prediction, load-bearing self-citation uniqueness claim, or renamed known result appears in the abstract or the stated method–evaluation structure. Score 0 is therefore appropriate; steps remain empty.

Axiom & Free-Parameter Ledger

3 free parameters · 3 axioms · 2 invented entities

Abstract-only review. Load-bearing premises are the validity of LLM judges, the representativeness of LoCoMo (10 conversations), the operational definitions of the six methods, and the bookmark/recall interface. No free parameters are numerically fitted in the abstract; design choices (page size fixed_20, eviction policies, bookmark generators) are discrete experimental factors rather than continuous fits to the target metric.

free parameters (3)

page size / boundary strategy (e.g. fixed_20)
Chosen design hyperparameter; fixed_20 is reported best (96.7%) but is selected from a discrete ablation, not derived.
bookmark length / keyword count (~8–24 tokens)
Heuristic range stated in abstract; generation strategy is ablated but not uniquely determined.
eviction policy (FIFO, LFU, etc.)
Data-dependent choice reported as free experimental factor rather than a fixed law.

axioms (3)

domain assumption LLM-as-judge scores from four independent judges are a valid proxy for multi-session answer quality.
Central superiority claim and p=0.017 rest on this evaluation protocol (stated in abstract).
domain assumption The model can reliably use a recall() tool when bookmarks are present in context.
Cooperative paging presupposes tool-use competence; abstract reports 96% trigger rate.
domain assumption LoCoMo's 10 conversations plus synthetic probes adequately sample long-horizon retrieval needs.
Generalization of ranking and ablation findings depends on this sample.

invented entities (2)

keyword bookmark tokens of form [pN:keywords] no independent evidence
purpose: Compact stand-in for evicted conversation segments that the model can later resolve via recall().
Core interface invention of the method; no independent existence outside this design.
cooperative paging (model-driven page restore via recall()) no independent evidence
purpose: Frame long-context eviction as OS-style paging under model control.
Named methodological construct; evidence is the reported LoCoMo gains, not external measurement.

pith-pipeline@v1.1.0-grok45 · 35211 in / 2638 out tokens · 28512 ms · 2026-07-12T21:21:27.377800+00:00 · methodology

0 comments

read the original abstract

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

Figures

Figures reproduced from arXiv: 2604.12376 by Ziyang Liu.

**Figure 2.** Figure 2: Cooperative memory paging. As the conversation grows (left), turns are grouped into pages that occupy [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: LoCoMo per-category comparison. Bookmark+Recall outperforms baselines across all 5 QA categories, with the largest gains on temporal reasoning and open-domain questions, which require accessing distant conversation history [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Page boundary ablation. Coarse fixed-size [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Eviction policy ablation. Bel´ ady’s oracle ´ (rightmost, highlighted) upper-bounds online policies by 8–14 points, revealing headroom for smarter practical policies [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Two conversation topologies explain the eviction-policy inversion. In forward-moving conversations [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Bookmark format Pareto analysis (n=22 controlled probes). Minimal keywords (∼24 tokens) achieve the highest accuracy at the lowest token cost. LRU vs. LFU matters much less than on GPT-4omini. This is consistent with DeepSeek being a stronger model: it can compensate for mediocre eviction by more aggressively calling recall() to bring evicted pages back. The Bel´ ady gap ´ shrinks from 14.3 to 4.7 points,… view at source ↗

**Figure 8.** Figure 8: The information gap principle. A minimal bookmark (left) creates just enough uncertainty for the model to call recall() before answering, retrieving the full page and responding correctly. A rich bookmark (right) gives the same model a false sense of sufficiency, suppresses the recall call, and leads to a hallucinated answer. Paradoxically, more information in the bookmark yields worse end-to-end accuracy.… view at source ↗

**Figure 9.** Figure 9: Heatmap of boundary×eviction accuracy. Page granularity (rows) dominates eviction policy (columns): fixed 20 is uniformly high regardless of policy. 159 probes per strategy). All strategies degrade compared to the fixed 10 results in the main text, but the ranking is preserved: hybrid remains the best strategy (54.7%), and random remains worst (22.6%) [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 linked inside Pith

[1]

InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1640–1651, Punta Cana, Dominican Republic

Char2Subword: Extending the subword em- bedding space using robust character compositional- ity. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1640–1651, Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. Adam Albright and Yoonjung Kang. 2009. Predict- ing innovative alternations in korean verb pa...

2021
[2]

In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA

Funnel-transformer: filtering out sequential redundancy for efficient language processing. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA. Curran Associates Inc. Peter Daniels and William Bright. 1996.The World’s Writing Systems. Oxford University Press. Jacob Devlin, Ming-Wei Cha...

Pith/arXiv arXiv 1996
[3]

A broad-coverage challenge corpus for sen- tence understanding through inference. InProceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- tics. Yinfei Yang, Yu...

2018
[4]

B- ” marks the beginning of a morpheme, while “I-

HyperCLOV A X technical report.Preprint, arXiv:2404.01954. Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park, Gyu Tae Kim, Minjoon Seo, and Alice Oh. 2023. Towards standardizing Korean gram- matical error correction: Datasets and annotation. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume ...

Pith/arXiv arXiv 2023

[1] [1]

InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1640–1651, Punta Cana, Dominican Republic

Char2Subword: Extending the subword em- bedding space using robust character compositional- ity. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1640–1651, Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. Adam Albright and Yoonjung Kang. 2009. Predict- ing innovative alternations in korean verb pa...

2021

[2] [2]

In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA

Funnel-transformer: filtering out sequential redundancy for efficient language processing. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA. Curran Associates Inc. Peter Daniels and William Bright. 1996.The World’s Writing Systems. Oxford University Press. Jacob Devlin, Ming-Wei Cha...

Pith/arXiv arXiv 1996

[3] [3]

A broad-coverage challenge corpus for sen- tence understanding through inference. InProceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- tics. Yinfei Yang, Yu...

2018

[4] [4]

B- ” marks the beginning of a morpheme, while “I-

HyperCLOV A X technical report.Preprint, arXiv:2404.01954. Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park, Gyu Tae Kim, Minjoon Seo, and Alice Oh. 2023. Towards standardizing Korean gram- matical error correction: Datasets and annotation. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume ...

Pith/arXiv arXiv 2023