PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

Jingyi Peng; Qiuzhuang Sun; Weiting Liu; Zhongwei Wan

arxiv: 2605.12260 · v2 · pith:GSRUJ57Anew · submitted 2026-05-12 · 💻 cs.CL

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

Jingyi Peng , Zhongwei Wan , Weiting Liu , Qiuzhuang Sun This is my paper

Pith reviewed 2026-05-25 06:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-horizon agentsmemory retrievalgraph-structured memoryintent-aware retrievalevidence compressiontraining-free frameworkcontext efficiencyPareto frontier

0 comments

The pith

PRISM retrieves the right facts from long conversation graphs using an order of magnitude less context while raising answer accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRISM as a training-free method that turns memory retrieval for long-horizon language agents into a joint search-and-compression task over an existing graph of typed relations. It solves the problem by scoring paths according to detected query intent, bundling evidence hierarchically, compressing the bundle, and routing most queries away from full model calls. A sympathetic reader would care because agents that keep multi-turn histories currently face a hard trade-off between missing key facts and paying high token costs. If the method works, agents could sustain far longer interactions under fixed context budgets without retraining any model or altering how memory is first built.

Core claim

PRISM treats retrieval as min-cost selection over typed path templates in a graph-structured memory and pairs it with an LLM-side compression step. The framework runs four inference-time components—Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that shrinks the candidate bundle into compact answer-side context, and Adaptive Intent Routing that sends most queries through zero-LLM tiers—without any fine-tuning or changes to the upstream ingestion pipeline. On the LoCoMo benchmark this produces substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnit

What carries the argument

Min-cost selection over typed path templates paired with LLM-side evidence compression

If this is right

Agents can sustain longer histories without a proportional rise in token cost.
Retrieval quality remains high even when the allowed context size is tightly limited.
No changes are required to existing memory-ingestion pipelines or to the underlying language model.
The majority of queries can be answered without invoking the full language model at all.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If intent detection proves brittle on new domains, the accuracy advantage could shrink unless paired with more robust intent classifiers.
The same min-cost path formulation might extend to other structured memory representations such as trees or hypergraphs if typed relations can still be defined.
Combining PRISM with improved upstream graph construction could push the required context size even lower while preserving the accuracy lift.

Load-bearing premise

The framework assumes that an upstream pipeline already supplies a graph with typed relation paths and that query intent can be detected reliably enough to set edge costs and choose routes without adding substantial error or extra cost.

What would settle it

Measuring LLM-judge accuracy on the LoCoMo benchmark while restricting PRISM to one-tenth the context budget used by baselines and finding no accuracy gain would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.12260 by Jingyi Peng, Qiuzhuang Sun, Weiting Liu, Zhongwei Wan.

**Figure 1.** Figure 1: (a) Existing memory designs cluster in three regions of the accuracy–context-cost plane, leaving the high-accuracy / low-cost corner underfilled. (b) PRISM is the only system that combines all six design dimensions we identify as relevant. GraphRAG and MAGMA [4, 7] build typed graphs over events, entities, and causal links, and use graph traversal as the retrieval primitive. A complementary direction train… view at source ↗

**Figure 2.** Figure 2: Architectural overview of PRISM. PRISM is composed of a four-layer memory graph and four inference-time modules: (1) N4 routes query intent; (2) N2 adjusts traversal costs over typed edges; (3) N1 searches relation paths and assembles candidate bundles; and (4) N3 compresses retrieved evidence into a compact context for the answer model. which is also the unit eventually returned to the answer model. The e… view at source ↗

**Figure 3.** Figure 3: Accuracy–context trade-off on LoCoMo. Each point is one system; x-axis is average retrieved context tokens per query, y-axis is LLM-judge score. Evidence Compression Sets the Corner. The orange diamond (PRISM − N3) isolates Evidence Compression’s contribution. Without N3, PRISM passes the top-10 candidate bundle directly to the answer model, roughly doubling the per-query context while moving judge by les… view at source ↗

**Figure 4.** Figure 4: Per-category routing distribution of Adaptive Intent Routing (N4) on LoCoMo cat 1–4. Each bar shows the share of queries dispatched through each routing path. The keyword_gated, prototype, and none paths incur zero LLM calls; only the LLM path incurs one classifier-side LLM call per query. The annotation marks the overall no-LLM rate of 42.3%. NeurIPS Paper Checklist 1. Claims Question: Do the main claims … view at source ↗

read the original abstract

Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM puts forward a training-free four-component retrieval system over graph memory for long-horizon agents and claims a strong accuracy-context tradeoff on LoCoMo, but the results rest on untested assumptions about intent detection and graph quality with no supporting ablations or error analysis.

read the letter

The paper's main point is a concrete inference-time framework called PRISM that treats retrieval as min-cost selection over typed relation paths in a graph memory. It adds hierarchical bundle search, query-sensitive edge costing, evidence compression, and adaptive intent routing, all without training or changes to the upstream pipeline. This is presented as a way to get higher LLM-judge accuracy at roughly 10x smaller context on the LoCoMo benchmark compared to same-protocol baselines. The formulation is new in how it pairs the costing and routing steps with a later compression stage to hit a strict budget. The component breakdown is clear and directly targets the accuracy-cost tension in agent memory. The soft spots are exactly where the stress-test note flags them. The whole Pareto claim depends on the intent detector being reliable enough that the wrong bundles do not get selected, and on the graph already having correct typed paths. The abstract states these run without substantial error or extra cost, yet supplies no error rates, robustness checks, or ablations on misclassification. If intent accuracy is only moderate, the claimed gains would not materialize. The benchmark results are asserted without baseline details, statistical tests, or ablation tables, so it is not possible to judge whether the data actually support the frontier claim. This work is aimed at people building retrieval layers for long-running agents. A reader who needs ideas for inference-time graph traversal and intent-aware costing could extract usable pieces even if the experiments need strengthening. It deserves peer review because the problem is real and the proposal is specific, though any referee would have to press hard on the missing validation for the two load-bearing assumptions.

Referee Report

3 major / 0 minor

Summary. The paper presents PRISM, a training-free retrieval framework for long-horizon language agents that operates over graph-structured memory. It integrates four components—Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing aligned to detected intent, Evidence Compression, and Adaptive Intent Routing—to formulate retrieval as min-cost selection over path templates. The central claim is that this yields substantially higher LLM-judge accuracy than same-protocol baselines on the LoCoMo benchmark while using an order-of-magnitude smaller context budget, occupying a new point on the accuracy-context-cost frontier without fine-tuning or changes to the upstream ingestion pipeline.

Significance. If the empirical results and underlying assumptions hold after proper validation, the work could meaningfully advance efficient memory management for agents by demonstrating a practical Pareto improvement that avoids both context expansion and heavy ingestion-time costs.

major comments (3)

[Abstract] Abstract: The performance claim rests on Query-Sensitive Edge Costing and Adaptive Intent Routing operating 'without introducing substantial error,' yet the manuscript supplies no error rates for intent detection, no ablation on misclassification impact, and no sensitivity analysis; if intent error exceeds a few percent the min-cost selection would route incorrect bundles and the claimed frontier improvement would not hold.
[Description of the four components] Description of the four components: The framework presupposes that an upstream pipeline already emits a correctly typed relation graph, but provides no validation, error statistics, or robustness checks on graph quality or typing accuracy; this assumption is load-bearing because incorrect edge types would invalidate the typed path templates used for Hierarchical Bundle Search.
[Experiments on LoCoMo] Experiments on LoCoMo: The abstract asserts superior LLM-judge accuracy at 10x smaller context but reports no implementation details for baselines, no statistical significance tests, no ablation results isolating each component, and no breakdown of intent-detection accuracy on the benchmark queries, rendering it impossible to verify whether the data support the central Pareto claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, agreeing where additional validation is needed and outlining specific revisions to strengthen the empirical support for PRISM's claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claim rests on Query-Sensitive Edge Costing and Adaptive Intent Routing operating 'without introducing substantial error,' yet the manuscript supplies no error rates for intent detection, no ablation on misclassification impact, and no sensitivity analysis; if intent error exceeds a few percent the min-cost selection would route incorrect bundles and the claimed frontier improvement would not hold.

Authors: We agree that explicit quantification of intent detection error and its downstream effects is necessary to fully substantiate the claims. The manuscript emphasizes end-to-end results and the design of Adaptive Intent Routing (which includes fallback mechanisms), but does not report per-query intent accuracy or sensitivity ablations. In revision we will add: (i) intent detection accuracy on LoCoMo queries, (ii) an ablation injecting controlled misclassification rates, and (iii) sensitivity plots showing accuracy-context trade-offs under varying error levels. These will be presented as new tables and figures. revision: yes
Referee: [Description of the four components] Description of the four components: The framework presupposes that an upstream pipeline already emits a correctly typed relation graph, but provides no validation, error statistics, or robustness checks on graph quality or typing accuracy; this assumption is load-bearing because incorrect edge types would invalidate the typed path templates used for Hierarchical Bundle Search.

Authors: The manuscript positions PRISM as a retrieval-time method that operates on any provided typed graph and explicitly states it requires no changes to upstream ingestion. We do not claim to solve or measure ingestion errors. To address the concern we will expand the component description with a dedicated robustness subsection discussing the impact of edge-type errors on path templates and, where possible, include a controlled experiment injecting synthetic typing noise to quantify degradation. This adds transparency without requiring new upstream pipelines. revision: partial
Referee: [Experiments on LoCoMo] Experiments on LoCoMo: The abstract asserts superior LLM-judge accuracy at 10x smaller context but reports no implementation details for baselines, no statistical significance tests, no ablation results isolating each component, and no breakdown of intent-detection accuracy on the benchmark queries, rendering it impossible to verify whether the data support the central Pareto claim.

Authors: We acknowledge that the current experimental section, while reporting aggregate LLM-judge accuracy and context sizes, lacks the requested granularity. In the revised manuscript we will: (i) provide complete baseline implementation details and hyperparameters, (ii) include statistical significance tests (e.g., bootstrap confidence intervals or paired tests), (iii) present full ablations isolating Hierarchical Bundle Search, Query-Sensitive Edge Costing, Evidence Compression, and Adaptive Intent Routing, and (iv) add a breakdown of intent-detection accuracy per query category. These additions will directly enable verification of the Pareto improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no self-referential derivation

full rationale

The paper presents PRISM as a training-free framework whose central performance claim (higher LLM-judge accuracy at 10x smaller context on LoCoMo) is an observed experimental outcome, not a quantity derived from equations or fits. The four components are described at the level of inference-time procedures; no self-definitional loops, fitted-input predictions, or load-bearing self-citations appear. Assumptions about upstream graph memory and intent detection are stated as operating conditions rather than quantities the paper claims to derive or validate internally. The derivation chain therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; it contains no equations, no explicit free parameters, no listed axioms, and no new postulated entities. The framework is described at the level of named components and high-level workflow.

pith-pipeline@v0.9.0 · 5774 in / 1235 out tokens · 61037 ms · 2026-05-25T06:22:00.681295+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024
[3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

FlowElement-ai. M-flow. https://github.com/FlowElement-ai/m_flow, 2026. GitHub repository. Accessed: 2026-05-06

work page 2026
[7]

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. Magma: A multi-graph based agentic memory architecture for ai agents.arXiv preprint arXiv:2601.03236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Llmlingua: Compress- ing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

work page 2023
[9]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972– 25981, 2025

work page 2025
[10]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP (1), pages 6769–6781, 2020

work page 2020
[11]

Colbert: Efficient and effective passage search via contextual- ized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextual- ized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020

work page 2020
[12]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[14]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024
[15]

Passage Re-ranking with BERT

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert.arXiv preprint arXiv:1901.04085, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[16]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[21]

Meda: Dynamic kv cache allocation for efficient multimodal long-context inference

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2485–2497, 2025

work page 2025
[22]

D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

work page arXiv 2024
[23]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4065–4078, 2024

work page 2024
[24]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[25]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018
[27]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, pages 19724–19731, 2024. 11 A Limitations and Broader Impacts Limitations.PRISM currently focuses on retrieval-side compression for LLM-based long-horizon convers...

work page 2024
[29]

Include specific details like names, dates, places, objects, and quantities

episode_summary - A concise but comprehensive summary of ALL events and facts mentioned in the chunk. Include specific details like names, dates, places, objects, and quantities

work page
[30]

name": string,

entities - Each item must be: {"name": string, "entity_type": string} - entity_type should be one of: "person", "organization", "place", "concept", "event", "other". - Keep names as they appear in the text whenever possible. - Include specific items mentioned (books, foods, activities, pets, places visited, etc.) as entities with type "concept" or "other"

work page
[31]

content": string,

facet_points - Each item must be: {"content": string, "related_entity_name": string or null, "timestamp_text": string or null} - content should be atomic and factual. - IMPORTANT: Be specific. Include concrete details like exact names, quantities, colors, and descriptions. Good: "Melanie made a cup in her pottery class" Bad : "Melanie does pottery" Good: ...

work page
[32]

theme": string,

facets - Each item must be: {"theme": string, "facet_point_indices": array of integers} - facet_point_indices refers to zero-based indices in the facet_points array

work page
[33]

subject": string,

temporal_info - Each item must be: {"subject": string, "time_expression": string, "normalized_time": string or null, "relation": string} - relation examples: "before", "after", "during", "at". - normalized_time should use ISO-8601 when explicit enough, otherwise null. - For relative time references (e.g., "yesterday", "last week"), use the conversation ti...

work page
[34]

Be specific and cite concrete details from the context

Answer the question using the provided context. Be specific and cite concrete details from the context

work page
[35]

yesterday

For time-related questions, follow these steps: Step 1: Find the conversation date from the header (e.g., [1:56 pm on 8 May, 2023] means the conversation date is 8 May 2023). Step 2: Identify the relative time expression (e.g., "yesterday", "last week", "last Saturday"). Step 3: Calculate the actual date. "yesterday" = conversation date minus 1 day. "last...

work page 2023
[36]

When multiple events of the same type exist (e.g., multiple 18 camping trips, multiple beach visits), distinguish between them using their dates

work page
[37]

Prefer quoting specific details (names, dates, objects, places) from the context over paraphrasing

work page
[38]

If the context contains partial but relevant information, provide the best answer you can

work page
[39]

May 7th" and

Only say you cannot answer if the context truly contains NO relevant information at all. Answer: LLM-as-a-Judge Prompt. You are an evaluation judge. Compare the generated answer with the gold answer and determine if the generated answer is correct. Be lenient with format differences. For example: - "May 7th" and "7 May" are the same date -> CORRECT - "Cae...

work page
[40]

when", "before

temporal -- The query asks about WHEN something happened, time ordering, duration, or sequence of events. Signals: "when", "before", "after", "during", "how long", "what year", explicit dates, or asking about the timing of events relative to each other

work page
[41]

why", "because

causal -- The query asks WHY something happened, what caused it, or what led to an outcome. Signals: "why", "because", "what caused", "what led to", "as a result of", or asking about reasons, motivations, or consequences

work page
[42]

based on X and Y

multi_hop -- The query requires combining facts from multiple separate events, interactions, or contexts to answer. A single-fact lookup is NOT multi_hop. Signals: "based on X and Y", "how does X relate to Y", "given that ... what ...", "combining these conversations", "across multiple sessions", asking about trends/patterns/shifts across time, or asking ...

work page
[43]

who is",

entity_centric -- The query asks about a specific attribute, description, or property of a person, place, or thing that can be looked up as a stored fact. Signals: "who is", "what does X look like", "where does X live", "what is X’s job", or asking to retrieve a single concrete fact about a named entity. NOTE: if answering requires inference or reasoning ...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024

[3] [3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

FlowElement-ai. M-flow. https://github.com/FlowElement-ai/m_flow, 2026. GitHub repository. Accessed: 2026-05-06

work page 2026

[7] [7]

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. Magma: A multi-graph based agentic memory architecture for ai agents.arXiv preprint arXiv:2601.03236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Llmlingua: Compress- ing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

work page 2023

[9] [9]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972– 25981, 2025

work page 2025

[10] [10]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP (1), pages 6769–6781, 2020

work page 2020

[11] [11]

Colbert: Efficient and effective passage search via contextual- ized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextual- ized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020

work page 2020

[12] [12]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024

[14] [14]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024

[15] [15]

Passage Re-ranking with BERT

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert.arXiv preprint arXiv:1901.04085, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[16] [16]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009

[19] [19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[21] [21]

Meda: Dynamic kv cache allocation for efficient multimodal long-context inference

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2485–2497, 2025

work page 2025

[22] [22]

D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

work page arXiv 2024

[23] [23]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4065–4078, 2024

work page 2024

[24] [24]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025

[25] [25]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018

[27] [27]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, pages 19724–19731, 2024. 11 A Limitations and Broader Impacts Limitations.PRISM currently focuses on retrieval-side compression for LLM-based long-horizon convers...

work page 2024

[29] [29]

Include specific details like names, dates, places, objects, and quantities

episode_summary - A concise but comprehensive summary of ALL events and facts mentioned in the chunk. Include specific details like names, dates, places, objects, and quantities

work page

[30] [30]

name": string,

entities - Each item must be: {"name": string, "entity_type": string} - entity_type should be one of: "person", "organization", "place", "concept", "event", "other". - Keep names as they appear in the text whenever possible. - Include specific items mentioned (books, foods, activities, pets, places visited, etc.) as entities with type "concept" or "other"

work page

[31] [31]

content": string,

facet_points - Each item must be: {"content": string, "related_entity_name": string or null, "timestamp_text": string or null} - content should be atomic and factual. - IMPORTANT: Be specific. Include concrete details like exact names, quantities, colors, and descriptions. Good: "Melanie made a cup in her pottery class" Bad : "Melanie does pottery" Good: ...

work page

[32] [32]

theme": string,

facets - Each item must be: {"theme": string, "facet_point_indices": array of integers} - facet_point_indices refers to zero-based indices in the facet_points array

work page

[33] [33]

subject": string,

temporal_info - Each item must be: {"subject": string, "time_expression": string, "normalized_time": string or null, "relation": string} - relation examples: "before", "after", "during", "at". - normalized_time should use ISO-8601 when explicit enough, otherwise null. - For relative time references (e.g., "yesterday", "last week"), use the conversation ti...

work page

[34] [34]

Be specific and cite concrete details from the context

Answer the question using the provided context. Be specific and cite concrete details from the context

work page

[35] [35]

yesterday

For time-related questions, follow these steps: Step 1: Find the conversation date from the header (e.g., [1:56 pm on 8 May, 2023] means the conversation date is 8 May 2023). Step 2: Identify the relative time expression (e.g., "yesterday", "last week", "last Saturday"). Step 3: Calculate the actual date. "yesterday" = conversation date minus 1 day. "last...

work page 2023

[36] [36]

When multiple events of the same type exist (e.g., multiple 18 camping trips, multiple beach visits), distinguish between them using their dates

work page

[37] [37]

Prefer quoting specific details (names, dates, objects, places) from the context over paraphrasing

work page

[38] [38]

If the context contains partial but relevant information, provide the best answer you can

work page

[39] [39]

May 7th" and

Only say you cannot answer if the context truly contains NO relevant information at all. Answer: LLM-as-a-Judge Prompt. You are an evaluation judge. Compare the generated answer with the gold answer and determine if the generated answer is correct. Be lenient with format differences. For example: - "May 7th" and "7 May" are the same date -> CORRECT - "Cae...

work page

[40] [40]

when", "before

temporal -- The query asks about WHEN something happened, time ordering, duration, or sequence of events. Signals: "when", "before", "after", "during", "how long", "what year", explicit dates, or asking about the timing of events relative to each other

work page

[41] [41]

why", "because

causal -- The query asks WHY something happened, what caused it, or what led to an outcome. Signals: "why", "because", "what caused", "what led to", "as a result of", or asking about reasons, motivations, or consequences

work page

[42] [42]

based on X and Y

multi_hop -- The query requires combining facts from multiple separate events, interactions, or contexts to answer. A single-fact lookup is NOT multi_hop. Signals: "based on X and Y", "how does X relate to Y", "given that ... what ...", "combining these conversations", "across multiple sessions", asking about trends/patterns/shifts across time, or asking ...

work page

[43] [43]

who is",

entity_centric -- The query asks about a specific attribute, description, or property of a person, place, or thing that can be looked up as a stored fact. Signals: "who is", "what does X look like", "where does X live", "what is X’s job", or asking to retrieve a single concrete fact about a named entity. NOTE: if answering requires inference or reasoning ...

work page