PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
Pith reviewed 2026-05-25 06:22 UTC · model grok-4.3
The pith
PRISM retrieves the right facts from long conversation graphs using an order of magnitude less context while raising answer accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRISM treats retrieval as min-cost selection over typed path templates in a graph-structured memory and pairs it with an LLM-side compression step. The framework runs four inference-time components—Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that shrinks the candidate bundle into compact answer-side context, and Adaptive Intent Routing that sends most queries through zero-LLM tiers—without any fine-tuning or changes to the upstream ingestion pipeline. On the LoCoMo benchmark this produces substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnit
What carries the argument
Min-cost selection over typed path templates paired with LLM-side evidence compression
If this is right
- Agents can sustain longer histories without a proportional rise in token cost.
- Retrieval quality remains high even when the allowed context size is tightly limited.
- No changes are required to existing memory-ingestion pipelines or to the underlying language model.
- The majority of queries can be answered without invoking the full language model at all.
Where Pith is reading between the lines
- If intent detection proves brittle on new domains, the accuracy advantage could shrink unless paired with more robust intent classifiers.
- The same min-cost path formulation might extend to other structured memory representations such as trees or hypergraphs if typed relations can still be defined.
- Combining PRISM with improved upstream graph construction could push the required context size even lower while preserving the accuracy lift.
Load-bearing premise
The framework assumes that an upstream pipeline already supplies a graph with typed relation paths and that query intent can be detected reliably enough to set edge costs and choose routes without adding substantial error or extra cost.
What would settle it
Measuring LLM-judge accuracy on the LoCoMo benchmark while restricting PRISM to one-tenth the context budget used by baselines and finding no accuracy gain would falsify the central performance claim.
Figures
read the original abstract
Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents PRISM, a training-free retrieval framework for long-horizon language agents that operates over graph-structured memory. It integrates four components—Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing aligned to detected intent, Evidence Compression, and Adaptive Intent Routing—to formulate retrieval as min-cost selection over path templates. The central claim is that this yields substantially higher LLM-judge accuracy than same-protocol baselines on the LoCoMo benchmark while using an order-of-magnitude smaller context budget, occupying a new point on the accuracy-context-cost frontier without fine-tuning or changes to the upstream ingestion pipeline.
Significance. If the empirical results and underlying assumptions hold after proper validation, the work could meaningfully advance efficient memory management for agents by demonstrating a practical Pareto improvement that avoids both context expansion and heavy ingestion-time costs.
major comments (3)
- [Abstract] Abstract: The performance claim rests on Query-Sensitive Edge Costing and Adaptive Intent Routing operating 'without introducing substantial error,' yet the manuscript supplies no error rates for intent detection, no ablation on misclassification impact, and no sensitivity analysis; if intent error exceeds a few percent the min-cost selection would route incorrect bundles and the claimed frontier improvement would not hold.
- [Description of the four components] Description of the four components: The framework presupposes that an upstream pipeline already emits a correctly typed relation graph, but provides no validation, error statistics, or robustness checks on graph quality or typing accuracy; this assumption is load-bearing because incorrect edge types would invalidate the typed path templates used for Hierarchical Bundle Search.
- [Experiments on LoCoMo] Experiments on LoCoMo: The abstract asserts superior LLM-judge accuracy at 10x smaller context but reports no implementation details for baselines, no statistical significance tests, no ablation results isolating each component, and no breakdown of intent-detection accuracy on the benchmark queries, rendering it impossible to verify whether the data support the central Pareto claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, agreeing where additional validation is needed and outlining specific revisions to strengthen the empirical support for PRISM's claims without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The performance claim rests on Query-Sensitive Edge Costing and Adaptive Intent Routing operating 'without introducing substantial error,' yet the manuscript supplies no error rates for intent detection, no ablation on misclassification impact, and no sensitivity analysis; if intent error exceeds a few percent the min-cost selection would route incorrect bundles and the claimed frontier improvement would not hold.
Authors: We agree that explicit quantification of intent detection error and its downstream effects is necessary to fully substantiate the claims. The manuscript emphasizes end-to-end results and the design of Adaptive Intent Routing (which includes fallback mechanisms), but does not report per-query intent accuracy or sensitivity ablations. In revision we will add: (i) intent detection accuracy on LoCoMo queries, (ii) an ablation injecting controlled misclassification rates, and (iii) sensitivity plots showing accuracy-context trade-offs under varying error levels. These will be presented as new tables and figures. revision: yes
-
Referee: [Description of the four components] Description of the four components: The framework presupposes that an upstream pipeline already emits a correctly typed relation graph, but provides no validation, error statistics, or robustness checks on graph quality or typing accuracy; this assumption is load-bearing because incorrect edge types would invalidate the typed path templates used for Hierarchical Bundle Search.
Authors: The manuscript positions PRISM as a retrieval-time method that operates on any provided typed graph and explicitly states it requires no changes to upstream ingestion. We do not claim to solve or measure ingestion errors. To address the concern we will expand the component description with a dedicated robustness subsection discussing the impact of edge-type errors on path templates and, where possible, include a controlled experiment injecting synthetic typing noise to quantify degradation. This adds transparency without requiring new upstream pipelines. revision: partial
-
Referee: [Experiments on LoCoMo] Experiments on LoCoMo: The abstract asserts superior LLM-judge accuracy at 10x smaller context but reports no implementation details for baselines, no statistical significance tests, no ablation results isolating each component, and no breakdown of intent-detection accuracy on the benchmark queries, rendering it impossible to verify whether the data support the central Pareto claim.
Authors: We acknowledge that the current experimental section, while reporting aggregate LLM-judge accuracy and context sizes, lacks the requested granularity. In the revised manuscript we will: (i) provide complete baseline implementation details and hyperparameters, (ii) include statistical significance tests (e.g., bootstrap confidence intervals or paired tests), (iii) present full ablations isolating Hierarchical Bundle Search, Query-Sensitive Edge Costing, Evidence Compression, and Adaptive Intent Routing, and (iv) add a breakdown of intent-detection accuracy per query category. These additions will directly enable verification of the Pareto improvement. revision: yes
Circularity Check
No circularity: empirical benchmark results with no self-referential derivation
full rationale
The paper presents PRISM as a training-free framework whose central performance claim (higher LLM-judge accuracy at 10x smaller context on LoCoMo) is an observed experimental outcome, not a quantity derived from equations or fits. The four components are described at the level of inference-time procedures; no self-definitional loops, fitted-input predictions, or load-bearing self-citations appear. Assumptions about upstream graph memory and intent detection are stated as operating conditions rather than quantities the paper claims to derive or validate internally. The derivation chain therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024
work page 2024
-
[3]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
FlowElement-ai. M-flow. https://github.com/FlowElement-ai/m_flow, 2026. GitHub repository. Accessed: 2026-05-06
work page 2026
-
[7]
MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents
Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. Magma: A multi-graph based agentic memory architecture for ai agents.arXiv preprint arXiv:2601.03236, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Llmlingua: Compress- ing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023
work page 2023
-
[9]
Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972– 25981, 2025
work page 2025
-
[10]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP (1), pages 6769–6781, 2020
work page 2020
-
[11]
Colbert: Efficient and effective passage search via contextual- ized late interaction over bert
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextual- ized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020
work page 2020
-
[12]
SimpleMem: Efficient Lifelong Memory for LLM Agents
Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
work page 2024
-
[14]
Evaluating very long-term conversational memory of llm agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024
work page 2024
-
[15]
Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert.arXiv preprint arXiv:1901.04085, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[16]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009
work page 2009
-
[19]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[21]
Meda: Dynamic kv cache allocation for efficient multimodal long-context inference
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2485–2497, 2025
work page 2025
-
[22]
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024
-
[23]
Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4065–4078, 2024
work page 2024
-
[24]
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025
-
[25]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018
work page 2018
-
[27]
Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, pages 19724–19731, 2024. 11 A Limitations and Broader Impacts Limitations.PRISM currently focuses on retrieval-side compression for LLM-based long-horizon convers...
work page 2024
-
[29]
Include specific details like names, dates, places, objects, and quantities
episode_summary - A concise but comprehensive summary of ALL events and facts mentioned in the chunk. Include specific details like names, dates, places, objects, and quantities
-
[30]
entities - Each item must be: {"name": string, "entity_type": string} - entity_type should be one of: "person", "organization", "place", "concept", "event", "other". - Keep names as they appear in the text whenever possible. - Include specific items mentioned (books, foods, activities, pets, places visited, etc.) as entities with type "concept" or "other"
-
[31]
facet_points - Each item must be: {"content": string, "related_entity_name": string or null, "timestamp_text": string or null} - content should be atomic and factual. - IMPORTANT: Be specific. Include concrete details like exact names, quantities, colors, and descriptions. Good: "Melanie made a cup in her pottery class" Bad : "Melanie does pottery" Good: ...
-
[32]
facets - Each item must be: {"theme": string, "facet_point_indices": array of integers} - facet_point_indices refers to zero-based indices in the facet_points array
-
[33]
temporal_info - Each item must be: {"subject": string, "time_expression": string, "normalized_time": string or null, "relation": string} - relation examples: "before", "after", "during", "at". - normalized_time should use ISO-8601 when explicit enough, otherwise null. - For relative time references (e.g., "yesterday", "last week"), use the conversation ti...
-
[34]
Be specific and cite concrete details from the context
Answer the question using the provided context. Be specific and cite concrete details from the context
-
[35]
For time-related questions, follow these steps: Step 1: Find the conversation date from the header (e.g., [1:56 pm on 8 May, 2023] means the conversation date is 8 May 2023). Step 2: Identify the relative time expression (e.g., "yesterday", "last week", "last Saturday"). Step 3: Calculate the actual date. "yesterday" = conversation date minus 1 day. "last...
work page 2023
-
[36]
When multiple events of the same type exist (e.g., multiple 18 camping trips, multiple beach visits), distinguish between them using their dates
-
[37]
Prefer quoting specific details (names, dates, objects, places) from the context over paraphrasing
-
[38]
If the context contains partial but relevant information, provide the best answer you can
-
[39]
Only say you cannot answer if the context truly contains NO relevant information at all. Answer: LLM-as-a-Judge Prompt. You are an evaluation judge. Compare the generated answer with the gold answer and determine if the generated answer is correct. Be lenient with format differences. For example: - "May 7th" and "7 May" are the same date -> CORRECT - "Cae...
-
[40]
temporal -- The query asks about WHEN something happened, time ordering, duration, or sequence of events. Signals: "when", "before", "after", "during", "how long", "what year", explicit dates, or asking about the timing of events relative to each other
-
[41]
causal -- The query asks WHY something happened, what caused it, or what led to an outcome. Signals: "why", "because", "what caused", "what led to", "as a result of", or asking about reasons, motivations, or consequences
-
[42]
multi_hop -- The query requires combining facts from multiple separate events, interactions, or contexts to answer. A single-fact lookup is NOT multi_hop. Signals: "based on X and Y", "how does X relate to Y", "given that ... what ...", "combining these conversations", "across multiple sessions", asking about trends/patterns/shifts across time, or asking ...
-
[43]
entity_centric -- The query asks about a specific attribute, description, or property of a person, place, or thing that can be looked up as a stored fact. Signals: "who is", "what does X look like", "where does X live", "what is X’s job", or asking to retrieve a single concrete fact about a named entity. NOTE: if answering requires inference or reasoning ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.