Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

Armin Toroghi; Faeze Moradi Kalarde; Jiazhou Liang; Liam Gallagher; Scott Sanner; Yifan Simon Liu

arxiv: 2606.04555 · v1 · pith:FDB2P6OKnew · submitted 2026-06-03 · 💻 cs.CL · cs.AI

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

Yifan Simon Liu , Liam Gallagher , Faeze Moradi Kalarde , Jiazhou Liang , Armin Toroghi , Scott Sanner This is my paper

Pith reviewed 2026-06-28 06:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords segment tree memorytemporal orderlong-horizon agentsconversational memoryagentic memorymemory architectureonline insertion

0 comments

The pith

A segment tree that inserts conversation turns in chronological order improves answer quality on long-horizon agent benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SegTreeMem, a memory architecture that builds a segment tree over utterances using an online rightmost-frontier insertion rule to keep events in their original time sequence. Retrieval works by propagating relevance scores down the tree so that local semantic matches are combined with hierarchical temporal context. Experiments on three long-horizon benchmarks with two different LLM backbones show higher answer quality than flat retrieval, graph memory, and other tree baselines. A follow-up test that randomly permutes temporal order during construction finds that the gains disappear, indicating that chronological structure is necessary for the observed improvement.

Core claim

SegTreeMem represents conversation history as a temporally ordered Segment Tree over utterances, incrementally inserts new utterances through an online rightmost-frontier update rule while forming hierarchical memory segments, and propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context; across three long-horizon memory benchmarks and two LLM backbones this yields higher answer quality than flat retrieval, graph-structured memory, and tree-structured memory baselines, and a temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction.

What carries the argument

Segment tree with online rightmost-frontier update rule that preserves chronological order while forming hierarchical segments, plus relevance-score propagation for retrieval.

If this is right

Long-horizon agents can maintain coherent recall of evolving events without explicit time-stamping at query time.
Memory systems that ignore insertion order will underperform even when they use hierarchical or graph structures.
Online incremental construction is sufficient to produce usable temporal hierarchy without full offline reordering.
Retrieval that combines local match scores with ancestor context captures both topical and sequential relevance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same insertion rule could be tested on non-conversational event streams such as sensor logs or code commit histories.
If temporal order is the dominant factor, simpler linear buffers with time-aware scoring might close much of the gap.
The approach suggests that agent memory benchmarks should include explicit order-permutation controls as a standard diagnostic.

Load-bearing premise

That the measured gains come from temporal-order preservation rather than from other differences in how the segment tree inserts items or scores them.

What would settle it

Running the same benchmarks after building the segment tree with the rightmost-frontier rule disabled or after randomly shuffling utterance order at insertion time and finding no drop in answer quality.

Figures

Figures reproduced from arXiv: 2606.04555 by Armin Toroghi, Faeze Moradi Kalarde, Jiazhou Liang, Liam Gallagher, Scott Sanner, Yifan Simon Liu.

**Figure 2.** Figure 2: Online Segment Tree update. For a new utterance xt+1, the compatibility model selects a frontier node v3, to which a subtree with leaf xt+1 is attached. Online tree update via rightmost frontier. To select an attachment node, we use a compatibility model, which either returns the non-leaf frontier node most compatible with the incoming utterance xt+1, or indicates that no compatible frontier node exists… view at source ↗

**Figure 3.** Figure 3: Two score propagation policies: top: top-down, where scores propagate from parents to [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Controlled comparison of tree construction strategies. We compare a non-temporal similarity [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Construction and retrieval efficiency as memory grows. We report per-utterance construc [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of propagation policy and decay factor on retrieval quality. The dashed line [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Retrieved-node level distribution as a function of the propagation setting, using batch-LLM [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Whole-tree visualizations of SEGTREEMEM (top), MEMTREE [31] (middle), and RAPTOR [34] (bottom) on LoCoMo conv-47, which spans 329 utterances across 31 dialogues. Each leaf (filled circle) is one utterance; internal nodes are drawn as outlined circles. Both leaves and internal nodes are colored by source dialogue index. C.3 Adversarial Cases We analyze two adversarial input patterns for SEGTREEMEM’s online… view at source ↗

**Figure 9.** Figure 9: Degenerate cascading segment tree produced by maximally topic-switching input. Each [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Long-range topical recurrence. Two leaves on the same topic (purple) are separated by an [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Q1 ground-truth answer-bearing utterance and ancestor nodes in the LoCoMo [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of propagation direction and decay factor under the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracy as a function of the decay factor [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Retrieved-node level distribution as a function of the propagation setting, using batch-LLM [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

read the original abstract

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SegTreeMem uses segment trees with rightmost insertion to keep temporal order in agent memory and reports gains over baselines, but the permutation test mixes order effects with changes to segment boundaries.

read the letter

The paper's core move is to represent conversation history as a segment tree built incrementally by always inserting at the rightmost frontier. This keeps utterances in strict chronological order while creating hierarchical segments. Relevance scores then propagate up and down the tree to blend local matches with temporal context. They test this on three long-horizon benchmarks with two different LLM backbones and show better answer quality than flat retrieval, graph memory, and other tree baselines. A permutation experiment is included to argue that the gains depend on preserving order.

The adaptation itself is clean and the incremental rule fits the online nature of agent conversations. Running the same setup on real benchmarks with multiple models gives the result some grounding.

The soft spot is the permutation analysis. Randomly reordering the utterances necessarily changes the rightmost-frontier insertion sequence, so the resulting segments and their boundaries shift. The paper does not appear to include a control that holds the tree topology fixed while only altering order, which leaves open whether the performance drop comes from lost chronology or from disrupted scoring propagation. The abstract also gives no numbers, error bars, or statistical details, so the size and reliability of the gains are hard to judge from the summary alone.

This is aimed at people building memory modules for long-running agents. A reader already working on temporal or hierarchical retrieval might pick up the segment-tree trick, but the order claim needs tighter isolation before it can be taken as settled.

I would send it to peer review. The idea is concrete and the experiments are on external tasks, so referees can ask for the missing controls and numbers.

Referee Report

1 major / 2 minor

Summary. The paper introduces SegTreeMem, a memory architecture that represents conversation history as a temporally ordered segment tree over utterances. New utterances are inserted incrementally via an online rightmost-frontier update rule that preserves chronological order while forming hierarchical segments; relevance scores are propagated through the tree at retrieval time to combine local semantic matching with hierarchical temporal context. Empirical results across three long-horizon benchmarks and two LLM backbones show improved answer quality relative to flat retrieval, graph-structured memory, and other tree-structured baselines. A temporal-order permutation analysis is presented to support the claim that performance gains depend on preserving temporal order during memory construction.

Significance. If the central empirical claims and the attribution to temporal order hold after addressing the noted methodological gap, the work would strengthen the case that chronological structure is a load-bearing inductive bias for agentic memory systems, distinct from topical similarity alone. The choice of a standard segment-tree data structure with an explicit online insertion rule is a clear methodological strength that could facilitate reproducibility.

major comments (1)

[Permutation analysis] Permutation analysis (described in the abstract and presumably §4 or §5): randomly reordering utterances necessarily alters the rightmost-frontier insertion sequence and therefore the resulting hierarchical segment boundaries and tree topology. No control experiment is described that holds segment structure fixed while only permuting order (or vice versa), so the observed performance drop cannot be attributed solely to loss of chronological order rather than to disruption of the tree topology or scoring propagation rules. This directly affects the load-bearing claim that 'the performance gain depends on preserving temporal order.'

minor comments (2)

The abstract states empirical improvements but supplies no numerical values, error bars, or statistical tests; the full results section should include these details with explicit baseline comparisons.
Clarify the exact relevance-score propagation rule (e.g., how scores are combined across levels) with a worked example or pseudocode, as the current description leaves the interaction between local matching and hierarchical context underspecified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback. We respond to the major comment on the permutation analysis and will revise the manuscript accordingly to address the methodological concern.

read point-by-point responses

Referee: [Permutation analysis] Permutation analysis (described in the abstract and presumably §4 or §5): randomly reordering utterances necessarily alters the rightmost-frontier insertion sequence and therefore the resulting hierarchical segment boundaries and tree topology. No control experiment is described that holds segment structure fixed while only permuting order (or vice versa), so the observed performance drop cannot be attributed solely to loss of chronological order rather than to disruption of the tree topology or scoring propagation rules. This directly affects the load-bearing claim that 'the performance gain depends on preserving temporal order.'

Authors: We thank the referee for highlighting this important point. The permutation analysis was intended to demonstrate the importance of chronological order in the construction process, which inherently determines the tree topology via the online insertion rule. However, we acknowledge that this does not fully disentangle the contribution of order from the resulting structure. To address this, we will include an additional control experiment in the revised manuscript where the segment tree topology is fixed based on the original temporal order, but the leaf utterances are permuted in content. This will allow us to assess the impact of order independently of topology changes. We believe this will strengthen the evidence for our central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks

full rationale

The paper introduces SegTreeMem via an online insertion rule and reports answer-quality gains on three external long-horizon benchmarks plus a temporal-order permutation test. No equations, fitted parameters, self-citations, or ansatzes are described that reduce any claimed result to an input by construction. The reported improvements and the dependence on temporal order are presented as measured outcomes on independent benchmarks rather than quantities defined in terms of the method itself; the derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The segment tree itself is a standard structure repurposed here, so no new postulated entities are introduced in the provided text.

pith-pipeline@v0.9.1-grok · 5711 in / 1289 out tokens · 20897 ms · 2026-06-28T06:11:50.512956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 27 canonical work pages · 13 internal anchors

[1]

Qwen3.5-Flash model documentation

Alibaba Cloud. Qwen3.5-Flash model documentation. https://www.alibabacloud.com/ help/en/model-studio/getting-started/models, 2026. Alibaba Cloud Model Studio documentation forqwen3.5-flash; snapshotqwen3.5-flash-2026-02-23

2026
[2]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024
[3]

Realmem: Benchmarking llms in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. Realmem: Benchmarking llms in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

work page arXiv 2026
[4]

Walking down the memory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029, 2023

Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. Walking down the memory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029, 2023

work page arXiv 2023
[5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. Includes Mem0 and graph-enhanced Mem0g variants

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Springer, 3 edition, 2008

Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars.Computational Geometry: Algorithms and Applications. Springer, 3 edition, 2008

2008
[7]

McKeown, Eric Fosler-Lussier, and Hongyan Jing

Michel Galley, Kathleen R. McKeown, Eric Fosler-Lussier, and Hongyan Jing. Discourse segmentation of multi-party conversation. InProceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 562–569, Sapporo, Japan, July 2003. Association for Computational Linguistics. doi: 10.3115/1075096.1075167. URL https: //aclantholog...

work page doi:10.3115/1075096.1075167 2003
[8]

Nilesh Gupta, Wei-Cheng Chang, Ngot Bui, Cho-Jui Hsieh, and Inderjit S. Dhillon. LLM-guided hierarchical retrieval.arXiv preprint arXiv:2510.13217, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831, 2024

work page arXiv 2024
[10]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Scaling personalized web search

Glen Jeh and Jennifer Widom. Scaling personalized web search. InProceedings of the 12th International Conference on World Wide Web, pages 271–279. ACM, 2003

2003
[12]

Shafiq Joty, Giuseppe Carenini, and Raymond T. Ng. Topic segmentation and labeling in asynchronous conversations.Journal of Artificial Intelligence Research, 47:521–573, 2013. doi: 10.1613/jair.3940. URLhttps://doi.org/10.1613/jair.3940

work page doi:10.1613/jair.3940 2013
[13]

InFindings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Lin- guistics. ...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[14]

Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval

Junyoung Kim, Anton Korikov, Jiazhou Liang, Justin Cui, Yifan Simon Liu, Qianfeng Wen, Mark Zhao, and Scott Sanner. Bayesian active learning with gaussian processes guided by llm relevance scoring for dense passage retrieval.arXiv preprint arXiv:2604.17906, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation

Jiazhou Liang, Yifan Simon Liu, David Guo, Minqi Sun, Yilun Jiang, and Scott Sanner. Evaluating scene-based in-situ item labeling for immersive conversational recommendation. arXiv preprint arXiv:2604.09698, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, and Scott Sanner. Goal-oriented reasoning for rag-based memory in conversational agentic llm systems.arXiv preprint arXiv:2605.12213, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

2023
[19]

A compar- ative study of static and contextual embeddings for analyzing semantic changes in medieval latin charters

Yifan Liu, Gelila Tilahun, Xinxiang Gao, Qianfeng Wen, and Michael Gervers. A compar- ative study of static and contextual embeddings for analyzing semantic changes in medieval latin charters. InProceedings of the First Workshop on Language Models for Low-Resource Languages, pages 182–192, 2025

2025
[20]

Multimodal item scoring for natural language recommendation via gaussian process regression with llm relevance judgments.arXiv preprint arXiv:2510.22023, 2025

Yifan Liu, Qianfeng Wen, Jiazhou Liang, Mark Zhao, Justin Cui, Anton Korikov, Armin Toroghi, Junyoung Kim, and Scott Sanner. Multimodal item scoring for natural language recommendation via gaussian process regression with llm relevance judgments.arXiv preprint arXiv:2510.22023, 2025

work page arXiv 2025
[21]

Ma-dpr: Manifold- aware distance metrics for dense passage retrieval

Yifan Liu, Qianfeng Wen, Mark Zhao, Jiazhou Liang, and Scott Sanner. Ma-dpr: Manifold- aware distance metrics for dense passage retrieval. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31073–31091, 2025

2025
[22]

Semantic xpath: Structured agentic memory access for conversational ai.arXiv preprint arXiv:2603.01160, 2026

Yifan Simon Liu, Ruifan Wu, Liam Gallagher, Jiazhou Liang, Armin Toroghi, and Scott Sanner. Semantic xpath: Structured agentic memory access for conversational ai.arXiv preprint arXiv:2603.01160, 2026

work page arXiv 2026
[23]

2504.08266,arXiv:2504.08266,doi:10.48550/ARXIV.2504.08266

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model.CoRR, abs/2305.02156, 2023. doi: 10.48550/arXiv. 2305.02156

work page internal anchor Pith review doi:10.48550/arxiv 2023
[24]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

GPT-4o mini model documentation

OpenAI. GPT-4o mini model documentation. https://platform.openai.com/docs/ models/gpt-4o-mini, 2024. Official OpenAI API documentation for gpt-4o-mini; snap- shotgpt-4o-mini-2024-07-18

2024
[26]

New embedding models and api updates

OpenAI. New embedding models and api updates. https://openai. com/index/new-embedding-models-and-api-updates/ , 2024. Introduces text-embedding-3-small

2024
[27]

Gpt-5.4 mini model documentation

OpenAI. Gpt-5.4 mini model documentation. https://developers.openai.com/ api/docs/models/gpt-5.4-mini, 2026. Official OpenAI API documentation for gpt-5.4-mini

2026
[28]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

The pagerank citation ranking: Bringing order to the web

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999

1999
[30]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, 2023. 12

2023
[31]

From isolated conversations to hierarchi- cal schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052, 2024

Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchi- cal schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052, 2024

work page arXiv 2024
[32]

The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/ 1500000019

2009
[33]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. InText Retrieval Conference, 1994

1994
[34]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? investigating large language models as re-ranking agents. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 14918–14937, Singapore, December 2023. As- sociation for Computati...

work page doi:10.18653/v1/2023.emnlp-main.923 2023
[36]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Fast random walk with restart and its applications

Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. InProceedings of the Sixth IEEE International Conference on Data Mining, pages 613–622, 2006

2006
[38]

Elaborative subtopic query reformulation for broad and indirect queries in travel destination recommendation.arXiv preprint arXiv:2410.01598, 2024

Qianfeng Wen, Yifan Liu, Joshua Zhang, George Saad, Anton Korikov, Yury Sambale, and Scott Sanner. Elaborative subtopic query reformulation for broad and indirect queries in travel destination recommendation.arXiv preprint arXiv:2410.01598, 2024

work page arXiv 2024
[39]

A simple but effective elaborative query reformulation approach for natural language recommendation.arXiv preprint arXiv:2510.02656, 2025

Qianfeng Wen, Yifan Liu, Justin Cui, Joshua Zhang, Anton Korikov, George-Kirollos Saad, and Scott Sanner. A simple but effective elaborative query reformulation approach for natural language recommendation.arXiv preprint arXiv:2510.02656, 2025

work page arXiv 2025
[40]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Rank-without- GPT: Building GPT-independent listwise rerankers on open-source large language models

Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, and Jimmy Lin. Rank-without- GPT: Building GPT-independent listwise rerankers on open-source large language models. CoRR, abs/2312.02969, 2023. doi: 10.48550/arXiv.2312.02969

work page doi:10.48550/arxiv.2312.02969 2023
[43]

John advised him to practice first . . . using a gamepad and good tim- ing

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 2024. 13 A Algorithmic Details This subsection provides pseudocode for the two SEGTREEMEMoperations. Node annotations are denoted byA(v), node intervals byI(...

work page arXiv 2024
[44]

The question category (1=single-hop, 2=temporal, 3=open-ended, 4=multi-hop, 5=adversarial)
[45]

The gold answer (or adversarial answer for category 5)
[46]

For category 5, the candidate provides the adversarial (wrong) answer as if it were true

The candidate answer to evaluate Scoring criteria: Score 0: The candidate answer is incorrect or contradicts the gold answer. For category 5, the candidate provides the adversarial (wrong) answer as if it were true. Score 1: The candidate answer is vague or generic, not using specific information from the conversation. For category 5, the candidate is unc...
[47]

The user’s current query
[48]

The user-related memory, representing the latest valid user state
[49]

A reference answer based on the relevant memory
[50]

facts, constraints, preferences, and confirmed states

The candidate answer to be evaluated Please follow these rules during evaluation: - Focus only on whether "facts, constraints, preferences, and confirmed states" are correctly used - Do NOT evaluate language style, tone, politeness, empathy, or fluency - Do NOT give a high score just because the answer "sounds reasonable" - The reference answer is only to...
[51]

Answer: [Yes] Justification: We use LLMs as core components of both our method and evaluation

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

[1] [1]

Qwen3.5-Flash model documentation

Alibaba Cloud. Qwen3.5-Flash model documentation. https://www.alibabacloud.com/ help/en/model-studio/getting-started/models, 2026. Alibaba Cloud Model Studio documentation forqwen3.5-flash; snapshotqwen3.5-flash-2026-02-23

2026

[2] [2]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024

[3] [3]

Realmem: Benchmarking llms in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. Realmem: Benchmarking llms in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

work page arXiv 2026

[4] [4]

Walking down the memory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029, 2023

Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. Walking down the memory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029, 2023

work page arXiv 2023

[5] [5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. Includes Mem0 and graph-enhanced Mem0g variants

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Springer, 3 edition, 2008

Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars.Computational Geometry: Algorithms and Applications. Springer, 3 edition, 2008

2008

[7] [7]

McKeown, Eric Fosler-Lussier, and Hongyan Jing

Michel Galley, Kathleen R. McKeown, Eric Fosler-Lussier, and Hongyan Jing. Discourse segmentation of multi-party conversation. InProceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 562–569, Sapporo, Japan, July 2003. Association for Computational Linguistics. doi: 10.3115/1075096.1075167. URL https: //aclantholog...

work page doi:10.3115/1075096.1075167 2003

[8] [8]

Nilesh Gupta, Wei-Cheng Chang, Ngot Bui, Cho-Jui Hsieh, and Inderjit S. Dhillon. LLM-guided hierarchical retrieval.arXiv preprint arXiv:2510.13217, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831, 2024

work page arXiv 2024

[10] [10]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Scaling personalized web search

Glen Jeh and Jennifer Widom. Scaling personalized web search. InProceedings of the 12th International Conference on World Wide Web, pages 271–279. ACM, 2003

2003

[12] [12]

Shafiq Joty, Giuseppe Carenini, and Raymond T. Ng. Topic segmentation and labeling in asynchronous conversations.Journal of Artificial Intelligence Research, 47:521–573, 2013. doi: 10.1613/jair.3940. URLhttps://doi.org/10.1613/jair.3940

work page doi:10.1613/jair.3940 2013

[13] [13]

InFindings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Lin- guistics. ...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[14] [14]

Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval

Junyoung Kim, Anton Korikov, Jiazhou Liang, Justin Cui, Yifan Simon Liu, Qianfeng Wen, Mark Zhao, and Scott Sanner. Bayesian active learning with gaussian processes guided by llm relevance scoring for dense passage retrieval.arXiv preprint arXiv:2604.17906, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation

Jiazhou Liang, Yifan Simon Liu, David Guo, Minqi Sun, Yilun Jiang, and Scott Sanner. Evaluating scene-based in-situ item labeling for immersive conversational recommendation. arXiv preprint arXiv:2604.09698, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, and Scott Sanner. Goal-oriented reasoning for rag-based memory in conversational agentic llm systems.arXiv preprint arXiv:2605.12213, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [18]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

2023

[18] [19]

A compar- ative study of static and contextual embeddings for analyzing semantic changes in medieval latin charters

Yifan Liu, Gelila Tilahun, Xinxiang Gao, Qianfeng Wen, and Michael Gervers. A compar- ative study of static and contextual embeddings for analyzing semantic changes in medieval latin charters. InProceedings of the First Workshop on Language Models for Low-Resource Languages, pages 182–192, 2025

2025

[19] [20]

Multimodal item scoring for natural language recommendation via gaussian process regression with llm relevance judgments.arXiv preprint arXiv:2510.22023, 2025

Yifan Liu, Qianfeng Wen, Jiazhou Liang, Mark Zhao, Justin Cui, Anton Korikov, Armin Toroghi, Junyoung Kim, and Scott Sanner. Multimodal item scoring for natural language recommendation via gaussian process regression with llm relevance judgments.arXiv preprint arXiv:2510.22023, 2025

work page arXiv 2025

[20] [21]

Ma-dpr: Manifold- aware distance metrics for dense passage retrieval

Yifan Liu, Qianfeng Wen, Mark Zhao, Jiazhou Liang, and Scott Sanner. Ma-dpr: Manifold- aware distance metrics for dense passage retrieval. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31073–31091, 2025

2025

[21] [22]

Semantic xpath: Structured agentic memory access for conversational ai.arXiv preprint arXiv:2603.01160, 2026

Yifan Simon Liu, Ruifan Wu, Liam Gallagher, Jiazhou Liang, Armin Toroghi, and Scott Sanner. Semantic xpath: Structured agentic memory access for conversational ai.arXiv preprint arXiv:2603.01160, 2026

work page arXiv 2026

[22] [23]

2504.08266,arXiv:2504.08266,doi:10.48550/ARXIV.2504.08266

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model.CoRR, abs/2305.02156, 2023. doi: 10.48550/arXiv. 2305.02156

work page internal anchor Pith review doi:10.48550/arxiv 2023

[23] [24]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [25]

GPT-4o mini model documentation

OpenAI. GPT-4o mini model documentation. https://platform.openai.com/docs/ models/gpt-4o-mini, 2024. Official OpenAI API documentation for gpt-4o-mini; snap- shotgpt-4o-mini-2024-07-18

2024

[25] [26]

New embedding models and api updates

OpenAI. New embedding models and api updates. https://openai. com/index/new-embedding-models-and-api-updates/ , 2024. Introduces text-embedding-3-small

2024

[26] [27]

Gpt-5.4 mini model documentation

OpenAI. Gpt-5.4 mini model documentation. https://developers.openai.com/ api/docs/models/gpt-5.4-mini, 2026. Official OpenAI API documentation for gpt-5.4-mini

2026

[27] [28]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [29]

The pagerank citation ranking: Bringing order to the web

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999

1999

[29] [30]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, 2023. 12

2023

[30] [31]

From isolated conversations to hierarchi- cal schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052, 2024

Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchi- cal schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052, 2024

work page arXiv 2024

[31] [32]

The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/ 1500000019

2009

[32] [33]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. InText Retrieval Conference, 1994

1994

[33] [34]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [35]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? investigating large language models as re-ranking agents. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 14918–14937, Singapore, December 2023. As- sociation for Computati...

work page doi:10.18653/v1/2023.emnlp-main.923 2023

[35] [36]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [37]

Fast random walk with restart and its applications

Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. InProceedings of the Sixth IEEE International Conference on Data Mining, pages 613–622, 2006

2006

[37] [38]

Elaborative subtopic query reformulation for broad and indirect queries in travel destination recommendation.arXiv preprint arXiv:2410.01598, 2024

Qianfeng Wen, Yifan Liu, Joshua Zhang, George Saad, Anton Korikov, Yury Sambale, and Scott Sanner. Elaborative subtopic query reformulation for broad and indirect queries in travel destination recommendation.arXiv preprint arXiv:2410.01598, 2024

work page arXiv 2024

[38] [39]

A simple but effective elaborative query reformulation approach for natural language recommendation.arXiv preprint arXiv:2510.02656, 2025

Qianfeng Wen, Yifan Liu, Justin Cui, Joshua Zhang, Anton Korikov, George-Kirollos Saad, and Scott Sanner. A simple but effective elaborative query reformulation approach for natural language recommendation.arXiv preprint arXiv:2510.02656, 2025

work page arXiv 2025

[39] [40]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [41]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [42]

Rank-without- GPT: Building GPT-independent listwise rerankers on open-source large language models

Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, and Jimmy Lin. Rank-without- GPT: Building GPT-independent listwise rerankers on open-source large language models. CoRR, abs/2312.02969, 2023. doi: 10.48550/arXiv.2312.02969

work page doi:10.48550/arxiv.2312.02969 2023

[42] [43]

John advised him to practice first . . . using a gamepad and good tim- ing

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 2024. 13 A Algorithmic Details This subsection provides pseudocode for the two SEGTREEMEMoperations. Node annotations are denoted byA(v), node intervals byI(...

work page arXiv 2024

[43] [44]

The question category (1=single-hop, 2=temporal, 3=open-ended, 4=multi-hop, 5=adversarial)

[44] [45]

The gold answer (or adversarial answer for category 5)

[45] [46]

For category 5, the candidate provides the adversarial (wrong) answer as if it were true

The candidate answer to evaluate Scoring criteria: Score 0: The candidate answer is incorrect or contradicts the gold answer. For category 5, the candidate provides the adversarial (wrong) answer as if it were true. Score 1: The candidate answer is vague or generic, not using specific information from the conversation. For category 5, the candidate is unc...

[46] [47]

The user’s current query

[47] [48]

The user-related memory, representing the latest valid user state

[48] [49]

A reference answer based on the relevant memory

[49] [50]

facts, constraints, preferences, and confirmed states

The candidate answer to be evaluated Please follow these rules during evaluation: - Focus only on whether "facts, constraints, preferences, and confirmed states" are correctly used - Do NOT evaluate language style, tone, politeness, empathy, or fluency - Do NOT give a high score just because the answer "sounds reasonable" - The reference answer is only to...

[50] [51]

Answer: [Yes] Justification: We use LLMs as core components of both our method and evaluation

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...