pith. machine review for the scientific record. sign in

arxiv: 2603.12572 · v3 · submitted 2026-03-13 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

LMEB: Long-horizon Memory Embedding Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords memoryretrievallmebembeddinglong-horizonmodelstasksbenchmark
0
0 comments X

The pith

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Memory embeddings help AI systems store and retrieve information over time, similar to how humans remember past events or conversations. Current benchmarks for these embeddings mainly test short, straightforward searches for information. This paper argues that real-world use often requires handling information that is spread out, depends on previous context, and spans long periods. To measure this, the authors built LMEB, which includes 22 different datasets turned into 193 retrieval tasks. These tasks cover four kinds of memory: remembering personal experiences, tracking dialogues, recalling facts, and following procedures. They ran tests on 15 popular embedding models of various sizes. The findings indicate that the new benchmark is appropriately challenging, that making models larger does not guarantee better results on these tasks, and that success on older benchmarks does not predict success here. This points to the need for new approaches in building embeddings that work well for complex memory needs. By separating these memory types, the benchmark allows researchers to see where models are strong or weak in different areas. The benchmark is shared publicly to encourage further research and standardized evaluation.

Core claim

LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval.

Load-bearing premise

The assumption that the 22 datasets and their categorization into four memory types accurately and comprehensively represent the challenges of real-world long-horizon memory retrieval without significant selection bias or overlap.

Figures

Figures reproduced from arXiv: 2603.12572 by Baotian Hu, Danyu Tang, Jiaxin Xu, Meishan Zhang, Mengjia Zhou, Min Zhang, Xinping Zhao, Xinshuo Hu, Xin Zhang, Yan Zhong, Yao Zhou, Zifei Shan.

Figure 1
Figure 1. Figure 1: Overview of LMEB memory categories and specificities. Table 1 presents detailed dataset [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory taxonomy of LMEB. In LMEB, we categorize memory into four types, characterized along two key dimensions, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inter-dataset diversity in LMEB. The left side illustrates pairwise weighted Jaccard [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison between w/o inst. and w/ inst.. The x-axis represents the two conditions (w/o inst. and w/ inst.), and the y-axis indicates the N@10 performance. 4 Main Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation between the evaluation scores on LMEB and MTEB (eng, v2) (retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Correlation between the scores on LMEB-Episodic and MTEB (eng, v2) (retrieval subset). [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correlation between the scores on LMEB-Dialogue and MTEB (eng, v2) (retrieval subset). [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Correlation between the scores on LMEB-Semantic and MTEB (eng, v2) (retrieval subset). [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Correlation between the scores on LMEB-Procedural and MTEB (eng, v2) (retrieval [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
read the original abstract

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this gap, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework for evaluating embedding models on complex, long-horizon memory retrieval. LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval. LMEB provides a standardized and reproducible framework that fills a key gap in memory embedding evaluation and supports future advances in long-term, context-dependent retrieval. LMEB is available at https://kalm-embedding.github.io/LMEB.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Long-horizon Memory Embedding Benchmark (LMEB), comprising 22 datasets and 193 zero-shot retrieval tasks across four memory types (episodic, dialogue, semantic, procedural). It evaluates 15 embedding models ranging from hundreds of millions to 10B parameters and reports three main findings: LMEB presents reasonable difficulty, larger models do not always outperform smaller ones, and LMEB scores are orthogonal to those on MTEB, implying that strong passage-retrieval performance does not transfer to long-horizon memory retrieval.

Significance. If the orthogonality result holds after verification of task disjointness, the work would be significant: it supplies a reproducible framework that exposes a gap in current embedding benchmarks and could steer development of models for temporally distant, context-dependent retrieval in memory-augmented systems.

major comments (3)
  1. [§4] §4 (Dataset Curation): the paper provides no explicit audit or overlap analysis between the 22 LMEB datasets and MTEB corpora or task templates; without this, the central claim that LMEB and MTEB measure orthogonal capabilities (reported in §5.3 and Figure 3) cannot be distinguished from a curation artifact.
  2. [§3.2] §3.2 (Task Construction): the description of how the 193 zero-shot tasks are generated from the four memory types lacks concrete details on query formulation, relevance labeling, and temporal-dependency handling, which are required to assess whether the evaluation genuinely tests long-horizon retrieval rather than standard passage matching.
  3. [Table 3] Table 3 (Model Rankings): the statement that 'larger models do not always perform better' is supported only by raw scores; without statistical tests, confidence intervals, or ablation on model scale, this observation remains descriptive and does not yet undermine scaling hypotheses.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'reasonable level of difficulty' is used without reference to a quantitative baseline (e.g., random or BM25 performance) that would allow readers to interpret the reported numbers.
  2. [§6] §6 (Reproducibility): the GitHub link is given, but the paper should list exact preprocessing steps, data splits, and prompt templates used for the 193 tasks to ensure full reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have revised the paper accordingly to incorporate additional analyses and details.

read point-by-point responses
  1. Referee: [§4] §4 (Dataset Curation): the paper provides no explicit audit or overlap analysis between the 22 LMEB datasets and MTEB corpora or task templates; without this, the central claim that LMEB and MTEB measure orthogonal capabilities (reported in §5.3 and Figure 3) cannot be distinguished from a curation artifact.

    Authors: We agree that an explicit audit for overlap with MTEB was missing. In the revised manuscript, we will add a new subsection in §4 detailing the overlap analysis. Our preliminary check shows that LMEB datasets are sourced from distinct domains (e.g., personal memory logs, dialogue histories) with no direct overlap in documents or templates with MTEB tasks, supporting that the orthogonality is not merely a curation artifact. revision: yes

  2. Referee: [§3.2] §3.2 (Task Construction): the description of how the 193 zero-shot tasks are generated from the four memory types lacks concrete details on query formulation, relevance labeling, and temporal-dependency handling, which are required to assess whether the evaluation genuinely tests long-horizon retrieval rather than standard passage matching.

    Authors: We appreciate this point and will expand §3.2 with concrete examples. For instance, for episodic memory, queries are formulated as 'What was the outcome of the event described in the memory from two weeks ago?' with relevance labels based on whether the passage contains the specific temporal reference. We will include pseudocode for task generation and explain how temporal dependencies are handled by including time-stamped contexts in the retrieval corpus. revision: yes

  3. Referee: [Table 3] Table 3 (Model Rankings): the statement that 'larger models do not always perform better' is supported only by raw scores; without statistical tests, confidence intervals, or ablation on model scale, this observation remains descriptive and does not yet undermine scaling hypotheses.

    Authors: The statement is based on the observed rankings in Table 3, where for example a 1B model outperforms a 7B model on certain tasks. To address the concern, we will include bootstrap confidence intervals for the scores and a note on the lack of consistent scaling, while acknowledging that this is an empirical observation rather than a full refutation of scaling laws. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark creation and empirical evaluation against external MTEB

full rationale

The paper constructs LMEB from 22 new datasets spanning episodic, dialogue, semantic, and procedural memory types, then directly evaluates 15 models on 193 zero-shot tasks and compares the resulting scores to published MTEB numbers. This comparison is an external empirical measurement with no equations, fitted parameters, or derivations that reduce to the paper's own inputs. No self-citations are load-bearing for the orthogonality claim, and the work contains no self-definitional steps, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on domain assumptions about memory type distinctions and the representativeness of chosen datasets for long-horizon retrieval; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The four memory types (episodic, dialogue, semantic, procedural) capture distinct and relevant aspects of long-horizon memory retrieval.
    Used to structure the 22 datasets and 193 tasks as described in the abstract.

pith-pipeline@v0.9.0 · 5613 in / 1248 out tokens · 42762 ms · 2026-05-15T12:23:01.082361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

    cs.CL 2026-05 unverdicted novelty 7.0

    LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

  2. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  3. SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval

    cs.IR 2026-04 conditional novelty 6.0

    SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 14 internal anchors

  1. [1]

    Mayonlp at semeval-2016 task 1: Semantic textual similarity based on lexical semantic net and deep learning semantic model

    Naveed Afzal, Yanshan Wang, and Hongfang Liu. Mayonlp at semeval-2016 task 1: Semantic textual similarity based on lexical semantic net and deep learning semantic model. InSemEval@NAACL- HLT, pages 674–679. The Association for Computer Linguistics,

  2. [2]

    Cer, Mona T

    Eneko Agirre, Daniel M. Cer, Mona T. Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. InSemEval@NAACL-HLT, pages 385–393. The Association for Computer Linguistics,

  3. [3]

    Cer, Mona T

    Eneko Agirre, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *sem 2013 shared task: Semantic textual similarity. In*SEM@NAACL-HLT, pages 32–43. Association for Computational Linguistics,

  4. [4]

    Cer, Mona T

    Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Mul- tilingual semantic textual similarity. InSemEval@COLING, pages 81–91. The Association for Computer Linguistics,

  5. [5]

    Cer, Mona T

    Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. InSemEval@NAACL-HLT, pages 252–263. The Association...

  6. [6]

    jina-embeddings-v5-text: Task-Targeted Embedding Distillation

    URLhttps://arxiv.org/abs/2602.15547. Nick Alonso, Tomas Figliolia, Anthony Ndirango, and Beren Millidge. Toward conversational agents with context and time sensitive long-term memory.CoRR, abs/2406.00057,

  7. [7]

    Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

    Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution. arXiv preprint arXiv:2512.10696,

  8. [8]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.CoRR, abs/2402.03216,

  9. [9]

    Air-bench: Automated heterogeneous information retrieval benchmark

    11 Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, and Zheng Liu. Air-bench: Automated heterogeneous information retrieval benchmark. InACL (1), pages 19991–20022. Association for Computational Linguistics, 2025a. Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin C...

  10. [10]

    Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering.CoRR, abs/2402.16288,

    Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering.CoRR, abs/2402.16288,

  11. [11]

    Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675, 2025a

    Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675, 2025a. Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam- Fai Wong, and Jeff Z. Pan. Reth...

  12. [12]

    Memp: Exploring Agent Procedural Memory

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.CoRR, abs/2508.06433,

  13. [13]

    Human-like episodic memory for infinite context llms

    Zafeirios Fountas, Martin Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, and Jun Wang. Human-like episodic memory for infinite context llms. CoRR, abs/2407.09450,

  14. [14]

    jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,

    Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,

  15. [15]

    Kalm-embedding: Superior training data brings A stronger embedding model.CoRR, abs/2501.01028,

    Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, and Min Zhang. Kalm-embedding: Superior training data brings A stronger embedding model.CoRR, abs/2501.01028,

  16. [16]

    A benchmark for procedural memory retrieval in language agents.CoRR, abs/2511.21730,

    Ishant Kohar and Aswanth Krishnan. A benchmark for procedural memory retrieval in language agents.CoRR, abs/2511.21730,

  17. [17]

    Nv-embed: Improved techniques for training llms as generalist embedding models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InICLR. OpenReview.net, 2025a. Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. REALTALK: A 21-day real-world dataset for long-term conversatio...

  18. [18]

    Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning.CoRR, abs/2503.07018,

    Xintong Li, Jalend Bantupalli, Ria Dharmani, Yuwei Zhang, and Jingbo Shang. Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning.CoRR, abs/2503.07018,

  19. [19]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.CoRR, abs/2308.03281,

  20. [20]

    Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. Memtool: Optimizing short-term memory management for dynamic tool calling in LLM agent multi-turn conversations.CoRR, abs/2507.21428,

  21. [21]

    Vidore benchmark V2: raising the bar for visual retrieval.CoRR, abs/2505.17166,

    Quentin Macé, António Loison, and Manuel Faysse. Vidore benchmark V2: raising the bar for visual retrieval.CoRR, abs/2505.17166,

  22. [22]

    Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.CoRR, abs/2507.04590,

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.CoRR, abs/2507.04590,

  23. [23]

    Multi-task contrastive learning for 8192-token bilingual text embeddings

    Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Multi-task contrastive learning for 8192-token bilingual text embeddings...

  24. [24]

    COVID-QA: A question answering dataset for COVID-19

    Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. COVID-QA: A question answering dataset for COVID-19. In Karin Verspoor, Kevin Bretonnel Cohen, Mark Dredze, Emilio Ferrara, Jonathan May, Robert Munro, Cecile Paris, and Byron Wallace, editors,Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July

  25. [25]

    URL https://aclanthology.org/2020.nlpcovid19-acl.18/

    Association for Computational Linguistics. URL https://aclanthology.org/2020.nlpcovid19-acl.18/. 13 Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: massive text embedding benchmark. InEACL, pages 2006–2029. Association for Computational Linguistics,

  26. [26]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    GitHub repository, accessed: 2026-03-10. Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory.CoRR, abs/2509.25140,

  27. [27]

    Convomem benchmark: Why your first 150 conversations don’t need RAG.CoRR, abs/2511.10523,

    Egor Pakhomov, Erik Nijkamp, and Caiming Xiong. Convomem benchmark: Why your first 150 conversations don’t need RAG.CoRR, abs/2511.10523,

  28. [28]

    Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,

    Mathis Pink, Qinyuan Wu, Vy Ai V o, Javier Turek, Jianing Mu, Alexander Huth, and Mariya Toneva. Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,

  29. [29]

    Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large lan- guage models

    Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large lan- guage models. InACL (Findings), volume ACL 2025 ofFindings of ACL, pages 24497–24524. Association for Computational Linguistics,

  30. [30]

    Membench: Towards more comprehensive evaluation on the memory of llm-based agents

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InACL (Findings), volume ACL 2025 ofFindings of ACL, pages 19336–19352. Association for Computational Linguistics,

  31. [31]

    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

    14 Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models.CoRR, abs/2104.08663,

  32. [32]

    EmbeddingGemma: Powerful and Lightweight Text Representations

    URLhttps://arxiv.org/abs/2509.20354. David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InEMNLP (1), pages 7534–7550. Association for Computational Linguistics,

  33. [33]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.CoRR, abs/2212.03533,

  34. [35]

    Multilingual E5 Text Embeddings: A Technical Report

    doi: 10.48550/ARXIV .2402.05672. URLhttps://doi.org/10.48550/arXiv.2402.05672. Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789,

  35. [36]

    KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

    Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745,

  36. [37]

    Enevoldsen, and Niklas Muennighoff

    15 Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth C. Enevoldsen, and Niklas Muennighoff. MIEB: massive image embedding benchmark.CoRR, abs/2504.10471,

  37. [38]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv rl-based memory agent.CoRR, abs/2507.02259,

  38. [39]

    On the role of pretrained language models in general-purpose text embeddings: A survey.CoRR, abs/2507.20783, 2025a

    Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, and Min Zhang. On the role of pretrained language models in general-purpose text embeddings: A survey.CoRR, abs/2507.20783, 2025a. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zh...

  39. [40]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.CoRR, abs/2506.05176, 2025b. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, C...

  40. [41]

    Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

    Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, and Min Zhang. Kalm-embedding-v2: Superior training techniques and data inspire A versatile embedding model.CoRR, abs/2506.20923, 2025a. Xinping Zhao, Yan Zh...

  41. [42]

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRR, abs/2506.15841,

  42. [43]

    A Datasets Below, we list the 22 evaluation datasets in LMEB, spanning four memory types

    16 Dataset Website (Link) Episodic Memory EPBench [Huet et al., 2025]https://doi.org/10.6084/m9.figshare.28244480 KnowMeBench [Wu et al., 2026]https://github.com/QuantaAlpha/KnowMeBench/tree/main/KnowmeBench Dialogue Memory LoCoMo [Maharana et al., 2024]https://github.com/snap-research/locomo/tree/main/data LongMemEval [Wu et al., 2025]https://huggingface...

  43. [44]

    In LMEB, we treat an episodic query as inputand retrieve the corresponding event memories asoutput

    We describe each dataset below: A.1 Episodic Memory Retrieval Episodic memory retrieval aims to recall past events grounded in temporal cues, entities, contents, and spatial contexts [Fountas et al., 2024, Pink et al., 2025]. In LMEB, we treat an episodic query as inputand retrieve the corresponding event memories asoutput. Examples are presented in Table

  44. [45]

    It represents episodic events with structured fields, including temporal and spatial context, involved entities, and detailed descriptions

    EPBench[Huet et al., 2025] is a synthetic episodic memory benchmark for evaluating event recall and episodic reasoning in LLMs. It represents episodic events with structured fields, including temporal and spatial context, involved entities, and detailed descriptions. We use its event set as the corpusCand the provided task queries as queriesQ. KnowMeBench...

  45. [46]

    id": 1,"timestamp

    17 Dataset Query Relevant-Document Granularity EPBench Think about Aurora Chavez’s experiences. De-scribe all the key events they’ve been involvedin, focusing on what happened rather than whenor where it occurred. . . . . . . Aurora implemented blockchain solutions with a determinationthat bordered on obsession. . . . . . . It wasn’t until Samara Bayes ta...

  46. [47]

    It contains 5,049 questions over 1,585 NLP papers, where questions are written from the title and abstract and answered using evidence from the full text

    QASPER[Dasigi et al., 2021] is a question-answering (QA) dataset grounded in full research papers, designed to reflect information-seeking queries that require reasoning across multiple document sections. It contains 5,049 questions over 1,585 NLP papers, where questions are written from the title and abstract and answered using evidence from the full tex...

  47. [48]

    Task Type

    Gorilla[Patil et al., 2024] introduces APIBench, a benchmark covering APIs from HuggingFace, TorchHub, and TensorHub, designed to evaluate tool-use through API call generation. A key component of Gorilla is the integration of a document retriever, which enables models to fetch up-to-date API documentation and adapt to version changes at test time. In LMEB...

  48. [49]

    speaker":

    D Dataset Licenses The authors of 2 datasets in the LMEB benchmark (REALTALK and TMD) do not specify the dataset license in the paper or repository. We summarize the licenses for the remaining datasets below. • EPBench, LongMemEval, MemBench, MLDR, MemGovern: Provided under the MIT License. • KnowMeBench, Covid-QA, Gorilla, ToolBench, ReMe, Proced_mem_ben...

  49. [50]

    He has produced many public memorials and installations inboth England and throughout the United States with subjects ranging from miners,to soldiers and fire fighters

    is a sculptor classically trained inthe traditional methodology of figurative bronze and portrait sculpture living inCarmel, California. He has produced many public memorials and installations inboth England and throughout the United States with subjects ranging from miners,to soldiers and fire fighters. He is credited with over fifty life size and larger...

  50. [51]

    product_id

    Ensure sample mean and variance tests only run when the theoreticalmoments are finite Experience DeepPlanningI’m putting together a completeoutdoor adventure outfit andneed to find three specific itemsonline. First, I’m looking for aWomen’s Rho Hybrid Zip NeckTop that’s proven popular withcustomers - it needs to havemore than 25 four-star reviewsand total...