arxiv: 2603.12572 · v3 · submitted 2026-03-13 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao , Xinshuo Hu , Jiaxin Xu , Danyu Tang , Xin Zhang , Mengjia Zhou , Yan Zhong , Yao Zhou

show 4 more authors

Zifei Shan Meishan Zhang Baotian Hu Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords memoryretrievallmebembeddinglong-horizonmodelstasksbenchmark

0 comments

The pith

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Memory embeddings help AI systems store and retrieve information over time, similar to how humans remember past events or conversations. Current benchmarks for these embeddings mainly test short, straightforward searches for information. This paper argues that real-world use often requires handling information that is spread out, depends on previous context, and spans long periods. To measure this, the authors built LMEB, which includes 22 different datasets turned into 193 retrieval tasks. These tasks cover four kinds of memory: remembering personal experiences, tracking dialogues, recalling facts, and following procedures. They ran tests on 15 popular embedding models of various sizes. The findings indicate that the new benchmark is appropriately challenging, that making models larger does not guarantee better results on these tasks, and that success on older benchmarks does not predict success here. This points to the need for new approaches in building embeddings that work well for complex memory needs. By separating these memory types, the benchmark allows researchers to see where models are strong or weak in different areas. The benchmark is shared publicly to encourage further research and standardized evaluation.

Core claim

LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval.

Load-bearing premise

The assumption that the 22 datasets and their categorization into four memory types accurately and comprehensively represent the challenges of real-world long-horizon memory retrieval without significant selection bias or overlap.

Figures

Figures reproduced from arXiv: 2603.12572 by Baotian Hu, Danyu Tang, Jiaxin Xu, Meishan Zhang, Mengjia Zhou, Min Zhang, Xinping Zhao, Xinshuo Hu, Xin Zhang, Yan Zhong, Yao Zhou, Zifei Shan.

**Figure 2.** Figure 2: Memory taxonomy of LMEB. In LMEB, we categorize memory into four types, characterized along two key dimensions, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Inter-dataset diversity in LMEB. The left side illustrates pairwise weighted Jaccard [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison between w/o inst. and w/ inst.. The x-axis represents the two conditions (w/o inst. and w/ inst.), and the y-axis indicates the N@10 performance. 4 Main Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation between the evaluation scores on LMEB and MTEB (eng, v2) (retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Correlation between the scores on LMEB-Episodic and MTEB (eng, v2) (retrieval subset). [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Correlation between the scores on LMEB-Dialogue and MTEB (eng, v2) (retrieval subset). [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation between the scores on LMEB-Semantic and MTEB (eng, v2) (retrieval subset). [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Correlation between the scores on LMEB-Procedural and MTEB (eng, v2) (retrieval [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

read the original abstract

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this gap, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework for evaluating embedding models on complex, long-horizon memory retrieval. LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval. LMEB provides a standardized and reproducible framework that fills a key gap in memory embedding evaluation and supports future advances in long-term, context-dependent retrieval. LMEB is available at https://kalm-embedding.github.io/LMEB.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMEB adds a benchmark for long-horizon memory retrieval but its claim of orthogonality to MTEB needs verification on dataset overlap.

read the letter

LMEB is a new benchmark for evaluating text embeddings on long-horizon memory retrieval tasks that standard benchmarks like MTEB overlook. The authors collected 22 datasets into 193 zero-shot tasks split into episodic, dialogue, semantic, and procedural memory types, then tested 15 models from hundreds of millions to 10 billion parameters. They report that the tasks are reasonably difficult, larger models do not always do better, and performance on LMEB does not track with MTEB scores. The work does a solid job highlighting a practical gap in how we test embeddings for memory-augmented systems. Creating tasks that require handling fragmented and temporally distant information is useful, and releasing the benchmark publicly gives others a way to measure progress on this. The finding that model scale does not guarantee better results on these tasks is worth noting, as it challenges some scaling assumptions in the area. The main soft spot is around the orthogonality claim. The paper says LMEB and MTEB measure different things based on the model score correlations, but this only holds if the underlying data and retrieval patterns are genuinely separate. The stress-test concern is fair: without a clear audit showing no shared documents, similar queries, or overlapping sources from MTEB, the lack of correlation could come from how the tasks were picked rather than from distinct model capabilities. The abstract also skips specifics on exact metrics, data selection rules, and how the four memory types were operationalized, which makes it harder to judge if the evaluation is robust. The categorization into four memory types based on abstraction and temporal dependency sounds reasonable on paper, but without seeing how individual datasets map to these or any checks for overlap within the benchmark itself, it's tough to know if the categories capture distinct challenges. This paper is aimed at researchers working on embeddings for long-context or memory-heavy applications. Anyone building retrieval systems that need to handle extended, context-dependent information would find the tasks and baseline numbers useful. It is not a complete solution but a starting point for better evaluation. I would send it to peer review. The core idea addresses a real limitation in current benchmarks, and the empirical results on the 15 models provide something concrete to discuss, even if the authors need to add more documentation on task construction and overlap checks to strengthen the orthogonality argument.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Long-horizon Memory Embedding Benchmark (LMEB), comprising 22 datasets and 193 zero-shot retrieval tasks across four memory types (episodic, dialogue, semantic, procedural). It evaluates 15 embedding models ranging from hundreds of millions to 10B parameters and reports three main findings: LMEB presents reasonable difficulty, larger models do not always outperform smaller ones, and LMEB scores are orthogonal to those on MTEB, implying that strong passage-retrieval performance does not transfer to long-horizon memory retrieval.

Significance. If the orthogonality result holds after verification of task disjointness, the work would be significant: it supplies a reproducible framework that exposes a gap in current embedding benchmarks and could steer development of models for temporally distant, context-dependent retrieval in memory-augmented systems.

major comments (3)

[§4] §4 (Dataset Curation): the paper provides no explicit audit or overlap analysis between the 22 LMEB datasets and MTEB corpora or task templates; without this, the central claim that LMEB and MTEB measure orthogonal capabilities (reported in §5.3 and Figure 3) cannot be distinguished from a curation artifact.
[§3.2] §3.2 (Task Construction): the description of how the 193 zero-shot tasks are generated from the four memory types lacks concrete details on query formulation, relevance labeling, and temporal-dependency handling, which are required to assess whether the evaluation genuinely tests long-horizon retrieval rather than standard passage matching.
[Table 3] Table 3 (Model Rankings): the statement that 'larger models do not always perform better' is supported only by raw scores; without statistical tests, confidence intervals, or ablation on model scale, this observation remains descriptive and does not yet undermine scaling hypotheses.

minor comments (2)

[Abstract] Abstract: the phrase 'reasonable level of difficulty' is used without reference to a quantitative baseline (e.g., random or BM25 performance) that would allow readers to interpret the reported numbers.
[§6] §6 (Reproducibility): the GitHub link is given, but the paper should list exact preprocessing steps, data splits, and prompt templates used for the 193 tasks to ensure full reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have revised the paper accordingly to incorporate additional analyses and details.

read point-by-point responses

Referee: [§4] §4 (Dataset Curation): the paper provides no explicit audit or overlap analysis between the 22 LMEB datasets and MTEB corpora or task templates; without this, the central claim that LMEB and MTEB measure orthogonal capabilities (reported in §5.3 and Figure 3) cannot be distinguished from a curation artifact.

Authors: We agree that an explicit audit for overlap with MTEB was missing. In the revised manuscript, we will add a new subsection in §4 detailing the overlap analysis. Our preliminary check shows that LMEB datasets are sourced from distinct domains (e.g., personal memory logs, dialogue histories) with no direct overlap in documents or templates with MTEB tasks, supporting that the orthogonality is not merely a curation artifact. revision: yes
Referee: [§3.2] §3.2 (Task Construction): the description of how the 193 zero-shot tasks are generated from the four memory types lacks concrete details on query formulation, relevance labeling, and temporal-dependency handling, which are required to assess whether the evaluation genuinely tests long-horizon retrieval rather than standard passage matching.

Authors: We appreciate this point and will expand §3.2 with concrete examples. For instance, for episodic memory, queries are formulated as 'What was the outcome of the event described in the memory from two weeks ago?' with relevance labels based on whether the passage contains the specific temporal reference. We will include pseudocode for task generation and explain how temporal dependencies are handled by including time-stamped contexts in the retrieval corpus. revision: yes
Referee: [Table 3] Table 3 (Model Rankings): the statement that 'larger models do not always perform better' is supported only by raw scores; without statistical tests, confidence intervals, or ablation on model scale, this observation remains descriptive and does not yet undermine scaling hypotheses.

Authors: The statement is based on the observed rankings in Table 3, where for example a 1B model outperforms a 7B model on certain tasks. To address the concern, we will include bootstrap confidence intervals for the scores and a note on the lack of consistent scaling, while acknowledging that this is an empirical observation rather than a full refutation of scaling laws. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark creation and empirical evaluation against external MTEB

full rationale

The paper constructs LMEB from 22 new datasets spanning episodic, dialogue, semantic, and procedural memory types, then directly evaluates 15 models on 193 zero-shot tasks and compares the resulting scores to published MTEB numbers. This comparison is an external empirical measurement with no equations, fitted parameters, or derivations that reduce to the paper's own inputs. No self-citations are load-bearing for the orthogonality claim, and the work contains no self-definitional steps, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on domain assumptions about memory type distinctions and the representativeness of chosen datasets for long-horizon retrieval; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The four memory types (episodic, dialogue, semantic, procedural) capture distinct and relevant aspects of long-horizon memory retrieval.
Used to structure the 22 datasets and 193 tasks as described in the abstract.

pith-pipeline@v0.9.0 · 5613 in / 1248 out tokens · 42762 ms · 2026-05-15T12:23:01.082361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
cs.IR 2026-04 conditional novelty 6.0

SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 14 internal anchors

[1]

Mayonlp at semeval-2016 task 1: Semantic textual similarity based on lexical semantic net and deep learning semantic model

Naveed Afzal, Yanshan Wang, and Hongfang Liu. Mayonlp at semeval-2016 task 1: Semantic textual similarity based on lexical semantic net and deep learning semantic model. InSemEval@NAACL- HLT, pages 674–679. The Association for Computer Linguistics,

work page 2016
[2]

Cer, Mona T

Eneko Agirre, Daniel M. Cer, Mona T. Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. InSemEval@NAACL-HLT, pages 385–393. The Association for Computer Linguistics,

work page 2012
[3]

Cer, Mona T

Eneko Agirre, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *sem 2013 shared task: Semantic textual similarity. In*SEM@NAACL-HLT, pages 32–43. Association for Computational Linguistics,

work page 2013
[4]

Cer, Mona T

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Mul- tilingual semantic textual similarity. InSemEval@COLING, pages 81–91. The Association for Computer Linguistics,

work page 2014
[5]

Cer, Mona T

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. InSemEval@NAACL-HLT, pages 252–263. The Association...

work page 2015
[6]

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

URLhttps://arxiv.org/abs/2602.15547. Nick Alonso, Tomas Figliolia, Anthony Ndirango, and Beren Millidge. Toward conversational agents with context and time sensitive long-term memory.CoRR, abs/2406.00057,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution. arXiv preprint arXiv:2512.10696,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.CoRR, abs/2402.03216,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Air-bench: Automated heterogeneous information retrieval benchmark

11 Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, and Zheng Liu. Air-bench: Automated heterogeneous information retrieval benchmark. InACL (1), pages 19991–20022. Association for Computational Linguistics, 2025a. Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin C...

work page arXiv
[10]

Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering.CoRR, abs/2402.16288,

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering.CoRR, abs/2402.16288,

work page arXiv
[11]

Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675, 2025a

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675, 2025a. Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam- Fai Wong, and Jeff Z. Pan. Reth...

work page arXiv
[12]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.CoRR, abs/2508.06433,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Human-like episodic memory for infinite context llms

Zafeirios Fountas, Martin Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, and Jun Wang. Human-like episodic memory for infinite context llms. CoRR, abs/2407.09450,

work page arXiv
[14]

jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,

work page arXiv
[15]

Kalm-embedding: Superior training data brings A stronger embedding model.CoRR, abs/2501.01028,

Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, and Min Zhang. Kalm-embedding: Superior training data brings A stronger embedding model.CoRR, abs/2501.01028,

work page arXiv
[16]

A benchmark for procedural memory retrieval in language agents.CoRR, abs/2511.21730,

Ishant Kohar and Aswanth Krishnan. A benchmark for procedural memory retrieval in language agents.CoRR, abs/2511.21730,

work page arXiv
[17]

Nv-embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InICLR. OpenReview.net, 2025a. Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. REALTALK: A 21-day real-world dataset for long-term conversatio...

work page arXiv
[18]

Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning.CoRR, abs/2503.07018,

Xintong Li, Jalend Bantupalli, Ria Dharmani, Yuwei Zhang, and Jingbo Shang. Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning.CoRR, abs/2503.07018,

work page arXiv
[19]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.CoRR, abs/2308.03281,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. Memtool: Optimizing short-term memory management for dynamic tool calling in LLM agent multi-turn conversations.CoRR, abs/2507.21428,

work page arXiv
[21]

Vidore benchmark V2: raising the bar for visual retrieval.CoRR, abs/2505.17166,

Quentin Macé, António Loison, and Manuel Faysse. Vidore benchmark V2: raising the bar for visual retrieval.CoRR, abs/2505.17166,

work page arXiv
[22]

Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.CoRR, abs/2507.04590,

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.CoRR, abs/2507.04590,

work page arXiv
[23]

Multi-task contrastive learning for 8192-token bilingual text embeddings

Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Multi-task contrastive learning for 8192-token bilingual text embeddings...

work page arXiv
[24]

COVID-QA: A question answering dataset for COVID-19

Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. COVID-QA: A question answering dataset for COVID-19. In Karin Verspoor, Kevin Bretonnel Cohen, Mark Dredze, Emilio Ferrara, Jonathan May, Robert Munro, Cecile Paris, and Byron Wallace, editors,Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July

work page 2020
[25]

URL https://aclanthology.org/2020.nlpcovid19-acl.18/

Association for Computational Linguistics. URL https://aclanthology.org/2020.nlpcovid19-acl.18/. 13 Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: massive text embedding benchmark. InEACL, pages 2006–2029. Association for Computational Linguistics,

work page 2020
[26]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

GitHub repository, accessed: 2026-03-10. Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory.CoRR, abs/2509.25140,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Convomem benchmark: Why your first 150 conversations don’t need RAG.CoRR, abs/2511.10523,

Egor Pakhomov, Erik Nijkamp, and Caiming Xiong. Convomem benchmark: Why your first 150 conversations don’t need RAG.CoRR, abs/2511.10523,

work page arXiv
[28]

Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,

Mathis Pink, Qinyuan Wu, Vy Ai V o, Javier Turek, Jianing Mu, Alexander Huth, and Mariya Toneva. Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,

work page arXiv
[29]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large lan- guage models

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large lan- guage models. InACL (Findings), volume ACL 2025 ofFindings of ACL, pages 24497–24524. Association for Computational Linguistics,

work page 2025
[30]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InACL (Findings), volume ACL 2025 ofFindings of ACL, pages 19336–19352. Association for Computational Linguistics,

work page 2025
[31]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

14 Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models.CoRR, abs/2104.08663,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

EmbeddingGemma: Powerful and Lightweight Text Representations

URLhttps://arxiv.org/abs/2509.20354. David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InEMNLP (1), pages 7534–7550. Association for Computational Linguistics,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.CoRR, abs/2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Multilingual E5 Text Embeddings: A Technical Report

doi: 10.48550/ARXIV .2402.05672. URLhttps://doi.org/10.48550/arXiv.2402.05672. Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[36]

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Enevoldsen, and Niklas Muennighoff

15 Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth C. Enevoldsen, and Niklas Muennighoff. MIEB: massive image embedding benchmark.CoRR, abs/2504.10471,

work page arXiv
[38]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv rl-based memory agent.CoRR, abs/2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

On the role of pretrained language models in general-purpose text embeddings: A survey.CoRR, abs/2507.20783, 2025a

Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, and Min Zhang. On the role of pretrained language models in general-purpose text embeddings: A survey.CoRR, abs/2507.20783, 2025a. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zh...

work page arXiv
[40]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.CoRR, abs/2506.05176, 2025b. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, C...

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, and Min Zhang. Kalm-embedding-v2: Superior training techniques and data inspire A versatile embedding model.CoRR, abs/2506.20923, 2025a. Xinping Zhao, Yan Zh...

work page arXiv 2025
[42]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRR, abs/2506.15841,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

A Datasets Below, we list the 22 evaluation datasets in LMEB, spanning four memory types

16 Dataset Website (Link) Episodic Memory EPBench [Huet et al., 2025]https://doi.org/10.6084/m9.figshare.28244480 KnowMeBench [Wu et al., 2026]https://github.com/QuantaAlpha/KnowMeBench/tree/main/KnowmeBench Dialogue Memory LoCoMo [Maharana et al., 2024]https://github.com/snap-research/locomo/tree/main/data LongMemEval [Wu et al., 2025]https://huggingface...

work page doi:10.6084/m9.figshare.28244480 2025
[44]

In LMEB, we treat an episodic query as inputand retrieve the corresponding event memories asoutput

We describe each dataset below: A.1 Episodic Memory Retrieval Episodic memory retrieval aims to recall past events grounded in temporal cues, entities, contents, and spatial contexts [Fountas et al., 2024, Pink et al., 2025]. In LMEB, we treat an episodic query as inputand retrieve the corresponding event memories asoutput. Examples are presented in Table

work page 2024
[45]

It represents episodic events with structured fields, including temporal and spatial context, involved entities, and detailed descriptions

EPBench[Huet et al., 2025] is a synthetic episodic memory benchmark for evaluating event recall and episodic reasoning in LLMs. It represents episodic events with structured fields, including temporal and spatial context, involved entities, and detailed descriptions. We use its event set as the corpusCand the provided task queries as queriesQ. KnowMeBench...

work page 2025
[46]

id": 1,"timestamp

17 Dataset Query Relevant-Document Granularity EPBench Think about Aurora Chavez’s experiences. De-scribe all the key events they’ve been involvedin, focusing on what happened rather than whenor where it occurred. . . . . . . Aurora implemented blockchain solutions with a determinationthat bordered on obsession. . . . . . . It wasn’t until Samara Bayes ta...

work page 1969
[47]

It contains 5,049 questions over 1,585 NLP papers, where questions are written from the title and abstract and answered using evidence from the full text

QASPER[Dasigi et al., 2021] is a question-answering (QA) dataset grounded in full research papers, designed to reflect information-seeking queries that require reasoning across multiple document sections. It contains 5,049 questions over 1,585 NLP papers, where questions are written from the title and abstract and answered using evidence from the full tex...

work page 2021
[48]

Task Type

Gorilla[Patil et al., 2024] introduces APIBench, a benchmark covering APIs from HuggingFace, TorchHub, and TensorHub, designed to evaluate tool-use through API call generation. A key component of Gorilla is the integration of a document retriever, which enables models to fetch up-to-date API documentation and adapt to version changes at test time. In LMEB...

work page 2024
[49]

speaker":

D Dataset Licenses The authors of 2 datasets in the LMEB benchmark (REALTALK and TMD) do not specify the dataset license in the paper or repository. We summarize the licenses for the remaining datasets below. • EPBench, LongMemEval, MemBench, MLDR, MemGovern: Provided under the MIT License. • KnowMeBench, Covid-QA, Gorilla, ToolBench, ReMe, Proced_mem_ben...

work page 2010
[50]

He has produced many public memorials and installations inboth England and throughout the United States with subjects ranging from miners,to soldiers and fire fighters

is a sculptor classically trained inthe traditional methodology of figurative bronze and portrait sculpture living inCarmel, California. He has produced many public memorials and installations inboth England and throughout the United States with subjects ranging from miners,to soldiers and fire fighters. He is credited with over fifty life size and larger...

work page 1948
[51]

product_id

Ensure sample mean and variance tests only run when the theoreticalmoments are finite Experience DeepPlanningI’m putting together a completeoutdoor adventure outfit andneed to find three specific itemsonline. First, I’m looking for aWomen’s Rho Hybrid Zip NeckTop that’s proven popular withcustomers - it needs to havemore than 25 four-star reviewsand total...

work page 2026