Recognition: 1 theorem link
· Lean TheoremLMEB: Long-horizon Memory Embedding Benchmark
Pith reviewed 2026-05-15 12:23 UTC · model grok-4.3
The pith
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval.
Load-bearing premise
The assumption that the 22 datasets and their categorization into four memory types accurately and comprehensively represent the challenges of real-world long-horizon memory retrieval without significant selection bias or overlap.
Figures
read the original abstract
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this gap, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework for evaluating embedding models on complex, long-horizon memory retrieval. LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval. LMEB provides a standardized and reproducible framework that fills a key gap in memory embedding evaluation and supports future advances in long-term, context-dependent retrieval. LMEB is available at https://kalm-embedding.github.io/LMEB.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Long-horizon Memory Embedding Benchmark (LMEB), comprising 22 datasets and 193 zero-shot retrieval tasks across four memory types (episodic, dialogue, semantic, procedural). It evaluates 15 embedding models ranging from hundreds of millions to 10B parameters and reports three main findings: LMEB presents reasonable difficulty, larger models do not always outperform smaller ones, and LMEB scores are orthogonal to those on MTEB, implying that strong passage-retrieval performance does not transfer to long-horizon memory retrieval.
Significance. If the orthogonality result holds after verification of task disjointness, the work would be significant: it supplies a reproducible framework that exposes a gap in current embedding benchmarks and could steer development of models for temporally distant, context-dependent retrieval in memory-augmented systems.
major comments (3)
- [§4] §4 (Dataset Curation): the paper provides no explicit audit or overlap analysis between the 22 LMEB datasets and MTEB corpora or task templates; without this, the central claim that LMEB and MTEB measure orthogonal capabilities (reported in §5.3 and Figure 3) cannot be distinguished from a curation artifact.
- [§3.2] §3.2 (Task Construction): the description of how the 193 zero-shot tasks are generated from the four memory types lacks concrete details on query formulation, relevance labeling, and temporal-dependency handling, which are required to assess whether the evaluation genuinely tests long-horizon retrieval rather than standard passage matching.
- [Table 3] Table 3 (Model Rankings): the statement that 'larger models do not always perform better' is supported only by raw scores; without statistical tests, confidence intervals, or ablation on model scale, this observation remains descriptive and does not yet undermine scaling hypotheses.
minor comments (2)
- [Abstract] Abstract: the phrase 'reasonable level of difficulty' is used without reference to a quantitative baseline (e.g., random or BM25 performance) that would allow readers to interpret the reported numbers.
- [§6] §6 (Reproducibility): the GitHub link is given, but the paper should list exact preprocessing steps, data splits, and prompt templates used for the 193 tasks to ensure full reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have revised the paper accordingly to incorporate additional analyses and details.
read point-by-point responses
-
Referee: [§4] §4 (Dataset Curation): the paper provides no explicit audit or overlap analysis between the 22 LMEB datasets and MTEB corpora or task templates; without this, the central claim that LMEB and MTEB measure orthogonal capabilities (reported in §5.3 and Figure 3) cannot be distinguished from a curation artifact.
Authors: We agree that an explicit audit for overlap with MTEB was missing. In the revised manuscript, we will add a new subsection in §4 detailing the overlap analysis. Our preliminary check shows that LMEB datasets are sourced from distinct domains (e.g., personal memory logs, dialogue histories) with no direct overlap in documents or templates with MTEB tasks, supporting that the orthogonality is not merely a curation artifact. revision: yes
-
Referee: [§3.2] §3.2 (Task Construction): the description of how the 193 zero-shot tasks are generated from the four memory types lacks concrete details on query formulation, relevance labeling, and temporal-dependency handling, which are required to assess whether the evaluation genuinely tests long-horizon retrieval rather than standard passage matching.
Authors: We appreciate this point and will expand §3.2 with concrete examples. For instance, for episodic memory, queries are formulated as 'What was the outcome of the event described in the memory from two weeks ago?' with relevance labels based on whether the passage contains the specific temporal reference. We will include pseudocode for task generation and explain how temporal dependencies are handled by including time-stamped contexts in the retrieval corpus. revision: yes
-
Referee: [Table 3] Table 3 (Model Rankings): the statement that 'larger models do not always perform better' is supported only by raw scores; without statistical tests, confidence intervals, or ablation on model scale, this observation remains descriptive and does not yet undermine scaling hypotheses.
Authors: The statement is based on the observed rankings in Table 3, where for example a 1B model outperforms a 7B model on certain tasks. To address the concern, we will include bootstrap confidence intervals for the scores and a note on the lack of consistent scaling, while acknowledging that this is an empirical observation rather than a full refutation of scaling laws. revision: yes
Circularity Check
No circularity: new benchmark creation and empirical evaluation against external MTEB
full rationale
The paper constructs LMEB from 22 new datasets spanning episodic, dialogue, semantic, and procedural memory types, then directly evaluates 15 models on 193 zero-shot tasks and compares the resulting scores to published MTEB numbers. This comparison is an external empirical measurement with no equations, fitted parameters, or derivations that reduce to the paper's own inputs. No self-citations are load-bearing for the orthogonality claim, and the work contains no self-definitional steps, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four memory types (episodic, dialogue, semantic, procedural) capture distinct and relevant aspects of long-horizon memory retrieval.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
Reference graph
Works this paper leans on
-
[1]
Naveed Afzal, Yanshan Wang, and Hongfang Liu. Mayonlp at semeval-2016 task 1: Semantic textual similarity based on lexical semantic net and deep learning semantic model. InSemEval@NAACL- HLT, pages 674–679. The Association for Computer Linguistics,
work page 2016
-
[2]
Eneko Agirre, Daniel M. Cer, Mona T. Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. InSemEval@NAACL-HLT, pages 385–393. The Association for Computer Linguistics,
work page 2012
-
[3]
Eneko Agirre, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *sem 2013 shared task: Semantic textual similarity. In*SEM@NAACL-HLT, pages 32–43. Association for Computational Linguistics,
work page 2013
-
[4]
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Mul- tilingual semantic textual similarity. InSemEval@COLING, pages 81–91. The Association for Computer Linguistics,
work page 2014
-
[5]
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. InSemEval@NAACL-HLT, pages 252–263. The Association...
work page 2015
-
[6]
jina-embeddings-v5-text: Task-Targeted Embedding Distillation
URLhttps://arxiv.org/abs/2602.15547. Nick Alonso, Tomas Figliolia, Anthony Ndirango, and Beren Millidge. Toward conversational agents with context and time sensitive long-term memory.CoRR, abs/2406.00057,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution
Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution. arXiv preprint arXiv:2512.10696,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.CoRR, abs/2402.03216,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Air-bench: Automated heterogeneous information retrieval benchmark
11 Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, and Zheng Liu. Air-bench: Automated heterogeneous information retrieval benchmark. InACL (1), pages 19991–20022. Association for Computational Linguistics, 2025a. Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin C...
-
[10]
Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering.CoRR, abs/2402.16288,
-
[11]
Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675, 2025a. Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam- Fai Wong, and Jeff Z. Pan. Reth...
-
[12]
Memp: Exploring Agent Procedural Memory
Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.CoRR, abs/2508.06433,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Human-like episodic memory for infinite context llms
Zafeirios Fountas, Martin Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, and Jun Wang. Human-like episodic memory for infinite context llms. CoRR, abs/2407.09450,
-
[14]
jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,
-
[15]
Kalm-embedding: Superior training data brings A stronger embedding model.CoRR, abs/2501.01028,
Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, and Min Zhang. Kalm-embedding: Superior training data brings A stronger embedding model.CoRR, abs/2501.01028,
-
[16]
A benchmark for procedural memory retrieval in language agents.CoRR, abs/2511.21730,
Ishant Kohar and Aswanth Krishnan. A benchmark for procedural memory retrieval in language agents.CoRR, abs/2511.21730,
-
[17]
Nv-embed: Improved techniques for training llms as generalist embedding models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InICLR. OpenReview.net, 2025a. Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. REALTALK: A 21-day real-world dataset for long-term conversatio...
-
[18]
Xintong Li, Jalend Bantupalli, Ria Dharmani, Yuwei Zhang, and Jingbo Shang. Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning.CoRR, abs/2503.07018,
-
[19]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.CoRR, abs/2308.03281,
work page internal anchor Pith review Pith/arXiv arXiv
- [20]
-
[21]
Vidore benchmark V2: raising the bar for visual retrieval.CoRR, abs/2505.17166,
Quentin Macé, António Loison, and Manuel Faysse. Vidore benchmark V2: raising the bar for visual retrieval.CoRR, abs/2505.17166,
-
[22]
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.CoRR, abs/2507.04590,
-
[23]
Multi-task contrastive learning for 8192-token bilingual text embeddings
Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Multi-task contrastive learning for 8192-token bilingual text embeddings...
-
[24]
COVID-QA: A question answering dataset for COVID-19
Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. COVID-QA: A question answering dataset for COVID-19. In Karin Verspoor, Kevin Bretonnel Cohen, Mark Dredze, Emilio Ferrara, Jonathan May, Robert Munro, Cecile Paris, and Byron Wallace, editors,Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July
work page 2020
-
[25]
URL https://aclanthology.org/2020.nlpcovid19-acl.18/
Association for Computational Linguistics. URL https://aclanthology.org/2020.nlpcovid19-acl.18/. 13 Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: massive text embedding benchmark. InEACL, pages 2006–2029. Association for Computational Linguistics,
work page 2020
-
[26]
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
GitHub repository, accessed: 2026-03-10. Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory.CoRR, abs/2509.25140,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Convomem benchmark: Why your first 150 conversations don’t need RAG.CoRR, abs/2511.10523,
Egor Pakhomov, Erik Nijkamp, and Caiming Xiong. Convomem benchmark: Why your first 150 conversations don’t need RAG.CoRR, abs/2511.10523,
-
[28]
Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,
Mathis Pink, Qinyuan Wu, Vy Ai V o, Javier Turek, Jianing Mu, Alexander Huth, and Mariya Toneva. Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,
-
[29]
Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large lan- guage models
Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large lan- guage models. InACL (Findings), volume ACL 2025 ofFindings of ACL, pages 24497–24524. Association for Computational Linguistics,
work page 2025
-
[30]
Membench: Towards more comprehensive evaluation on the memory of llm-based agents
Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InACL (Findings), volume ACL 2025 ofFindings of ACL, pages 19336–19352. Association for Computational Linguistics,
work page 2025
-
[31]
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
14 Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models.CoRR, abs/2104.08663,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
EmbeddingGemma: Powerful and Lightweight Text Representations
URLhttps://arxiv.org/abs/2509.20354. David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InEMNLP (1), pages 7534–7550. Association for Computational Linguistics,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.CoRR, abs/2212.03533,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Multilingual E5 Text Embeddings: A Technical Report
doi: 10.48550/ARXIV .2402.05672. URLhttps://doi.org/10.48550/arXiv.2402.05672. Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[36]
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Enevoldsen, and Niklas Muennighoff
15 Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth C. Enevoldsen, and Niklas Muennighoff. MIEB: massive image embedding benchmark.CoRR, abs/2504.10471,
-
[38]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv rl-based memory agent.CoRR, abs/2507.02259,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, and Min Zhang. On the role of pretrained language models in general-purpose text embeddings: A survey.CoRR, abs/2507.20783, 2025a. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zh...
-
[40]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.CoRR, abs/2506.05176, 2025b. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, C...
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, and Min Zhang. Kalm-embedding-v2: Superior training techniques and data inspire A versatile embedding model.CoRR, abs/2506.20923, 2025a. Xinping Zhao, Yan Zh...
-
[42]
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRR, abs/2506.15841,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
A Datasets Below, we list the 22 evaluation datasets in LMEB, spanning four memory types
16 Dataset Website (Link) Episodic Memory EPBench [Huet et al., 2025]https://doi.org/10.6084/m9.figshare.28244480 KnowMeBench [Wu et al., 2026]https://github.com/QuantaAlpha/KnowMeBench/tree/main/KnowmeBench Dialogue Memory LoCoMo [Maharana et al., 2024]https://github.com/snap-research/locomo/tree/main/data LongMemEval [Wu et al., 2025]https://huggingface...
-
[44]
In LMEB, we treat an episodic query as inputand retrieve the corresponding event memories asoutput
We describe each dataset below: A.1 Episodic Memory Retrieval Episodic memory retrieval aims to recall past events grounded in temporal cues, entities, contents, and spatial contexts [Fountas et al., 2024, Pink et al., 2025]. In LMEB, we treat an episodic query as inputand retrieve the corresponding event memories asoutput. Examples are presented in Table
work page 2024
-
[45]
EPBench[Huet et al., 2025] is a synthetic episodic memory benchmark for evaluating event recall and episodic reasoning in LLMs. It represents episodic events with structured fields, including temporal and spatial context, involved entities, and detailed descriptions. We use its event set as the corpusCand the provided task queries as queriesQ. KnowMeBench...
work page 2025
-
[46]
17 Dataset Query Relevant-Document Granularity EPBench Think about Aurora Chavez’s experiences. De-scribe all the key events they’ve been involvedin, focusing on what happened rather than whenor where it occurred. . . . . . . Aurora implemented blockchain solutions with a determinationthat bordered on obsession. . . . . . . It wasn’t until Samara Bayes ta...
work page 1969
-
[47]
QASPER[Dasigi et al., 2021] is a question-answering (QA) dataset grounded in full research papers, designed to reflect information-seeking queries that require reasoning across multiple document sections. It contains 5,049 questions over 1,585 NLP papers, where questions are written from the title and abstract and answered using evidence from the full tex...
work page 2021
-
[48]
Gorilla[Patil et al., 2024] introduces APIBench, a benchmark covering APIs from HuggingFace, TorchHub, and TensorHub, designed to evaluate tool-use through API call generation. A key component of Gorilla is the integration of a document retriever, which enables models to fetch up-to-date API documentation and adapt to version changes at test time. In LMEB...
work page 2024
-
[49]
D Dataset Licenses The authors of 2 datasets in the LMEB benchmark (REALTALK and TMD) do not specify the dataset license in the paper or repository. We summarize the licenses for the remaining datasets below. • EPBench, LongMemEval, MemBench, MLDR, MemGovern: Provided under the MIT License. • KnowMeBench, Covid-QA, Gorilla, ToolBench, ReMe, Proced_mem_ben...
work page 2010
-
[50]
is a sculptor classically trained inthe traditional methodology of figurative bronze and portrait sculpture living inCarmel, California. He has produced many public memorials and installations inboth England and throughout the United States with subjects ranging from miners,to soldiers and fire fighters. He is credited with over fifty life size and larger...
work page 1948
-
[51]
Ensure sample mean and variance tests only run when the theoreticalmoments are finite Experience DeepPlanningI’m putting together a completeoutdoor adventure outfit andneed to find three specific itemsonline. First, I’m looking for aWomen’s Rho Hybrid Zip NeckTop that’s proven popular withcustomers - it needs to havemore than 25 four-star reviewsand total...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.