LMEB: Long-horizon Memory Embedding Benchmark

· 2026 · cs.CL · arXiv 2603.12572

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this gap, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework for evaluating embedding models on complex, long-horizon memory retrieval. LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval. LMEB provides a standardized and reproducible framework that fills a key gap in memory embedding evaluation and supports future advances in long-term, context-dependent retrieval. LMEB is available at https://kalm-embedding.github.io/LMEB.github.io/.

representative citing papers

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval

cs.IR · 2026-04-02 · conditional · novelty 6.0

SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.

citing papers explorer

Showing 3 of 3 citing papers.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 106 · internal anchor
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 42 · internal anchor
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval cs.IR · 2026-04-02 · conditional · none · ref 3 · internal anchor
SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.

LMEB: Long-horizon Memory Embedding Benchmark

fields

years

verdicts

representative citing papers

citing papers explorer