KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

· 2026 · cs.AI · arXiv 2601.04745

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

cs.AI · 2026-05-12 · conditional · novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

HEART-Bench evaluates LLM agents on psychological consistency using 11 Big-Five-grounded characters with 1,000 episodic memories each and 64 DIAMONDS-based decision scenarios, yielding 673 validated MCQs.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

cs.AI · 2026-03-24 · unverdicted · novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.

LMEB: Long-horizon Memory Embedding Benchmark

cs.CL · 2026-03-13 · unverdicted · novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

cs.AI · 2026-05-31 · unverdicted · novelty 5.0

FlowTime introduces continuous generative regression using a one-step VAE and normalizing flows for personalized priors to predict watch time while addressing mean-collapse, quantization, and latency issues in prior paradigms.

citing papers explorer

Showing 8 of 8 citing papers.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare cs.AI · 2026-05-12 · conditional · none · ref 35 · internal anchor
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces cs.AI · 2026-06-01 · unverdicted · none · ref 12 · internal anchor
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology? cs.CL · 2026-05-28 · unverdicted · none · ref 41 · internal anchor
HEART-Bench evaluates LLM agents on psychological consistency using 11 Big-Five-grounded characters with 1,000 episodic memories each and 64 DIAMONDS-based decision scenarios, yielding 673 validated MCQs.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 37 · internal anchor
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 64 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
LMEB: Long-horizon Memory Embedding Benchmark cs.CL · 2026-03-13 · unverdicted · none · ref 36 · internal anchor
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue cs.CL · 2026-05-31 · unverdicted · none · ref 58 · internal anchor
RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.
FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors cs.AI · 2026-05-31 · unverdicted · none · ref 39 · internal anchor
FlowTime introduces continuous generative regression using a one-step VAE and normalizing flows for personalized priors to predict watch time while addressing mean-collapse, quantization, and latency issues in prior paradigms.

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer