Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
hub Mixed citations
Evaluating Very Long-Term Conversational Memory of
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.
Tangram makes non-uniform KV cache compression practical for LLM serving with deterministic budget allocation, head group paging, and ahead-of-time load balancing, achieving up to 2.6x throughput gains.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
LPM encodes personal history as N latent slots projected by cross-attention into input-conditioned soft prompts for frozen LLMs, reporting up to 8.8% higher accuracy than LoRA and 64x lower KV-cache on PersonaMem v1 plus matching LoRA accuracy with 120x fewer parameters on LoCoMo.
MRAgent combines a Cue-Tag-Content associative graph with active reconstruction to enable dynamic memory access in LLM agents, reporting up to 23% gains on long-memory benchmarks with lower token costs.
AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.
Eywa introduces a provenance-grounded memory system for persistent AI agents featuring evidence-first storage, typed validation, and deterministic multi-route retrieval, reporting 90.19% accuracy on LoCoMo and 88.2% on LongMemEval-S.
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.
CoreMem replaces cosine retrieval with Fisher-Rao Riemannian matching and introduces Fisher-guided discrete token distillation for syntax-aware compression, reporting +4.51 pp open-domain and +4.17 pp temporal gains on LOCOMO and LongMemEval-S while staying inside an 8 GB VRAM budget.
KG-CFR decouples planning from execution via knowledge-grounded counterfactual reasoning, preventing critical degradation in over 95% of perturbed runs and raising argument quality from 0.694 to 0.822 in a 1v1v1 simulation.
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.
UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.
The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.
citing papers explorer
-
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
-
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.