SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

Dhanajit Brahma; Ricardo Henao; Sijia Wang

arxiv: 2605.30711 · v2 · pith:J6XVKJETnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

Sijia Wang , Dhanajit Brahma , Ricardo Henao This is my paper

Pith reviewed 2026-06-28 23:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords novelty detectionmemory evolutionagentic LLMsvon Mises-Fisheradaptive thresholdmemory managementwrite-side control

0 comments

The pith

SAGE routes new facts through a density-based novelty gate to add, ignore, or merge them in agent memory while cutting LLM calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats memory evolution in agentic LLMs as a novelty-detection problem instead of invoking full LLM reasoning on every extracted fact. It scores each candidate fact with a von Mises-Fisher density estimator computed over the embeddings of existing memories and applies an adaptive threshold that follows the geometry of the memory store. Facts judged clearly novel are added directly, clearly redundant facts are ignored, and only uncertain facts reach an LLM merge step. This selective routing produces the highest average token-F1 scores against prior memory systems across seven open-weight models on the reported benchmark and lowers API cost and latency on the closed model with only a small judge-score difference.

Core claim

SAGE is a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning.

What carries the argument

Spherical Adaptive Gate (SAGE) that applies a von Mises-Fisher density estimator to memory embeddings together with an adaptive threshold derived from memory-store geometry to classify each fact as novel, redundant, or uncertain.

If this is right

On the reported benchmark, SAGE achieves the best average token-F1 against the prior memory system on all seven open-weight backbone comparisons.
On the closed model it reduces add-phase API cost by 3.4× and add-phase latency by 2.5× with only a small average judge-score gap.
Used as a drop-in binary gate for the prior memory system, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones.
Novelty-aware write control improves both memory quality and system efficiency in long-term agentic memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding-density gate could be inserted into other memory architectures to limit redundant storage without changing their retrieval logic.
Agents that ingest facts in a continuous stream might keep the gate active at all times to avoid repeated LLM involvement.
If the geometry-tracking threshold proves stable, the same mechanism could be tested on embedding spaces produced by non-LLM encoders.

Load-bearing premise

The von Mises-Fisher density estimator over memory embeddings combined with an adaptive threshold that tracks memory-store geometry is sufficient to separate clearly novel, clearly redundant, and uncertain facts without systematic misclassification that would degrade downstream memory quality.

What would settle it

A controlled run in which every fact is forced through the full LLM merge step and the resulting token-F1 and judge scores are compared against the selective routing produced by the density gate on identical inputs and backbones.

Figures

Figures reproduced from arXiv: 2605.30711 by Dhanajit Brahma, Ricardo Henao, Sijia Wang.

**Figure 2.** Figure 2: Adaptive threshold sensitivity on Qwen2.5-3B. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of decaying adaptive threshold [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Threshold sensitivity to the fixed threshold [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory. The source code for our approach is accessible at https://github.com/swang1024/SAGE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE uses a vMF density estimator plus adaptive threshold to gate memory writes and cut LLM calls by 2-3x on reported runs, but the gate's classification accuracy lacks supporting metrics.

read the letter

The core idea is a write-side novelty gate that scores new facts against existing memory embeddings with von Mises-Fisher density estimation, then routes clear cases to ADD or NOOP while sending only uncertain ones to the LLM merge step. This is framed as a practical way to reduce expensive reasoning during memory evolution in agentic systems.

The paper shows concrete efficiency numbers on LoCoMo: 3.4× lower API cost and 2.5× lower latency on GPT-4o-mini with only a small judge-score gap, plus best average token-F1 against Mem0 across seven open-weight models. It also reports skipping 16-18% of calls when dropped into A-Mem. Releasing the code is useful for anyone wanting to test the gate.

The combination of vMF estimation with a geometry-tracking threshold for the write decision is the main technical step beyond prior retrieval-focused memory work. That part is straightforward and directly targets the bottleneck mentioned in the abstract.

The soft spot is that everything depends on the gate making accurate novel/redundant/uncertain splits. The abstract supplies no per-class precision or recall, no ablation on the adaptive threshold, and no checks for what happens when embeddings cluster or sit near boundaries. If the spherical uniformity assumption does not hold in practice, the small quality gap could mask gradual memory degradation even while the cost numbers look good. The stress-test concern about geometry mismatch therefore lands until those details appear.

This is for groups working on long-running agent memory who need to control write costs. The claims are specific enough and the code is public, so it deserves a serious referee even if the gate reliability needs more evidence in review.

Referee Report

2 major / 0 minor

Summary. The paper introduces SAGE, a Spherical Adaptive Gate for memory Evolution in agentic LLMs. It frames memory write decisions as a novelty-detection task, using a von Mises-Fisher density estimator over memory embeddings together with an adaptive threshold that tracks store geometry to route facts as ADD (clearly novel), NOOP (clearly redundant), or LLM merge (uncertain). This is claimed to reduce expensive write-time LLM calls. On the LoCoMo benchmark SAGE reports the best average token-F1 versus Mem0 across seven open-weight backbones; on GPT-4o-mini it reports 3.4× lower add-phase API cost and 2.5× lower latency with only a small judge-score gap. As a drop-in binary gate for A-Mem it skips 16-18% of LLM calls across five models with minimal quality change. The code is released at https://github.com/swang1024/SAGE.

Significance. If the gate's classification accuracy holds, the work supplies a lightweight, geometry-aware mechanism that simultaneously improves memory quality and system efficiency for long-horizon agentic applications; the open-source release is a concrete strength that would allow direct replication and extension.

major comments (2)

[Abstract / method description] Abstract / method description: the central claim that the vMF density estimator plus adaptive threshold cleanly partitions novel/redundant/uncertain facts (thereby routing only uncertain cases to the LLM merge step) is load-bearing for both the token-F1 gains and the 3.4× cost reduction, yet the manuscript supplies neither per-class precision/recall for the gate nor an ablation of the adaptive threshold nor any analysis of embedding-distribution mismatch when memories cluster tightly or lie near decision boundaries.
[Abstract] Abstract: all reported performance numbers (best average token-F1 on seven backbones, 3.4× cost / 2.5× latency reductions, 16-18% call skipping) are stated without any experimental protocol, statistical tests, data splits, or ablation details, so it is impossible to determine whether the observed gains are attributable to the proposed gate rather than to uncontrolled factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract / method description] Abstract / method description: the central claim that the vMF density estimator plus adaptive threshold cleanly partitions novel/redundant/uncertain facts (thereby routing only uncertain cases to the LLM merge step) is load-bearing for both the token-F1 gains and the 3.4× cost reduction, yet the manuscript supplies neither per-class precision/recall for the gate nor an ablation of the adaptive threshold nor any analysis of embedding-distribution mismatch when memories cluster tightly or lie near decision boundaries.

Authors: We agree that explicit per-class precision/recall for the gate, an ablation of the adaptive threshold, and analysis of embedding-distribution mismatch would strengthen the central claims. The current manuscript focuses on end-to-end results but does not include these gate-specific diagnostics. In revision we will add a dedicated 'Gate Analysis' subsection with precision/recall breakdowns on LoCoMo, a threshold ablation, and discussion of vMF behavior under tight clustering or boundary cases, supported by embedding visualizations. revision: yes
Referee: [Abstract] Abstract: all reported performance numbers (best average token-F1 on seven backbones, 3.4× cost / 2.5× latency reductions, 16-18% call skipping) are stated without any experimental protocol, statistical tests, data splits, or ablation details, so it is impossible to determine whether the observed gains are attributable to the proposed gate rather than to uncontrolled factors.

Authors: The experimental protocol (LoCoMo splits, backbone configurations, judge scoring, and cost/latency measurement) is described in Section 4, but the abstract indeed omits this context and no statistical tests appear in the reported results. We will revise the abstract to reference the evaluation setup and add paired statistical significance tests plus ablation details to the results section to confirm attribution to the gate. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on standard density estimation without self-referential reductions

full rationale

The paper frames memory evolution as novelty detection and describes SAGE via a von Mises-Fisher density estimator plus adaptive threshold on embeddings. No equations, derivations, or fitted parameters are shown that reduce the reported LoCoMo token-F1 gains, cost reductions, or gate decisions to quantities defined by construction within the same work. The approach invokes standard statistical tools rather than self-citations, ansatzes, or uniqueness theorems from the authors. Empirical results on benchmarks provide the support, with no load-bearing steps that collapse to input definitions or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the vMF concentration and adaptive threshold are likely fitted but their values and fitting procedure are not stated.

pith-pipeline@v0.9.1-grok · 5785 in / 1139 out tokens · 26014 ms · 2026-06-28T23:01:49.577657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Dat: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv preprint arXiv:2503.23013, 2025

The faiss library.IEEE Transactions on Big Data. Hsin-Ling Hsu and Jengnan Tzeng. 2025. Dat: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation.arXiv preprint arXiv:2503.23013. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus.IEEE trans- actions on big data, 7(3):535–547. Vladimir Karp...

work page arXiv 2025
[2]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851– 13870

Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851– 13870. Kanti V Mardia and Peter E Jupp. 1999. Directional statistics.Wiley Series in Probability and Statistics, page 40. Charles Packer, Sarah Wooders, Kevin Lin, Vi...

1999
[3]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bern- stein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th an- nual acm symposium on user interface software and technology, pages 1–2...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Dat: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv preprint arXiv:2503.23013, 2025

The faiss library.IEEE Transactions on Big Data. Hsin-Ling Hsu and Jengnan Tzeng. 2025. Dat: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation.arXiv preprint arXiv:2503.23013. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus.IEEE trans- actions on big data, 7(3):535–547. Vladimir Karp...

work page arXiv 2025

[2] [2]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851– 13870

Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851– 13870. Kanti V Mardia and Peter E Jupp. 1999. Directional statistics.Wiley Series in Probability and Statistics, page 40. Charles Packer, Sarah Wooders, Kevin Lin, Vi...

1999

[3] [3]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bern- stein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th an- nual acm symposium on user interface software and technology, pages 1–2...

work page internal anchor Pith review Pith/arXiv arXiv 2023