pith. sign in

arxiv: 2605.29630 · v1 · pith:WLQMLYIJnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI· cs.IR

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

Pith reviewed 2026-06-29 08:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords entity-collisionagent memory retrievalBM25 baseline pinningembedder evaluationtag stratificationlexical leakage controlretrieval lift attribution
0
0 comments X

The pith

Entity-collision protocol pins BM25 floor by forcing shared entity tokens so any lift is attributable to the embedder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current agent-memory benchmarks report one hit rate that mixes lexical overlap between queries and distractors with mixed query tags. The entity-collision protocol eliminates both confounds by requiring every distractor to reuse the answer's entity tokens and by splitting evaluation across discriminator tags. This construction fixes the BM25 baseline so measured gains can be credited to the embedder rather than leakage. Experiments across five tags, three embedders, and five collision depths show a 256-d hash trigram helps only on closed lexical tags at high collision, MiniLM-384 leads on both axes, and a much larger BGE model wins on intent queries yet loses on lexical ones. The pattern implies that model size alone does not determine retrieval performance in this setting.

Core claim

By forcing every distractor to share the answer's entity tokens and by stratifying queries by discriminator tag, the entity-collision protocol pins the BM25 floor by construction; any remaining lift over BM25 is therefore attributable to the embedder. Applied to an open agent-memory testbed the protocol produces a two-axis pattern in which a 256-d hash trigram improves only closed-vocabulary lexical tags at deep collision, MiniLM-384 dominates both axes, and a 2.7-times-larger BGE model fails to improve uniformly, winning on intent-style queries while losing on lexical ones; encoder capacity is therefore not the binding constraint.

What carries the argument

Entity-collision protocol that forces distractors to share answer entity tokens (pinning BM25) and stratifies queries by discriminator tag (isolating embedder effects).

If this is right

  • Any lift above BM25 can be attributed to the embedder once entity overlap is controlled.
  • A 256-d hash trigram improves performance only on closed-vocabulary lexical tags under deep collision.
  • MiniLM-384 outperforms the tested alternatives on both lexical and intent-style query axes.
  • A 2.7-times-larger BGE model does not improve uniformly and underperforms MiniLM on lexical queries.
  • The synthetic intent-tag null result replicates on the external LongMemEval set as a single-session preference recall cliff.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retriever systems may gain from query-type routing rather than seeking a single universal embedder.
  • Benchmarks that report only aggregate hit rates will continue to obscure the tag-specific and collision-specific patterns shown here.
  • Adaptive vector-weight routing will need stronger signals than those examined to close the reported oracle headroom gap.
  • Re-applying the protocol to other memory testbeds would test whether the two-axis dominance pattern generalizes.

Load-bearing premise

Forcing shared entity tokens between distractors and answers together with tag stratification fully removes residual confounds from query generation and testbed structure.

What would settle it

Re-running the protocol on a testbed that removes the forced entity-token overlap and finding that embedder rankings or lift magnitudes shift substantially would show the isolation claim does not hold.

Figures

Figures reproduced from arXiv: 2605.29630 by Youwang Deng.

Figure 1
Figure 1. Figure 1: Entity-collision ∆hit@1 vs K, by tag × embedder, paired 95% CI bands at +0.104 [+0.076, +0.131], ~1.8× the hash lift. BGE-large-1024. BGE is CI-positive on 18/20 cells at K ≥ 4 (vs 20/20 MiniLM, 5/20 hash). The two CI-touching￾zero cells (service/tool K=2) match the MiniLM and hash patterns at the same low collision regime — structural, not BGE-specific. 5.3 Two-axis interpretation The grid factors cleanly… view at source ↗
read the original abstract

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes the entity-collision protocol to attribute retrieval lifts in agent-memory systems to embedders rather than lexical leakage. By forcing every distractor to share the answer's entity tokens and stratifying queries by discriminator tag, the protocol is claimed to pin the BM25 baseline by construction; any remaining lift is then attributed to the embedder. Applied across 5 tags, 3 embedders, and 5 collision degrees on an event-sourced DAG testbed, the protocol yields a two-axis pattern: a 256-d hash trigram improves only closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and the 2.7 imes-parameter BGE-large does not uniformly beat MiniLM. The work also reports null results for adaptive vector routing and a replication on LongMemEval, with all 26 tables and 37 scripts version-controlled and byte-for-byte reproducible from the ingest stream.

Significance. If the isolation claim holds, the protocol supplies a system-agnostic, stratified evaluation method that separates embedder effects from uncontrolled entity overlap, revealing that encoder capacity is not the binding constraint and that lexical methods suffice under specific tag/collision conditions. The reproducibility provisions (public registry, deterministic testbed, verified scripts) constitute a concrete strength that permits independent verification of the paired-bootstrap CIs.

major comments (1)
  1. [Abstract] Abstract: the claim that shared entity tokens plus tag stratification 'pins the BM25 floor by construction' and isolates embedder effects is load-bearing for the attribution of all reported lifts. The manuscript does not provide an explicit check that query-generation or distractor-selection steps introduce no residual tag-specific lexical overlaps beyond the entity tokens; without such a check the two-axis pattern could partly reflect testbed artifacts rather than embedder properties.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for isolating the load-bearing isolation claim. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that shared entity tokens plus tag stratification 'pins the BM25 floor by construction' and isolates embedder effects is load-bearing for the attribution of all reported lifts. The manuscript does not provide an explicit check that query-generation or distractor-selection steps introduce no residual tag-specific lexical overlaps beyond the entity tokens; without such a check the two-axis pattern could partly reflect testbed artifacts rather than embedder properties.

    Authors: We agree that an explicit verification step would strengthen the attribution. The protocol enforces entity-token sharing at the distractor-selection stage by construction, and queries are drawn from the same event-sourced DAG to keep tag content controlled. Nevertheless, to rule out residual tag-specific lexical leakage introduced during query generation, the revision will add an appendix reporting per-tag token-overlap statistics (excluding the pinned entity tokens) between queries and distractors on the deterministic testbed. This will confirm that no systematic tag-specific overlaps remain beyond the entity tokens. revision: yes

Circularity Check

0 steps flagged

No circularity: protocol defined independently and applied to external testbeds

full rationale

The paper defines the entity-collision protocol by explicit construction rules (shared entity tokens between distractors and answer, plus tag stratification) and then measures empirical lifts on external, deterministically governed testbeds with reproducible scripts and CIs. No equations reduce reported patterns or lifts to fitted parameters or self-referential quantities. No self-citations are load-bearing. The 'by construction' phrasing describes the protocol's definitional isolation, not a derivation that loops back to the results themselves. The chain is self-contained against the testbed data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The protocol rests on the domain assumption that entity-token sharing controls lexical leakage and that the testbed is deterministic; no free parameters are fitted to produce the central claims, and no new entities are postulated.

axioms (2)
  • domain assumption Entity token overlap between distractors and answer pins the BM25 floor by construction
    Core mechanism stated in the abstract for isolating embedder lift.
  • domain assumption The memory testbed is deterministically governed with event-sourced decision log and DAG-state-machine schema
    Invoked to support byte-for-byte reproducibility of all CIs.

pith-pipeline@v0.9.1-grok · 5818 in / 1278 out tokens · 42145 ms · 2026-06-29T08:07:37.089079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Morgan Kaufmann

    Temporal Data and the Relational Model . Morgan Kaufmann. Thibault Formal, Carlos Lassance, Benjamin Pi- wowarski, and Stéphane Clinchant. 2022. From dis- tillation to hard negative sampling: Making sparse neural IR models more effective . In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (S...

  2. [2]

    MemGPT: Towards LLMs as Operating Systems

    Evaluating very long-term conversational memory of LLM agents . In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) . Association for Compu- tational Linguistics. Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. MTEB: Massive text embed- ding benchmark. In Proceedings of the 17th Confer-...

  3. [3]

    S1 already delivers ∆hit@1 +0.0831 [CI overlapping S7’s +0.0764]

    Cumulative (S1= extraction only → S7=full default). S1 already delivers ∆hit@1 +0.0831 [CI overlapping S7’s +0.0764]. The paired diff ( ∆_S1 − ∆_S7) brackets zero on 4 of 5 metrics; the lone bite is ∆gold_recall@k p=0.038 (would not survive Bonferroni across five metrics)

  4. [4]

    operationally inert on LoCoMo

    Leave-one-out from S7. Dropping extraction collapses every metric: ∆hit@1 −0.076￿, ∆MRR −0.090￿, ∆gold_recall@k −0.146￿, all p<0.001 — i.e. exactly cancels the §94c headline. Dropping any of the other 11 droppable stages ( deduplication, fact_extraction, emotion_tagging, interfer- ence, schema_update, somatic_marking, decay, suppression, temperament_drift...

  5. [5]

    Bob never wrote this

    Existence oracle. When Bob writes content semantically near-identical to Alice’s, me- chanical merge silently moves Bob’s row into SUPPRESSED. Alice cannot read Bob’s row directly under ACL — but she can observe a state transition on it via lifecycle metadata ( memories_suppressed in the consolidation re- port, state column queries on the audit path). Thi...

  6. [6]

    system memory

    Cross-tenant denial-of-recall. In the asym- metric case where one tenant runs at higher salience than another, the louder tenant’s writes systematically suppress the quieter tenant’s near-duplicates. Multi-tenant de- ployments cannot tolerate this: a single noisy tenant could erase a quiet tenant’s memories simply by writing similar content at higher sali...

  7. [7]

    I am meeting Mallory at 3pm at the Whitebridge

    Alice (agent_id=’alice’) writes “I am meeting Mallory at 3pm at the Whitebridge.”

  8. [8]

    Alice meets Mallory at the Whitebridge

    Consolidation extracts the fact “Alice meets Mallory at the Whitebridge” — distilled, structured, often more searchable than the source

  9. [9]

    Under Grant.can_access, any actor’s grant matches agent_id='' because system-shared content is intentionally readable to all

    The fact is stored with agent_id=''. Under Grant.can_access, any actor’s grant matches agent_id='' because system-shared content is intentionally readable to all

  10. [10]

    Alice’s ACL does not protect her — her own consolidation pipeline promoted her con- tent into the system pool

    Bob’s recall(”Mallory”) surfaces the distilled fact. Alice’s ACL does not protect her — her own consolidation pipeline promoted her con- tent into the system pool. This is strictly worse than §A.6.11 / §A.6.12: those leak a signal (rank position, suppression state) about Alice’s content. This one leaks the dis- tilled content itself , including any facts ...

  11. [11]

    Mining the audit log reveals which probes intersect Alice’s content

    Audit channel: remember_deduped fires only when a near-cosine neighbour exists somewhere in the store. Mining the audit log reveals which probes intersect Alice’s content

  12. [12]

    The gap is observable without any access to Alice’s data

    Recall asymmetry: Bob’s remember() re- turns a non-empty event id, but his subsequent recall() over his own scope returns 0 hits. The gap is observable without any access to Alice’s data

  13. [13]

    skip neigh- bour, do not suppress

    Storage delta: the JSONL event buffer grows but Bob’s projection-row count does not. Any monitor watching the event-log/projection delta sees the leak. This is write-side and survives every read- side ACL fix landed through §A.6.13 — the previously-closed channels (PRF mining pool, IDF-rarity, share_prior reranker, lifecycle cache, BM25/vector candidate p...

  14. [14]

    BM25 is hard to beat zero-shot

    — single-fact retrieval at controlled depth. The closest one-axis ancestor of entity-collision; entity-collision generalises by stratifying on discriminator type , which NIAH does not. • L V-Eval / LooGLE / L-Eval(An et al., 2024; Li et al. , 2024; Yuan et al. , 2024) — long- context QA suites that all report a single hit@k or LLM-judge accuracy per model...

  15. [15]

    bm25_top1 - bm25_top2 only sees the top-2 distance; it misses the broader candidate distribution

    Signal coarseness. bm25_top1 - bm25_top2 only sees the top-2 distance; it misses the broader candidate distribution

  16. [16]

    BM25 is uncertain

    Confounding with hardness. A small gap may indicate “BM25 is uncertain” (good sig- nal) or “all candidates are semantically near the gold” (bad signal — vector won’t help ei- ther). The signals collapse both regimes. We additionally trained a GradientBoosting- Classifier over the full BM25 feature panel + category one-hot under leave-one-conversation- out...

  17. [17]

    A schema’s state at time t is the fold of its decision log up to t

    Lifecycle decisions are events, not in-place mutations. A schema’s state at time t is the fold of its decision log up to t. This is the same discipline event-sourced ledgers borrow from accounting; in a memory system it gives bit-identical audit replay across re-runs of the same ingest stream

  18. [18]

    For any permutation of the in- put fact stream, the family assignment a fact lands in is invariant; only the order of deci- sions within a family is permuted

    Family clustering is decision-stable under permutation. For any permutation of the in- put fact stream, the family assignment a fact lands in is invariant; only the order of deci- sions within a family is permuted. This is what makes the lifecycle safe to run concur- rently: writers don’t race for cluster identity

  19. [19]

    Decay is monotone in real time, not in ar- rival time. A fact’s confidence trajectory depends only on wall-clock spacing, not on whether other facts were observed between ticks; the property is verified under fuzzed interleaving of tick() and update() calls. This rules out a class of write-amplification bugs where a chatty witness inadvertently ex- tends ...

  20. [20]

    A Letta block is a string the LLM rewrites in place via tool calls; prior content is reachable only through exter- nal chat history

    Mutation discipline. A Letta block is a string the LLM rewrites in place via tool calls; prior content is reachable only through exter- nal chat history. An Engram schema is a fold over an append-only decision log (§A7.4.4): every state change is a typed event with a reason field, and previous state is recover- able by replaying any prefix. Letta optimise...

  21. [21]

    A Letta block is identified by name; whoever can is- sue a tool call can rewrite it

    Identity stability under adversary. A Letta block is identified by name; whoever can is- sue a tool call can rewrite it. Engram schemas are content-addressed (cluster centroid + fam- ily key, §A7.4.4) with a quorum-gated DEP- RECATE primitive (§A.4.6, §A.6.16) requir- ing k independent emitters over aw-event win- dow. A single compromised emitter cannot t...

  22. [22]

    what did this agent know and when

    Recovery semantics. Letta has no first- class undo: a corrupted block must be recon- structed from chat history by the same LLM that may be the corruption source. Engram’s lifecycle DAG includes a RECOVER edge — verified under randomized event interleaving (see §A6) — that re-promotes a DEPRE- CATED schema once subsequent evidence reaches the same quorum....