Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory
Pith reviewed 2026-06-29 08:07 UTC · model grok-4.3
The pith
Entity-collision protocol pins BM25 floor by forcing shared entity tokens so any lift is attributable to the embedder.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By forcing every distractor to share the answer's entity tokens and by stratifying queries by discriminator tag, the entity-collision protocol pins the BM25 floor by construction; any remaining lift over BM25 is therefore attributable to the embedder. Applied to an open agent-memory testbed the protocol produces a two-axis pattern in which a 256-d hash trigram improves only closed-vocabulary lexical tags at deep collision, MiniLM-384 dominates both axes, and a 2.7-times-larger BGE model fails to improve uniformly, winning on intent-style queries while losing on lexical ones; encoder capacity is therefore not the binding constraint.
What carries the argument
Entity-collision protocol that forces distractors to share answer entity tokens (pinning BM25) and stratifies queries by discriminator tag (isolating embedder effects).
If this is right
- Any lift above BM25 can be attributed to the embedder once entity overlap is controlled.
- A 256-d hash trigram improves performance only on closed-vocabulary lexical tags under deep collision.
- MiniLM-384 outperforms the tested alternatives on both lexical and intent-style query axes.
- A 2.7-times-larger BGE model does not improve uniformly and underperforms MiniLM on lexical queries.
- The synthetic intent-tag null result replicates on the external LongMemEval set as a single-session preference recall cliff.
Where Pith is reading between the lines
- Retriever systems may gain from query-type routing rather than seeking a single universal embedder.
- Benchmarks that report only aggregate hit rates will continue to obscure the tag-specific and collision-specific patterns shown here.
- Adaptive vector-weight routing will need stronger signals than those examined to close the reported oracle headroom gap.
- Re-applying the protocol to other memory testbeds would test whether the two-axis dominance pattern generalizes.
Load-bearing premise
Forcing shared entity tokens between distractors and answers together with tag stratification fully removes residual confounds from query generation and testbed structure.
What would settle it
Re-running the protocol on a testbed that removes the forced entity-token overlap and finding that embedder rankings or lift magnitudes shift substantially would show the isolation claim does not hold.
Figures
read the original abstract
End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the entity-collision protocol to attribute retrieval lifts in agent-memory systems to embedders rather than lexical leakage. By forcing every distractor to share the answer's entity tokens and stratifying queries by discriminator tag, the protocol is claimed to pin the BM25 baseline by construction; any remaining lift is then attributed to the embedder. Applied across 5 tags, 3 embedders, and 5 collision degrees on an event-sourced DAG testbed, the protocol yields a two-axis pattern: a 256-d hash trigram improves only closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and the 2.7 imes-parameter BGE-large does not uniformly beat MiniLM. The work also reports null results for adaptive vector routing and a replication on LongMemEval, with all 26 tables and 37 scripts version-controlled and byte-for-byte reproducible from the ingest stream.
Significance. If the isolation claim holds, the protocol supplies a system-agnostic, stratified evaluation method that separates embedder effects from uncontrolled entity overlap, revealing that encoder capacity is not the binding constraint and that lexical methods suffice under specific tag/collision conditions. The reproducibility provisions (public registry, deterministic testbed, verified scripts) constitute a concrete strength that permits independent verification of the paired-bootstrap CIs.
major comments (1)
- [Abstract] Abstract: the claim that shared entity tokens plus tag stratification 'pins the BM25 floor by construction' and isolates embedder effects is load-bearing for the attribution of all reported lifts. The manuscript does not provide an explicit check that query-generation or distractor-selection steps introduce no residual tag-specific lexical overlaps beyond the entity tokens; without such a check the two-axis pattern could partly reflect testbed artifacts rather than embedder properties.
Simulated Author's Rebuttal
We thank the referee for the careful review and for isolating the load-bearing isolation claim. We respond point-by-point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that shared entity tokens plus tag stratification 'pins the BM25 floor by construction' and isolates embedder effects is load-bearing for the attribution of all reported lifts. The manuscript does not provide an explicit check that query-generation or distractor-selection steps introduce no residual tag-specific lexical overlaps beyond the entity tokens; without such a check the two-axis pattern could partly reflect testbed artifacts rather than embedder properties.
Authors: We agree that an explicit verification step would strengthen the attribution. The protocol enforces entity-token sharing at the distractor-selection stage by construction, and queries are drawn from the same event-sourced DAG to keep tag content controlled. Nevertheless, to rule out residual tag-specific lexical leakage introduced during query generation, the revision will add an appendix reporting per-tag token-overlap statistics (excluding the pinned entity tokens) between queries and distractors on the deterministic testbed. This will confirm that no systematic tag-specific overlaps remain beyond the entity tokens. revision: yes
Circularity Check
No circularity: protocol defined independently and applied to external testbeds
full rationale
The paper defines the entity-collision protocol by explicit construction rules (shared entity tokens between distractors and answer, plus tag stratification) and then measures empirical lifts on external, deterministically governed testbeds with reproducible scripts and CIs. No equations reduce reported patterns or lifts to fitted parameters or self-referential quantities. No self-citations are load-bearing. The 'by construction' phrasing describes the protocol's definitional isolation, not a derivation that loops back to the results themselves. The chain is self-contained against the testbed data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Entity token overlap between distractors and answer pins the BM25 floor by construction
- domain assumption The memory testbed is deterministically governed with event-sourced decision log and DAG-state-machine schema
Reference graph
Works this paper leans on
-
[1]
Morgan Kaufmann
Temporal Data and the Relational Model . Morgan Kaufmann. Thibault Formal, Carlos Lassance, Benjamin Pi- wowarski, and Stéphane Clinchant. 2022. From dis- tillation to hard negative sampling: Making sparse neural IR models more effective . In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (S...
2022
-
[2]
MemGPT: Towards LLMs as Operating Systems
Evaluating very long-term conversational memory of LLM agents . In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) . Association for Compu- tational Linguistics. Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. MTEB: Massive text embed- ding benchmark. In Proceedings of the 17th Confer-...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
S1 already delivers ∆hit@1 +0.0831 [CI overlapping S7’s +0.0764]
Cumulative (S1= extraction only → S7=full default). S1 already delivers ∆hit@1 +0.0831 [CI overlapping S7’s +0.0764]. The paired diff ( ∆_S1 − ∆_S7) brackets zero on 4 of 5 metrics; the lone bite is ∆gold_recall@k p=0.038 (would not survive Bonferroni across five metrics)
-
[4]
operationally inert on LoCoMo
Leave-one-out from S7. Dropping extraction collapses every metric: ∆hit@1 −0.076, ∆MRR −0.090, ∆gold_recall@k −0.146, all p<0.001 — i.e. exactly cancels the §94c headline. Dropping any of the other 11 droppable stages ( deduplication, fact_extraction, emotion_tagging, interfer- ence, schema_update, somatic_marking, decay, suppression, temperament_drift...
2000
-
[5]
Bob never wrote this
Existence oracle. When Bob writes content semantically near-identical to Alice’s, me- chanical merge silently moves Bob’s row into SUPPRESSED. Alice cannot read Bob’s row directly under ACL — but she can observe a state transition on it via lifecycle metadata ( memories_suppressed in the consolidation re- port, state column queries on the audit path). Thi...
-
[6]
system memory
Cross-tenant denial-of-recall. In the asym- metric case where one tenant runs at higher salience than another, the louder tenant’s writes systematically suppress the quieter tenant’s near-duplicates. Multi-tenant de- ployments cannot tolerate this: a single noisy tenant could erase a quiet tenant’s memories simply by writing similar content at higher sali...
-
[7]
I am meeting Mallory at 3pm at the Whitebridge
Alice (agent_id=’alice’) writes “I am meeting Mallory at 3pm at the Whitebridge.”
-
[8]
Alice meets Mallory at the Whitebridge
Consolidation extracts the fact “Alice meets Mallory at the Whitebridge” — distilled, structured, often more searchable than the source
-
[9]
Under Grant.can_access, any actor’s grant matches agent_id='' because system-shared content is intentionally readable to all
The fact is stored with agent_id=''. Under Grant.can_access, any actor’s grant matches agent_id='' because system-shared content is intentionally readable to all
-
[10]
Alice’s ACL does not protect her — her own consolidation pipeline promoted her con- tent into the system pool
Bob’s recall(”Mallory”) surfaces the distilled fact. Alice’s ACL does not protect her — her own consolidation pipeline promoted her con- tent into the system pool. This is strictly worse than §A.6.11 / §A.6.12: those leak a signal (rank position, suppression state) about Alice’s content. This one leaks the dis- tilled content itself , including any facts ...
-
[11]
Mining the audit log reveals which probes intersect Alice’s content
Audit channel: remember_deduped fires only when a near-cosine neighbour exists somewhere in the store. Mining the audit log reveals which probes intersect Alice’s content
-
[12]
The gap is observable without any access to Alice’s data
Recall asymmetry: Bob’s remember() re- turns a non-empty event id, but his subsequent recall() over his own scope returns 0 hits. The gap is observable without any access to Alice’s data
-
[13]
skip neigh- bour, do not suppress
Storage delta: the JSONL event buffer grows but Bob’s projection-row count does not. Any monitor watching the event-log/projection delta sees the leak. This is write-side and survives every read- side ACL fix landed through §A.6.13 — the previously-closed channels (PRF mining pool, IDF-rarity, share_prior reranker, lifecycle cache, BM25/vector candidate p...
2026
-
[14]
BM25 is hard to beat zero-shot
— single-fact retrieval at controlled depth. The closest one-axis ancestor of entity-collision; entity-collision generalises by stratifying on discriminator type , which NIAH does not. • L V-Eval / LooGLE / L-Eval(An et al., 2024; Li et al. , 2024; Yuan et al. , 2024) — long- context QA suites that all report a single hit@k or LLM-judge accuracy per model...
2024
-
[15]
bm25_top1 - bm25_top2 only sees the top-2 distance; it misses the broader candidate distribution
Signal coarseness. bm25_top1 - bm25_top2 only sees the top-2 distance; it misses the broader candidate distribution
-
[16]
BM25 is uncertain
Confounding with hardness. A small gap may indicate “BM25 is uncertain” (good sig- nal) or “all candidates are semantically near the gold” (bad signal — vector won’t help ei- ther). The signals collapse both regimes. We additionally trained a GradientBoosting- Classifier over the full BM25 feature panel + category one-hot under leave-one-conversation- out...
1978
-
[17]
A schema’s state at time t is the fold of its decision log up to t
Lifecycle decisions are events, not in-place mutations. A schema’s state at time t is the fold of its decision log up to t. This is the same discipline event-sourced ledgers borrow from accounting; in a memory system it gives bit-identical audit replay across re-runs of the same ingest stream
-
[18]
For any permutation of the in- put fact stream, the family assignment a fact lands in is invariant; only the order of deci- sions within a family is permuted
Family clustering is decision-stable under permutation. For any permutation of the in- put fact stream, the family assignment a fact lands in is invariant; only the order of deci- sions within a family is permuted. This is what makes the lifecycle safe to run concur- rently: writers don’t race for cluster identity
-
[19]
Decay is monotone in real time, not in ar- rival time. A fact’s confidence trajectory depends only on wall-clock spacing, not on whether other facts were observed between ticks; the property is verified under fuzzed interleaving of tick() and update() calls. This rules out a class of write-amplification bugs where a chatty witness inadvertently ex- tends ...
2023
-
[20]
A Letta block is a string the LLM rewrites in place via tool calls; prior content is reachable only through exter- nal chat history
Mutation discipline. A Letta block is a string the LLM rewrites in place via tool calls; prior content is reachable only through exter- nal chat history. An Engram schema is a fold over an append-only decision log (§A7.4.4): every state change is a typed event with a reason field, and previous state is recover- able by replaying any prefix. Letta optimise...
-
[21]
A Letta block is identified by name; whoever can is- sue a tool call can rewrite it
Identity stability under adversary. A Letta block is identified by name; whoever can is- sue a tool call can rewrite it. Engram schemas are content-addressed (cluster centroid + fam- ily key, §A7.4.4) with a quorum-gated DEP- RECATE primitive (§A.4.6, §A.6.16) requir- ing k independent emitters over aw-event win- dow. A single compromised emitter cannot t...
-
[22]
what did this agent know and when
Recovery semantics. Letta has no first- class undo: a corrupted block must be recon- structed from chat history by the same LLM that may be the corruption source. Engram’s lifecycle DAG includes a RECOVER edge — verified under randomized event interleaving (see §A6) — that re-promotes a DEPRE- CATED schema once subsequent evidence reaches the same quorum....
1978
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.