CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
Pith reviewed 2026-05-16 08:50 UTC · model grok-4.3
The pith
CoMeT uses a dual FIFO-and-gated memory system to let transformers process arbitrarily long sequences at constant memory and linear time cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoMeT processes input in sequential chunks by maintaining a temporary FIFO memory for recent events and a global memory whose updates are controlled by a gating mechanism; the combined memories serve as a compact, dynamic soft prompt that conditions the transformer layers on the current chunk, yielding constant memory footprint and linear time complexity independent of total sequence length.
What carries the argument
Dual-memory system (FIFO temporary queue plus gated global memory) that supplies a dynamic soft prompt to each new chunk.
If this is right
- Models fine-tuned with CoMeT on 32k contexts retrieve a passkey from any position inside a 1M-token sequence with high accuracy.
- On SCROLLS summarization tasks CoMeT matches the performance of full-attention baselines while using constant memory.
- The method supports practical agent and user-behavior QA workloads that require long context.
- Layer-level pipeline parallelism makes fine-tuning feasible on sequences far longer than the training length.
Where Pith is reading between the lines
- The fixed-size memory design could enable long-context inference on hardware with tight RAM limits without retraining the base model.
- If the gated global memory successfully filters noise, similar chunked memory mechanisms might apply to streaming video or sensor data where full history is unavailable.
- The separation of short-term and long-term memories suggests a possible route to reduce attention cost in multi-turn dialogue systems that must retain distant user preferences.
Load-bearing premise
The dual-memory system with its FIFO temporary queue and gated global updates can preserve all task-relevant long-range information across arbitrary lengths without irreversible loss or interference between chunks.
What would settle it
Run the passkey retrieval test on a 1M-token sequence after 32k fine-tuning and observe whether accuracy drops to chance for keys placed far from the training length.
read the original abstract
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine-tuning on extremely long contexts, we introduce a novel layer-level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks. The code is available at: https://github.com/LivingFutureLab/Comet
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoMeT, a plug-in module for pre-trained Transformers that processes long sequences in chunks using a dual-memory system: a temporary FIFO queue for recent context and a fixed-size global memory updated via a gated rule. These memories serve as dynamic soft prompts for subsequent chunks, yielding constant memory usage and linear time complexity. The central empirical claims are that a model fine-tuned only on 32k contexts achieves 100% passkey retrieval accuracy at any position in 1M-token sequences, competitive SCROLLS scores (especially on summarization), and strong results on agent/user-behavior QA tasks. Code is released.
Significance. If the preservation properties of the gated global memory hold under extrapolation, the work would offer a practical route to constant-memory long-context inference without quadratic attention costs or growing KV caches. The plug-in design and pipeline-parallelism fine-tuning strategy lower the barrier to adoption. Reproducibility via the linked code repository strengthens the contribution relative to purely theoretical proposals.
major comments (3)
- [Passkey retrieval experiment] Passkey retrieval experiment (abstract and §4): the claim of 100% accuracy at 1M tokens after 32k fine-tuning is load-bearing for the extrapolation argument, yet no ablation on chunk count, no measurement of passkey embedding fidelity (e.g., cosine similarity before/after each global update), and no error analysis across positions are provided. This leaves the weakest assumption—that the fixed-size gated memory retains arbitrary tokens without irreversible dilution over ~30 chunks—unverified.
- [Method: dual-memory system] Gated global memory update rule (method section): the description of the gated update does not include a formal analysis or ablation demonstrating that task-relevant signals survive successive updates as sequence length grows linearly. Because memory size is constant while chunk count increases, any non-perfect gate introduces a risk of overwriting or averaging away specific tokens; this mechanism is central to the constant-memory claim and requires explicit verification.
- [Experiments] Experimental reporting (abstract and §4): strong claims are made without quantitative metrics, baseline comparisons with error bars, or ablation studies on memory sizes, gate parameters, or chunk lengths. This absence weakens support for both the 1M retrieval result and the SCROLLS competitiveness assertion.
minor comments (3)
- [Abstract] Abstract states results are 'remarkable' but supplies no numbers; move at least the key quantitative outcomes (e.g., passkey accuracy, SCROLLS deltas) into the abstract.
- [Method] Notation for the gated update and FIFO queue should be formalized with equations rather than prose alone to aid reproducibility.
- [Related work and experiments] Add explicit comparison table against recent memory-augmented or chunked-attention baselines (e.g., Infini-Transformer, Ring Attention) with identical model sizes.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below with the strongest honest defense supported by the manuscript. Where additional experiments or clarifications are feasible, we will revise the paper accordingly.
read point-by-point responses
-
Referee: [Passkey retrieval experiment] Passkey retrieval experiment (abstract and §4): the claim of 100% accuracy at 1M tokens after 32k fine-tuning is load-bearing for the extrapolation argument, yet no ablation on chunk count, no measurement of passkey embedding fidelity (e.g., cosine similarity before/after each global update), and no error analysis across positions are provided. This leaves the weakest assumption—that the fixed-size gated memory retains arbitrary tokens without irreversible dilution over ~30 chunks—unverified.
Authors: We agree that deeper verification of memory retention strengthens the extrapolation claim. The manuscript reports end-to-end 100% accuracy on 1M-token passkey retrieval after 32k fine-tuning as the primary evidence of effective long-range preservation. In revision we will add (i) an ablation varying chunk count while holding total length fixed, (ii) cosine-similarity measurements of the passkey embedding before and after each global-memory update, and (iii) per-position error rates across the 1M sequence. These additions will directly test whether the gated memory avoids irreversible dilution over the ~30 chunks required for 1M tokens. revision: yes
-
Referee: [Method: dual-memory system] Gated global memory update rule (method section): the description of the gated update does not include a formal analysis or ablation demonstrating that task-relevant signals survive successive updates as sequence length grows linearly. Because memory size is constant while chunk count increases, any non-perfect gate introduces a risk of overwriting or averaging away specific tokens; this mechanism is central to the constant-memory claim and requires explicit verification.
Authors: The gated update is presented as an empirical design choice that selectively retains information via learned gates. While a closed-form proof of perfect preservation is not provided (and may be intractable for learned gates), the manuscript demonstrates that the mechanism supports 100% retrieval at 1M tokens. In revision we will include an ablation that tracks retrieval accuracy and gate activation statistics as the number of chunks increases from 1 to 30+, thereby empirically verifying that task-relevant signals survive successive updates under constant memory size. revision: partial
-
Referee: [Experiments] Experimental reporting (abstract and §4): strong claims are made without quantitative metrics, baseline comparisons with error bars, or ablation studies on memory sizes, gate parameters, or chunk lengths. This absence weakens support for both the 1M retrieval result and the SCROLLS competitiveness assertion.
Authors: The current manuscript already reports SCROLLS scores against multiple efficient baselines and a full-attention reference, plus results on agent/user-behavior QA. To address the concern we will expand §4 with (i) mean and standard-deviation metrics over at least three random seeds, (ii) explicit numerical tables comparing all baselines with error bars, and (iii) additional ablations on memory size, gate temperature, and chunk length. These revisions will supply the quantitative rigor requested while preserving the existing competitive claims. revision: yes
Circularity Check
No circularity: architectural design with external benchmark validation
full rationale
The paper presents CoMeT as a new plug-in architecture using a dual-memory system (FIFO temporary queue + gated global memory) that acts as a dynamic soft prompt. All reported results, including 1M-token passkey retrieval after 32k fine-tuning and SCROLLS benchmark scores, are empirical measurements on external tasks. No equations, fitted parameters, or self-citations are shown to reduce any performance claim to a quantity defined by the paper's own inputs. The derivation chain consists of design choices followed by standard training and evaluation; it remains self-contained against benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- gated update parameters
axioms (1)
- domain assumption Pre-trained transformer layers remain effective when augmented with external memory prompts
invented entities (2)
-
temporary FIFO memory
no independent evidence
-
global gated memory
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.