Online Vector Quantized Attention
Pith reviewed 2026-05-21 13:38 UTC · model grok-4.3
The pith
Online vector-quantized attention matches self-attention performance on sequences up to 64,000 tokens while using linear compute and constant memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OVQ-attention requires linear compute costs and constant memory, but unlike linear attention and SSMs it uses a sparse memory update that allows it to greatly increase the size of its memory state and consequently memory capacity. A theoretical basis is developed from Gaussian mixture regression, and the method is evaluated on synthetic long-context tasks as well as long-context language modeling, where it shows significant gains over linear attention baselines and the original VQ-attention while remaining competitive with strong self-attention baselines up to 64k sequence length despite using only a small fraction of the memory.
What carries the argument
The sparse memory update rule derived from Gaussian mixture regression, which expands the effective memory state size while keeping overall memory usage constant.
If this is right
- Language models could process sequences up to at least 64k tokens with performance close to standard self-attention but at a small fraction of the memory cost.
- Linear attention and SSM approaches could be improved by incorporating similar sparse memory expansion for better long-range dependency handling.
- Training and inference on longer contexts become feasible without quadratic memory growth or the information loss typical of fixed-size memory states.
- The same mechanism could extend the effective context length in existing transformer architectures with minimal additional hardware requirements.
Where Pith is reading between the lines
- Hybrid models that combine this sparse memory state with existing state-space models might further reduce compute while preserving recall over even longer ranges.
- The approach suggests a path toward constant-memory architectures that still support retrieval of arbitrary distant tokens if the Gaussian mixture update can be made more selective.
- Real-world document and code completion tasks that span tens of thousands of tokens would be a natural next test to measure whether the synthetic-task gains translate to practical utility.
Load-bearing premise
The sparse memory update rule derived from Gaussian mixture regression is assumed to retain enough long-range information across updates without interference or loss of earlier details.
What would settle it
An experiment in which OVQ-attention produces substantially lower accuracy than full self-attention on a task that requires recalling a specific fact introduced only at the start of a 64,000-token sequence would indicate that the memory capacity increase does not deliver the claimed long-range retention.
Figures
read the original abstract
Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces online vector-quantized (OVQ) attention, a sequence mixing layer for language models that achieves linear compute and constant memory via a sparse memory update rule derived from Gaussian mixture regression. It reports significant gains over linear attention and original VQ-attention baselines, along with competitive (sometimes identical) performance to strong self-attention models on synthetic long-context tasks and language modeling up to 64k sequence lengths, while using a small fraction of the memory.
Significance. If the results hold under further verification, OVQ-attention offers a promising compromise between the efficiency of linear attention/SSMs and the long-context performance of full self-attention. The explicit derivation from Gaussian mixture regression supplies a theoretical basis that strengthens the construction relative to purely empirical alternatives.
major comments (1)
- [§4] §4 (Experiments): the central performance claims at 64k lengths rest on the sparse update rule preserving long-range information without catastrophic forgetting or interference, yet the manuscript provides no ablation comparing this GMM-derived rule to other constant-memory strategies of matched capacity (e.g., FIFO, reservoir sampling, or learned compression). Without such isolation, gains could be attributable to codebook size, quantization details, or training procedure rather than the proposed sparsity mechanism.
minor comments (2)
- [Abstract] The abstract contains the phrase 'up 64k sequence length'; this should be corrected to 'up to 64k sequence length' for clarity.
- [Methods] Notation for the sparsity/update-rate hyperparameter and its relation to the number of mixture components could be made more explicit in the methods section to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation for major revision. The suggested ablation will help isolate the contribution of the GMM-derived update rule, and we will incorporate it in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the central performance claims at 64k lengths rest on the sparse update rule preserving long-range information without catastrophic forgetting or interference, yet the manuscript provides no ablation comparing this GMM-derived rule to other constant-memory strategies of matched capacity (e.g., FIFO, reservoir sampling, or learned compression). Without such isolation, gains could be attributable to codebook size, quantization details, or training procedure rather than the proposed sparsity mechanism.
Authors: We agree that additional ablations would strengthen the experimental claims by isolating the effect of the GMM-derived sparse update. The update rule is derived from Gaussian mixture regression, which supplies a probabilistic justification for sparse, quantized memory that preserves long-range information without explicit forgetting mechanisms. This theoretical basis distinguishes OVQ-attention from heuristic constant-memory approaches. In the revision we will add experiments comparing OVQ-attention against FIFO and reservoir sampling baselines with matched memory capacity on the 64k-length tasks. We will also include a brief discussion of learned compression, noting that it typically requires extra parameters and training overhead not present in our fixed codebook approach. These changes should clarify that performance differences arise from the sparsity mechanism rather than codebook size or training details alone. revision: yes
Circularity Check
No significant circularity; OVQ-attention derivation from GMM regression is independent of target performance claims
full rationale
The paper constructs OVQ-attention by deriving a sparse memory update rule from Gaussian mixture regression, then validates the resulting mechanism through direct empirical comparisons to linear attention, original VQ-attention, and full self-attention baselines on synthetic long-context tasks and language modeling up to 64k lengths. No step reduces a claimed prediction or first-principles result to a fitted parameter or self-citation by construction; the central performance claims rest on reported experimental deltas rather than tautological re-derivations of inputs. The method is presented as a novel construction with constant-memory scaling, and the theoretical basis is external to the evaluation metrics.
Axiom & Free-Parameter Ledger
free parameters (2)
- codebook size / number of mixture components
- sparsity / update rate hyperparameter
axioms (1)
- domain assumption Gaussian mixture regression provides a suitable probabilistic model for online attention updates
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression... sparse online update rules... growth function Nt = t N / (t + N)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
IEEE, 2010. Li, J., Zhang, Y ., Hassan, M. Y ., Chafekar, T., Cai, T., Ren, Z., Guo, P., Karimzadeh, F., Wang, C., and Gan, C. Com- mvq: Commutative vector quantization for kv cache com- pression.arXiv preprint arXiv:2506.18879, 2025a. Li, K., Tang, Y ., Cheng, Y ., Bai, Y ., Zeng, Y ., Wang, C., Liu, X., and Jiang, P. Vql: An end-to-end context-aware vec...
-
[2]
Appendix A: Theoretical Results 7.1. Details on Relation between Gaussian Kernel Regression and Self-Attention Assumeq t andk t are unit norm. In this case, −1 2 ||qT −k t||2 =− 1 2(||qT ||2 +||k t||2 −2q Tk⊤ t )(20) =− 1 2(1 + 1−2q Tk⊤ t )(21) =−1 +q Tk⊤ t .(22) If we take the exponent of this result we get e−1+qT k⊤ t =P e qT k⊤ t (23) where P is a cons...
work page 2014
-
[3]
Evaluate the responsibilities zt,n using the current parameter values
E-Step (Expectation). Evaluate the responsibilities zt,n using the current parameter values. This represents the posterior probability that data point[k t,v t]belongs to componentn: zt,n = πnN([k t,v t]|[µ k n,µ v n],Σ n)Pn j=1 πnN([k t,v t]|[µ kn,µ vn],Σ n) .(32) Under our assumptions about the covariance matrix N([k t,v t]|[µ k n,µ v n],Σ n)∝e −β||[kt,v...
-
[4]
Re-estimate the parameters using the responsibilities calculated in the E-step
M-Step (Maximization). Re-estimate the parameters using the responsibilities calculated in the E-step. First, calculate the total weight assigned to componentk: γn = TX t=1 zt,n (34) Then, update the means and mixing coefficients: [µk n,µ v n]new = 1 γn TX t=0 zt,n[kt,v t](35) πnew n = γn γ ,(36) whereγ= P n γn. 7.3. Linking GMR and OVQ-attention Assumpti...
work page 2006
-
[5]
First, we show how our algorithm implements standard GMM initialization techniques
-
[6]
13 Online V ector Quantized Attention
Next, we show that, under this initialization scheme and our assumption, Σn =I 1 β , the first batch EM step on a GMM is equivalent to a batch k-means update, and we explain how our update rule is the standard online approximation of batch k-means. 13 Online V ector Quantized Attention
-
[7]
Next, we show that the first EM update, after our initialization scheme, is equivalent to a Newton update on the NLL, thus providing a second-order approximation of the minimizing parameters
-
[8]
Finally, we show that generating predictions using GMR after initialization and one EM update is equivalent to OVQ-attention
-
[9]
Initialization.First, our online algorithm instantiates a standard initialization process for GMMs. EM describes how to update parameters of a GMM, but it does not specify how to initialize parameters of GMMs. The common method for initialization in the batch setting sets the means of the N Gaussian components of a GMM equal to a subset of N data points (...
work page 2006
-
[10]
To perform this EM step the values of the prior distribution, π, and precision,β, need to be set
OVQ-attention Approximates a Batch EM Update.After initializing means with a sample of N data points, a step of EM is performed using the remaining T−N data points. To perform this EM step the values of the prior distribution, π, and precision,β, need to be set. Setting the prior to a uniform distribution is justified at this point, since immediately afte...
work page 2006
-
[11]
K-means and Newton’s Method.It has previously been establish by Bottou et al. (Bottou & Bengio, 1994) that a batched K-means update is equivalent to performing a step of Newton’s method on the K-mean’s loss function, which in our case, would be L(θ) = TX i=1 ||[kt,v t]−[µ k t∗ ,µ v t∗]||2,(37) where [µk t∗ ,µ v t∗] is the nearest neighbor mean for data po...
work page 1994
-
[12]
OVQ-attention and GMR Prediction.Under the assumptions, we can see that the procedure for computing E(V|K= qT )for a GMR is equivalent to OVQ-attention. E(V|K=q T ) = πnN(q T |µ k n,Σ k n)PN i=1 πiN(q T |µ k i ,Σ k i ) µv|qt n (43) = NX n=0 πne−(qT −µk n)⊤Σk,−1 n (qT −µk n) PN j=0 πne−(qT −µkn)⊤Σk,−1 n (qT −µkn) µv n (44) = NX n=0 cne− β 2 ||qT −µk n||2 P...
-
[13]
The second step also applies assumption 1 to covariance term, and replaces πn with cn, which is true in the case where learning is done via hard cluster assignments, which is the implied by the previous result. The third step applies our second assumption, and the finaly step simply moves the count term,c n, inside the exponential. 15 Online V ector Quant...
-
[14]
Appendix B: Methods 8.1. VQ-attention Implementation Dictionary Training.The original VQ-attention implementation (Lingle, 2023) trained Dk in pre-training using exponential moving average updates. It also used an auxiliary commitment loss similar to the original VQ-V AE (Van Den Oord et al., 2017). This approach is known to have several difficulties: it ...
work page 2023
-
[15]
Appendix C: Further Results Figure 9 Ablations on Basic ICR.We test three ablations on OVQ-attention using the basic ICR task. First, instead of assigning setting new centroids based on our spread maximizing scheme, we set the new centroids equal to a random sample of key values within the current chunk (rand assign ). Second, instead of using our plateau...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.