Online Vector Quantized Attention

Beren Millidge; Nick Alonso; Tomas Figliolia

arxiv: 2602.03922 · v3 · pith:2RFKET6Anew · submitted 2026-02-03 · 💻 cs.LG

Online Vector Quantized Attention

Nick Alonso , Tomas Figliolia , Beren Millidge This is my paper

Pith reviewed 2026-05-21 13:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords attention mechanismsvector quantizationlong context modelinglanguage modelssequence mixingefficient transformersgaussian mixture modelsmemory efficient attention

0 comments

The pith

Online vector-quantized attention matches self-attention performance on sequences up to 64,000 tokens while using linear compute and constant memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents online vector-quantized attention as a new sequence mixing layer designed to improve the trade-off between computational efficiency and the ability to handle long contexts in language models. It achieves linear computation costs and constant memory usage by maintaining a memory state that grows through sparse updates rather than remaining fixed in size like standard linear attention methods. This design draws on Gaussian mixture regression to guide how new information is incorporated without overwriting distant details. A reader would care because current approaches either scale poorly in memory for long inputs or lose the capacity to connect information across long distances. The reported experiments indicate that this method improves on prior linear and vector-quantized attention variants and reaches parity with full self-attention in several long-context settings.

Core claim

OVQ-attention requires linear compute costs and constant memory, but unlike linear attention and SSMs it uses a sparse memory update that allows it to greatly increase the size of its memory state and consequently memory capacity. A theoretical basis is developed from Gaussian mixture regression, and the method is evaluated on synthetic long-context tasks as well as long-context language modeling, where it shows significant gains over linear attention baselines and the original VQ-attention while remaining competitive with strong self-attention baselines up to 64k sequence length despite using only a small fraction of the memory.

What carries the argument

The sparse memory update rule derived from Gaussian mixture regression, which expands the effective memory state size while keeping overall memory usage constant.

If this is right

Language models could process sequences up to at least 64k tokens with performance close to standard self-attention but at a small fraction of the memory cost.
Linear attention and SSM approaches could be improved by incorporating similar sparse memory expansion for better long-range dependency handling.
Training and inference on longer contexts become feasible without quadratic memory growth or the information loss typical of fixed-size memory states.
The same mechanism could extend the effective context length in existing transformer architectures with minimal additional hardware requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid models that combine this sparse memory state with existing state-space models might further reduce compute while preserving recall over even longer ranges.
The approach suggests a path toward constant-memory architectures that still support retrieval of arbitrary distant tokens if the Gaussian mixture update can be made more selective.
Real-world document and code completion tasks that span tens of thousands of tokens would be a natural next test to measure whether the synthetic-task gains translate to practical utility.

Load-bearing premise

The sparse memory update rule derived from Gaussian mixture regression is assumed to retain enough long-range information across updates without interference or loss of earlier details.

What would settle it

An experiment in which OVQ-attention produces substantially lower accuracy than full self-attention on a task that requires recalling a specific fact introduced only at the start of a 64,000-token sequence would indicate that the memory capacity increase does not deliver the claimed long-range retention.

Figures

Figures reproduced from arXiv: 2602.03922 by Beren Millidge, Nick Alonso, Tomas Figliolia.

**Figure 2.** Figure 2: Process for generating OVQ-attention output for token at position T. This quantization error adds a bias to the self-attention operation and the gradients, since we must use a straight-through estimator to propagate gradients through the quantization operation. As we saw, this performance degredation cannot be trivially solved by increasing the size of Dk. We, therefore, propose an alternative soluti… view at source ↗

**Figure 3.** Figure 3: State updates in linear and OVQ-attention models. prediction generation are parallelized across the chunk. As above, we assume qt and kt are unit norm and Σn = I 1 β , where β is a parameter fixed during test time training and learned in the outer-loop. Prediction. Our algorithm uses a chunk-wise parallel implementation, where the model loops through chunks, Qc, Kc, Vc, each of length L, but parallelizes o… view at source ↗

**Figure 4.** Figure 4: In-context recall. Left two plots show per-token-accuracy for our two synthetic recall tasks up to 64k context length. The right plot shows how the memory state, i.e. kv-cache, grows with context length. tion maximization (EM) (Dempster et al., 1977), which has guarantees of converging to minima of the NLL. EM iteratively performs batch updates on the parameters for multiple epochs until convergence. In t… view at source ↗

**Figure 5.** Figure 5: In context learning. Models are trained on 2k context with 16 functions. We show the per-token accuracy over the output, yn, for each example, n, in the context, averaged over the test set. 3.4. Comparison to Linear Attention Models A central difference between OVQ-attention and common sequence mixing layers with constant memory, i.e., SSMs (Gu & Dao, 2024; Dao & Gu, 2024) and linear attention models (Qin … view at source ↗

**Figure 6.** Figure 6: Long context language modeling on PG19. Long In-Context Learning. Next, we test models on a long-context version of linear regression based ICL tasks (Akyurek et al. ¨ , 2022). In this task, the context consists of input-output pairs, x, y, where the output of each is generated by some linear function, funcf (x) = y = b + aPx. In our tests, a and b are integers such that 0 < a < 5, 0 < b < 5 and P is a pe… view at source ↗

**Figure 7.** Figure 7: Ablations on Basic ICR. 4.3. Short Context Benchmarks OVQ-attention does not compress keys and values significantly over short contexts, so we should expect similar performance on common short context benchmarks to standard attention. To confirm this, we compare a 480M parameter sw-nope, std att, and sw-ovq model on standard short context benchmarks. Models are trained for 50B to7 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 8.** Figure 8: Comparison to Linear Attention and SSM Baselines. Left two plots show per-token-accuracy for ICL tasks. Models train on 2k context with four functions, tested on four and 16 functions. The right plot shows performance for basic ICR. Model Params PIQA Hella. Wino. ARC-e ARC-c Avg. std att 480M 66.4 ±0.1 40.5 ±0.1 52.3 ±0.7 52.8 ±0.4 29.1 ±0.5 48.21 sw-nope 480M 66.7 ±0.4 40.9 ±0.1 52.6 ±0.6 53.1 ±0.1 28.4 ±… view at source ↗

read the original abstract

Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces online vector-quantized (OVQ) attention, a sequence mixing layer for language models that achieves linear compute and constant memory via a sparse memory update rule derived from Gaussian mixture regression. It reports significant gains over linear attention and original VQ-attention baselines, along with competitive (sometimes identical) performance to strong self-attention models on synthetic long-context tasks and language modeling up to 64k sequence lengths, while using a small fraction of the memory.

Significance. If the results hold under further verification, OVQ-attention offers a promising compromise between the efficiency of linear attention/SSMs and the long-context performance of full self-attention. The explicit derivation from Gaussian mixture regression supplies a theoretical basis that strengthens the construction relative to purely empirical alternatives.

major comments (1)

[§4] §4 (Experiments): the central performance claims at 64k lengths rest on the sparse update rule preserving long-range information without catastrophic forgetting or interference, yet the manuscript provides no ablation comparing this GMM-derived rule to other constant-memory strategies of matched capacity (e.g., FIFO, reservoir sampling, or learned compression). Without such isolation, gains could be attributable to codebook size, quantization details, or training procedure rather than the proposed sparsity mechanism.

minor comments (2)

[Abstract] The abstract contains the phrase 'up 64k sequence length'; this should be corrected to 'up to 64k sequence length' for clarity.
[Methods] Notation for the sparsity/update-rate hyperparameter and its relation to the number of mixture components could be made more explicit in the methods section to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. The suggested ablation will help isolate the contribution of the GMM-derived update rule, and we will incorporate it in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central performance claims at 64k lengths rest on the sparse update rule preserving long-range information without catastrophic forgetting or interference, yet the manuscript provides no ablation comparing this GMM-derived rule to other constant-memory strategies of matched capacity (e.g., FIFO, reservoir sampling, or learned compression). Without such isolation, gains could be attributable to codebook size, quantization details, or training procedure rather than the proposed sparsity mechanism.

Authors: We agree that additional ablations would strengthen the experimental claims by isolating the effect of the GMM-derived sparse update. The update rule is derived from Gaussian mixture regression, which supplies a probabilistic justification for sparse, quantized memory that preserves long-range information without explicit forgetting mechanisms. This theoretical basis distinguishes OVQ-attention from heuristic constant-memory approaches. In the revision we will add experiments comparing OVQ-attention against FIFO and reservoir sampling baselines with matched memory capacity on the 64k-length tasks. We will also include a brief discussion of learned compression, noting that it typically requires extra parameters and training overhead not present in our fixed codebook approach. These changes should clarify that performance differences arise from the sparsity mechanism rather than codebook size or training details alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; OVQ-attention derivation from GMM regression is independent of target performance claims

full rationale

The paper constructs OVQ-attention by deriving a sparse memory update rule from Gaussian mixture regression, then validates the resulting mechanism through direct empirical comparisons to linear attention, original VQ-attention, and full self-attention baselines on synthetic long-context tasks and language modeling up to 64k lengths. No step reduces a claimed prediction or first-principles result to a fitted parameter or self-citation by construction; the central performance claims rest on reported experimental deltas rather than tautological re-derivations of inputs. The method is presented as a novel construction with constant-memory scaling, and the theoretical basis is external to the evaluation metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the Gaussian mixture regression framing for attention updates and on the assumption that sparse slot updates suffice for long-range dependency capture; no explicit free parameters are named in the abstract, but quantization codebook size and update sparsity rate are implicit design choices.

free parameters (2)

codebook size / number of mixture components
Determines memory capacity and is chosen to balance capacity against compute; not derived from first principles.
sparsity / update rate hyperparameter
Controls how many memory slots are refreshed per step; selected to achieve constant memory while preserving performance.

axioms (1)

domain assumption Gaussian mixture regression provides a suitable probabilistic model for online attention updates
Invoked to justify the sparse memory mechanism; stated as the theoretical basis in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1355 out tokens · 34072 ms · 2026-05-21T13:38:14.239167+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression... sparse online update rules... growth function Nt = t N / (t + N)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Li, J., Zhang, Y ., Hassan, M

IEEE, 2010. Li, J., Zhang, Y ., Hassan, M. Y ., Chafekar, T., Cai, T., Ren, Z., Guo, P., Karimzadeh, F., Wang, C., and Gan, C. Com- mvq: Commutative vector quantization for kv cache com- pression.arXiv preprint arXiv:2506.18879, 2025a. Li, K., Tang, Y ., Cheng, Y ., Bai, Y ., Zeng, Y ., Wang, C., Liu, X., and Jiang, P. Vql: An end-to-end context-aware vec...

work page arXiv 2010
[2]

Details on Relation between Gaussian Kernel Regression and Self-Attention Assumeq t andk t are unit norm

Appendix A: Theoretical Results 7.1. Details on Relation between Gaussian Kernel Regression and Self-Attention Assumeq t andk t are unit norm. In this case, −1 2 ||qT −k t||2 =− 1 2(||qT ||2 +||k t||2 −2q Tk⊤ t )(20) =− 1 2(1 + 1−2q Tk⊤ t )(21) =−1 +q Tk⊤ t .(22) If we take the exponent of this result we get e−1+qT k⊤ t =P e qT k⊤ t (23) where P is a cons...

work page 2014
[3]

Evaluate the responsibilities zt,n using the current parameter values

E-Step (Expectation). Evaluate the responsibilities zt,n using the current parameter values. This represents the posterior probability that data point[k t,v t]belongs to componentn: zt,n = πnN([k t,v t]|[µ k n,µ v n],Σ n)Pn j=1 πnN([k t,v t]|[µ kn,µ vn],Σ n) .(32) Under our assumptions about the covariance matrix N([k t,v t]|[µ k n,µ v n],Σ n)∝e −β||[kt,v...

work page
[4]

Re-estimate the parameters using the responsibilities calculated in the E-step

M-Step (Maximization). Re-estimate the parameters using the responsibilities calculated in the E-step. First, calculate the total weight assigned to componentk: γn = TX t=1 zt,n (34) Then, update the means and mixing coefficients: [µk n,µ v n]new = 1 γn TX t=0 zt,n[kt,v t](35) πnew n = γn γ ,(36) whereγ= P n γn. 7.3. Linking GMR and OVQ-attention Assumpti...

work page 2006
[5]

First, we show how our algorithm implements standard GMM initialization techniques

work page
[6]

13 Online V ector Quantized Attention

Next, we show that, under this initialization scheme and our assumption, Σn =I 1 β , the first batch EM step on a GMM is equivalent to a batch k-means update, and we explain how our update rule is the standard online approximation of batch k-means. 13 Online V ector Quantized Attention

work page
[7]

Next, we show that the first EM update, after our initialization scheme, is equivalent to a Newton update on the NLL, thus providing a second-order approximation of the minimizing parameters

work page
[8]

Finally, we show that generating predictions using GMR after initialization and one EM update is equivalent to OVQ-attention

work page
[9]

EM describes how to update parameters of a GMM, but it does not specify how to initialize parameters of GMMs

Initialization.First, our online algorithm instantiates a standard initialization process for GMMs. EM describes how to update parameters of a GMM, but it does not specify how to initialize parameters of GMMs. The common method for initialization in the batch setting sets the means of the N Gaussian components of a GMM equal to a subset of N data points (...

work page 2006
[10]

To perform this EM step the values of the prior distribution, π, and precision,β, need to be set

OVQ-attention Approximates a Batch EM Update.After initializing means with a sample of N data points, a step of EM is performed using the remaining T−N data points. To perform this EM step the values of the prior distribution, π, and precision,β, need to be set. Setting the prior to a uniform distribution is justified at this point, since immediately afte...

work page 2006
[11]

K-means and Newton’s Method.It has previously been establish by Bottou et al. (Bottou & Bengio, 1994) that a batched K-means update is equivalent to performing a step of Newton’s method on the K-mean’s loss function, which in our case, would be L(θ) = TX i=1 ||[kt,v t]−[µ k t∗ ,µ v t∗]||2,(37) where [µk t∗ ,µ v t∗] is the nearest neighbor mean for data po...

work page 1994
[12]

OVQ-attention and GMR Prediction.Under the assumptions, we can see that the procedure for computing E(V|K= qT )for a GMR is equivalent to OVQ-attention. E(V|K=q T ) = πnN(q T |µ k n,Σ k n)PN i=1 πiN(q T |µ k i ,Σ k i ) µv|qt n (43) = NX n=0 πne−(qT −µk n)⊤Σk,−1 n (qT −µk n) PN j=0 πne−(qT −µkn)⊤Σk,−1 n (qT −µkn) µv n (44) = NX n=0 cne− β 2 ||qT −µk n||2 P...

work page
[13]

The third step applies our second assumption, and the finaly step simply moves the count term,c n, inside the exponential

The second step also applies assumption 1 to covariance term, and replaces πn with cn, which is true in the case where learning is done via hard cluster assignments, which is the implied by the previous result. The third step applies our second assumption, and the finaly step simply moves the count term,c n, inside the exponential. 15 Online V ector Quant...

work page
[14]

VQ-attention Implementation Dictionary Training.The original VQ-attention implementation (Lingle, 2023) trained Dk in pre-training using exponential moving average updates

Appendix B: Methods 8.1. VQ-attention Implementation Dictionary Training.The original VQ-attention implementation (Lingle, 2023) trained Dk in pre-training using exponential moving average updates. It also used an auxiliary commitment loss similar to the original VQ-V AE (Van Den Oord et al., 2017). This approach is known to have several difficulties: it ...

work page 2023
[15]

Appendix C: Further Results Figure 9 Ablations on Basic ICR.We test three ablations on OVQ-attention using the basic ICR task. First, instead of assigning setting new centroids based on our spread maximizing scheme, we set the new centroids equal to a random sample of key values within the current chunk (rand assign ). Second, instead of using our plateau...

work page 2023

[1] [1]

Li, J., Zhang, Y ., Hassan, M

IEEE, 2010. Li, J., Zhang, Y ., Hassan, M. Y ., Chafekar, T., Cai, T., Ren, Z., Guo, P., Karimzadeh, F., Wang, C., and Gan, C. Com- mvq: Commutative vector quantization for kv cache com- pression.arXiv preprint arXiv:2506.18879, 2025a. Li, K., Tang, Y ., Cheng, Y ., Bai, Y ., Zeng, Y ., Wang, C., Liu, X., and Jiang, P. Vql: An end-to-end context-aware vec...

work page arXiv 2010

[2] [2]

Details on Relation between Gaussian Kernel Regression and Self-Attention Assumeq t andk t are unit norm

Appendix A: Theoretical Results 7.1. Details on Relation between Gaussian Kernel Regression and Self-Attention Assumeq t andk t are unit norm. In this case, −1 2 ||qT −k t||2 =− 1 2(||qT ||2 +||k t||2 −2q Tk⊤ t )(20) =− 1 2(1 + 1−2q Tk⊤ t )(21) =−1 +q Tk⊤ t .(22) If we take the exponent of this result we get e−1+qT k⊤ t =P e qT k⊤ t (23) where P is a cons...

work page 2014

[3] [3]

Evaluate the responsibilities zt,n using the current parameter values

E-Step (Expectation). Evaluate the responsibilities zt,n using the current parameter values. This represents the posterior probability that data point[k t,v t]belongs to componentn: zt,n = πnN([k t,v t]|[µ k n,µ v n],Σ n)Pn j=1 πnN([k t,v t]|[µ kn,µ vn],Σ n) .(32) Under our assumptions about the covariance matrix N([k t,v t]|[µ k n,µ v n],Σ n)∝e −β||[kt,v...

work page

[4] [4]

Re-estimate the parameters using the responsibilities calculated in the E-step

M-Step (Maximization). Re-estimate the parameters using the responsibilities calculated in the E-step. First, calculate the total weight assigned to componentk: γn = TX t=1 zt,n (34) Then, update the means and mixing coefficients: [µk n,µ v n]new = 1 γn TX t=0 zt,n[kt,v t](35) πnew n = γn γ ,(36) whereγ= P n γn. 7.3. Linking GMR and OVQ-attention Assumpti...

work page 2006

[5] [5]

First, we show how our algorithm implements standard GMM initialization techniques

work page

[6] [6]

13 Online V ector Quantized Attention

Next, we show that, under this initialization scheme and our assumption, Σn =I 1 β , the first batch EM step on a GMM is equivalent to a batch k-means update, and we explain how our update rule is the standard online approximation of batch k-means. 13 Online V ector Quantized Attention

work page

[7] [7]

Next, we show that the first EM update, after our initialization scheme, is equivalent to a Newton update on the NLL, thus providing a second-order approximation of the minimizing parameters

work page

[8] [8]

Finally, we show that generating predictions using GMR after initialization and one EM update is equivalent to OVQ-attention

work page

[9] [9]

EM describes how to update parameters of a GMM, but it does not specify how to initialize parameters of GMMs

Initialization.First, our online algorithm instantiates a standard initialization process for GMMs. EM describes how to update parameters of a GMM, but it does not specify how to initialize parameters of GMMs. The common method for initialization in the batch setting sets the means of the N Gaussian components of a GMM equal to a subset of N data points (...

work page 2006

[10] [10]

To perform this EM step the values of the prior distribution, π, and precision,β, need to be set

OVQ-attention Approximates a Batch EM Update.After initializing means with a sample of N data points, a step of EM is performed using the remaining T−N data points. To perform this EM step the values of the prior distribution, π, and precision,β, need to be set. Setting the prior to a uniform distribution is justified at this point, since immediately afte...

work page 2006

[11] [11]

K-means and Newton’s Method.It has previously been establish by Bottou et al. (Bottou & Bengio, 1994) that a batched K-means update is equivalent to performing a step of Newton’s method on the K-mean’s loss function, which in our case, would be L(θ) = TX i=1 ||[kt,v t]−[µ k t∗ ,µ v t∗]||2,(37) where [µk t∗ ,µ v t∗] is the nearest neighbor mean for data po...

work page 1994

[12] [12]

OVQ-attention and GMR Prediction.Under the assumptions, we can see that the procedure for computing E(V|K= qT )for a GMR is equivalent to OVQ-attention. E(V|K=q T ) = πnN(q T |µ k n,Σ k n)PN i=1 πiN(q T |µ k i ,Σ k i ) µv|qt n (43) = NX n=0 πne−(qT −µk n)⊤Σk,−1 n (qT −µk n) PN j=0 πne−(qT −µkn)⊤Σk,−1 n (qT −µkn) µv n (44) = NX n=0 cne− β 2 ||qT −µk n||2 P...

work page

[13] [13]

The third step applies our second assumption, and the finaly step simply moves the count term,c n, inside the exponential

The second step also applies assumption 1 to covariance term, and replaces πn with cn, which is true in the case where learning is done via hard cluster assignments, which is the implied by the previous result. The third step applies our second assumption, and the finaly step simply moves the count term,c n, inside the exponential. 15 Online V ector Quant...

work page

[14] [14]

VQ-attention Implementation Dictionary Training.The original VQ-attention implementation (Lingle, 2023) trained Dk in pre-training using exponential moving average updates

Appendix B: Methods 8.1. VQ-attention Implementation Dictionary Training.The original VQ-attention implementation (Lingle, 2023) trained Dk in pre-training using exponential moving average updates. It also used an auxiliary commitment loss similar to the original VQ-V AE (Van Den Oord et al., 2017). This approach is known to have several difficulties: it ...

work page 2023

[15] [15]

Appendix C: Further Results Figure 9 Ablations on Basic ICR.We test three ablations on OVQ-attention using the basic ICR task. First, instead of assigning setting new centroids based on our spread maximizing scheme, we set the new centroids equal to a random sample of key values within the current chunk (rand assign ). Second, instead of using our plateau...

work page 2023