pith. sign in

arxiv: 2605.27646 · v1 · pith:CWCRSCFRnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Pith reviewed 2026-06-29 18:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache compressionquaternion quantizationHurwitz groupcalibration-freelarge language modelsinference optimizationmultiplicative quantizationoutlier handling
0
0 comments X

The pith

Hurwitz quaternion multiplicative quantization compresses KV caches to roughly 5 bits per element while matching fp16 perplexity on multiple modern language models without calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a calibration-free compression technique for the key-value cache used during large language model inference. Each 4-element chunk is viewed as a quaternion whose direction is quantized by multiplying an element from the fixed 24-element Hurwitz group with one from a small per-layer-head random secondary codebook. This product construction creates a larger effective codebook while storing only the secondary parameters, and a simple per-batch median multiplier extracts outliers. Experiments across five models demonstrate that the resulting ~5-bit representations stay within 0.02 to 0.10 perplexity points of full precision, even on architectures where standard 4-bit integers collapse. The approach also yields higher compression ratios and better downstream task accuracy than integer baselines at comparable bit widths.

Core claim

HQMQ quantizes each 4-element KV chunk as a quaternion by forming the product of a fixed 24-element Hurwitz group element and a per-(layer, head) random unit quaternion from a secondary codebook of size S, producing 24S effective directions from only S stored parameters, then applies a per-batch median-multiplier outlier step at C=3 to recover fp16-level perplexity at approximately 5 bits on Mistral-7B, Qwen2.5-7B, Qwen3-8B and related models without any calibration data or model-specific tuning.

What carries the argument

The multiplicative composition of the 24-element Hurwitz group 2T with a per-(layer, head) secondary codebook of random unit quaternions, which exploits S^3 isometry to enlarge the effective codebook size while storing few parameters.

If this is right

  • On Mistral-7B and Qwen3-8B, perplexity stays within 0.02-0.03 points of fp16 at ~5 bits.
  • On Qwen2.5-7B and Qwen3-8B where int4 collapses, HQMQ plus median-3x recovers fp16 quality within 0.02-0.10 points at ~5 bits.
  • HQMQ Pareto-dominates naive integer quantization by factors of 3 to 1900 at matched bit widths across all tested models.
  • Zero-shot downstream accuracy matches fp16 at 3.79 bits on Mistral while using 16 percent fewer bits than the strongest calibrated baseline.
  • Storage shrinks by up to 5.05x, reducing a Llama-3-70B 128k-context KV cache from 43 GB to 8.5 GB.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The isometry property of left quaternion multiplication may allow the same multiplicative construction to be reused for compressing other rotationally structured tensors in neural networks.
  • Because the secondary codebooks are initialized randomly and require no training, the method could be applied on-the-fly to new models or even non-LLM sequence models that maintain similar caches.
  • The per-batch median multiplier may interact with very small batch sizes or highly variable sequence lengths in ways that require additional safeguards not explored in the current experiments.
  • Extending the same Hurwitz-group idea to higher-dimensional division algebras or other discrete subgroups on the 3-sphere could yield further bit-rate improvements.

Load-bearing premise

Random per-layer-head secondary codebooks of unit quaternions together with the fixed Hurwitz group and median-multiplier outlier extraction can represent KV cache distributions across different model families without calibration data or tuning.

What would settle it

Running HQMQ on an additional model family whose KV cache value distributions lie far outside the range observed in the five evaluated models and measuring whether perplexity remains within 0.10 of fp16 at 5 bits.

Figures

Figures reproduced from arXiv: 2605.27646 by Antonio Torralba, Daniel Karl I. Weidele, David Cox, Kabir Swain, Mauro Martino, Sijie Han.

Figure 1
Figure 1. Figure 1: HQMQ + Med3× at ∼ 5 bits/element matches fp16 perplexity across four modern open LLMs; naive int4 catastrophi￾cally fails on outlier-heavy attention (17,661 ppl on Qwen2.5-7B vs fp16’s 7.59). The recipe transfers across architecture families with a single C=3 outlier-multiplier constant and zero calibration data. 1. Introduction The memory cost of the KV cache during long-context LLM inference often domina… view at source ↗
Figure 2
Figure 2. Figure 2: Pareto frontier across four modern open models from three vendors. HQMQ (deep blue, no outlier extraction) and HQMQ + Med3× (amber, our headline method) dominate naive int (cyan) and the spherical+JL baseline (mid-blue) at every bit budget. On Mistral-7B and Llama-3-8B (naive int4 is functional), HQMQ alone Pareto-dominates. On Qwen2.5-7B and Qwen3-8B (outlier-heavy, naive int4 catastrophic at > 104 ppl), … view at source ↗
Figure 3
Figure 3. Figure 3: Head-to-head against KIVI on Mistral-7B (CoQA EM, TruthfulQA BLEU, GSM8K exact match). HQMQ s96 r4 at 3.79 bits (amber) matches KIVI-4 at ∼ 4.5 bits across all three tasks while using 16% fewer bits and no calibration pass; on TruthfulQA, the calibration-free HQMQ bar actually exceeds calibrated KIVI-4 by 2.7 pts. HQMQ + Med3× at 4.41 bits (deep blue) crosses KIVI-4 on CoQA. The sub-3-bit HQMQ s24 r2 bar (… view at source ↗
Figure 4
Figure 4. Figure 4: Fused HQMQ-Attention kernel on the production decode workload (Mistral-class GQA, Tq=1, s192 codebook, RTX 4090, fp16). Left: per-step latency vs context length. The fake-quant pipeline scales linearly with Tkv (full-cache dequant per step); the fused kernel (amber) stays roughly constant (∼ 0.033 ms) because the codebook gather and softmax are computed inline and only touch the per-step KV window. Right: … view at source ↗
Figure 5
Figure 5. Figure 5: RULER long-context retrieval on Qwen3-8B at Tkv ∈ {4k, 8k} (n=50/task). HQMQ s96 r6 + Med3× (amber, 4.89 bits) preserves fp16’s perfect VT score (1.00 → 1.00) at both context lengths and matches fp16 within 2 pts on SQuAD (0.60 vs 0.60 exact at 8k). Naive int4 (cyan) collapses on every subtask, and the fp16-to-int4 gap on SQuAD widens from 0.31 at 4k to 0.44 at 8k as quantization noise accumulates over lon… view at source ↗
Figure 6
Figure 6. Figure 6: KV cache memory: fp16 (deep blue) vs HQMQ at three bit budgets for Llama-3-8B (dotted) and Llama-3-70B (solid). HQMQ s24 r3 (5.05× compression) makes 70B / 128k-context inference fit on a single 24 GB consumer GPU. Q. Ablation: HQMQ vs uncalibrated additive VQ To isolate the contribution of the multiplicative quaternion structure (as opposed to “having a large effective codebook”), we compare HQMQ against … view at source ↗
Figure 7
Figure 7. Figure 7: K-chunk outlier diagnosis on Mistral-7B vs Qwen2.5-7B. Left: per-layer max-over-heads ratio of Kmax/Kmed. Qwen2.5 (amber) exceeds the Med3× threshold (C=3, dashed) in ≥ 95% of layers, with peaks > 100×; Mistral (blue) hovers near C=3 with peaks ≲ 10×. Right: K-chunk-norm quantile profile (median across layers). Qwen’s 99.9th-percentile chunk norm is ∼50× its median; Mistral’s is ∼8×. The architectural diff… view at source ↗
Figure 8
Figure 8. Figure 8: Outlier-multiplier sweep on Qwen2.5-7B (HQMQ s192 r6). Left: perplexity vs C. Right: empirical outlier fraction vs C. Med3× extracts ∼3% of chunks and gives the best ppl; Med100× (0.03% extracted) is catastrophic. config bits ppl range std CoV hqmq s24 r3 3.04 5.759–5.781 0.008 0.14% hqmq s96 r4 3.79 5.567–5.580 0.004 0.07% hqmq s192 r4 4.04 5.538–5.551 0.005 0.09% hqmq s192 r6 4.54 5.531–5.543 0.004 0.08%… view at source ↗
Figure 9
Figure 9. Figure 9: Qwen2.5-7B downstream task accuracy (PIQA / HellaSwag / ARC-Easy at n=200/task). Naive int4 collapses to (or below) the 25% random baseline on HellaSwag and ARC, while the three HQMQ + Med3× configurations are within ±2 percentage points of fp16 on every task at ∼ 5 bits. HQMQ s192 r6 + Med3× exceeds fp16 on HellaSwag (0.680 vs 0.630), which we attribute to the slight regularizing effect of the radius quan… view at source ↗
read the original abstract

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free KV cache compression technique that represents 4-element K/V chunks as quaternions and quantizes their unit directions via the product of the fixed 24-element Hurwitz group (2T) and per-(layer, head) random unit quaternions, augmented by a per-batch median-multiplier outlier handler (C=3). It reports that this yields ~5-bit representations matching fp16 perplexity within 0.02-0.03 points on Mistral-7B and Qwen3-8B, recovers fp16 quality within 0.02-0.10 points on Qwen2.5-7B/Qwen3-8B where int4 fails, Pareto-dominates naive int quantization by 3-1900x at matched bits, and matches the calibrated KIVI-4 baseline on downstream tasks at 16% fewer bits without calibration; downstream zero-shot accuracy also matches fp16 at 3.79 bits on Mistral, with up to 5.05x cache compression.

Significance. If the empirical results hold, the work provides a practically significant calibration-free approach to KV cache compression that could enable longer contexts on memory-constrained hardware across dense and MoE architectures. The multiplicative quaternion construction and median-multiplier step are notable for avoiding per-model tuning while achieving competitive or superior performance to calibrated baselines; the explicit reporting of bit rates, perplexity deltas, and downstream metrics on five models strengthens the contribution.

major comments (3)
  1. [Abstract] Abstract: The justification that 'random initialization suffices because left-multiplication is an S^3 isometry, so seeded codebooks vary in end-task ppl by <1.5%' addresses only inter-seed variation and does not establish that the resulting 24S codewords lie near the empirical mass of normalized 4-vectors from real KV caches; this assumption is load-bearing for the calibration-free claim on Qwen2.5-7B and Qwen3-8B where int4 collapses.
  2. [Abstract] Abstract (results paragraph): The reported recovery of fp16 quality 'within 0.02--0.10 ppl points at ~5 bits' on Qwen2.5-7B/Qwen3-8B is presented without error bars, number of evaluation runs, or details on how the per-batch median multiplier interacts with GQA head grouping; this makes it difficult to assess robustness of the central no-calibration performance claim.
  3. [Abstract] Abstract (comparison paragraph): The claim that HQMQ at 3.79 bits 'matches KIVI-4 (~4.5 bits) within ~1 pt on CoQA, 0.6 pts on TruthfulQA, and 2.3 pts on GSM8K' requires explicit confirmation that the effective bit-rate calculation for HQMQ (including the stored S parameters per head plus the median multiplier) is directly comparable to KIVI's reported rate; any mismatch would affect the Pareto-dominance conclusion.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief parenthetical on how the 24-element Hurwitz group is stored (e.g., as indices) to clarify the exact parameter count in the 24S product codebook.
  2. [Abstract] Notation for the secondary codebook size S and the multiplier C should be introduced with a short definition on first use to improve readability for readers unfamiliar with quaternion quantization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments highlight areas where additional clarification and evidence can strengthen the calibration-free claims. We address each point below and will make targeted revisions to the abstract and supporting sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The justification that 'random initialization suffices because left-multiplication is an S^3 isometry, so seeded codebooks vary in end-task ppl by <1.5%' addresses only inter-seed variation and does not establish that the resulting 24S codewords lie near the empirical mass of normalized 4-vectors from real KV caches; this assumption is load-bearing for the calibration-free claim on Qwen2.5-7B and Qwen3-8B where int4 collapses.

    Authors: We agree the isometry argument demonstrates robustness to seed choice but does not directly quantify alignment with the empirical distribution of normalized KV vectors. The calibration-free results on Qwen models provide indirect support via end-task performance. To address this, we will add an appendix analysis with angular coverage metrics and nearest-neighbor distances between the 24S codewords and sampled KV vectors from the evaluated models. revision: partial

  2. Referee: [Abstract] Abstract (results paragraph): The reported recovery of fp16 quality 'within 0.02--0.10 ppl points at ~5 bits' on Qwen2.5-7B/Qwen3-8B is presented without error bars, number of evaluation runs, or details on how the per-batch median multiplier interacts with GQA head grouping; this makes it difficult to assess robustness of the central no-calibration performance claim.

    Authors: We will update the abstract and results section to report error bars from multiple runs (specifying 3 seeds), the number of evaluations, and explicit details on the median multiplier: it is computed per batch and applied uniformly within each GQA head group sharing KV projections. This will allow better assessment of robustness. revision: yes

  3. Referee: [Abstract] Abstract (comparison paragraph): The claim that HQMQ at 3.79 bits 'matches KIVI-4 (~4.5 bits) within ~1 pt on CoQA, 0.6 pts on TruthfulQA, and 2.3 pts on GSM8K' requires explicit confirmation that the effective bit-rate calculation for HQMQ (including the stored S parameters per head plus the median multiplier) is directly comparable to KIVI's reported rate; any mismatch would affect the Pareto-dominance conclusion.

    Authors: The 3.79-bit figure already incorporates all overheads, including the S random quaternions stored per head (as 16-bit values) and the per-batch median multiplier. We will add an explicit bit-rate breakdown table in the methods section and a clarifying sentence in the abstract to confirm direct comparability with KIVI's reported rates, preserving the Pareto-dominance claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method grounded in standard quaternion geometry

full rationale

The paper presents HQMQ as a calibration-free empirical quantization technique motivated by the geometry of the Hurwitz group and S^3 isometry under left-multiplication. This isometry is a standard mathematical property of unit quaternions, not derived from or dependent on the paper's own results. No derivation chain reduces a claimed prediction or first-principles result to a quantity defined by the method itself (e.g., no fitted parameters renamed as predictions, no self-definitional loops). Performance is assessed via direct experiments on external models without self-referential fitting. No load-bearing self-citations or ansatzes imported via prior author work are present in the text. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review; limited visibility into parameters or assumptions. The method invokes the isometry property of quaternion multiplication to justify random codebooks and introduces a median-multiplier rule without calibration.

free parameters (2)
  • S
    Size of per-(layer, head) secondary random quaternion codebook; determines total effective codewords as 24S but value not numerically specified.
  • C
    Multiplier for per-batch median outlier extraction, stated as C=3.
axioms (1)
  • standard math Left-multiplication by a unit quaternion is an isometry of S^3
    Invoked to explain why random initialization of q_s yields <1.5% end-task ppl variation.

pith-pipeline@v0.9.1-grok · 6067 in / 1585 out tokens · 62365 ms · 2026-06-29T18:11:48.796346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    URL https://github.com/gkamradt/ LLMTest_NeedleInAHaystack. Lee, N. and Kim, Y . FibQuant: Universal vector quantiza- tion for random-access KV-Cache compression.arXiv preprint arXiv:2605.11478, 2026. URL https:// arxiv.org/abs/2605.11478. Li, J. et al. CommVQ: Commutative vector quantization for KV cache compression. InICML, 2025a. URL https: //arxiv.org...

  2. [2]

    CoQA: A Conversational Question Answering Challenge

    Available at https://huggingface.co/ openai/gpt-oss-20b. Paszke, A., Gross, S., Massa, F., et al. PyTorch: An impera- tive style, high-performance deep learning library, 2019. NeurIPS. Pope, J. D. RotorQuant: Clifford algebra vector quantization for LLM KV cache compression, 2026. GitHub: https://github.com/abysslover/ rotorquant_improved. Reddy, S., Chen...

  3. [3]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    URL https://research.google/blog/ turboquant-redefining-ai-efficiency-with-extreme-compression/ . Original arXiv: 2504.19874 (April 2025). Zandieh, A., Han, I., Mirrokni, V ., and Karbasi, A. QJL: 1-Bit quantized JL transform for KV cache quantization with zero overhead. InAAAI, 2025. URL https:// arxiv.org/abs/2406.03482. Zellers, R., Holtzman, A., Bisk,...