hub

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni · 2025 · cs.LG · arXiv 2504.19874

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

open full Pith review browse 24 citing papers arXiv PDF

abstract

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Block-Sphere Vector Quantization

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer

cs.LG · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

IVF-TQ replaces learned codebooks with a fixed random rotation and precomputed scalar quantizer in the residual layer of an IVF index, delivering streaming recall stability at fixed bit budgets via a uniform-over-sphere inner-product bound.

PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources

cs.IT · 2026-05-15 · conditional · novelty 7.0

PrismQuant achieves near rate-distortion optimality for Gaussian-mixture sources by losslessly transmitting the mixture component label at H(C)/n bits per dimension and applying component-matched KLT plus scalar quantization, with vanishing gap to the genie-aided bound.

Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

cs.CV · 2026-04-07 · conditional · novelty 7.0

3DTurboQuant achieves training-free near-optimal quantization for 3DGS and DUSt3R models via random rotations inducing Beta distributions, enabling precomputed Lloyd-Max quantizers that deliver 3.5x and 7.9x compression with negligible quality loss.

SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.

Runtime-Certified Bounded-Error Quantized Attention

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

cs.AR · 2026-05-06 · unverdicted · novelty 6.0

Design Conductor 2.0 uses April 2026 frontier models to autonomously create a 5129-unit FP16/32 TurboQuant inference accelerator mapped to FPGA at 125 MHz in 80 hours.

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

At 4-bit budget KQV wins on KL divergence, geometric K error and 6D distance with unconditional K-V asymmetry; QKQV wins geometrically at other budgets because the Jensen-amplified variance inflation from QJL on K does not bind.

HARBOR: Automated Harness Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

cs.LG · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.

High-Rate Quantized Matrix Multiplication I

cs.IT · 2026-01-23 · unverdicted · novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Hierarchical vs. Flat Iteration in Shared-Weight Transformers

cs.CL · 2026-04-15 · unverdicted · novelty 4.0

Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.

ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

eess.SP · 2026-04-02 · unverdicted · novelty 3.0

ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

citing papers explorer

Showing 24 of 24 citing papers.

Block-Sphere Vector Quantization cs.LG · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 72 · internal anchor
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer cs.LG · 2026-05-17 · unverdicted · none · ref 10 · 2 links · internal anchor
IVF-TQ replaces learned codebooks with a fixed random rotation and precomputed scalar quantizer in the residual layer of an IVF index, delivering streaming recall stability at fixed bit budgets via a uniform-over-sphere inner-product bound.
PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources cs.IT · 2026-05-15 · conditional · none · ref 30 · internal anchor
PrismQuant achieves near rate-distortion optimality for Gaussian-mixture sources by losslessly transmitting the mixture component label at H(C)/n bits per dimension and applying component-matched KLT plus scalar quantization, with vanishing gap to the genie-aided bound.
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven cs.LG · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon cs.PF · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit cs.LG · 2026-04-10 · unverdicted · none · ref 13 · internal anchor
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models cs.CV · 2026-04-07 · conditional · none · ref 5 · internal anchor
3DTurboQuant achieves training-free near-optimal quantization for 3DGS and DUSt3R models via random rotations inducing Beta distributions, enabling precomputed Lloyd-Max quantizers that deliver 3.5x and 7.9x compression with negligible quality loss.
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems cs.AI · 2026-04-06 · unverdicted · none · ref 23 · internal anchor
SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
Runtime-Certified Bounded-Error Quantized Attention cs.LG · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 50 · internal anchor
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG · 2026-05-19 · unverdicted · none · ref 74 · internal anchor
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 36 · internal anchor
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 79 · internal anchor
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours cs.AR · 2026-05-06 · unverdicted · none · ref 11 · internal anchor
Design Conductor 2.0 uses April 2026 frontier models to autonomously create a 5129-unit FP16/32 TurboQuant inference accelerator mapped to FPGA at 125 MHz in 80 hours.
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant cs.LG · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
At 4-bit budget KQV wins on KL divergence, geometric K error and 6D distance with unconditional K-V asymmetry; QKQV wins geometrically at other budgets because the Jensen-amplified variance inflation from QJL on K does not bind.
HARBOR: Automated Harness Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 43 · internal anchor
HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving cs.LG · 2026-04-21 · unverdicted · none · ref 20 · internal anchor
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon cs.LG · 2026-04-18 · unverdicted · none · ref 12 · internal anchor
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization cs.LG · 2026-04-06 · unverdicted · none · ref 40 · internal anchor
eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization cs.LG · 2026-05-05 · unverdicted · none · ref 14 · 2 links · internal anchor
HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
High-Rate Quantized Matrix Multiplication I cs.IT · 2026-01-23 · unverdicted · none · ref 12 · internal anchor
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
Hierarchical vs. Flat Iteration in Shared-Weight Transformers cs.CL · 2026-04-15 · unverdicted · none · ref 18 · internal anchor
Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook eess.SP · 2026-04-02 · unverdicted · none · ref 158 · internal anchor
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer