TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Pith reviewed 2026-05-20 08:03 UTC · model grok-4.3
pith:S4XIHWQS Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{S4XIHWQS}
Prints a linked pith:S4XIHWQS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
TurboQuant achieves vector quantization with distortion rates within a small constant factor of the information-theoretic minimum.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TurboQuant randomly rotates input vectors to induce a concentrated Beta distribution on coordinates and then applies optimal scalar quantizers independently to each coordinate. This yields near-optimal mean-squared-error distortion. For unbiased inner-product estimation, an MSE quantizer is followed by a 1-bit Quantized JL transform on the residual. The resulting scheme matches the proven information-theoretic lower bound on distortion rate up to a small constant factor of approximately 2.7 across all bit widths and dimensions.
What carries the argument
Random rotation that concentrates coordinates into a Beta distribution, enabling independent optimal scalar quantization per coordinate while preserving near-optimality due to high-dimensional near-independence.
If this is right
- KV-cache quantization reaches absolute quality neutrality at 3.5 bits per channel.
- Only marginal quality degradation appears at 2.5 bits per channel for KV caches.
- Nearest-neighbor search recall improves over product quantization while indexing time drops to essentially zero.
- The same scheme works online and data-obliviously for every bit width and dimension.
- Any future vector quantizer must respect the same information-theoretic lower bounds established in the paper.
Where Pith is reading between the lines
- The rotation-plus-scalar-quantizer pattern could be combined with learned per-coordinate codebooks to tighten the constant factor in domain-specific settings.
- Because no data-dependent preprocessing is required, the method fits naturally into streaming or continually changing high-dimensional data pipelines.
- The same independence argument may extend to other distortion measures such as cosine similarity or Wasserstein distance once suitable scalar quantizers are identified.
Load-bearing premise
After random rotation the coordinates become sufficiently near-independent in high dimensions that separate scalar quantization loses only a small constant factor in distortion compared with any joint scheme.
What would settle it
A calculation or experiment in which TurboQuant's distortion rate exceeds three times the proven lower bound for some bit width and dimension would show the near-optimality claim is false.
read the original abstract
Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TurboQuant, a data-oblivious online vector quantization algorithm for high-dimensional Euclidean vectors. It applies a random rotation to induce concentrated Beta marginals on coordinates, then performs independent scalar quantization per coordinate. For inner-product preservation it adds a two-stage correction using an MSE quantizer followed by a 1-bit Quantized JL transform. The authors supply a formal proof of information-theoretic lower bounds on achievable distortion and claim that TurboQuant attains rates within a small constant factor (approximately 2.7) of these bounds across bit-widths and dimensions. Experiments report absolute quality neutrality at 3.5 bits per channel for KV-cache quantization and improved recall with near-zero indexing time for nearest-neighbor search.
Significance. If the central claims are substantiated, the work offers a theoretically grounded, online quantization primitive that is attractive for memory-constrained inference in large models. The combination of a data-oblivious construction, explicit lower-bound proof, and concrete task-level validation constitutes a useful contribution to the quantization literature.
major comments (2)
- [Core algorithm and Beta distribution analysis] Core algorithm description and Beta-concentration analysis: the central claim that random rotation plus per-coordinate scalar quantization yields distortion within a small constant factor of the vector-quantization lower bound rests on the coordinates behaving as sufficiently near-independent after rotation. The unit-norm constraint induces negative dependence (sum x_i^2 = ||v||^2) that is not obviously canceled by the marginal concentration argument; this dependence could enlarge the mutual-information gap or error accumulation beyond the stated ≈2.7 factor, especially at moderate dimensions or low bit-widths. A quantitative bound that accounts for the residual joint dependence is needed to support the optimality claim.
- [Lower-bound proof section] Formal proof of information-theoretic lower bounds: the proof is presented as independent of the algorithm construction, yet the manuscript must explicitly show how the reported empirical distortion rates are obtained relative to the derived lower bound without post-hoc fitting from the same evaluation data. Without this explicit linkage, the constant-factor comparison cannot be verified as non-circular.
minor comments (2)
- [Experimental results] Experimental sections on KV-cache and nearest-neighbor search should report standard deviations or results over multiple random seeds and datasets to allow assessment of variability in the quality metrics.
- [Notation and definitions] Clarify the precise normalization used for the distortion-rate quantity when comparing across bit-widths and dimensions in both theory and experiments.
Simulated Author's Rebuttal
Thank you for the detailed review and the recommendation for major revision. We believe the concerns can be addressed by clarifying the analysis and adding explicit comparisons, as detailed in our point-by-point responses below.
read point-by-point responses
-
Referee: Core algorithm and Beta distribution analysis: the central claim that random rotation plus per-coordinate scalar quantization yields distortion within a small constant factor of the vector-quantization lower bound rests on the coordinates behaving as sufficiently near-independent after rotation. The unit-norm constraint induces negative dependence (sum x_i^2 = ||v||^2) that is not obviously canceled by the marginal concentration argument; this dependence could enlarge the mutual-information gap or error accumulation beyond the stated ≈2.7 factor, especially at moderate dimensions or low bit-widths. A quantitative bound that accounts for the residual joint dependence is needed to support the optimality claim.
Authors: We agree that the unit-norm constraint introduces negative dependence among coordinates. However, for high-dimensional vectors uniformly distributed on the sphere, after a random orthogonal transformation, the coordinates are known to be approximately independent with the dependence becoming negligible as the dimension increases. Our proof leverages concentration properties of the Beta distribution for the marginals and bounds the quantization error using these marginals, with the constant factor of approximately 2.7 derived from the ratio of the vector rate-distortion function to the scalar one under the induced distribution. To strengthen this, we will add a quantitative analysis of the residual dependence in the revised manuscript, showing that it does not increase the factor beyond 2.7 for dimensions d >= 64 and bit-widths >= 2, using bounds from negative association or Stein's method. revision: partial
-
Referee: Formal proof of information-theoretic lower bounds: the proof is presented as independent of the algorithm construction, yet the manuscript must explicitly show how the reported empirical distortion rates are obtained relative to the derived lower bound without post-hoc fitting from the same evaluation data. Without this explicit linkage, the constant-factor comparison cannot be verified as non-circular.
Authors: The information-theoretic lower bound is derived independently in the proof section using standard techniques from rate-distortion theory applied to the distortion measures (MSE and inner-product), without reference to TurboQuant. The empirical rates are obtained by running the algorithm on test vectors and computing the average distortion directly. The factor ≈2.7 is the maximum observed ratio across our experiments and is consistent with the theoretical gap between scalar and vector quantization under Beta marginals. We will revise the manuscript to include an explicit comparison table that lists the theoretical lower bound, the achieved distortion for TurboQuant, and the ratio for each setting, making the non-circular nature clear. No post-hoc fitting was performed; the constant is an upper bound on the observed ratios. revision: yes
Circularity Check
No significant circularity; lower-bound proof and algorithm analysis are independent
full rationale
The paper derives its main result by first proving an information-theoretic lower bound on distortion for any vector quantizer, then separately analyzing TurboQuant's distortion under random rotation plus per-coordinate scalar quantization using Beta marginals. The lower-bound proof is presented as a standalone information-theoretic argument that does not invoke the specific TurboQuant construction or its fitted constants. Distortion rates for the method are obtained from explicit concentration bounds and scalar quantization error formulas rather than by fitting parameters to the same evaluation data used to report empirical performance. The near-independence assumption after rotation is an explicit modeling choice applied uniformly to both the algorithm and the gap analysis, but it does not create a definitional loop or reduce the claimed constant-factor gap to a tautology. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the provided derivation outline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Random orthogonal transformations concentrate coordinate marginals to a Beta distribution and render distinct coordinates nearly independent in high dimensions.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant (≈2.7) factor
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
Block-Sphere Vector Quantization
BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.
-
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...
-
PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources
PrismQuant achieves near rate-distortion optimality for Gaussian-mixture sources by losslessly transmitting the mixture component label at H(C)/n bits per dimension and applying component-matched KLT plus scalar quant...
-
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models
3DTurboQuant achieves training-free near-optimal quantization for 3DGS and DUSt3R models via random rotations inducing Beta distributions, enabling precomputed Lloyd-Max quantizers that deliver 3.5x and 7.9x compressi...
-
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
-
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...
-
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
-
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
-
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
-
Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours
Design Conductor 2.0 uses April 2026 frontier models to autonomously create a 5129-unit FP16/32 TurboQuant inference accelerator mapped to FPGA at 125 MHz in 80 hours.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
-
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
At 4-bit budget KQV wins on KL divergence, geometric K error and 6D distance with unconditional K-V asymmetry; QKQV wins geometrically at other budgets because the Jensen-amplified variance inflation from QJL on K doe...
-
HARBOR: Automated Harness Optimization
HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
-
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.
-
High-Rate Quantized Matrix Multiplication I
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
-
Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
-
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
Reference graph
Works this paper leans on
-
[1]
https://www.elastic.co/enterprise-search/vector-search
Elastic search., 2025. https://www.elastic.co/enterprise-search/vector-search
work page 2025
- [2]
-
[3]
https://github.com/pgvector/pgvector/
Pgvector search., 2025. https://github.com/pgvector/pgvector/
work page 2025
- [4]
-
[5]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 4895–4901, 2023
work page 2023
- [7]
-
[8]
L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J
Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456, 2024
-
[9]
Babenko, A. and Lempitsky, V. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 931– 938, 2014
work page 2014
-
[10]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 21
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[12]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems , 36:4396–4429, 2023
work page 2023
-
[14]
Cover, T. M. Elements of information theory . John Wiley & Sons, 1999
work page 1999
-
[15]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix mul- tiplication for transformers at scale. Advances in Neural Information Processing Systems , 35: 30318–30332, 2022
work page 2022
-
[17]
Qaq: Quality adaptive quantization for llm kv cache
Dong, S., Cheng, W., Qin, J., and Wang, W. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643 , 2024
-
[18]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Data engineering for scaling language models to 128k context
Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171 , 2024. URL https://github.com/FranxYao/Long-Context-Data-Engineering
- [22]
-
[23]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Optimized product quantization for approximate near- est neighbor search
Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate near- est neighbor search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2946–2953, 2013
work page 2013
-
[25]
Asymptotically optimal block quantization
Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on information theory, 25(4):373–380, 1979. 22
work page 1979
-
[26]
On the structure of vector quantizers
Gersho, A. On the structure of vector quantizers. IEEE Transactions on Information Theory , 28(2):157–166, 1982
work page 1982
-
[27]
Accelerat- ing large-scale inference with anisotropic vector quantization
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerat- ing large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. PMLR, 2020
work page 2020
-
[28]
Polarquant: Quantizing kv caches with polar transformation
Han, I., Kacham, P., Karbasi, A., Mirrokni, V., and Zandieh, A. Polarquant: Quantizing kv caches with polar transformation. arXiv preprint arXiv:2502.02617 , 2025
-
[29]
Balancekv: Kv cache compression through discrepancy theory
Han, I., Kapralov, M., Kochetkova, E., Sheth, K., and Zandieh, A. Balancekv: Kv cache compression through discrepancy theory. arXiv preprint arXiv:2502.07861 , 2025
-
[30]
Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation. arXiv preprint arXiv:2401.18079 , 2024
-
[31]
Product quantization for nearest neighbor search
Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence , 33(1):117–128, 2010
work page 2010
-
[32]
Needle in a haystack - pressure testing llms., 2023
Kamradt, G. Needle in a haystack - pressure testing llms., 2023. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack
work page 2023
-
[33]
Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527, 2024
-
[34]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[35]
Khattab, O. and Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pp. 39–48, 2020
work page 2020
-
[36]
Lexico: Extreme kv cache compression via sparse coding over universal dictionaries
Kim, J., Park, J., Cho, J., and Papailiopoulos, D. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries. arXiv preprint arXiv:2412.08890 , 2024
-
[37]
Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629 , 2023
-
[38]
SnapKV: LLM Knows What You are Looking for Before Generation
Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration
Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems , 6:87–100, 2024
work page 2024
-
[40]
Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems , 36, 2024. 23
work page 2024
-
[41]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Least squares quantization in pcm
Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory , 28(2): 129–137, 1982
work page 1982
-
[43]
Quantizing for minimum distortion
Max, J. Quantizing for minimum distortion. IRE Transactions on Information Theory , 6(1): 7–12, 1960
work page 1960
-
[44]
Panter, P. and Dite, W. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951
work page 1951
-
[45]
GloVe: Global vectors for word representation
Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL ...
-
[46]
Colbertv2: Effective and efficient retrieval via lightweight late interaction
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488 , 2021
-
[47]
Flashattention- 3: Fast and accurate attention with asynchrony and low-precision
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024
-
[48]
Shannon, C. E. A mathematical theory of communication. The Bell system technical journal , 27(3):379–423, 1948
work page 1948
-
[49]
Shannon, C. E. et al. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959
work page 1959
-
[50]
Fast Transformer Decoding: One Write-Head is All You Need
Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[51]
Su, Z., Chen, Z., Shen, W., Wei, H., Li, L., Yu, H., and Yuan, K. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations, 2025. URL https://arxiv.org/abs/2501.16383
-
[52]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Thakur, N., Reimers, N., R¨ uckl´ e, A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ
work page 2021
-
[54]
N., Kaiser, L., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. NeurIPS, 2017. 24
work page 2017
-
[55]
High-dimensional probability: An introduction with applications in data science , volume 47
Vershynin, R. High-dimensional probability: An introduction with applications in data science , volume 47. Cambridge university press, 2018
work page 2018
- [56]
-
[57]
Smoothquant: Accurate and efficient post-training quantization for large language models
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023
work page 2023
-
[58]
Efficient Streaming Language Models with Attention Sinks
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S
Yang, J. Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S. J., and Lee, D. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096 , 2024
-
[60]
WKVQuant: Quantizing weight and key/value cache for large language models gains more
Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., and Nie, L. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065 , 2024
-
[61]
Zador, P. L. Development and evaluation of procedures for quantizing multivariate distributions. Stanford University, 1964
work page 1964
-
[63]
Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead
Zandieh, A., Daliri, M., and Han, I. Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead. arXiv preprint arXiv:2406.03482 , 2024
-
[64]
Subgen: Token generation in sublinear time and memory
Zandieh, A., Han, I., Mirrokni, V., and Karbasi, A. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082 , 2024
-
[65]
Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization
Zhang, T., Yi, J., Xu, Z., and Shrivastava, A. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917 , 2024
-
[66]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R´ e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems , 36, 2024. 25
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.