pith. sign in

arxiv: 2606.00289 · v1 · pith:FPWZZSHYnew · submitted 2026-05-29 · 💻 cs.LG · cs.DS

Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms

Pith reviewed 2026-06-28 22:47 UTC · model grok-4.3

classification 💻 cs.LG cs.DS
keywords quantizationinner productsadaptive stochastic quantizationvector compressionmachine learning algorithmsapproximate algorithms
0
0 comments X

The pith

Quantization methods targeting inner product preservation admit faster algorithms than standard approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces objectives for quantization that aim to approximately preserve inner products between vectors rather than just minimizing reconstruction error. It establishes a connection between these objectives and adaptive stochastic quantization, then provides provably efficient exact and approximate algorithms for solving them. These theoretical advances translate into practical algorithms that work well on various data distributions and deliver 2-10 times speedup for standard ASQ tasks without loss in quality. Readers interested in efficient compression for machine learning workloads would care because inner products are central to many similarity-based computations.

Core claim

Formulating inner product aware quantization objectives leads to a tight connection with adaptive stochastic quantization, enabling the development of provably fast exact and approximate algorithms that inspire practical methods 2-10 times faster than prior state-of-the-art while maintaining quality.

What carries the argument

Inner product aware quantization objectives that approximately preserve inner products with worst-case and average-case inputs, connected to adaptive stochastic quantization.

If this is right

  • Provably fast exact and approximate algorithms exist for the inner product preservation objectives.
  • Practical algorithms perform well across a variety of workload distributions.
  • Standard ASQ can be accelerated by 2-10× using the inspired practical algorithms.
  • Adaptive quantization techniques become more efficient and tractable in practical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These methods might be combined with other compression techniques like pruning for further gains in neural network deployment.
  • Extending the worst-case and average-case analyses to specific application domains could yield tailored variants.
  • The unbiased property of the quantization methods may allow for better integration into stochastic gradient descent pipelines.

Load-bearing premise

The formulated objectives capture natural desiderata for approximately preserving inner products with worst-case and average-case inputs.

What would settle it

A benchmark where the proposed practical algorithms do not achieve 2-10 times speedup over prior methods on standard inner product workloads while keeping comparable accuracy would falsify the main practical claim.

Figures

Figures reproduced from arXiv: 2606.00289 by Krish Singal, Nathan White.

Figure 1
Figure 1. Figure 1: Figure comparing the runtime of the previous state-of-the-art algorithm for traditional [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Average variance VarSSQ[wi ] of the worst k coordinates on vectors w drawn from LogNormal(0,1) over 100 trials. (b) Average absolute inner product error on vectors w drawn from a mixture of 16 Gaussians with variance 10 and means separated by 105 . In both plots, the shaded regions show the 10–90% range of average variances We now briefly discuss the merits of unbiased and adaptive quantization methods… view at source ↗
Figure 3
Figure 3. Figure 3: Runtime of exact algorithms for ASQ on vectors [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Runtime comparison of approximate QUIVER and our implementation using faster exact sub￾routines on vectors w drawn from LogNormal(0, 1), with m = 100s uniformly spaced discretization points. The shaded regions show the 10–90% range of runtimes across draws of w. 9Due to the special structure of the calls approximate QUIVER makes to the exact algorithm, we use a slight variant of our fast exact algorithm; s… view at source ↗
Figure 5
Figure 5. Figure 5: Runtime of algorithms for MDV compared against Improved Approx QUIVER as a baseline (although it does not provide a good approximation to MDV), on vectors drawn from LogNormal(0, 1). The shaded regions show the 10–90% range of runtimes across draws of the input vector. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of distribution D′ , showing the reduction of support of D to just two elements See [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of (a) case 1 and (b) case 2 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
read the original abstract

Quantization is a fundamental tool used to compress datasets, neural network weights, and memory usage in a range of computational tasks. Many downstream applications of vector quantization perform inner products with arbitrary inputs. This motivates the study of inner product aware quantization schemes that approximately preserve inner products with unseen vectors -- in contrast to simply minimizing the mean-squared error. In this work, we formulate objectives that capture natural desiderata and develop adaptive and unbiased quantization methods that approximately preserve inner products with worst-case and average-case inputs. An analysis of these objectives shows a tight connection with the well-studied notion of Adaptive Stochastic Quantization (ASQ). We develop provably fast exact and approximate algorithms for our objectives. Our theoretical results inspire efficient practical algorithms that perform well across a variety of workload distributions. They also lead to practical algorithms for standard ASQ which are 2-10$\times$ faster than prior state-of-the-art methods while maintaining quality. These theoretical and empirical results contribute towards making adaptive quantization techniques more efficient and tractable in practical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript formulates inner-product-aware quantization objectives that approximately preserve inner products with worst-case and average-case inputs (as opposed to MSE minimization), establishes a tight connection between these objectives and Adaptive Stochastic Quantization (ASQ), develops provably fast exact and approximate algorithms for the objectives, and derives practical algorithms that achieve 2-10× speedups over prior state-of-the-art methods for standard ASQ while preserving quality across varied workload distributions.

Significance. If the claimed theoretical guarantees and empirical speedups are substantiated, the work would meaningfully advance practical quantization for inner-product-heavy tasks in machine learning. The explicit connection to ASQ together with the provision of both exact/approximate algorithms and faster practical implementations for the standard ASQ setting constitute a clear strength; the emphasis on adaptive, unbiased methods with worst-case and average-case analysis is a useful framing.

minor comments (3)
  1. [Abstract] The abstract states that the analysis 'shows a tight connection' to ASQ; a brief sentence clarifying whether this is an equivalence, a reduction, or an approximation under specific conditions would improve readability.
  2. [Abstract] The claim of 'provably fast exact and approximate algorithms' would benefit from a one-sentence statement of the key complexity results (e.g., time or space bounds) already in the abstract or introduction.
  3. Figure and table captions should explicitly state the workload distributions and baseline implementations used to obtain the reported 2-10× speedups so that the empirical claims are immediately verifiable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough summary and positive evaluation of the manuscript, including the recognition of the theoretical connections to ASQ and the practical speedups. The recommendation for minor revision is appreciated. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description formulate inner-product-preserving objectives, note a connection to ASQ, and claim provably fast algorithms with empirical speedups. No equations, derivations, or self-referential steps are visible in the text. No load-bearing claim reduces by construction to a fit, self-definition, or self-citation chain. The argument structure (formulation to analysis to algorithms) appears self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities used in the derivations or objectives.

pith-pipeline@v0.9.1-grok · 5708 in / 835 out tokens · 19314 ms · 2026-06-28T22:47:43.858775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Qsgd: Communication-efficient sgd via gradient quantization and encoding

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems , 30, 2017

  2. [2]

    Geometric applications of a matrix searching algorithm

    Alok Aggarwal, Maria Klawe, Shlomo Moran, Peter Shor, and Robert Wilber. Geometric applications of a matrix searching algorithm. In Proceedings of the second annual symposium on Computational geometry , pages 285--292, 1986

  3. [3]

    Finding a minimum weight k-link path in graphs with monge property and applications

    Alok Aggarwal, Baruch Schieber, and Takashi Tokuyama. Finding a minimum weight k-link path in graphs with monge property and applications. In Proceedings of the ninth annual symposium on Computational geometry , pages 189--197, 1993

  4. [4]

    Optimal and approximate adaptive stochastic quantization

    Ran Ben-Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, and Shay Vargaftik. Optimal and approximate adaptive stochastic quantization. Advances in Neural Information Processing Systems , 37:94265--94291, 2024

  5. [5]

    Better than optimal: Improving adaptive stochastic quantization using shared randomness

    Ran Ben Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, and Shay Vargaftik. Better than optimal: Improving adaptive stochastic quantization using shared randomness. Proceedings of the ACM on Measurement and Analysis of Computing Systems , 9(3):1--44, 2025

  6. [6]

    Post training 4-bit quantization of convolutional networks for rapid-deployment

    Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in neural information processing systems , 32, 2019

  7. [7]

    Riley Carlson, John Bauer, and Christopher D. Manning. A new pair of gloves, 2025

  8. [8]

    Neural gradients are near-lognormal: improved quantized and sparse training

    Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, and Daniel Soudry. Neural gradients are near-lognormal: improved quantized and sparse training. arXiv preprint arXiv:2006.08173 , 2020

  9. [9]

    Optq: Accurate quantization for generative pre-trained transformers

    E Frantar, S Ashkboos, T Hoefler, and D Alistarh. Optq: Accurate quantization for generative pre-trained transformers. 2023. In URL https://openreview. net/forum , 2023

  10. [10]

    Optimal algorithms for approximate clustering

    Tom\' a s Feder and Daniel Greene. Optimal algorithms for approximate clustering. In Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing , STOC '88, page 434–444, New York, NY, USA, 1988. Association for Computing Machinery

  11. [11]

    Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript

    Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning , pages 3304--3314. PMLR, 2020

  12. [12]

    Generalized selection and ranking: sorted matrices

    Greg N Frederickson and Donald B Johnson. Generalized selection and ranking: sorted matrices. SIAM Journal on computing , 13(1):14--30, 1984

  13. [13]

    Adaptive gradient quantization for data-parallel sgd

    Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel M Roy, and Ali Ramezani-Kebrya. Adaptive gradient quantization for data-parallel sgd. Advances in neural information processing systems , 33:3174--3185, 2020

  14. [14]

    Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search

    Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, and Raymond Chi-Wing Wong. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search. Proceedings of the ACM on Management of Data , 3(3):1--26, 2025

  15. [15]

    Optimized product quantization

    Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence , 36(4):744--755, 2013

  16. [16]

    Some simplified np-complete problems

    Michael R Garey, David S Johnson, and Larry Stockmeyer. Some simplified np-complete problems. In Proceedings of the sixth annual ACM symposium on Theory of computing , pages 47--63, 1974

  17. [17]

    Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D

    Allan Gr nlund, Kasper Green Larsen, Alexander Mathiasen, Jesper Sindahl Nielsen, Stefan Schneider, and Mingzhou Song. Fast exact k-means, k-medians and bregman divergence clustering in 1d. arXiv preprint arXiv:1701.07204 , 2017

  18. [18]

    Clustering to minimize the maximum intercluster distance

    Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science , 38:293--306, 1985

  19. [19]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson. Matrix Analysis . Cambridge University Press, 2nd edition, 2012

  20. [20]

    A frustratingly easy post-training quantization scheme for llms

    Yongkweon Jeon, Chungman Lee, Kyungphil Park, and Ho-young Kim. A frustratingly easy post-training quantization scheme for llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 14446--14461, 2023

  21. [21]

    Selection in x+ y and matrices with sorted rows and columns

    Andranik Mirzaian and Eshrat Arjomandi. Selection in x+ y and matrices with sorted rows and columns. Information processing letters , 20(1):13--17, 1985

  22. [22]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740 , 2017

  23. [23]

    A survey of product quantization

    Yusuke Matsui, Yusuke Uchida, Herve Jegou, and Shin'ichi Satoh. A survey of product quantization. ITE Transactions on Media Technology and Applications , 6(1):2--10, 2018

  24. [24]

    Up or down? adaptive rounding for post-training quantization

    Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. In International conference on machine learning , pages 7197--7206. PMLR, 2020

  25. [25]

    Nuqsgd: Provably communication-efficient data-parallel sgd via nonuniform quantization

    Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, and Daniel M Roy. Nuqsgd: Provably communication-efficient data-parallel sgd via nonuniform quantization. Journal of Machine Learning Research , 22(114):1--43, 2021

  26. [26]

    Computing a minimum weightk-link path in graphs with the concave monge property

    Baruch Schieber. Computing a minimum weightk-link path in graphs with the concave monge property. Journal of Algorithms , 29(2):204--222, 1998

  27. [27]

    Flexgen: High-throughput generative inference of large language models with a single gpu

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R \'e , Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning , pages 31094--31116. PMLR, 2023

  28. [28]

    Bayesian neural networks become heavier-tailed with depth

    Mariia Vladimirova, Julyan Arbel, and Pablo Mesejo. Bayesian neural networks become heavier-tailed with depth. In NeurIPS 2018-Thirty-second Conference on Neural Information Processing Systems , pages 1--7, 2018

  29. [29]

    The concave least-weight subsequence problem revisited

    Robert Wilber. The concave least-weight subsequence problem revisited. Journal of Algorithms , 9(3):418--425, 1988

  30. [30]

    Optimal quantization by matrix searching

    Xiaolin Wu. Optimal quantization by matrix searching. Journal of algorithms , 12(4):663--673, 1991

  31. [31]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. arXiv preprint arXiv:2504.19874 , 2025

  32. [32]

    Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning

    Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning , pages 4035--4043. PMLR, 2017