StatQAT: Statistical Quantizer Optimization for Deep Networks

Daniel Huang; Ke Ding; Mehmet Aktukmak

arxiv: 2605.17745 · v1 · pith:BITBCD4Mnew · submitted 2026-05-18 · 📊 stat.ML · cs.LG

StatQAT: Statistical Quantizer Optimization for Deep Networks

Mehmet Aktukmak , Daniel Huang , Ke Ding This is my paper

Pith reviewed 2026-05-20 01:16 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords quantizationdeep neural networksquantization-aware trainingstatistical error analysislow-precision inferenceuniform quantizationfloating-point quantization

0 comments

The pith

A statistical error analysis framework enables better quantizer design for low-precision neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a statistical framework to study how quantization errors arise in both uniform and floating-point schemes. It uses the analysis to build iterative quantizers that adapt to any data distribution and analytic quantizers suited to Gaussian-like weight distributions. These quantizers are plugged into quantization-aware training, where experiments show gains in accuracy and training stability for integer and floating-point low-precision networks. The work aims to reduce memory and compute costs on hardware while preserving model performance during both training and inference.

Core claim

The authors develop a statistical error analysis framework for uniform and floating-point quantization that supplies theoretical insight into error behavior across configurations and data distributions. From this they derive iterative quantizers for arbitrary distributions and analytic quantizers for Gaussian-like weights, allowing low-error quantization of both activations and weights when used inside quantization-aware training.

What carries the argument

The statistical error analysis framework that models quantization error behavior for arbitrary data distributions and Gaussian-like weight distributions.

If this is right

Quantization parameters can be chosen more effectively for the varied distributions seen in training and inference.
Iterative quantizers deliver efficient low-error quantization for activations in both integer and floating-point formats.
Analytic quantizers improve weight quantization when distributions are approximately Gaussian.
Integration into quantization-aware training produces higher accuracy and greater stability than prior approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be extended to non-uniform or learned quantization schemes not covered in the current analysis.
Hardware designers might use the error predictions to select bit widths and formats that balance accuracy against power and memory.
Testing the analytic quantizers on modern architectures with highly non-Gaussian weights would reveal how far the Gaussian assumption can be stretched.

Load-bearing premise

The statistical error analysis framework correctly predicts how quantization errors behave for arbitrary data distributions and Gaussian-like weight distributions.

What would settle it

Measuring actual quantization errors in a trained network and finding that they deviate substantially from the errors predicted by the statistical framework for the same configurations would falsify the central insight.

Figures

Figures reproduced from arXiv: 2605.17745 by Daniel Huang, Ke Ding, Mehmet Aktukmak.

**Figure 2.** Figure 2: a shows training curves (log scale) over 40 epochs for uniform quantizers. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 2.** Figure 2: Loss functions with 4-bit uniform (left) and float (right) quantization across [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Inferred quantization levels of 3-bit non-uniform quantizers for a normally [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: b-d. Hence, the stepping error power is computed by Eq. 13 since now the data [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: SNR per layer of the Llama-3-1B model under MinMax and Analytic quanti [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

read the original abstract

Quantization is essential for reducing the computational cost and memory usage of deep neural networks, enabling efficient inference on low-precision hardware. Despite the growing adoption of uniform and floating-point quantization schemes, selecting optimal quantization parameters remains a key challenge, particularly for diverse data distributions encountered during training and inference. This work presents a novel statistical error analysis framework for uniform and floating-point quantization, providing theoretical insight into error behavior across quantization configurations. Building on this analysis, we propose iterative quantizers designed for arbitrary data distributions and analytic quantizers tailored for Gaussian-like weight distributions. These methods enable efficient, low-error quantization suitable for both activations and weights. We incorporate our quantizers into quantization-aware training and evaluate them across integer and floating-point formats. Experiments demonstrate improved accuracy and stability, highlighting the effectiveness of our approach for training low-precision neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StatQAT adds a statistical error analysis and tailored quantizers to QAT with reported accuracy gains, but the generality to arbitrary distributions likely rests on unstated moment or tail conditions.

read the letter

Colleague, this paper introduces a statistical error analysis for quantization in deep networks and uses it to create better quantizers for training low-precision models. What is new is the framework that gives insight into error behavior for uniform and floating-point schemes. They propose iterative quantizers that handle arbitrary distributions and analytic ones for Gaussian weights. These get integrated into quantization-aware training, and the experiments report better accuracy and stability. The paper does well in addressing both activations and weights with different strategies. The experimental evaluation across formats shows the approach can improve results in practice. A soft spot is the reach of the theoretical part. The analysis claims to work for arbitrary data distributions, but it may depend on unverified assumptions about moments or tail behavior. Without seeing the full derivations, it's unclear if the bounds or expressions hold for the heavy-tailed activations that show up in real networks. That could limit how much new insight it actually provides beyond existing statistical modeling in quantization. The citation pattern looks standard, and the math seems to build on established ideas without obvious circularity. This is for people working on efficient deep learning deployment. Readers interested in quantization-aware training and hardware-friendly models would get value from the methods and results. It deserves a serious referee because the core claims are specific enough to review and the experiments provide something to check against. Recommendation: Send it for peer review with attention to the distribution assumptions in the analysis.

Referee Report

2 major / 2 minor

Summary. The manuscript presents StatQAT, a framework that develops a statistical error analysis for uniform and floating-point quantization schemes in deep networks. It derives iterative quantizers suited to arbitrary data distributions and analytic quantizers for Gaussian-like weight distributions, incorporates these into quantization-aware training, and reports experimental gains in accuracy and stability across integer and floating-point formats.

Significance. If the derivations hold, the work supplies concrete error expressions that could guide quantizer selection beyond heuristic search, a useful contribution to efficient inference. The combination of distribution-specific analytic forms with QAT integration is a strength; experiments showing stability improvements add practical value. The absence of free parameters in the core analysis (as indicated by the ledger) would further strengthen the result if confirmed in the derivations.

major comments (2)

[§3] §3 (Statistical Error Analysis): The claimed generality to arbitrary activation distributions rests on implicit regularity conditions (finite higher moments or sub-exponential tails) that are not stated or verified; without these, the iterative quantizer construction and error bounds may not extend to the heavy-tailed or multimodal activations common in practice. This is load-bearing for both the theoretical insight and the subsequent QAT improvements.
[§5.2] §5.2 (Experiments): The reported accuracy and stability gains lack explicit details on run count, variance across seeds, or baseline hyperparameter matching; this weakens the cross-format claim that the proposed quantizers outperform standard uniform/floating-point schemes.

minor comments (2)

[§2] Notation for the error expectation operator and the distinction between empirical versus population distributions is introduced without a dedicated table or appendix, making cross-references between equations cumbersome.
[Figure 3] Figure 3 caption does not specify the exact network architectures or dataset splits used for the floating-point results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the presentation.

read point-by-point responses

Referee: [§3] §3 (Statistical Error Analysis): The claimed generality to arbitrary activation distributions rests on implicit regularity conditions (finite higher moments or sub-exponential tails) that are not stated or verified; without these, the iterative quantizer construction and error bounds may not extend to the heavy-tailed or multimodal activations common in practice. This is load-bearing for both the theoretical insight and the subsequent QAT improvements.

Authors: We agree that the error bounds and iterative construction implicitly rely on regularity conditions such as finite higher moments and sub-exponential tails to ensure convergence and bounded error. In the revised manuscript we will add an explicit statement of these assumptions at the beginning of Section 3, together with a short discussion of their implications for heavy-tailed or multimodal activations. We will also note that the iterative procedure itself remains well-defined for any distribution with finite first and second moments, while the analytic error expressions require the stronger tail conditions. revision: yes
Referee: [§5.2] §5.2 (Experiments): The reported accuracy and stability gains lack explicit details on run count, variance across seeds, or baseline hyperparameter matching; this weakens the cross-format claim that the proposed quantizers outperform standard uniform/floating-point schemes.

Authors: We acknowledge that the current experimental section does not report the number of independent runs or variance across random seeds, nor does it detail the hyperparameter search budget used for the baselines. In the revision we will add these details: we will state that all results are averaged over five independent seeds with standard deviations reported, and we will confirm that baseline uniform and floating-point quantizers were tuned using the same hyperparameter search procedure and computational budget as the proposed methods. This will make the cross-format comparisons more rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with independent statistical analysis

full rationale

The paper introduces a statistical error analysis framework for quantization error behavior and then builds iterative and analytic quantizers on top of it for arbitrary and Gaussian-like distributions. No equations or sections are quoted in the provided material that reduce a prediction or optimal parameter directly to a fitted input by construction, nor is there evidence of load-bearing self-citations or ansatz smuggling. The central theoretical insight is presented as derived from expectations over data distributions rather than presupposing the final quantizer performance, making the chain non-circular under the stated criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Full manuscript would be required to populate the ledger.

pith-pipeline@v0.9.0 · 5665 in / 1110 out tokens · 65696 ms · 2026-05-20T01:16:55.008062+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mean-squared quantization error function is given by: E[e²] = E[(x−Q(x|l,t))²] = Σ ∫ (x−l_k)² p(x) dx
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

analytic uniform quantizer … assuming the data is normally distributed … clipping error Ec = 2(σ² + C²)Q(C/σ) …

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

A survey of quantization methods for efficient neural network infer- ence

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network infer- ence. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

work page 2022
[2]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016

work page 2016
[3]

8-bit inference with tensorrt

Scott Migacz. 8-bit inference with tensorrt. https://developer.nvidia. com/blog/int8-inference-autonomous-vehicles-tensorrt/ ,

work page
[4]

GPU Technology Conference

work page
[5]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An- drew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 13

work page 2018
[6]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022
[7]

Learning non-uniform step sizes for neural network quantization

Yoshitaka Gongyo et al. Learning non-uniform step sizes for neural network quantization. InAsian Conference on Computer Vision (ACCV), 2024

work page 2024
[8]

Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2016

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2016

work page 2016
[9]

Pact: Parameterized clipping activation for quantized neural networks, 2018

Yoonho Choi, Mostafa El-Khamy, and Jungwon Lee. Pact: Parameterized clipping activation for quantized neural networks, 2018

work page 2018
[10]

Fp8 quantization: The power of the exponent.Advances in Neural Information Processing Systems, 35:14651–14662, 2022

Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. Fp8 quantization: The power of the exponent.Advances in Neural Information Processing Systems, 35:14651–14662, 2022

work page 2022
[11]

Llm-fp4: 4-bit floating-point quantized transformers,

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating-point quantized transformers.arXiv preprint arXiv:2310.16836, 2023

work page arXiv 2023
[12]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088–10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088–10115, 2023

work page 2023
[13]

Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), 2019

IEEE Computer Society. Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), 2019

work page 2019
[14]

An- alytical and numerical studies of quantization effects in coherent optical ofdm transmission with 100 gbit/s and beyond.ITG-Fachtagung Photonische Netze, pages 34–40, 2012

Michael Bernhard, David Rörich, Thomas Handte, and Joachim Speidel. An- alytical and numerical studies of quantization effects in coherent optical ofdm transmission with 100 gbit/s and beyond.ITG-Fachtagung Photonische Netze, pages 34–40, 2012

work page 2012
[15]

Analytic quantization mod- eling of ofdm signals using normal gaussian distribution

Henning Ehm, Sebastian Winter, and Robert Weigel. Analytic quantization mod- eling of ofdm signals using normal gaussian distribution. In2006 Asia-Pacific Microwave Conference, pages 847–850. IEEE, 2006

work page 2006
[16]

Weight uncertainty in neural networks

Charles Blundell et al. Weight uncertainty in neural networks. InInternational Conference on Machine Learning (ICML), 2015

work page 2015
[17]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. InarXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al. Paretoq: Scaling laws in extremely low-bit llm quantization.arXiv preprint arXiv:2502.02631, 2025. 14

work page arXiv 2025
[19]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016

work page 2016
[20]

Learned step size quantization

Steven K Esser, Jeffrey L McKinstry, Arun Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[21]

Differentiable quantization of deep neural networks

Stefan Uhlich et al. Differentiable quantization of deep neural networks. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[22]

Relaxed quantization for discretized neural networks

Christos Louizos, Charles Blundell, and Max Welling. Relaxed quantization for discretized neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2019

work page 2019
[23]

Learning to quantize deep networks by optimizing quantization intervals with task loss

Sangil Jung, Byeongho Choi, Junha Kim, and Nojun Kwak. Learning to quantize deep networks by optimizing quantization intervals with task loss. InCVPR, pages 4350–4359, 2019

work page 2019
[24]

Optimal clipping and magnitude-aware differentiation for improved quantization-aware training

Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. Optimal clipping and magnitude-aware differentiation for improved quantization-aware training. InInternational conference on machine learning, pages 19123–19138. PMLR, 2022

work page 2022
[25]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Pierre Stock, and Dan Alistarh. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Zhenhua Tang, Yujun Liu, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Yuxuan Xiao, Yuhui Ren, Yifan Sun, Haotong Wang, Zheng Wang, et al. Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023
[28]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Ultra-low precision 4-bit training of deep neural networks.Advances in Neural Information Processing Systems, 33:1796– 1807, 2020

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Viji Srini- vasan, and Kailash Gopalakrishnan. Ultra-low precision 4-bit training of deep neural networks.Advances in Neural Information Processing Systems, 33:1796– 1807, 2020

work page 2020
[30]

Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN)

Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijay- alakshmi Srinivasan, and Kailash Gopalakrishnan. Bridging the accuracy gap for 2-bit quantized neural networks (qnn).arXiv preprint arXiv:1807.06964, 2018. 15

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Relaxed quantization for discretized neural networks

Christos Louizos, Charles Blundell, and Max Welling. Relaxed quantization for discretized neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2019. A Practical Implementation of Uniform Quantization In real-world applications, instead of storing high-precision lk values corresponding to each data point, only the indexes are s...

work page 2019

[1] [1]

A survey of quantization methods for efficient neural network infer- ence

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network infer- ence. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

work page 2022

[2] [2]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016

work page 2016

[3] [3]

8-bit inference with tensorrt

Scott Migacz. 8-bit inference with tensorrt. https://developer.nvidia. com/blog/int8-inference-autonomous-vehicles-tensorrt/ ,

work page

[4] [4]

GPU Technology Conference

work page

[5] [5]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An- drew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 13

work page 2018

[6] [6]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022

[7] [7]

Learning non-uniform step sizes for neural network quantization

Yoshitaka Gongyo et al. Learning non-uniform step sizes for neural network quantization. InAsian Conference on Computer Vision (ACCV), 2024

work page 2024

[8] [8]

Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2016

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2016

work page 2016

[9] [9]

Pact: Parameterized clipping activation for quantized neural networks, 2018

Yoonho Choi, Mostafa El-Khamy, and Jungwon Lee. Pact: Parameterized clipping activation for quantized neural networks, 2018

work page 2018

[10] [10]

Fp8 quantization: The power of the exponent.Advances in Neural Information Processing Systems, 35:14651–14662, 2022

Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. Fp8 quantization: The power of the exponent.Advances in Neural Information Processing Systems, 35:14651–14662, 2022

work page 2022

[11] [11]

Llm-fp4: 4-bit floating-point quantized transformers,

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating-point quantized transformers.arXiv preprint arXiv:2310.16836, 2023

work page arXiv 2023

[12] [12]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088–10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088–10115, 2023

work page 2023

[13] [13]

Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), 2019

IEEE Computer Society. Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), 2019

work page 2019

[14] [14]

An- alytical and numerical studies of quantization effects in coherent optical ofdm transmission with 100 gbit/s and beyond.ITG-Fachtagung Photonische Netze, pages 34–40, 2012

Michael Bernhard, David Rörich, Thomas Handte, and Joachim Speidel. An- alytical and numerical studies of quantization effects in coherent optical ofdm transmission with 100 gbit/s and beyond.ITG-Fachtagung Photonische Netze, pages 34–40, 2012

work page 2012

[15] [15]

Analytic quantization mod- eling of ofdm signals using normal gaussian distribution

Henning Ehm, Sebastian Winter, and Robert Weigel. Analytic quantization mod- eling of ofdm signals using normal gaussian distribution. In2006 Asia-Pacific Microwave Conference, pages 847–850. IEEE, 2006

work page 2006

[16] [16]

Weight uncertainty in neural networks

Charles Blundell et al. Weight uncertainty in neural networks. InInternational Conference on Machine Learning (ICML), 2015

work page 2015

[17] [17]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. InarXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al. Paretoq: Scaling laws in extremely low-bit llm quantization.arXiv preprint arXiv:2502.02631, 2025. 14

work page arXiv 2025

[19] [19]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016

work page 2016

[20] [20]

Learned step size quantization

Steven K Esser, Jeffrey L McKinstry, Arun Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[21] [21]

Differentiable quantization of deep neural networks

Stefan Uhlich et al. Differentiable quantization of deep neural networks. In International Conference on Learning Representations (ICLR), 2020

work page 2020

[22] [22]

Relaxed quantization for discretized neural networks

Christos Louizos, Charles Blundell, and Max Welling. Relaxed quantization for discretized neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2019

work page 2019

[23] [23]

Learning to quantize deep networks by optimizing quantization intervals with task loss

Sangil Jung, Byeongho Choi, Junha Kim, and Nojun Kwak. Learning to quantize deep networks by optimizing quantization intervals with task loss. InCVPR, pages 4350–4359, 2019

work page 2019

[24] [24]

Optimal clipping and magnitude-aware differentiation for improved quantization-aware training

Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. Optimal clipping and magnitude-aware differentiation for improved quantization-aware training. InInternational conference on machine learning, pages 19123–19138. PMLR, 2022

work page 2022

[25] [25]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Pierre Stock, and Dan Alistarh. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Zhenhua Tang, Yujun Liu, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Yuxuan Xiao, Yuhui Ren, Yifan Sun, Haotong Wang, Zheng Wang, et al. Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023

[28] [28]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Ultra-low precision 4-bit training of deep neural networks.Advances in Neural Information Processing Systems, 33:1796– 1807, 2020

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Viji Srini- vasan, and Kailash Gopalakrishnan. Ultra-low precision 4-bit training of deep neural networks.Advances in Neural Information Processing Systems, 33:1796– 1807, 2020

work page 2020

[30] [30]

Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN)

Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijay- alakshmi Srinivasan, and Kailash Gopalakrishnan. Bridging the accuracy gap for 2-bit quantized neural networks (qnn).arXiv preprint arXiv:1807.06964, 2018. 15

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Relaxed quantization for discretized neural networks

Christos Louizos, Charles Blundell, and Max Welling. Relaxed quantization for discretized neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2019. A Practical Implementation of Uniform Quantization In real-world applications, instead of storing high-precision lk values corresponding to each data point, only the indexes are s...

work page 2019