Quantizing Whisper-small: How design choices affect ASR performance
Pith reviewed 2026-05-22 12:11 UTC · model grok-4.3
The pith
Dynamic int8 quantization with Quanto reduces Whisper-small model size by 57% while improving word error rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions.
What carries the argument
Post-training quantization (PTQ) tested across quantization scheme, method, granularity, and bit-width in PyTorch, Optimum-Quanto, HQQ, and bitsandbytes on Whisper-small.
If this is right
- Dynamic int8 quantization can cut model size by more than half with no accuracy penalty on clean and noisy speech tests.
- Static quantization is less effective than dynamic for Transformer-based speech models.
- Aggressive low-bit formats such as nf4 and int3 reach higher compression but reduce accuracy especially in noisy conditions.
- Post-training quantization enables Whisper-small deployment on constrained hardware without any retraining step.
Where Pith is reading between the lines
- These quantization patterns could extend to other large speech or language models to support mobile and embedded use.
- Real-world inference speed and memory savings would need direct measurement on target edge devices beyond the reported size figures.
- Results might shift if the same methods are applied to fine-tuned Whisper variants or non-English speech data.
Load-bearing premise
That the quantization implementations in the four libraries were applied without hidden configuration differences or bugs, and that LibriSpeech results predict behavior on other datasets or hardware.
What would settle it
Repeating the quantization experiments on a different dataset such as Common Voice and checking whether the word error rate improvement and size reduction with dynamic int8 in Quanto still appear.
read the original abstract
Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates post-training quantization (PTQ) on Whisper-small across four libraries (PyTorch, Optimum-Quanto, HQQ, bitsandbytes) to isolate effects of scheme, granularity, and bit-width on ASR performance. Experiments on LibriSpeech test-clean/test-other report that dynamic int8 quantization via Quanto yields the strongest trade-off: 57% model-size reduction with WER improvement over the unquantized baseline. Static quantization underperforms, while more aggressive formats (nf4, int3) reach up to 71% compression at the cost of accuracy on noisy data.
Significance. If the ranking of methods is robust, the study supplies concrete, library-aware guidance for deploying Whisper-scale ASR models on edge hardware without retraining. The cross-library design is a strength that could help practitioners avoid library-specific pitfalls, provided implementation details are shown to be comparable.
major comments (2)
- The central claim that dynamic int8 with Quanto is unambiguously best requires that observed WER and size differences arise from the studied variables rather than library-specific defaults. The manuscript does not describe explicit controls ensuring identical quantization scope (e.g., whether embeddings, layer-norm, or convolutional layers in the Whisper encoder-decoder are quantized) or identical dynamic-scale computation across PyTorch, Quanto, HQQ, and bitsandbytes. Without such verification, the ranking cannot be securely attributed to design choices.
- Results section: single-run WER numbers are reported without error bars, multiple random seeds, or statistical tests. Given that the claimed improvement over baseline is modest, it is unclear whether the advantage of Quanto int8 is reproducible or within run-to-run variance.
minor comments (2)
- Abstract and §4: the phrase 'improving on the baseline's word error rate' should be accompanied by the exact baseline WER values for both test-clean and test-other to allow immediate assessment of the magnitude of improvement.
- The manuscript would benefit from a short table summarizing, for each library, which modules receive quantization and whether calibration data are used; this would directly address the comparability concern.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our cross-library PTQ evaluation of Whisper-small. The comments highlight important aspects of ensuring fair comparisons and assessing result robustness. We address each major comment below and describe the revisions planned for the updated manuscript.
read point-by-point responses
-
Referee: The central claim that dynamic int8 with Quanto is unambiguously best requires that observed WER and size differences arise from the studied variables rather than library-specific defaults. The manuscript does not describe explicit controls ensuring identical quantization scope (e.g., whether embeddings, layer-norm, or convolutional layers in the Whisper encoder-decoder are quantized) or identical dynamic-scale computation across PyTorch, Quanto, HQQ, and bitsandbytes. Without such verification, the ranking cannot be securely attributed to design choices.
Authors: We agree that documenting the precise quantization scope and dynamic-scale handling is necessary to attribute differences to the intended design choices rather than library defaults. In the revised manuscript we will add an appendix that details, for each library, exactly which components (embeddings, layer norms, convolutional layers, etc.) are quantized and how per-tensor or per-channel dynamic scales are computed. We will also state the default configurations used and note any unavoidable library-specific constraints, thereby allowing readers to verify that the reported ranking reflects the studied variables. revision: yes
-
Referee: Results section: single-run WER numbers are reported without error bars, multiple random seeds, or statistical tests. Given that the claimed improvement over baseline is modest, it is unclear whether the advantage of Quanto int8 is reproducible or within run-to-run variance.
Authors: Post-training quantization is a deterministic process and the LibriSpeech evaluations employ fixed test sets together with deterministic decoding, so repeated runs produce identical WER values. We will clarify this point in the revised manuscript and emphasize that the observed improvement is consistent across both test-clean and test-other. If the library APIs permit controlled randomness in scale estimation, we will report results from a small number of independent quantization runs; otherwise we will explain why error bars are not applicable. revision: partial
Circularity Check
No circularity: purely empirical comparison with direct benchmark runs
full rationale
The paper reports results from direct experimental runs of post-training quantization using four libraries on the fixed LibriSpeech test-clean and test-other sets. No equations, derivations, fitted parameters, or first-principles predictions are present. All claims (e.g., 57% size reduction and WER improvement for dynamic int8 with Quanto) are measurements of observed outcomes rather than any chain that reduces to its own inputs by construction. The evaluation is self-contained against external benchmarks with no load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Post-training quantization preserves sufficient accuracy for ASR without retraining when applied to Whisper-small
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Table 1. Selected dynamic (dyn.) and static (stat.) quantization results on LibriSpeech
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Automatic speech recognition (ASR) has advanced rapidly with large-scale Transformer models such as Whisper-small, which deliver state-of-the-art transcription accuracy across di- verse languages and domains. However, this accuracy comes at a cost: models with hundreds of millions of parameters are difficult to deploy on edge devices, embedde...
-
[2]
RELA TED WORK While its theoretical foundations date back decades [2], mod- ern neural network quantization has evolved rapidly, with sev- eral surveys providing systematic overviews [6, 7, 8]. These works classify methods into post-training quantization and quantization-aware training, and outline factors such as bit- width, quantization granularity, and...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
BACKGROUND Quantization reduces the memory and computational cost of neural networks by mapping high-precision values (typically 32-bit floating point) to lower-precision representations. In practice, this means storing weights and activations in com- pact formats such asint8orfp8, with memory usage scaling approximately as1/bwherebis the bit-width [6, 7]...
-
[4]
METHODS We evaluate PTQ of Whisper-small, a 244M-parameter Transformer-based ASR model pre-trained for multilingual and multitask speech recognition [13]. Experiments are con- ducted on theLibriSpeech test-cleanandtest-othersubsets [14], representing clean and noisy conditions. Depending on library support, we apply quantization across a range of bit-widt...
-
[5]
RESULTS Table 1 summarizes the best-performing quantized models relative to the full-precision baseline. OnCPU,PyTorchdynamicint8delivered the fastest in- ference (RTF0.077; 57.1% faster than the0.121baseline) with only a small accuracy drop.HQQdynamicint4pre- served accuracy on clean speech while achieving the largest compression (69%). In contrast, stat...
-
[6]
DISCUSSION 6.1. Trade-offs Between Different Quantization Methods On CPUs,PyTorchdynamicint8consistently achieved the fastest inference. Its advantage stems from using a per-tensor asymmetric scheme, which applies a single scale across an en- tire tensor. This approach simplifies computation and reduces the overhead of quantization and dequantization, exp...
-
[7]
LIMITA TIONS AND FUTURE WORK This study has several limitations. First, the evaluation was restricted toLibriSpeech, which does not capture the full range of noise profiles, accents, and spontaneous speech found in real-world scenarios. Future work should evalu- ate on more diverse datasets. Second, we focused exclu- sively on post-training quantization; ...
-
[8]
CONCLUSION This study evaluated post-training quantization of Whisper- small across four libraries and multiple bit-widths, disen- tangling the effects of scheme (dynamic vs. static), method (symmetric vs. asymmetric), and granularity (per-tensor, per-channel, per-group) in quantization. OnLibriSpeech test- cleanandtest-other, dynamic quantization consist...
-
[9]
ACKNOWLEDGEMENTS We thank Jabra and GN Group for supporting this research. Computational experiments were performed on the Danish e-Infrastructure Consortium (DeiC) National HPC facilities, utilizing Lenovo ThinkSystem SR675 V3 nodes equipped with dual AMD EPYC 9454 processors (2.75 GHz, 192 vC- PUs total), 768 GB DDR5-4800 memory, and four NVIDIA Hopper ...
-
[10]
Improving the speed of neural networks on cpus,
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao, “Improving the speed of neural networks on cpus,” inDeep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011
work page 2011
-
[11]
R.M. Gray and D.L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998
work page 1998
-
[12]
Integer quantization for deep learning inference: Principles and empirical evaluation,
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius, “Integer quantization for deep learning inference: Principles and empirical evaluation,” arXiv, 2020
work page 2020
-
[13]
Llm.int8(): 8-bit matrix multiplication for transformers at scale,
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022
work page 2022
-
[14]
Efficient post- training quantization with fp8 formats,
Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, and Mengni Wang, “Efficient post- training quantization with fp8 formats,”arXiv, 2023
work page 2023
-
[15]
A survey of quantization methods for efficient neural network infer- ence,
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, “A survey of quantization methods for efficient neural network infer- ence,”arXiv, 2021
work page 2021
-
[16]
A white paper on neural network quan- tization,
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort, “A white paper on neural network quan- tization,”arXiv, 2021
work page 2021
-
[17]
Ad- vances in the neural network quantization: A compre- hensive review,
Lu Wei, Zhong Ma, Chaojie Yang, and Qin Yao, “Ad- vances in the neural network quantization: A compre- hensive review,”Applied Sciences, vol. 14, no. 17, 2024
work page 2024
-
[18]
I-bert: Integer-only bert quantization,
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, “I-bert: Integer-only bert quantization,”arXiv, 2021
work page 2021
-
[19]
Integer-only zero-shot quantization for efficient speech recognition,
Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, and Kurt Keutzer, “Integer-only zero-shot quantization for efficient speech recognition,” 2022
work page 2022
-
[20]
Zero- quant: Efficient and affordable post-training quantiza- tion for large-scale transformers,
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He, “Zero- quant: Efficient and affordable post-training quantiza- tion for large-scale transformers,”arXiv, 2022
work page 2022
-
[21]
Learning both weights and connections for efficient neural networks,
Song Han, Jeff Pool, John Tran, and William J. Dally, “Learning both weights and connections for efficient neural networks,” 2015
work page 2015
-
[22]
Robust Speech Recognition via Large-Scale Weak Supervision
A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recog- nition via large-scale weak supervision,”arXiv preprint arXiv:2212.04356, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Librispeech: an asr corpus based on public domain audio books,
Vassil Panayotov, Guoguo Chen, Daniel Povey, and San- jeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inAcoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.