Quantizing Whisper-small: How design choices affect ASR performance

Andreas S{\o}eborg Kirkedal; Arthur S\"ohler; Julian Irigoyen

arxiv: 2511.08093 · v2 · pith:XSJMEROLnew · submitted 2025-11-11 · 📡 eess.AS · cs.CL· cs.SD

Quantizing Whisper-small: How design choices affect ASR performance

Arthur S\"ohler , Julian Irigoyen , Andreas S{\o}eborg Kirkedal This is my paper

Pith reviewed 2026-05-22 12:11 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords post-training quantizationWhisper-smallautomatic speech recognitionmodel compressionint8 quantizationLibriSpeechedge deploymentASR performance

0 comments

The pith

Dynamic int8 quantization with Quanto reduces Whisper-small model size by 57% while improving word error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates post-training quantization options for the Whisper-small speech recognition model across four libraries to measure how choices in scheme, method, granularity, and bit-width change accuracy and size. It establishes that dynamic int8 quantization using Quanto delivers the strongest balance by shrinking the model substantially while actually lowering word error rate on standard test sets. A reader would care because Whisper models are accurate but too large for phones and other edge hardware, and this approach shows a way to deploy them efficiently without retraining. The work also finds that static quantization lags behind and that very low-bit formats like int3 or nf4 trade more size for worse results in noisy audio.

Core claim

Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions.

What carries the argument

Post-training quantization (PTQ) tested across quantization scheme, method, granularity, and bit-width in PyTorch, Optimum-Quanto, HQQ, and bitsandbytes on Whisper-small.

If this is right

Dynamic int8 quantization can cut model size by more than half with no accuracy penalty on clean and noisy speech tests.
Static quantization is less effective than dynamic for Transformer-based speech models.
Aggressive low-bit formats such as nf4 and int3 reach higher compression but reduce accuracy especially in noisy conditions.
Post-training quantization enables Whisper-small deployment on constrained hardware without any retraining step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These quantization patterns could extend to other large speech or language models to support mobile and embedded use.
Real-world inference speed and memory savings would need direct measurement on target edge devices beyond the reported size figures.
Results might shift if the same methods are applied to fine-tuned Whisper variants or non-English speech data.

Load-bearing premise

That the quantization implementations in the four libraries were applied without hidden configuration differences or bugs, and that LibriSpeech results predict behavior on other datasets or hardware.

What would settle it

Repeating the quantization experiments on a different dataset such as Common Voice and checking whether the word error rate improvement and size reduction with dynamic int8 in Quanto still appear.

read the original abstract

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a useful cross-library check on quantizing Whisper-small and reports concrete size and WER numbers, but library implementation differences could be driving the ranking more than the stated design choices.

read the letter

The main thing to know is that dynamic int8 quantization with Quanto comes out ahead on Whisper-small, cutting model size by 57 percent while improving word error rate on LibriSpeech test-clean and test-other. The work also shows that static quantization underperforms and that lower-bit formats like nf4 or int3 push compression to 71 percent but lose accuracy in noisier conditions. These are the practical takeaways for anyone trying to shrink the model for edge use without retraining.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates post-training quantization (PTQ) on Whisper-small across four libraries (PyTorch, Optimum-Quanto, HQQ, bitsandbytes) to isolate effects of scheme, granularity, and bit-width on ASR performance. Experiments on LibriSpeech test-clean/test-other report that dynamic int8 quantization via Quanto yields the strongest trade-off: 57% model-size reduction with WER improvement over the unquantized baseline. Static quantization underperforms, while more aggressive formats (nf4, int3) reach up to 71% compression at the cost of accuracy on noisy data.

Significance. If the ranking of methods is robust, the study supplies concrete, library-aware guidance for deploying Whisper-scale ASR models on edge hardware without retraining. The cross-library design is a strength that could help practitioners avoid library-specific pitfalls, provided implementation details are shown to be comparable.

major comments (2)

The central claim that dynamic int8 with Quanto is unambiguously best requires that observed WER and size differences arise from the studied variables rather than library-specific defaults. The manuscript does not describe explicit controls ensuring identical quantization scope (e.g., whether embeddings, layer-norm, or convolutional layers in the Whisper encoder-decoder are quantized) or identical dynamic-scale computation across PyTorch, Quanto, HQQ, and bitsandbytes. Without such verification, the ranking cannot be securely attributed to design choices.
Results section: single-run WER numbers are reported without error bars, multiple random seeds, or statistical tests. Given that the claimed improvement over baseline is modest, it is unclear whether the advantage of Quanto int8 is reproducible or within run-to-run variance.

minor comments (2)

Abstract and §4: the phrase 'improving on the baseline's word error rate' should be accompanied by the exact baseline WER values for both test-clean and test-other to allow immediate assessment of the magnitude of improvement.
The manuscript would benefit from a short table summarizing, for each library, which modules receive quantization and whether calibration data are used; this would directly address the comparability concern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our cross-library PTQ evaluation of Whisper-small. The comments highlight important aspects of ensuring fair comparisons and assessing result robustness. We address each major comment below and describe the revisions planned for the updated manuscript.

read point-by-point responses

Referee: The central claim that dynamic int8 with Quanto is unambiguously best requires that observed WER and size differences arise from the studied variables rather than library-specific defaults. The manuscript does not describe explicit controls ensuring identical quantization scope (e.g., whether embeddings, layer-norm, or convolutional layers in the Whisper encoder-decoder are quantized) or identical dynamic-scale computation across PyTorch, Quanto, HQQ, and bitsandbytes. Without such verification, the ranking cannot be securely attributed to design choices.

Authors: We agree that documenting the precise quantization scope and dynamic-scale handling is necessary to attribute differences to the intended design choices rather than library defaults. In the revised manuscript we will add an appendix that details, for each library, exactly which components (embeddings, layer norms, convolutional layers, etc.) are quantized and how per-tensor or per-channel dynamic scales are computed. We will also state the default configurations used and note any unavoidable library-specific constraints, thereby allowing readers to verify that the reported ranking reflects the studied variables. revision: yes
Referee: Results section: single-run WER numbers are reported without error bars, multiple random seeds, or statistical tests. Given that the claimed improvement over baseline is modest, it is unclear whether the advantage of Quanto int8 is reproducible or within run-to-run variance.

Authors: Post-training quantization is a deterministic process and the LibriSpeech evaluations employ fixed test sets together with deterministic decoding, so repeated runs produce identical WER values. We will clarify this point in the revised manuscript and emphasize that the observed improvement is consistent across both test-clean and test-other. If the library APIs permit controlled randomness in scale estimation, we will report results from a small number of independent quantization runs; otherwise we will explain why error bars are not applicable. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct benchmark runs

full rationale

The paper reports results from direct experimental runs of post-training quantization using four libraries on the fixed LibriSpeech test-clean and test-other sets. No equations, derivations, fitted parameters, or first-principles predictions are present. All claims (e.g., 57% size reduction and WER improvement for dynamic int8 with Quanto) are measurements of observed outcomes rather than any chain that reduces to its own inputs by construction. The evaluation is self-contained against external benchmarks with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions about post-training quantization applicability to transformers and the representativeness of LibriSpeech for ASR evaluation.

axioms (1)

domain assumption Post-training quantization preserves sufficient accuracy for ASR without retraining when applied to Whisper-small
Invoked by the decision to use PTQ rather than quantization-aware training.

pith-pipeline@v0.9.0 · 5721 in / 1122 out tokens · 54161 ms · 2026-05-22T12:11:16.307267+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 1. Selected dynamic (dyn.) and static (stat.) quantization results on LibriSpeech

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

However, this accuracy comes at a cost: models with hundreds of millions of parameters are difficult to deploy on edge devices, embedded systems, or latency-sensitive applications

INTRODUCTION Automatic speech recognition (ASR) has advanced rapidly with large-scale Transformer models such as Whisper-small, which deliver state-of-the-art transcription accuracy across di- verse languages and domains. However, this accuracy comes at a cost: models with hundreds of millions of parameters are difficult to deploy on edge devices, embedde...

work page
[2]

RELA TED WORK While its theoretical foundations date back decades [2], mod- ern neural network quantization has evolved rapidly, with sev- eral surveys providing systematic overviews [6, 7, 8]. These works classify methods into post-training quantization and quantization-aware training, and outline factors such as bit- width, quantization granularity, and...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In practice, this means storing weights and activations in com- pact formats such asint8orfp8, with memory usage scaling approximately as1/bwherebis the bit-width [6, 7]

BACKGROUND Quantization reduces the memory and computational cost of neural networks by mapping high-precision values (typically 32-bit floating point) to lower-precision representations. In practice, this means storing weights and activations in com- pact formats such asint8orfp8, with memory usage scaling approximately as1/bwherebis the bit-width [6, 7]...

work page
[4]

Experiments are con- ducted on theLibriSpeech test-cleanandtest-othersubsets [14], representing clean and noisy conditions

METHODS We evaluate PTQ of Whisper-small, a 244M-parameter Transformer-based ASR model pre-trained for multilingual and multitask speech recognition [13]. Experiments are con- ducted on theLibriSpeech test-cleanandtest-othersubsets [14], representing clean and noisy conditions. Depending on library support, we apply quantization across a range of bit-widt...

work page
[5]

RESULTS Table 1 summarizes the best-performing quantized models relative to the full-precision baseline. OnCPU,PyTorchdynamicint8delivered the fastest in- ference (RTF0.077; 57.1% faster than the0.121baseline) with only a small accuracy drop.HQQdynamicint4pre- served accuracy on clean speech while achieving the largest compression (69%). In contrast, stat...

work page
[6]

Trade-offs Between Different Quantization Methods On CPUs,PyTorchdynamicint8consistently achieved the fastest inference

DISCUSSION 6.1. Trade-offs Between Different Quantization Methods On CPUs,PyTorchdynamicint8consistently achieved the fastest inference. Its advantage stems from using a per-tensor asymmetric scheme, which applies a single scale across an en- tire tensor. This approach simplifies computation and reduces the overhead of quantization and dequantization, exp...

work page
[7]

First, the evaluation was restricted toLibriSpeech, which does not capture the full range of noise profiles, accents, and spontaneous speech found in real-world scenarios

LIMITA TIONS AND FUTURE WORK This study has several limitations. First, the evaluation was restricted toLibriSpeech, which does not capture the full range of noise profiles, accents, and spontaneous speech found in real-world scenarios. Future work should evalu- ate on more diverse datasets. Second, we focused exclu- sively on post-training quantization; ...

work page
[8]

static), method (symmetric vs

CONCLUSION This study evaluated post-training quantization of Whisper- small across four libraries and multiple bit-widths, disen- tangling the effects of scheme (dynamic vs. static), method (symmetric vs. asymmetric), and granularity (per-tensor, per-channel, per-group) in quantization. OnLibriSpeech test- cleanandtest-other, dynamic quantization consist...

work page
[9]

ACKNOWLEDGEMENTS We thank Jabra and GN Group for supporting this research. Computational experiments were performed on the Danish e-Infrastructure Consortium (DeiC) National HPC facilities, utilizing Lenovo ThinkSystem SR675 V3 nodes equipped with dual AMD EPYC 9454 processors (2.75 GHz, 192 vC- PUs total), 768 GB DDR5-4800 memory, and four NVIDIA Hopper ...

work page
[10]

Improving the speed of neural networks on cpus,

Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao, “Improving the speed of neural networks on cpus,” inDeep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011

work page 2011
[11]

Quantization,

R.M. Gray and D.L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998

work page 1998
[12]

Integer quantization for deep learning inference: Principles and empirical evaluation,

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius, “Integer quantization for deep learning inference: Principles and empirical evaluation,” arXiv, 2020

work page 2020
[13]

Llm.int8(): 8-bit matrix multiplication for transformers at scale,

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022

work page 2022
[14]

Efficient post- training quantization with fp8 formats,

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, and Mengni Wang, “Efficient post- training quantization with fp8 formats,”arXiv, 2023

work page 2023
[15]

A survey of quantization methods for efficient neural network infer- ence,

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, “A survey of quantization methods for efficient neural network infer- ence,”arXiv, 2021

work page 2021
[16]

A white paper on neural network quan- tization,

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort, “A white paper on neural network quan- tization,”arXiv, 2021

work page 2021
[17]

Ad- vances in the neural network quantization: A compre- hensive review,

Lu Wei, Zhong Ma, Chaojie Yang, and Qin Yao, “Ad- vances in the neural network quantization: A compre- hensive review,”Applied Sciences, vol. 14, no. 17, 2024

work page 2024
[18]

I-bert: Integer-only bert quantization,

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, “I-bert: Integer-only bert quantization,”arXiv, 2021

work page 2021
[19]

Integer-only zero-shot quantization for efficient speech recognition,

Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, and Kurt Keutzer, “Integer-only zero-shot quantization for efficient speech recognition,” 2022

work page 2022
[20]

Zero- quant: Efficient and affordable post-training quantiza- tion for large-scale transformers,

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He, “Zero- quant: Efficient and affordable post-training quantiza- tion for large-scale transformers,”arXiv, 2022

work page 2022
[21]

Learning both weights and connections for efficient neural networks,

Song Han, Jeff Pool, John Tran, and William J. Dally, “Learning both weights and connections for efficient neural networks,” 2015

work page 2015
[22]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recog- nition via large-scale weak supervision,”arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Librispeech: an asr corpus based on public domain audio books,

Vassil Panayotov, Guoguo Chen, Daniel Povey, and San- jeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inAcoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210

work page 2015

[1] [1]

However, this accuracy comes at a cost: models with hundreds of millions of parameters are difficult to deploy on edge devices, embedded systems, or latency-sensitive applications

INTRODUCTION Automatic speech recognition (ASR) has advanced rapidly with large-scale Transformer models such as Whisper-small, which deliver state-of-the-art transcription accuracy across di- verse languages and domains. However, this accuracy comes at a cost: models with hundreds of millions of parameters are difficult to deploy on edge devices, embedde...

work page

[2] [2]

RELA TED WORK While its theoretical foundations date back decades [2], mod- ern neural network quantization has evolved rapidly, with sev- eral surveys providing systematic overviews [6, 7, 8]. These works classify methods into post-training quantization and quantization-aware training, and outline factors such as bit- width, quantization granularity, and...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

In practice, this means storing weights and activations in com- pact formats such asint8orfp8, with memory usage scaling approximately as1/bwherebis the bit-width [6, 7]

BACKGROUND Quantization reduces the memory and computational cost of neural networks by mapping high-precision values (typically 32-bit floating point) to lower-precision representations. In practice, this means storing weights and activations in com- pact formats such asint8orfp8, with memory usage scaling approximately as1/bwherebis the bit-width [6, 7]...

work page

[4] [4]

Experiments are con- ducted on theLibriSpeech test-cleanandtest-othersubsets [14], representing clean and noisy conditions

METHODS We evaluate PTQ of Whisper-small, a 244M-parameter Transformer-based ASR model pre-trained for multilingual and multitask speech recognition [13]. Experiments are con- ducted on theLibriSpeech test-cleanandtest-othersubsets [14], representing clean and noisy conditions. Depending on library support, we apply quantization across a range of bit-widt...

work page

[5] [5]

RESULTS Table 1 summarizes the best-performing quantized models relative to the full-precision baseline. OnCPU,PyTorchdynamicint8delivered the fastest in- ference (RTF0.077; 57.1% faster than the0.121baseline) with only a small accuracy drop.HQQdynamicint4pre- served accuracy on clean speech while achieving the largest compression (69%). In contrast, stat...

work page

[6] [6]

Trade-offs Between Different Quantization Methods On CPUs,PyTorchdynamicint8consistently achieved the fastest inference

DISCUSSION 6.1. Trade-offs Between Different Quantization Methods On CPUs,PyTorchdynamicint8consistently achieved the fastest inference. Its advantage stems from using a per-tensor asymmetric scheme, which applies a single scale across an en- tire tensor. This approach simplifies computation and reduces the overhead of quantization and dequantization, exp...

work page

[7] [7]

First, the evaluation was restricted toLibriSpeech, which does not capture the full range of noise profiles, accents, and spontaneous speech found in real-world scenarios

LIMITA TIONS AND FUTURE WORK This study has several limitations. First, the evaluation was restricted toLibriSpeech, which does not capture the full range of noise profiles, accents, and spontaneous speech found in real-world scenarios. Future work should evalu- ate on more diverse datasets. Second, we focused exclu- sively on post-training quantization; ...

work page

[8] [8]

static), method (symmetric vs

CONCLUSION This study evaluated post-training quantization of Whisper- small across four libraries and multiple bit-widths, disen- tangling the effects of scheme (dynamic vs. static), method (symmetric vs. asymmetric), and granularity (per-tensor, per-channel, per-group) in quantization. OnLibriSpeech test- cleanandtest-other, dynamic quantization consist...

work page

[9] [9]

ACKNOWLEDGEMENTS We thank Jabra and GN Group for supporting this research. Computational experiments were performed on the Danish e-Infrastructure Consortium (DeiC) National HPC facilities, utilizing Lenovo ThinkSystem SR675 V3 nodes equipped with dual AMD EPYC 9454 processors (2.75 GHz, 192 vC- PUs total), 768 GB DDR5-4800 memory, and four NVIDIA Hopper ...

work page

[10] [10]

Improving the speed of neural networks on cpus,

Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao, “Improving the speed of neural networks on cpus,” inDeep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011

work page 2011

[11] [11]

Quantization,

R.M. Gray and D.L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998

work page 1998

[12] [12]

Integer quantization for deep learning inference: Principles and empirical evaluation,

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius, “Integer quantization for deep learning inference: Principles and empirical evaluation,” arXiv, 2020

work page 2020

[13] [13]

Llm.int8(): 8-bit matrix multiplication for transformers at scale,

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022

work page 2022

[14] [14]

Efficient post- training quantization with fp8 formats,

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, and Mengni Wang, “Efficient post- training quantization with fp8 formats,”arXiv, 2023

work page 2023

[15] [15]

A survey of quantization methods for efficient neural network infer- ence,

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, “A survey of quantization methods for efficient neural network infer- ence,”arXiv, 2021

work page 2021

[16] [16]

A white paper on neural network quan- tization,

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort, “A white paper on neural network quan- tization,”arXiv, 2021

work page 2021

[17] [17]

Ad- vances in the neural network quantization: A compre- hensive review,

Lu Wei, Zhong Ma, Chaojie Yang, and Qin Yao, “Ad- vances in the neural network quantization: A compre- hensive review,”Applied Sciences, vol. 14, no. 17, 2024

work page 2024

[18] [18]

I-bert: Integer-only bert quantization,

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, “I-bert: Integer-only bert quantization,”arXiv, 2021

work page 2021

[19] [19]

Integer-only zero-shot quantization for efficient speech recognition,

Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, and Kurt Keutzer, “Integer-only zero-shot quantization for efficient speech recognition,” 2022

work page 2022

[20] [20]

Zero- quant: Efficient and affordable post-training quantiza- tion for large-scale transformers,

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He, “Zero- quant: Efficient and affordable post-training quantiza- tion for large-scale transformers,”arXiv, 2022

work page 2022

[21] [21]

Learning both weights and connections for efficient neural networks,

Song Han, Jeff Pool, John Tran, and William J. Dally, “Learning both weights and connections for efficient neural networks,” 2015

work page 2015

[22] [22]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recog- nition via large-scale weak supervision,”arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Librispeech: an asr corpus based on public domain audio books,

Vassil Panayotov, Guoguo Chen, Daniel Povey, and San- jeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inAcoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210

work page 2015