BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
Pith reviewed 2026-05-18 09:12 UTC · model grok-4.3
The pith
Whisper can be pruned to 48 percent smaller and 2.15 times faster while retaining 90 percent performance on Bambara with limited data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By compressing embeddings via low-rank decomposition and feature distillation while merging layers rather than removing them, the method yields a pruned Whisper model that preserves 90 percent of original performance on Bambara speech-to-text, reduces size by 48 percent, and delivers 2.15 times faster inference on edge hardware, all without relying on massive retraining corpora.
What carries the argument
Head shearing through low-rank decomposition of embeddings together with feature distillation and layer merging, which together compress the model while limiting accuracy loss in data-scarce conditions.
If this is right
- The model runs 2.15 times faster on a MacBook Air M1.
- It retains 90 percent of original performance with only 32 hours of Bambara data.
- Avoiding vocabulary pruning handles code-switching common in Bambara speech.
- Layer merging reduces performance drop compared with simple layer removal.
Where Pith is reading between the lines
- The same compression steps could be tested on other low-resource languages beyond Bambara.
- Layer merging might reduce retraining needs when pruning other transformer speech models.
- Combining the approach with quantization could yield additional speed gains on mobile devices.
Load-bearing premise
Low-rank decomposition of embeddings combined with feature distillation and layer merging can recover most of the original performance on Bambara without needing the massive retraining data used in prior distillation approaches.
What would settle it
Evaluating the final pruned model on a held-out Bambara test set and finding word error rate more than 10 percent higher than the original Whisper would show that 90 percent performance was not preserved.
read the original abstract
Pruning large pre-trained transformers in a data-scarce scenario is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40 and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90 of the original performance while being 48 smaller and 2.15x faster on a MacBook Air M1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BaldWhisper, a pruning recipe for the Whisper ASR model targeted at Bambara using only 32 hours of paired speech-to-text data. Due to frequent French/English code-switching, the authors avoid vocabulary pruning and instead apply low-rank decomposition plus feature distillation to the embedding matrix, while using layer merging rather than layer removal. They report that the resulting model retains 90% of the original performance, is 48 times smaller, and runs 2.15 times faster on a MacBook Air M1.
Significance. If the empirical claims are substantiated, the work would offer a practical path for compressing large ASR transformers in genuinely data-scarce, code-switched settings without the 21 000-hour retraining budgets required by Distill-Whisper. The combination of low-rank embedding compression and layer merging, rather than aggressive pruning, is a potentially reusable recipe for edge deployment of multilingual models.
major comments (2)
- [Abstract] Abstract: the central claim that low-rank decomposition plus feature distillation plus layer merging recovers 90 % of Whisper performance on Bambara with only 32 h of data is presented without any reported metric (WER, CER, or otherwise), baseline model version, error bars, or comparison against a naïve fine-tune or Distill-Whisper on the same Bambara corpus; this absence is load-bearing because the abstract supplies the only quantitative evidence for the claim.
- [Method] Method section (presumed §3): no rank value is stated for the low-rank decomposition of the embedding matrix, and the feature-distillation loss is not formulated (e.g., whether it is MSE on encoder hidden states, attention maps, or logit matching); without these definitions or an ablation isolating distillation from merging, it is impossible to evaluate whether the cross-lingual subspaces needed for code-switched Bambara utterances are preserved.
minor comments (1)
- [Abstract] Abstract: 'preserves 90 of the original performance' should read 'preserves 90 % of the original performance'; '48 smaller' should read '48× smaller'.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which will help improve the clarity and completeness of our paper. Below, we provide detailed responses to each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that low-rank decomposition plus feature distillation plus layer merging recovers 90 % of Whisper performance on Bambara with only 32 h of data is presented without any reported metric (WER, CER, or otherwise), baseline model version, error bars, or comparison against a naïve fine-tune or Distill-Whisper on the same Bambara corpus; this absence is load-bearing because the abstract supplies the only quantitative evidence for the claim.
Authors: We agree that the abstract should be more self-contained with specific details. In the revised version, we will specify the performance metric used (word error rate on the Bambara evaluation set), the baseline Whisper model version, and include a comparison against a naïve fine-tuning of the original model on the same 32-hour corpus. We will also report error bars from our experimental runs. Note that a direct comparison to Distill-Whisper is not feasible in this data-scarce setting as it requires 21,000 hours of data, but the naïve fine-tune serves as the appropriate baseline. revision: yes
-
Referee: [Method] Method section (presumed §3): no rank value is stated for the low-rank decomposition of the embedding matrix, and the feature-distillation loss is not formulated (e.g., whether it is MSE on encoder hidden states, attention maps, or logit matching); without these definitions or an ablation isolating distillation from merging, it is impossible to evaluate whether the cross-lingual subspaces needed for code-switched Bambara utterances are preserved.
Authors: We acknowledge these gaps in the method section. We will explicitly state the rank chosen for the low-rank decomposition of the embedding matrix. We will also provide the exact formulation of the feature-distillation loss, which is based on MSE between the hidden states of the encoder layers. Additionally, we will add an ablation study that isolates the contribution of feature distillation from that of layer merging. This will allow readers to assess how well the cross-lingual capabilities are preserved in the compressed model. revision: yes
Circularity Check
No circularity: empirical pruning results are direct measurements
full rationale
The paper describes an empirical pruning recipe for Whisper on Bambara using low-rank embedding decomposition, feature distillation, and layer merging. All reported outcomes (90% performance retention, 48x size reduction, 2.15x speedup) are presented as measured results on held-out data rather than predictions derived from fitted parameters or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to close a derivation loop; the method is justified by experimental outcomes in a data-scarce setting. The chain is therefore self-contained and externally falsifiable via replication on the same 32h corpus.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
However, traditional pruning meth- ods often require substantial amounts of retraining data
INTRODUCTION One approach for local edge-device inference for speech recognition is to prune a large, pre-trained, high-performing model into a small model. However, traditional pruning meth- ods often require substantial amounts of retraining data. For example, Distill-Whisper [1] prunes the English-only version of Whisper [2] and further retrained on 21...
-
[2]
RELA TED WORKS Deep Encoder, Shallow DecoderWe take inspiration from
-
[3]
BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
who show that the decoder part of encoder-decoder archi- tectures can be very short while still achieving good accuracy and being faster. To make the model lighter and faster, we mainly compress the decoder part of Whisper. However, contrary to [3], which trains encoder-decoder models from scratch, we use a pre-trained Whisper and compress the de- coder. ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
BALDWHISPER APPROACH Motivation.We are interested in local offline deployment, preferably on mobile devices. To achieve efficient inference on edge devices, our method reduces model size and im- proves the generation speed using a two-stage compression approach. Both stages are designed to be data-efficient, since Bambara is a low-resource language with o...
-
[5]
Method Data.We use 32 hours of Bambara speech data from an openly available resource 1
EXPERIMENTS 4.1. Method Data.We use 32 hours of Bambara speech data from an openly available resource 1. We use 50mn for development, 1h20 for testing and the remaining for training. Implementation Details.We first fine-tuned the Whisper- 73M parameters on the Bambara dataset for 20 epochs, on a single A100-80GB GPU, with a learning rate of5e−5. After fin...
-
[6]
ANALYSIS How to choose the merging parameterα&β?To choose the best merging weightsα(importance of layer 1) andβ(im- portance of layer 2), we performed a hyperparameter search using Bayesian optimization as implemented inAxlibrary. We searched for 30 iterations by training at each time on 30% of the training set and testing on 60% of the development set. F...
-
[7]
We show that merging adjacent layers instead of pruning limits the per- formance drop
CONCLUSION We design a new pruning approach to work in low-resource scenario and applied it to Whisper for Bambara. We show that merging adjacent layers instead of pruning limits the per- formance drop. The compressed model, with half of the lay- ers and low-rank embedding, is 2.15x faster and 48% smaller while maintaining over 90% of the performance of t...
-
[8]
Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,
Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush, “Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,” 2023
work page 2023
-
[9]
Robust speech recognition via large-scale weak supervision,
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022
work page 2022
-
[10]
Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,
Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah Smith, “Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,” inInternational Conference on Learning Representa- tions, 2021
work page 2021
-
[11]
Sequence-level knowledge distillation,
Yoon Kim and Alexander M. Rush, “Sequence-level knowledge distillation,” inProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras, Eds., Austin, Texas, Nov. 2016, pp. 1317–1327, Associ- ation for Computational Linguistics
work page 2016
-
[12]
Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Mor- cos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt, “Model soups: averaging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,” inProceedings of the 39th International Conferenc...
work page 2022
-
[13]
TIES-merging: Resolving interference when merging models,
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal, “TIES-merging: Resolving interference when merging models,” inThirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[14]
Pruning via merg- ing: Compressing LLMs via manifold alignment based layer merging,
Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Bo Li, Xi Chen, Cunhang Fan, Zhao Lv, Dianhui Chu, Zhiying Tu, and Dianbo Sui, “Pruning via merg- ing: Compressing LLMs via manifold alignment based layer merging,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,...
work page 2024
-
[15]
Load what you need: Smaller versions of mutililin- gual BERT,
Amine Abdaoui, Camille Pradel, and Gr ´egoire Sigel, “Load what you need: Smaller versions of mutililin- gual BERT,” inProceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glava ˇs, Shafiq Joty, Alex Wang, and Thomas Wolf, Eds., Online, Nov. 2020, pp. 119–123, Associ...
work page 2020
-
[16]
Truong Do, Minh-Phuong Nguyen, and Le Minh Nguyen, “PruneSLU: Efficient On-device Spoken Lan- guage Understanding through V ocabulary and Structural Pruning,” inInterspeech 2025, 2025, pp. 1738–1742
work page 2025
-
[17]
Compressing Transformers: Features Are Low-Rank, but Weights Are Not!,
Hao Yu and Jianxin Wu, “Compressing Transformers: Features Are Low-Rank, but Weights Are Not!,”Pro- ceedings of the AAAI Conference on Artificial Intelli- gence, vol. 37, no. 9, pp. 11007–11015, June 2023
work page 2023
-
[18]
DRONE: Data- aware Low-rank Compression for Large NLP Models,
Patrick H Chen and Inderjit S Dhillon, “DRONE: Data- aware Low-rank Compression for Large NLP Models,” 2021
work page 2021
-
[19]
Distill- ing the knowledge in a neural network,
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the knowledge in a neural network,” 2015
work page 2015
-
[20]
Language models are super mario: absorbing abil- ities from homologous models as a free lunch,
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li, “Language models are super mario: absorbing abil- ities from homologous models as a free lunch,” inPro- ceedings of the 41st International Conference on Ma- chine Learning. 2024, ICML’24, JMLR.org
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.