BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Christophe Cerisara; Irina Illina; Yaya Sy

arxiv: 2510.08599 · v2 · submitted 2025-10-06 · 📡 eess.AS · cs.AI· cs.CL· cs.SD

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Yaya Sy , Christophe Cerisara , Irina Illina This is my paper

Pith reviewed 2026-05-18 09:12 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.SD

keywords Whisper pruningspeech model compressionlow-resource ASRBambara languagelayer merginglow-rank decompositionfeature distillationedge inference

0 comments

The pith

Whisper can be pruned to 48 percent smaller and 2.15 times faster while retaining 90 percent performance on Bambara with limited data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles pruning large speech models like Whisper when only small amounts of training data exist for a target language. For Bambara with 32 hours of data, the authors replace vocabulary pruning with low-rank decomposition of embeddings plus feature distillation, and they merge layers instead of deleting them. This produces a model that keeps 90 percent of the original accuracy. The compressed version runs 2.15 times faster on a MacBook Air M1 and occupies 48 percent less space. The recipe demonstrates how to adapt large audio transformers to low-resource settings without the thousands of hours of retraining data used in earlier distillation work.

Core claim

By compressing embeddings via low-rank decomposition and feature distillation while merging layers rather than removing them, the method yields a pruned Whisper model that preserves 90 percent of original performance on Bambara speech-to-text, reduces size by 48 percent, and delivers 2.15 times faster inference on edge hardware, all without relying on massive retraining corpora.

What carries the argument

Head shearing through low-rank decomposition of embeddings together with feature distillation and layer merging, which together compress the model while limiting accuracy loss in data-scarce conditions.

If this is right

The model runs 2.15 times faster on a MacBook Air M1.
It retains 90 percent of original performance with only 32 hours of Bambara data.
Avoiding vocabulary pruning handles code-switching common in Bambara speech.
Layer merging reduces performance drop compared with simple layer removal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression steps could be tested on other low-resource languages beyond Bambara.
Layer merging might reduce retraining needs when pruning other transformer speech models.
Combining the approach with quantization could yield additional speed gains on mobile devices.

Load-bearing premise

Low-rank decomposition of embeddings combined with feature distillation and layer merging can recover most of the original performance on Bambara without needing the massive retraining data used in prior distillation approaches.

What would settle it

Evaluating the final pruned model on a held-out Bambara test set and finding word error rate more than 10 percent higher than the original Whisper would show that 90 percent performance was not preserved.

read the original abstract

Pruning large pre-trained transformers in a data-scarce scenario is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40 and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90 of the original performance while being 48 smaller and 2.15x faster on a MacBook Air M1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a targeted pruning recipe for Whisper on code-switched Bambara that skips vocabulary pruning, uses low-rank embedding compression plus distillation and layer merging, and claims 90% performance retention with only 32 hours of data.

read the letter

The key takeaway is that this work proposes a pruning strategy for Whisper ASR models aimed at low-resource code-switching languages like Bambara. By using low-rank decomposition on embeddings combined with feature distillation and layer merging, they report a model that is 48 times smaller and 2.15 times faster while retaining 90% of the original performance, trained on just 32 hours of data.

Referee Report

2 major / 1 minor

Summary. The paper introduces BaldWhisper, a pruning recipe for the Whisper ASR model targeted at Bambara using only 32 hours of paired speech-to-text data. Due to frequent French/English code-switching, the authors avoid vocabulary pruning and instead apply low-rank decomposition plus feature distillation to the embedding matrix, while using layer merging rather than layer removal. They report that the resulting model retains 90% of the original performance, is 48 times smaller, and runs 2.15 times faster on a MacBook Air M1.

Significance. If the empirical claims are substantiated, the work would offer a practical path for compressing large ASR transformers in genuinely data-scarce, code-switched settings without the 21 000-hour retraining budgets required by Distill-Whisper. The combination of low-rank embedding compression and layer merging, rather than aggressive pruning, is a potentially reusable recipe for edge deployment of multilingual models.

major comments (2)

[Abstract] Abstract: the central claim that low-rank decomposition plus feature distillation plus layer merging recovers 90 % of Whisper performance on Bambara with only 32 h of data is presented without any reported metric (WER, CER, or otherwise), baseline model version, error bars, or comparison against a naïve fine-tune or Distill-Whisper on the same Bambara corpus; this absence is load-bearing because the abstract supplies the only quantitative evidence for the claim.
[Method] Method section (presumed §3): no rank value is stated for the low-rank decomposition of the embedding matrix, and the feature-distillation loss is not formulated (e.g., whether it is MSE on encoder hidden states, attention maps, or logit matching); without these definitions or an ablation isolating distillation from merging, it is impossible to evaluate whether the cross-lingual subspaces needed for code-switched Bambara utterances are preserved.

minor comments (1)

[Abstract] Abstract: 'preserves 90 of the original performance' should read 'preserves 90 % of the original performance'; '48 smaller' should read '48× smaller'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which will help improve the clarity and completeness of our paper. Below, we provide detailed responses to each major comment.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that low-rank decomposition plus feature distillation plus layer merging recovers 90 % of Whisper performance on Bambara with only 32 h of data is presented without any reported metric (WER, CER, or otherwise), baseline model version, error bars, or comparison against a naïve fine-tune or Distill-Whisper on the same Bambara corpus; this absence is load-bearing because the abstract supplies the only quantitative evidence for the claim.

Authors: We agree that the abstract should be more self-contained with specific details. In the revised version, we will specify the performance metric used (word error rate on the Bambara evaluation set), the baseline Whisper model version, and include a comparison against a naïve fine-tuning of the original model on the same 32-hour corpus. We will also report error bars from our experimental runs. Note that a direct comparison to Distill-Whisper is not feasible in this data-scarce setting as it requires 21,000 hours of data, but the naïve fine-tune serves as the appropriate baseline. revision: yes
Referee: [Method] Method section (presumed §3): no rank value is stated for the low-rank decomposition of the embedding matrix, and the feature-distillation loss is not formulated (e.g., whether it is MSE on encoder hidden states, attention maps, or logit matching); without these definitions or an ablation isolating distillation from merging, it is impossible to evaluate whether the cross-lingual subspaces needed for code-switched Bambara utterances are preserved.

Authors: We acknowledge these gaps in the method section. We will explicitly state the rank chosen for the low-rank decomposition of the embedding matrix. We will also provide the exact formulation of the feature-distillation loss, which is based on MSE between the hidden states of the encoder layers. Additionally, we will add an ablation study that isolates the contribution of feature distillation from that of layer merging. This will allow readers to assess how well the cross-lingual capabilities are preserved in the compressed model. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pruning results are direct measurements

full rationale

The paper describes an empirical pruning recipe for Whisper on Bambara using low-rank embedding decomposition, feature distillation, and layer merging. All reported outcomes (90% performance retention, 48x size reduction, 2.15x speedup) are presented as measured results on held-out data rather than predictions derived from fitted parameters or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to close a derivation loop; the method is justified by experimental outcomes in a data-scarce setting. The chain is therefore self-contained and externally falsifiable via replication on the same 32h corpus.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5687 in / 1038 out tokens · 28211 ms · 2026-05-18T09:12:36.738635+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

However, traditional pruning meth- ods often require substantial amounts of retraining data

INTRODUCTION One approach for local edge-device inference for speech recognition is to prune a large, pre-trained, high-performing model into a small model. However, traditional pruning meth- ods often require substantial amounts of retraining data. For example, Distill-Whisper [1] prunes the English-only version of Whisper [2] and further retrained on 21...

work page
[2]

RELA TED WORKS Deep Encoder, Shallow DecoderWe take inspiration from

work page
[3]

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

who show that the decoder part of encoder-decoder archi- tectures can be very short while still achieving good accuracy and being faster. To make the model lighter and faster, we mainly compress the decoder part of Whisper. However, contrary to [3], which trains encoder-decoder models from scratch, we use a pre-trained Whisper and compress the de- coder. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

To achieve efficient inference on edge devices, our method reduces model size and im- proves the generation speed using a two-stage compression approach

BALDWHISPER APPROACH Motivation.We are interested in local offline deployment, preferably on mobile devices. To achieve efficient inference on edge devices, our method reduces model size and im- proves the generation speed using a two-stage compression approach. Both stages are designed to be data-efficient, since Bambara is a low-resource language with o...

work page
[5]

Method Data.We use 32 hours of Bambara speech data from an openly available resource 1

EXPERIMENTS 4.1. Method Data.We use 32 hours of Bambara speech data from an openly available resource 1. We use 50mn for development, 1h20 for testing and the remaining for training. Implementation Details.We first fine-tuned the Whisper- 73M parameters on the Bambara dataset for 20 epochs, on a single A100-80GB GPU, with a learning rate of5e−5. After fin...

work page
[6]

We searched for 30 iterations by training at each time on 30% of the training set and testing on 60% of the development set

ANALYSIS How to choose the merging parameterα&β?To choose the best merging weightsα(importance of layer 1) andβ(im- portance of layer 2), we performed a hyperparameter search using Bayesian optimization as implemented inAxlibrary. We searched for 30 iterations by training at each time on 30% of the training set and testing on 60% of the development set. F...

work page
[7]

We show that merging adjacent layers instead of pruning limits the per- formance drop

CONCLUSION We design a new pruning approach to work in low-resource scenario and applied it to Whisper for Bambara. We show that merging adjacent layers instead of pruning limits the per- formance drop. The compressed model, with half of the lay- ers and low-rank embedding, is 2.15x faster and 48% smaller while maintaining over 90% of the performance of t...

work page
[8]

Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,

Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush, “Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,” 2023

work page 2023
[9]

Robust speech recognition via large-scale weak supervision,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022

work page 2022
[10]

Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,

Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah Smith, “Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,” inInternational Conference on Learning Representa- tions, 2021

work page 2021
[11]

Sequence-level knowledge distillation,

Yoon Kim and Alexander M. Rush, “Sequence-level knowledge distillation,” inProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras, Eds., Austin, Texas, Nov. 2016, pp. 1317–1327, Associ- ation for Computational Linguistics

work page 2016
[12]

Model soups: averaging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Mor- cos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt, “Model soups: averaging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,” inProceedings of the 39th International Conferenc...

work page 2022
[13]

TIES-merging: Resolving interference when merging models,

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal, “TIES-merging: Resolving interference when merging models,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[14]

Pruning via merg- ing: Compressing LLMs via manifold alignment based layer merging,

Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Bo Li, Xi Chen, Cunhang Fan, Zhao Lv, Dianhui Chu, Zhiying Tu, and Dianbo Sui, “Pruning via merg- ing: Compressing LLMs via manifold alignment based layer merging,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,...

work page 2024
[15]

Load what you need: Smaller versions of mutililin- gual BERT,

Amine Abdaoui, Camille Pradel, and Gr ´egoire Sigel, “Load what you need: Smaller versions of mutililin- gual BERT,” inProceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glava ˇs, Shafiq Joty, Alex Wang, and Thomas Wolf, Eds., Online, Nov. 2020, pp. 119–123, Associ...

work page 2020
[16]

PruneSLU: Efficient On-device Spoken Lan- guage Understanding through V ocabulary and Structural Pruning,

Truong Do, Minh-Phuong Nguyen, and Le Minh Nguyen, “PruneSLU: Efficient On-device Spoken Lan- guage Understanding through V ocabulary and Structural Pruning,” inInterspeech 2025, 2025, pp. 1738–1742

work page 2025
[17]

Compressing Transformers: Features Are Low-Rank, but Weights Are Not!,

Hao Yu and Jianxin Wu, “Compressing Transformers: Features Are Low-Rank, but Weights Are Not!,”Pro- ceedings of the AAAI Conference on Artificial Intelli- gence, vol. 37, no. 9, pp. 11007–11015, June 2023

work page 2023
[18]

DRONE: Data- aware Low-rank Compression for Large NLP Models,

Patrick H Chen and Inderjit S Dhillon, “DRONE: Data- aware Low-rank Compression for Large NLP Models,” 2021

work page 2021
[19]

Distill- ing the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the knowledge in a neural network,” 2015

work page 2015
[20]

Language models are super mario: absorbing abil- ities from homologous models as a free lunch,

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li, “Language models are super mario: absorbing abil- ities from homologous models as a free lunch,” inPro- ceedings of the 41st International Conference on Ma- chine Learning. 2024, ICML’24, JMLR.org

work page 2024

[1] [1]

However, traditional pruning meth- ods often require substantial amounts of retraining data

INTRODUCTION One approach for local edge-device inference for speech recognition is to prune a large, pre-trained, high-performing model into a small model. However, traditional pruning meth- ods often require substantial amounts of retraining data. For example, Distill-Whisper [1] prunes the English-only version of Whisper [2] and further retrained on 21...

work page

[2] [2]

RELA TED WORKS Deep Encoder, Shallow DecoderWe take inspiration from

work page

[3] [3]

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

who show that the decoder part of encoder-decoder archi- tectures can be very short while still achieving good accuracy and being faster. To make the model lighter and faster, we mainly compress the decoder part of Whisper. However, contrary to [3], which trains encoder-decoder models from scratch, we use a pre-trained Whisper and compress the de- coder. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

To achieve efficient inference on edge devices, our method reduces model size and im- proves the generation speed using a two-stage compression approach

BALDWHISPER APPROACH Motivation.We are interested in local offline deployment, preferably on mobile devices. To achieve efficient inference on edge devices, our method reduces model size and im- proves the generation speed using a two-stage compression approach. Both stages are designed to be data-efficient, since Bambara is a low-resource language with o...

work page

[5] [5]

Method Data.We use 32 hours of Bambara speech data from an openly available resource 1

EXPERIMENTS 4.1. Method Data.We use 32 hours of Bambara speech data from an openly available resource 1. We use 50mn for development, 1h20 for testing and the remaining for training. Implementation Details.We first fine-tuned the Whisper- 73M parameters on the Bambara dataset for 20 epochs, on a single A100-80GB GPU, with a learning rate of5e−5. After fin...

work page

[6] [6]

We searched for 30 iterations by training at each time on 30% of the training set and testing on 60% of the development set

ANALYSIS How to choose the merging parameterα&β?To choose the best merging weightsα(importance of layer 1) andβ(im- portance of layer 2), we performed a hyperparameter search using Bayesian optimization as implemented inAxlibrary. We searched for 30 iterations by training at each time on 30% of the training set and testing on 60% of the development set. F...

work page

[7] [7]

We show that merging adjacent layers instead of pruning limits the per- formance drop

CONCLUSION We design a new pruning approach to work in low-resource scenario and applied it to Whisper for Bambara. We show that merging adjacent layers instead of pruning limits the per- formance drop. The compressed model, with half of the lay- ers and low-rank embedding, is 2.15x faster and 48% smaller while maintaining over 90% of the performance of t...

work page

[8] [8]

Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,

Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush, “Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,” 2023

work page 2023

[9] [9]

Robust speech recognition via large-scale weak supervision,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022

work page 2022

[10] [10]

Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,

Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah Smith, “Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,” inInternational Conference on Learning Representa- tions, 2021

work page 2021

[11] [11]

Sequence-level knowledge distillation,

Yoon Kim and Alexander M. Rush, “Sequence-level knowledge distillation,” inProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras, Eds., Austin, Texas, Nov. 2016, pp. 1317–1327, Associ- ation for Computational Linguistics

work page 2016

[12] [12]

Model soups: averaging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Mor- cos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt, “Model soups: averaging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,” inProceedings of the 39th International Conferenc...

work page 2022

[13] [13]

TIES-merging: Resolving interference when merging models,

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal, “TIES-merging: Resolving interference when merging models,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[14] [14]

Pruning via merg- ing: Compressing LLMs via manifold alignment based layer merging,

Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Bo Li, Xi Chen, Cunhang Fan, Zhao Lv, Dianhui Chu, Zhiying Tu, and Dianbo Sui, “Pruning via merg- ing: Compressing LLMs via manifold alignment based layer merging,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,...

work page 2024

[15] [15]

Load what you need: Smaller versions of mutililin- gual BERT,

Amine Abdaoui, Camille Pradel, and Gr ´egoire Sigel, “Load what you need: Smaller versions of mutililin- gual BERT,” inProceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glava ˇs, Shafiq Joty, Alex Wang, and Thomas Wolf, Eds., Online, Nov. 2020, pp. 119–123, Associ...

work page 2020

[16] [16]

PruneSLU: Efficient On-device Spoken Lan- guage Understanding through V ocabulary and Structural Pruning,

Truong Do, Minh-Phuong Nguyen, and Le Minh Nguyen, “PruneSLU: Efficient On-device Spoken Lan- guage Understanding through V ocabulary and Structural Pruning,” inInterspeech 2025, 2025, pp. 1738–1742

work page 2025

[17] [17]

Compressing Transformers: Features Are Low-Rank, but Weights Are Not!,

Hao Yu and Jianxin Wu, “Compressing Transformers: Features Are Low-Rank, but Weights Are Not!,”Pro- ceedings of the AAAI Conference on Artificial Intelli- gence, vol. 37, no. 9, pp. 11007–11015, June 2023

work page 2023

[18] [18]

DRONE: Data- aware Low-rank Compression for Large NLP Models,

Patrick H Chen and Inderjit S Dhillon, “DRONE: Data- aware Low-rank Compression for Large NLP Models,” 2021

work page 2021

[19] [19]

Distill- ing the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the knowledge in a neural network,” 2015

work page 2015

[20] [20]

Language models are super mario: absorbing abil- ities from homologous models as a free lunch,

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li, “Language models are super mario: absorbing abil- ities from homologous models as a free lunch,” inPro- ceedings of the 41st International Conference on Ma- chine Learning. 2024, ICML’24, JMLR.org

work page 2024