BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

· 2025 · eess.AS · arXiv 2510.08599

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Pruning large pre-trained transformers in a data-scarce scenario is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40 and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90 of the original performance while being 48 smaller and 2.15x faster on a MacBook Air M1.

representative citing papers

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

eess.AS · 2025-10-06 · unverdicted · novelty 5.0

A new pruning recipe for Whisper on Bambara with 32h data uses low-rank embedding compression, feature distillation, and layer merging to produce a model 48x smaller and 2.15x faster that retains 90% of original performance.

citing papers explorer

Showing 1 of 1 citing paper.

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging eess.AS · 2025-10-06 · unverdicted · none · ref 3 · internal anchor
A new pruning recipe for Whisper on Bambara with 32h data uses low-rank embedding compression, feature distillation, and layer merging to produce a model 48x smaller and 2.15x faster that retains 90% of original performance.

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

fields

years

verdicts

representative citing papers

citing papers explorer