pith. sign in

arxiv: 2510.24570 · v2 · submitted 2025-10-28 · 💻 cs.CL

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Pith reviewed 2026-05-18 02:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-supervised learningdomain adaptationWhisperautomatic speech recognitionBEST-RQknowledge distillationair traffic control
0
0 comments X

The pith

The BEARD framework adapts Whisper's encoder to new domains by combining BEST-RQ self-supervised learning with knowledge distillation from a frozen teacher, delivering a 12% relative improvement on air traffic control speech using mostly un

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEARD to adapt Whisper's encoder in low-resource settings where labeled data for a new domain is scarce. It combines the BEST-RQ objective with distillation from a frozen teacher encoder so that the updated encoder stays compatible with the existing decoder. Experiments on the ATCO2 air traffic control corpus use roughly 5,000 hours of untranscribed speech for the adaptation stage and only 2 hours of transcribed speech for final fine-tuning. This setup produces a 12% relative gain over a model that was fine-tuned without the preceding self-supervised stage. The work is presented as the first application of a self-supervised objective specifically for domain adaptation of Whisper.

Core claim

The paper establishes that the BEARD framework, which applies the BEST-RQ self-supervised objective together with knowledge distillation from a frozen teacher encoder, enables effective domain adaptation of Whisper's encoder using primarily unlabeled speech data. On the ATCO2 corpus of air traffic control communications, this yields a 12% relative improvement in recognition performance compared to fine-tuning alone when only 2 hours of transcribed data are available for the final step.

What carries the argument

The BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation) framework, which merges the BEST-RQ self-supervised objective with knowledge distillation from a frozen teacher encoder to update the encoder while preserving complementarity with the pre-trained decoder.

If this is right

  • The adapted model shows higher accuracy on noisy, non-native, domain-specific speech such as air traffic control communications.
  • Large quantities of unlabeled domain data can be used for encoder adaptation without requiring matching volumes of new transcriptions.
  • The pre-trained decoder continues to function effectively after the encoder update because the distillation step maintains complementarity.
  • This constitutes the first reported use of a self-supervised learning objective for domain adaptation of Whisper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation-plus-self-supervised pattern could be tested on other pre-trained ASR models that face domain shifts with limited labels.
  • Varying the amount of unlabeled data while holding labeled data fixed would show how performance scales with the size of the adaptation corpus.
  • The approach implies that encoder-only updates can be made to large speech models without retraining the entire system from scratch.
  • Applying BEARD to additional specialized domains would test whether the complementarity preservation holds beyond air traffic control speech.

Load-bearing premise

The assumption that combining the BEST-RQ objective with knowledge distillation from a frozen teacher encoder will ensure the adapted encoder remains complementary to the pre-trained decoder.

What would settle it

If the word error rate on the ATCO2 test set after BEARD adaptation plus 2-hour fine-tuning is not approximately 12% lower than the word error rate of the fine-tuned baseline alone, or if encoder changes visibly degrade decoder compatibility, the central performance claim would be falsified.

read the original abstract

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder with unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a self-supervised framework for domain adaptation of Whisper's encoder. It combines the BEST-RQ objective on unlabeled speech with knowledge distillation from a frozen teacher encoder to preserve complementarity with the pre-trained decoder, followed by limited supervised fine-tuning. On the ATCO2 air-traffic-control corpus, the method reports a 12% relative WER improvement over a fine-tuned baseline when using ~5,000 hours of untranscribed data for adaptation and 2 hours of transcribed data for fine-tuning, and claims to be the first application of SSL for Whisper domain adaptation.

Significance. If the performance gains prove robust and the distillation term successfully maintains decoder compatibility, the result would offer a practical route for adapting large multilingual ASR models to specialized, low-resource domains where labeled data is scarce but raw audio is plentiful. The focus on the noisy, accented, phraseology-heavy ATC setting adds immediate engineering relevance.

major comments (2)
  1. Abstract: the headline claim of a 12% relative WER improvement supplies neither absolute WER values for the baseline and proposed systems, nor error bars, statistical significance tests, or explicit data-split descriptions. Without these quantities the magnitude and reliability of the reported gain cannot be assessed.
  2. Abstract (definition of BEARD): the assertion that BEST-RQ plus distillation from the frozen teacher 'ensures the encoder's complementarity with the pre-trained decoder' is presented as a design property but is not accompanied by any supporting ablation, representation-similarity metric, or decoder-isolation experiment. Because the entire adaptation pipeline rests on the adapted encoder remaining usable by the original Whisper decoder, this unverified assumption is load-bearing for the central claim.
minor comments (1)
  1. Abstract: the challenges of the ATCO2 domain (non-native speech, noise, specialized phraseology) are stated qualitatively; a short quantitative characterization (e.g., SNR statistics or vocabulary overlap with Whisper training data) would help readers gauge the severity of the domain shift.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: Abstract: the headline claim of a 12% relative WER improvement supplies neither absolute WER values for the baseline and proposed systems, nor error bars, statistical significance tests, or explicit data-split descriptions. Without these quantities the magnitude and reliability of the reported gain cannot be assessed.

    Authors: We agree that the abstract would be clearer with absolute numbers and setup details. In the revision we will insert the absolute WER figures for the fine-tuned baseline and the BEARD system, together with a concise description of the 5,000-hour unlabeled ATCO2 split used for adaptation and the 2-hour transcribed split used for fine-tuning. If multiple runs were performed we will also report standard deviations; otherwise we will state the limitation explicitly. revision: yes

  2. Referee: Abstract (definition of BEARD): the assertion that BEST-RQ plus distillation from the frozen teacher 'ensures the encoder's complementarity with the pre-trained decoder' is presented as a design property but is not accompanied by any supporting ablation, representation-similarity metric, or decoder-isolation experiment. Because the entire adaptation pipeline rests on the adapted encoder remaining usable by the original Whisper decoder, this unverified assumption is load-bearing for the central claim.

    Authors: The distillation loss is introduced precisely to keep the adapted encoder outputs aligned with those of the original Whisper encoder, thereby preserving compatibility with the frozen decoder. We acknowledge that the submitted manuscript provides no quantitative verification of this effect. We will add (i) a representation-similarity analysis (cosine similarity between adapted and teacher encoder activations on held-out audio) and (ii) an ablation that removes the distillation term and measures the resulting WER when the original decoder is used. These results will appear in the experiments section or an appendix of the revised paper. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical adaptation procedure

full rationale

The paper presents BEARD as an empirical framework combining BEST-RQ self-supervised learning with knowledge distillation for Whisper encoder adaptation on unlabeled ATC data, followed by limited fine-tuning. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described structure. Central claims rest on reported WER improvements from experiments (e.g., 12% relative gain), which are externally falsifiable via replication on the ATCO2 corpus and do not reduce to self-definitional inputs or ansatzes smuggled via prior work. This is a standard honest non-finding for an applied adaptation study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The contribution rests on standard self-supervised and distillation techniques plus the unverified assumption that the proposed combination preserves decoder compatibility; no new physical entities or formal axioms are introduced.

free parameters (1)
  • BEST-RQ and distillation hyperparameters
    These control the adaptation process and are presumably tuned on the unlabeled data but are not specified in the abstract.
axioms (1)
  • domain assumption Distillation from the frozen teacher encoder ensures complementarity with the pre-trained decoder
    Invoked directly in the abstract's description of how BEARD maintains encoder-decoder compatibility.
invented entities (1)
  • BEARD framework no independent evidence
    purpose: Named method for encoder adaptation combining BEST-RQ and distillation
    Newly introduced named procedure whose effectiveness is the central empirical claim.

pith-pipeline@v0.9.0 · 5707 in / 1538 out tokens · 56367 ms · 2026-05-18T02:52:56.680333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

    INTRODUCTION Automatic speech recognition (ASR) has reached near-human ac- curacy in many domains [1]. The arrival of large-scale end-to-end models made these models easier to use out-of-the-box. However, despite being trained on massive multilingual datasets, these models still struggle with out-of-domain scenarios, like out-of-vocabulary words, spontane...

  2. [2]

    Whisper Whisper, end-to-end encoder-decoder Transformer, is a state-of-the- art model for automatic speech recognition [20]

    PROPOSED METHODOLOGY 2.1. Whisper Whisper, end-to-end encoder-decoder Transformer, is a state-of-the- art model for automatic speech recognition [20]. It has been trained on 680,000 hours of transcribed multilingual data. Whisper’s en- coder is mostly focused on acoustic features, while its decoder is mostly focused on linguistic features. Whisper differs...

  3. [3]

    Dataset We conducted our experiments using the ATCO2 dataset 1 [13]

    EXPERIMENTAL SETTINGS 3.1. Dataset We conducted our experiments using the ATCO2 dataset 1 [13]. It contains air traffic control communications between pilots and air traffic controllers from various airports. The speech is non-native, with high speech rate and noisy, with signal-to-noise ratios (SNR) varying from -10dB to 40dB, estimated using W ADA-SNR [...

  4. [4]

    Baselines We consider four baselines

    RESULTS AND DISCUSSIONS 4.1. Baselines We consider four baselines. The first two are prior works on the ATCO2 dataset. An XLS-R model fine-tuned on 132 hours of ATC speech from diverse corpora (referred to as XLS-R FT) [18], ATCO2 was only used for testing. This is the first baseline that was presented 3https://gitlab.inria.fr/rbagat/beard Table 2: WER (%...

  5. [5]

    We introduced BEARD, a framework that combines self-supervised learning and distillation to adapt Whisper’s encoder using unlabeled speech

    CONCLUSION In this work, we investigated whether self-supervised learning can help Whisper adapt to a new domain. We introduced BEARD, a framework that combines self-supervised learning and distillation to adapt Whisper’s encoder using unlabeled speech. The modified en- coder is then fine-tuned with the decoder using a limited amount of labeled data. On t...

  6. [6]

    Toward human parity in conversa- tional speech recognition,

    W. Xiong, J. Droppo, X. Huang, F. Seide, M.L. Seltzer, A. Stol- cke, D. Yu, and G. Zweig, “Toward human parity in conversa- tional speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410– 2423, 2017

  7. [7]

    Utilizing untranscribed training data to improve performance,

    G. Zavaliagkos and T. Colthurst, “Utilizing untranscribed training data to improve performance,” inDARPA Broad- cast News Transcription and Understanding Workshop, Lands- downe, 1998

  8. [8]

    Self-training for end-to-end speech recognition,

    J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” inProc. 2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088

  9. [9]

    wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in neural information processing sys- tems, vol. 33, pp. 12449–12460, 2020

  10. [10]

    XLS-R: Self-supervised cross-lingual speech representation learning at scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” inProc. Interspeech 2022, 2022, pp. 2278–2282

  11. [11]

    wav2vec 2.0 ASR for cantonese- speaking older adults in a clinical setting,

    R. Huang and B. Mak, “wav2vec 2.0 ASR for cantonese- speaking older adults in a clinical setting,” inProc. Interspeech 2023, 2023, pp. 4958–4962

  12. [12]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

    W-N. Hsu, B. Bolte, Y-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 29, pp. 3451–3460, 2021

  13. [13]

    BERT: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019, pp. 4171–4186

  14. [14]

    Improving automatic speech recognition performance for low-resource languages with self- supervised models,

    J. Zhao and W-Q. Zhang, “Improving automatic speech recognition performance for low-resource languages with self- supervised models,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1227–1241, 2022

  15. [15]

    Self-supervised learning with random-projection quantizer for speech recog- nition,

    C-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self-supervised learning with random-projection quantizer for speech recog- nition,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 3915–3924

  16. [16]

    Google USM: Scaling automatic speech recognition beyond 100 languages,

    Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023

  17. [17]

    NEST: Self-supervised fast conformer as all-purpose seasoning to speech processing tasks,

    H. Huang, T. Park, K. Dhawan, I. Medennikov, K.C. Puvvada, N.R. Koluguri, W. Wang, J. Balam, and B. Ginsburg, “NEST: Self-supervised fast conformer as all-purpose seasoning to speech processing tasks,” inProc. 2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  18. [18]

    ATCO2 corpus: A large-scale dataset for research on au- tomatic speech recognition and natural language understand- ing of air traffic control communications,

    J. Zuluaga-Gomez, K. Vesel ´y, I. Sz ¨oke, P. Motlicek, et al., “ATCO2 corpus: A large-scale dataset for research on au- tomatic speech recognition and natural language understand- ing of air traffic control communications,”arXiv preprint arXiv:2211.04054, 2022

  19. [19]

    A speech interface for air traffic control terminals,

    J. Ferreiros, JM. Pardo, R. De C ´ordoba, J. Macias-Guarasa, JM. Montero, F. Fern ´andez, V . Sama, G. Gonz´alez, et al., “A speech interface for air traffic control terminals,”Aerospace Science and Technology, vol. 21, no. 1, pp. 7–15, 2012

  20. [20]

    A uni- fied framework for multilingual speech recognition in air traffic control systems,

    Y . Lin, D. Guo, J. Zhang, Z. Chen, and B. Yang, “A uni- fied framework for multilingual speech recognition in air traffic control systems,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3608–3620, 2021

  21. [21]

    The air- bus air traffic control speech recognition 2018 challenge: To- wards ATC automatic transcription and call sign detection,

    T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air- bus air traffic control speech recognition 2018 challenge: To- wards ATC automatic transcription and call sign detection,” in Proc. Interspeech 2019, 2019, pp. 2993–2997

  22. [22]

    Automatic speech recog- nition for air traffic control communications,

    S. Badrinath and H. Balakrishnan, “Automatic speech recog- nition for air traffic control communications,”Transportation research record, vol. 2676, no. 1, pp. 798–810, 2022

  23. [23]

    How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? an extensive benchmark on air traffic control communications,

    J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S.S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, and Q. Zhan, “How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? an extensive benchmark on air traffic control communications,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 205–212

  24. [24]

    Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,

    J. van Doorn, J. Sun, J. Hoekstra, P. Jonk, and V . de Vries, “Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,” inProc. Int. Conf. Res. Air Transp. (ICRAT), 2024

  25. [25]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28492–28518

  26. [26]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulati, J. Qin, C-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech 2020, 2020, pp. 5036–5040

  27. [27]

    Revisiting convolution-free trans- former for speech recognition,

    Z. Hou, G. Huybrechts, A. Bhatia, D. Garcia-Romero, K.J. Han, and K. Kirchhoff, “Revisiting convolution-free trans- former for speech recognition,” inProc. Interspeech 2024, 2024, pp. 4568–4572

  28. [28]

    Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,

    C. Kim and R.M. Stern, “Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,” in Proc. Interspeech 2008, 2008, pp. 2598–2601

  29. [29]

    Open implementation and study of BEST-RQ for speech processing,

    R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open implementation and study of BEST-RQ for speech processing,” inProc. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 460–464

  30. [30]

    NIST, “SCTK,”https://github.com/usnistgov/ SCTK.git, 2024