BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Emmanuel Vincent; Irina Illina; Rapha\"el Bagat

arxiv: 2510.24570 · v2 · submitted 2025-10-28 · 💻 cs.CL

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Rapha\"el Bagat , Irina Illina , Emmanuel Vincent This is my paper

Pith reviewed 2026-05-18 02:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-supervised learningdomain adaptationWhisperautomatic speech recognitionBEST-RQknowledge distillationair traffic control

0 comments

The pith

The BEARD framework adapts Whisper's encoder to new domains by combining BEST-RQ self-supervised learning with knowledge distillation from a frozen teacher, delivering a 12% relative improvement on air traffic control speech using mostly un

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEARD to adapt Whisper's encoder in low-resource settings where labeled data for a new domain is scarce. It combines the BEST-RQ objective with distillation from a frozen teacher encoder so that the updated encoder stays compatible with the existing decoder. Experiments on the ATCO2 air traffic control corpus use roughly 5,000 hours of untranscribed speech for the adaptation stage and only 2 hours of transcribed speech for final fine-tuning. This setup produces a 12% relative gain over a model that was fine-tuned without the preceding self-supervised stage. The work is presented as the first application of a self-supervised objective specifically for domain adaptation of Whisper.

Core claim

The paper establishes that the BEARD framework, which applies the BEST-RQ self-supervised objective together with knowledge distillation from a frozen teacher encoder, enables effective domain adaptation of Whisper's encoder using primarily unlabeled speech data. On the ATCO2 corpus of air traffic control communications, this yields a 12% relative improvement in recognition performance compared to fine-tuning alone when only 2 hours of transcribed data are available for the final step.

What carries the argument

The BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation) framework, which merges the BEST-RQ self-supervised objective with knowledge distillation from a frozen teacher encoder to update the encoder while preserving complementarity with the pre-trained decoder.

If this is right

The adapted model shows higher accuracy on noisy, non-native, domain-specific speech such as air traffic control communications.
Large quantities of unlabeled domain data can be used for encoder adaptation without requiring matching volumes of new transcriptions.
The pre-trained decoder continues to function effectively after the encoder update because the distillation step maintains complementarity.
This constitutes the first reported use of a self-supervised learning objective for domain adaptation of Whisper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation-plus-self-supervised pattern could be tested on other pre-trained ASR models that face domain shifts with limited labels.
Varying the amount of unlabeled data while holding labeled data fixed would show how performance scales with the size of the adaptation corpus.
The approach implies that encoder-only updates can be made to large speech models without retraining the entire system from scratch.
Applying BEARD to additional specialized domains would test whether the complementarity preservation holds beyond air traffic control speech.

Load-bearing premise

The assumption that combining the BEST-RQ objective with knowledge distillation from a frozen teacher encoder will ensure the adapted encoder remains complementary to the pre-trained decoder.

What would settle it

If the word error rate on the ATCO2 test set after BEARD adaptation plus 2-hour fine-tuning is not approximately 12% lower than the word error rate of the fine-tuned baseline alone, or if encoder changes visibly degrade decoder compatibility, the central performance claim would be falsified.

read the original abstract

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder with unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Whisper's encoder for ATC speech via BEST-RQ plus distillation on mostly unlabeled data and reports a 12% relative WER gain after light fine-tuning.

read the letter

The main thing to know is that they adapt the Whisper encoder for the air traffic control domain using about 5000 hours of unlabeled speech through a BEST-RQ self-supervised objective combined with distillation from a frozen teacher encoder. This is meant to keep the adapted encoder usable by the original decoder, followed by fine-tuning on just 2 hours of labeled data, yielding a claimed 12% relative WER improvement on ATCO2 over a fine-tuned baseline.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a self-supervised framework for domain adaptation of Whisper's encoder. It combines the BEST-RQ objective on unlabeled speech with knowledge distillation from a frozen teacher encoder to preserve complementarity with the pre-trained decoder, followed by limited supervised fine-tuning. On the ATCO2 air-traffic-control corpus, the method reports a 12% relative WER improvement over a fine-tuned baseline when using ~5,000 hours of untranscribed data for adaptation and 2 hours of transcribed data for fine-tuning, and claims to be the first application of SSL for Whisper domain adaptation.

Significance. If the performance gains prove robust and the distillation term successfully maintains decoder compatibility, the result would offer a practical route for adapting large multilingual ASR models to specialized, low-resource domains where labeled data is scarce but raw audio is plentiful. The focus on the noisy, accented, phraseology-heavy ATC setting adds immediate engineering relevance.

major comments (2)

Abstract: the headline claim of a 12% relative WER improvement supplies neither absolute WER values for the baseline and proposed systems, nor error bars, statistical significance tests, or explicit data-split descriptions. Without these quantities the magnitude and reliability of the reported gain cannot be assessed.
Abstract (definition of BEARD): the assertion that BEST-RQ plus distillation from the frozen teacher 'ensures the encoder's complementarity with the pre-trained decoder' is presented as a design property but is not accompanied by any supporting ablation, representation-similarity metric, or decoder-isolation experiment. Because the entire adaptation pipeline rests on the adapted encoder remaining usable by the original Whisper decoder, this unverified assumption is load-bearing for the central claim.

minor comments (1)

Abstract: the challenges of the ATCO2 domain (non-native speech, noise, specialized phraseology) are stated qualitatively; a short quantitative characterization (e.g., SNR statistics or vocabulary overlap with Whisper training data) would help readers gauge the severity of the domain shift.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: Abstract: the headline claim of a 12% relative WER improvement supplies neither absolute WER values for the baseline and proposed systems, nor error bars, statistical significance tests, or explicit data-split descriptions. Without these quantities the magnitude and reliability of the reported gain cannot be assessed.

Authors: We agree that the abstract would be clearer with absolute numbers and setup details. In the revision we will insert the absolute WER figures for the fine-tuned baseline and the BEARD system, together with a concise description of the 5,000-hour unlabeled ATCO2 split used for adaptation and the 2-hour transcribed split used for fine-tuning. If multiple runs were performed we will also report standard deviations; otherwise we will state the limitation explicitly. revision: yes
Referee: Abstract (definition of BEARD): the assertion that BEST-RQ plus distillation from the frozen teacher 'ensures the encoder's complementarity with the pre-trained decoder' is presented as a design property but is not accompanied by any supporting ablation, representation-similarity metric, or decoder-isolation experiment. Because the entire adaptation pipeline rests on the adapted encoder remaining usable by the original Whisper decoder, this unverified assumption is load-bearing for the central claim.

Authors: The distillation loss is introduced precisely to keep the adapted encoder outputs aligned with those of the original Whisper encoder, thereby preserving compatibility with the frozen decoder. We acknowledge that the submitted manuscript provides no quantitative verification of this effect. We will add (i) a representation-similarity analysis (cosine similarity between adapted and teacher encoder activations on held-out audio) and (ii) an ablation that removes the distillation term and measures the resulting WER when the original decoder is used. These results will appear in the experiments section or an appendix of the revised paper. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical adaptation procedure

full rationale

The paper presents BEARD as an empirical framework combining BEST-RQ self-supervised learning with knowledge distillation for Whisper encoder adaptation on unlabeled ATC data, followed by limited fine-tuning. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described structure. Central claims rest on reported WER improvements from experiments (e.g., 12% relative gain), which are externally falsifiable via replication on the ATCO2 corpus and do not reduce to self-definitional inputs or ansatzes smuggled via prior work. This is a standard honest non-finding for an applied adaptation study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The contribution rests on standard self-supervised and distillation techniques plus the unverified assumption that the proposed combination preserves decoder compatibility; no new physical entities or formal axioms are introduced.

free parameters (1)

BEST-RQ and distillation hyperparameters
These control the adaptation process and are presumably tuned on the unlabeled data but are not specified in the abstract.

axioms (1)

domain assumption Distillation from the frozen teacher encoder ensures complementarity with the pre-trained decoder
Invoked directly in the abstract's description of how BEARD maintains encoder-decoder compatibility.

invented entities (1)

BEARD framework no independent evidence
purpose: Named method for encoder adaptation combining BEST-RQ and distillation
Newly introduced named procedure whose effectiveness is the central empirical claim.

pith-pipeline@v0.9.0 · 5707 in / 1538 out tokens · 56367 ms · 2026-05-18T02:52:56.680333+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The overall training objective L is defined as L = L_q^ℓ + λ L_d^ℓ + βλ L_n^d ... using cosine similarity ... applied to the output of its ℓ-th layer
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

INTRODUCTION Automatic speech recognition (ASR) has reached near-human ac- curacy in many domains [1]. The arrival of large-scale end-to-end models made these models easier to use out-of-the-box. However, despite being trained on massive multilingual datasets, these models still struggle with out-of-domain scenarios, like out-of-vocabulary words, spontane...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Whisper Whisper, end-to-end encoder-decoder Transformer, is a state-of-the- art model for automatic speech recognition [20]

PROPOSED METHODOLOGY 2.1. Whisper Whisper, end-to-end encoder-decoder Transformer, is a state-of-the- art model for automatic speech recognition [20]. It has been trained on 680,000 hours of transcribed multilingual data. Whisper’s en- coder is mostly focused on acoustic features, while its decoder is mostly focused on linguistic features. Whisper differs...

work page
[3]

Dataset We conducted our experiments using the ATCO2 dataset 1 [13]

EXPERIMENTAL SETTINGS 3.1. Dataset We conducted our experiments using the ATCO2 dataset 1 [13]. It contains air traffic control communications between pilots and air traffic controllers from various airports. The speech is non-native, with high speech rate and noisy, with signal-to-noise ratios (SNR) varying from -10dB to 40dB, estimated using W ADA-SNR [...

work page 2048
[4]

Baselines We consider four baselines

RESULTS AND DISCUSSIONS 4.1. Baselines We consider four baselines. The first two are prior works on the ATCO2 dataset. An XLS-R model fine-tuned on 132 hours of ATC speech from diverse corpora (referred to as XLS-R FT) [18], ATCO2 was only used for testing. This is the first baseline that was presented 3https://gitlab.inria.fr/rbagat/beard Table 2: WER (%...

work page
[5]

We introduced BEARD, a framework that combines self-supervised learning and distillation to adapt Whisper’s encoder using unlabeled speech

CONCLUSION In this work, we investigated whether self-supervised learning can help Whisper adapt to a new domain. We introduced BEARD, a framework that combines self-supervised learning and distillation to adapt Whisper’s encoder using unlabeled speech. The modified en- coder is then fine-tuned with the decoder using a limited amount of labeled data. On t...

work page
[6]

Toward human parity in conversa- tional speech recognition,

W. Xiong, J. Droppo, X. Huang, F. Seide, M.L. Seltzer, A. Stol- cke, D. Yu, and G. Zweig, “Toward human parity in conversa- tional speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410– 2423, 2017

work page 2017
[7]

Utilizing untranscribed training data to improve performance,

G. Zavaliagkos and T. Colthurst, “Utilizing untranscribed training data to improve performance,” inDARPA Broad- cast News Transcription and Understanding Workshop, Lands- downe, 1998

work page 1998
[8]

Self-training for end-to-end speech recognition,

J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” inProc. 2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088

work page 2020
[9]

wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in neural information processing sys- tems, vol. 33, pp. 12449–12460, 2020

work page 2020
[10]

XLS-R: Self-supervised cross-lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” inProc. Interspeech 2022, 2022, pp. 2278–2282

work page 2022
[11]

wav2vec 2.0 ASR for cantonese- speaking older adults in a clinical setting,

R. Huang and B. Mak, “wav2vec 2.0 ASR for cantonese- speaking older adults in a clinical setting,” inProc. Interspeech 2023, 2023, pp. 4958–4962

work page 2023
[12]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W-N. Hsu, B. Bolte, Y-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 29, pp. 3451–3460, 2021

work page 2021
[13]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019, pp. 4171–4186

work page 2019
[14]

Improving automatic speech recognition performance for low-resource languages with self- supervised models,

J. Zhao and W-Q. Zhang, “Improving automatic speech recognition performance for low-resource languages with self- supervised models,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1227–1241, 2022

work page 2022
[15]

Self-supervised learning with random-projection quantizer for speech recog- nition,

C-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self-supervised learning with random-projection quantizer for speech recog- nition,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 3915–3924

work page 2022
[16]

Google USM: Scaling automatic speech recognition beyond 100 languages,

Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023

work page arXiv 2023
[17]

NEST: Self-supervised fast conformer as all-purpose seasoning to speech processing tasks,

H. Huang, T. Park, K. Dhawan, I. Medennikov, K.C. Puvvada, N.R. Koluguri, W. Wang, J. Balam, and B. Ginsburg, “NEST: Self-supervised fast conformer as all-purpose seasoning to speech processing tasks,” inProc. 2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[18]

ATCO2 corpus: A large-scale dataset for research on au- tomatic speech recognition and natural language understand- ing of air traffic control communications,

J. Zuluaga-Gomez, K. Vesel ´y, I. Sz ¨oke, P. Motlicek, et al., “ATCO2 corpus: A large-scale dataset for research on au- tomatic speech recognition and natural language understand- ing of air traffic control communications,”arXiv preprint arXiv:2211.04054, 2022

work page arXiv 2022
[19]

A speech interface for air traffic control terminals,

J. Ferreiros, JM. Pardo, R. De C ´ordoba, J. Macias-Guarasa, JM. Montero, F. Fern ´andez, V . Sama, G. Gonz´alez, et al., “A speech interface for air traffic control terminals,”Aerospace Science and Technology, vol. 21, no. 1, pp. 7–15, 2012

work page 2012
[20]

A uni- fied framework for multilingual speech recognition in air traffic control systems,

Y . Lin, D. Guo, J. Zhang, Z. Chen, and B. Yang, “A uni- fied framework for multilingual speech recognition in air traffic control systems,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3608–3620, 2021

work page 2021
[21]

The air- bus air traffic control speech recognition 2018 challenge: To- wards ATC automatic transcription and call sign detection,

T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air- bus air traffic control speech recognition 2018 challenge: To- wards ATC automatic transcription and call sign detection,” in Proc. Interspeech 2019, 2019, pp. 2993–2997

work page 2018
[22]

Automatic speech recog- nition for air traffic control communications,

S. Badrinath and H. Balakrishnan, “Automatic speech recog- nition for air traffic control communications,”Transportation research record, vol. 2676, no. 1, pp. 798–810, 2022

work page 2022
[23]

How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? an extensive benchmark on air traffic control communications,

J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S.S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, and Q. Zhan, “How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? an extensive benchmark on air traffic control communications,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 205–212

work page 2023
[24]

Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,

J. van Doorn, J. Sun, J. Hoekstra, P. Jonk, and V . de Vries, “Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,” inProc. Int. Conf. Res. Air Transp. (ICRAT), 2024

work page 2024
[25]

Robust speech recognition via large-scale weak supervision,

A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28492–28518

work page 2023
[26]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech 2020, 2020, pp. 5036–5040

work page 2020
[27]

Revisiting convolution-free trans- former for speech recognition,

Z. Hou, G. Huybrechts, A. Bhatia, D. Garcia-Romero, K.J. Han, and K. Kirchhoff, “Revisiting convolution-free trans- former for speech recognition,” inProc. Interspeech 2024, 2024, pp. 4568–4572

work page 2024
[28]

Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,

C. Kim and R.M. Stern, “Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,” in Proc. Interspeech 2008, 2008, pp. 2598–2601

work page 2008
[29]

Open implementation and study of BEST-RQ for speech processing,

R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open implementation and study of BEST-RQ for speech processing,” inProc. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 460–464

work page 2024
[30]

NIST, “SCTK,”https://github.com/usnistgov/ SCTK.git, 2024

work page 2024

[1] [1]

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

INTRODUCTION Automatic speech recognition (ASR) has reached near-human ac- curacy in many domains [1]. The arrival of large-scale end-to-end models made these models easier to use out-of-the-box. However, despite being trained on massive multilingual datasets, these models still struggle with out-of-domain scenarios, like out-of-vocabulary words, spontane...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Whisper Whisper, end-to-end encoder-decoder Transformer, is a state-of-the- art model for automatic speech recognition [20]

PROPOSED METHODOLOGY 2.1. Whisper Whisper, end-to-end encoder-decoder Transformer, is a state-of-the- art model for automatic speech recognition [20]. It has been trained on 680,000 hours of transcribed multilingual data. Whisper’s en- coder is mostly focused on acoustic features, while its decoder is mostly focused on linguistic features. Whisper differs...

work page

[3] [3]

Dataset We conducted our experiments using the ATCO2 dataset 1 [13]

EXPERIMENTAL SETTINGS 3.1. Dataset We conducted our experiments using the ATCO2 dataset 1 [13]. It contains air traffic control communications between pilots and air traffic controllers from various airports. The speech is non-native, with high speech rate and noisy, with signal-to-noise ratios (SNR) varying from -10dB to 40dB, estimated using W ADA-SNR [...

work page 2048

[4] [4]

Baselines We consider four baselines

RESULTS AND DISCUSSIONS 4.1. Baselines We consider four baselines. The first two are prior works on the ATCO2 dataset. An XLS-R model fine-tuned on 132 hours of ATC speech from diverse corpora (referred to as XLS-R FT) [18], ATCO2 was only used for testing. This is the first baseline that was presented 3https://gitlab.inria.fr/rbagat/beard Table 2: WER (%...

work page

[5] [5]

We introduced BEARD, a framework that combines self-supervised learning and distillation to adapt Whisper’s encoder using unlabeled speech

CONCLUSION In this work, we investigated whether self-supervised learning can help Whisper adapt to a new domain. We introduced BEARD, a framework that combines self-supervised learning and distillation to adapt Whisper’s encoder using unlabeled speech. The modified en- coder is then fine-tuned with the decoder using a limited amount of labeled data. On t...

work page

[6] [6]

Toward human parity in conversa- tional speech recognition,

W. Xiong, J. Droppo, X. Huang, F. Seide, M.L. Seltzer, A. Stol- cke, D. Yu, and G. Zweig, “Toward human parity in conversa- tional speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410– 2423, 2017

work page 2017

[7] [7]

Utilizing untranscribed training data to improve performance,

G. Zavaliagkos and T. Colthurst, “Utilizing untranscribed training data to improve performance,” inDARPA Broad- cast News Transcription and Understanding Workshop, Lands- downe, 1998

work page 1998

[8] [8]

Self-training for end-to-end speech recognition,

J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” inProc. 2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088

work page 2020

[9] [9]

wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in neural information processing sys- tems, vol. 33, pp. 12449–12460, 2020

work page 2020

[10] [10]

XLS-R: Self-supervised cross-lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” inProc. Interspeech 2022, 2022, pp. 2278–2282

work page 2022

[11] [11]

wav2vec 2.0 ASR for cantonese- speaking older adults in a clinical setting,

R. Huang and B. Mak, “wav2vec 2.0 ASR for cantonese- speaking older adults in a clinical setting,” inProc. Interspeech 2023, 2023, pp. 4958–4962

work page 2023

[12] [12]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W-N. Hsu, B. Bolte, Y-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 29, pp. 3451–3460, 2021

work page 2021

[13] [13]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019, pp. 4171–4186

work page 2019

[14] [14]

Improving automatic speech recognition performance for low-resource languages with self- supervised models,

J. Zhao and W-Q. Zhang, “Improving automatic speech recognition performance for low-resource languages with self- supervised models,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1227–1241, 2022

work page 2022

[15] [15]

Self-supervised learning with random-projection quantizer for speech recog- nition,

C-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self-supervised learning with random-projection quantizer for speech recog- nition,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 3915–3924

work page 2022

[16] [16]

Google USM: Scaling automatic speech recognition beyond 100 languages,

Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023

work page arXiv 2023

[17] [17]

NEST: Self-supervised fast conformer as all-purpose seasoning to speech processing tasks,

H. Huang, T. Park, K. Dhawan, I. Medennikov, K.C. Puvvada, N.R. Koluguri, W. Wang, J. Balam, and B. Ginsburg, “NEST: Self-supervised fast conformer as all-purpose seasoning to speech processing tasks,” inProc. 2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[18] [18]

ATCO2 corpus: A large-scale dataset for research on au- tomatic speech recognition and natural language understand- ing of air traffic control communications,

J. Zuluaga-Gomez, K. Vesel ´y, I. Sz ¨oke, P. Motlicek, et al., “ATCO2 corpus: A large-scale dataset for research on au- tomatic speech recognition and natural language understand- ing of air traffic control communications,”arXiv preprint arXiv:2211.04054, 2022

work page arXiv 2022

[19] [19]

A speech interface for air traffic control terminals,

J. Ferreiros, JM. Pardo, R. De C ´ordoba, J. Macias-Guarasa, JM. Montero, F. Fern ´andez, V . Sama, G. Gonz´alez, et al., “A speech interface for air traffic control terminals,”Aerospace Science and Technology, vol. 21, no. 1, pp. 7–15, 2012

work page 2012

[20] [20]

A uni- fied framework for multilingual speech recognition in air traffic control systems,

Y . Lin, D. Guo, J. Zhang, Z. Chen, and B. Yang, “A uni- fied framework for multilingual speech recognition in air traffic control systems,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3608–3620, 2021

work page 2021

[21] [21]

The air- bus air traffic control speech recognition 2018 challenge: To- wards ATC automatic transcription and call sign detection,

T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air- bus air traffic control speech recognition 2018 challenge: To- wards ATC automatic transcription and call sign detection,” in Proc. Interspeech 2019, 2019, pp. 2993–2997

work page 2018

[22] [22]

Automatic speech recog- nition for air traffic control communications,

S. Badrinath and H. Balakrishnan, “Automatic speech recog- nition for air traffic control communications,”Transportation research record, vol. 2676, no. 1, pp. 798–810, 2022

work page 2022

[23] [23]

How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? an extensive benchmark on air traffic control communications,

J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S.S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, and Q. Zhan, “How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? an extensive benchmark on air traffic control communications,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 205–212

work page 2023

[24] [24]

Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,

J. van Doorn, J. Sun, J. Hoekstra, P. Jonk, and V . de Vries, “Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,” inProc. Int. Conf. Res. Air Transp. (ICRAT), 2024

work page 2024

[25] [25]

Robust speech recognition via large-scale weak supervision,

A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28492–28518

work page 2023

[26] [26]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech 2020, 2020, pp. 5036–5040

work page 2020

[27] [27]

Revisiting convolution-free trans- former for speech recognition,

Z. Hou, G. Huybrechts, A. Bhatia, D. Garcia-Romero, K.J. Han, and K. Kirchhoff, “Revisiting convolution-free trans- former for speech recognition,” inProc. Interspeech 2024, 2024, pp. 4568–4572

work page 2024

[28] [28]

Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,

C. Kim and R.M. Stern, “Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,” in Proc. Interspeech 2008, 2008, pp. 2598–2601

work page 2008

[29] [29]

Open implementation and study of BEST-RQ for speech processing,

R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open implementation and study of BEST-RQ for speech processing,” inProc. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 460–464

work page 2024

[30] [30]

NIST, “SCTK,”https://github.com/usnistgov/ SCTK.git, 2024

work page 2024