Synthetic Audio Generation Framework for Air Traffic Control Speech Recognition

Emmanuel Vincent; Irina Illina; Junichi Yamagishi; Rapha\"el Bagat; Zhe Zhang

arxiv: 2606.21340 · v1 · pith:UZPDY4GQnew · submitted 2026-06-19 · 💻 cs.CL

Synthetic Audio Generation Framework for Air Traffic Control Speech Recognition

Rapha\"el Bagat , Zhe Zhang , Junichi Yamagishi , Irina Illina , Emmanuel Vincent This is my paper

Pith reviewed 2026-06-26 14:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords air traffic controlautomatic speech recognitionsynthetic dataaccent conversiontext-to-speechvoice conversionwhisper modelword error rate

0 comments

The pith

Fine-tuning Whisper on synthetic ATC audio with accent simulation reduces word error rates compared to real-data baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Air traffic control speech recognition faces challenges from channel noise, non-native English accents, and scarce real recordings. The paper builds a generation pipeline that starts with text-to-speech, adds voice conversion, converts L2 accents to L1, and introduces a controllable L1-to-L2 conversion to create realistic accented speech. Experiments fine-tune the Whisper model on the ATCO2 corpus using either the synthetic data alone or a mix with real data. Both approaches lower word error rates relative to an out-of-the-box model and to real-data-only fine-tuning. A sympathetic reader cares because the method offers a way to overcome data limits in safety-critical, acoustically difficult domains without needing vastly more real recordings.

Core claim

The paper claims that its synthetic data generation pipeline, built from neural text-to-speech, voice conversion, L2-to-L1 accent conversion, and a novel controllable L1-to-L2 accent conversion, produces audio whose acoustic properties allow fine-tuning that significantly improves word error rate on the ATCO2 corpus over both out-of-the-box and real-data-only baselines.

What carries the argument

The synthetic audio generation pipeline that combines neural generation techniques with a controllable L1-to-L2 accent conversion framework to simulate channel noise and accent distributions.

If this is right

Synthetic data alone improves performance over an out-of-the-box Whisper model.
A mix of real and synthetic data improves performance over real data alone.
The pipeline addresses both non-native accents and channel noise through targeted conversion steps.
The improvements are demonstrated specifically on the Whisper model and ATCO2 corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generation steps could be tested on other low-resource noisy domains such as medical dictation or emergency radio.
If the controllable accent conversion proves robust, it might reduce reliance on expensive collection of real accented ATC recordings.
Applying the pipeline to additional ASR architectures beyond Whisper would test whether the gains are model-specific.

Load-bearing premise

The synthetic audio must accurately reproduce the channel noise, accent distributions, and acoustic conditions of real ATC communications in the ATCO2 corpus.

What would settle it

Fine-tuning Whisper on the synthetic or mixed data and then measuring word error rate on a held-out real ATCO2 test set; if the rate shows no improvement or worsens versus the real-data-only baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 2606.21340 by Emmanuel Vincent, Irina Illina, Junichi Yamagishi, Rapha\"el Bagat, Zhe Zhang.

**Figure 1.** Figure 1: Synthetic audio generation framework for ATC. 3. synthesizing the Mel-spectrogram from the L1 tokens using a token-to-Mel synthesizer model, conditioned on a speaker embedding extracted from the input speech and the input utterance duration; 4. synthesizing the L1 speech waveform from the generated Mel-spectrogram using a vocoder. L1-to-L2 accent conversion (ACL1→L2) — Our fourth generative approach is L… view at source ↗

**Figure 2.** Figure 2: Repurposed TokAN architecture for controllable L1- to-L2 accent conversion. provides the accent reference. While the speaker embedding can be extracted from the L1 input by default, it can also be extracted from a separate reference audio, enabling simultaneous control of both accent and speaker identity. 2.3. ATC acoustic simulation To faithfully simulate ATC speech, we apply a sequence of ATC-specific a… view at source ↗

read the original abstract

Automatic Speech Recognition (ASR) systems, despite achieving remarkable accuracy in general-purpose domains with native speech (L1), struggle in domains like Air Traffic Control (ATC) due to strong channel noise, a presence of non-native (L2) English accents, and data scarcity. We propose a synthetic data generation pipeline with acoustical properties simulations specifically designed to address this lack of real data to improve recognition accuracy in the ATC domain. Our approach leverages a combination of neural generation techniques, including Text-to-Speech, Voice Conversion, L2-to-L1 accent conversion, and a novel controllable L1-to-L2 accent conversion framework built to simulate accented speech. Our experiments with the Whisper model on the ATCO2 corpus demonstrate that fine-tuning with either synthetic data alone, or a mix of real and synthetic data, significantly improves the word error rate over out-of-the-box and real data only baselines respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a synthetic ATC audio pipeline around a new controllable L1-to-L2 accent converter and claims WER gains on Whisper with ATCO2, but the abstract supplies no numbers or fidelity checks.

read the letter

The main takeaway is a targeted pipeline that combines TTS, voice conversion, L2-to-L1 conversion, and a new controllable L1-to-L2 accent framework to generate training data for ATC speech recognition. Experiments reportedly fine-tune Whisper on ATCO2 and show gains from synthetic data alone or mixed with real data.

What is actually new is the controllable L1-to-L2 component built specifically for this domain rather than a generic extension of existing conversion methods.

The work does a reasonable job identifying a concrete, high-stakes data scarcity problem and laying out a practical generation approach that tries to simulate channel noise and accents.

The soft spots are clear from the abstract alone: no reported WER numbers, no ablations, no statistical tests, and no objective checks that the synthetic samples match real ATCO2 acoustics, SNR, or accent distributions. Without those, it is impossible to tell whether any gains come from domain-matched data or from other effects like added volume. The stress-test concern about fidelity holds up on the available text.

This is for researchers focused on domain-specific ASR or accent handling in constrained settings. A reader already working on aviation or low-resource speech might pick up the pipeline idea.

It deserves peer review because the application area matters and the targeted synthetic approach is worth testing properly, even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a synthetic audio generation pipeline for ATC speech recognition that combines TTS, voice conversion, L2-to-L1 accent conversion, and a novel controllable L1-to-L2 accent conversion framework to simulate channel noise and non-native accents. Experiments fine-tune the Whisper model on the ATCO2 corpus and claim that synthetic data alone or mixed with real data yields significant WER improvements over out-of-the-box and real-data-only baselines.

Significance. If the synthetic samples faithfully reproduce ATCO2 acoustics, the framework could offer a practical route to mitigating data scarcity in noisy, accented, safety-critical ASR domains.

major comments (2)

[Abstract] Abstract: the central claim that synthetic or mixed data 'significantly improves' WER is stated without any numerical results, confidence intervals, statistical tests, ablation tables, or baseline WER values, preventing assessment of effect size or robustness.
[Abstract] Abstract (and implied experimental section): the pipeline's validity rests on the unverified assumption that generated audio matches ATCO2 channel noise, L2 accent distributions, and acoustic conditions, yet no objective metrics (SNR histograms, formant statistics, spectrogram distribution distances) or listening tests are referenced to confirm domain fidelity.

minor comments (1)

The phrase 'acoustical properties simulations' is underspecified; a brief enumeration of the simulation parameters or modules would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the need for stronger validation of the synthetic pipeline. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that synthetic or mixed data 'significantly improves' WER is stated without any numerical results, confidence intervals, statistical tests, ablation tables, or baseline WER values, preventing assessment of effect size or robustness.

Authors: We agree that the abstract should include concrete numerical support. The manuscript body reports the relevant WER figures, but the abstract does not. We will revise the abstract to state the baseline WER, the WER achieved with synthetic data alone, the WER with mixed data, and the magnitude of improvement. revision: yes
Referee: [Abstract] Abstract (and implied experimental section): the pipeline's validity rests on the unverified assumption that generated audio matches ATCO2 channel noise, L2 accent distributions, and acoustic conditions, yet no objective metrics (SNR histograms, formant statistics, spectrogram distribution distances) or listening tests are referenced to confirm domain fidelity.

Authors: The primary validation in the work is the measured WER reduction on the held-out ATCO2 test set when synthetic or mixed data is used for fine-tuning; this constitutes task-level evidence that the generated samples are useful under the target acoustic and accent conditions. We acknowledge that the manuscript does not report direct acoustic similarity metrics or listening tests. Adding such analyses would require additional experiments that are outside the current scope, so we will instead clarify in the text that downstream ASR performance is the intended validation criterion. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external corpus

full rationale

The paper reports an empirical study: a synthetic audio pipeline (TTS, voice conversion, accent conversion) is used to fine-tune Whisper, with WER measured on the external ATCO2 corpus against out-of-the-box and real-data baselines. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations are present that would reduce any claimed result to its inputs by construction. The evaluation is independently falsifiable via the public ATCO2 benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract only; full paper may introduce additional parameters or assumptions not visible here.

axioms (1)

domain assumption The ATCO2 corpus represents typical real-world ATC acoustic and accent conditions.
Used as the evaluation benchmark for all reported improvements.

invented entities (1)

controllable L1-to-L2 accent conversion framework no independent evidence
purpose: To generate accented synthetic speech with tunable accent strength for ATC training data.
Explicitly described as novel in the abstract; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5691 in / 1245 out tokens · 17916 ms · 2026-06-26T14:33:04.574590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 2 linked inside Pith

[1]

However, deploying these systems in safety-critical environments like Air Traffic Control (ATC) remains a challenge

Introduction Automatic Speech Recognition (ASR) systems have achieved remarkable accuracy in general-purpose domains [1]. However, deploying these systems in safety-critical environments like Air Traffic Control (ATC) remains a challenge. ATC communica- tions are characterized by acoustic and linguistic complexities: domain-specific phraseology, rapid spe...

Pith/arXiv arXiv 2026
[2]

Proposed Methodology To address data scarcity in the ATC domain, we aim to syn- thesize diverse aspects of ATC speech. Figure 1 illustrates an overview of our framework, which generates diverse variants of transcribed ATC utterances by leveraging pre-trained mod- ules for speech/noise separation, super-resolution, TTS, VC, and AC. In addition, we propose ...
[3]

extracting discrete HuBERT tokens from input L2 speech
[4]

converting the L2 discrete HuBERT tokens into L1 tokens using a token-based accent conversion module, conditioned by the accent embedding computed from the input speech; 1https://gitlab.inria.fr/rbagat/atc_ generation Figure 1:Synthetic audio generation framework for ATC
[5]

synthesizing the Mel-spectrogram from the L1 tokens using a token-to-Mel synthesizer model, conditioned on a speaker embedding extracted from the input speech and the input ut- terance duration
[6]

L1-to-L2 accent conversion (ACL1→L2) —Our fourth gener- ative approach is L1-to-L2 accent conversion, i.e., generating non-native accented speech from native speech

synthesizing the L1 speech waveform from the generated Mel-spectrogram using a vocoder. L1-to-L2 accent conversion (ACL1→L2) —Our fourth gener- ative approach is L1-to-L2 accent conversion, i.e., generating non-native accented speech from native speech. By increas- ing accent diversity, we investigate whether exposure to a wider range of L2 accents improv...
[7]

radio-like

to leverage its learned representations and adapt it to ATC. TokAN’s token conversion module and token-to-mel synthe- sizer are the core components for accent conversion, and typ- ically require substantially less data to fine-tune than the Hu- BERT encoder or the vocoder. We therefore fine-tune these two modules for L1-to-L2 conversion. In the original T...
[8]

Datasets Real ATC data —Our experiments focus on the ATCO2 3 dataset [30]

Experimental Setup 3.1. Datasets Real ATC data —Our experiments focus on the ATCO2 3 dataset [30]. It consists of ATC communications recorded at seven airports. The dataset exhibits key characteristics of ATC speech, including non-native accents, strong channel noise, high speech rates, and domain-specific phraseology. In our experiments, we use only the ...
[9]

normalized

Results and Discussions Baselines —We consider two baselines (last two rows of Ta- ble 1): the out-of-the-box Whisper-small model (63.32% WER) and the same model fine-tuned exclusively on real ATCO2 data (22.69% WER). In-domain fine-tuning substantially im- proves performance, highlighting the importance of adapting ASR models to ATC-specific data. Fine-t...
[10]

We proposed a novel synthetic data genera- tion framework for ATC speech recognition that includes TTS, VC, AC, and domain-specific acoustical simulation

Conclusion We investigated generative data augmentation strategies to ad- dress data scarcity in Air Traffic Control (ATC) automatic speech recognition. We proposed a novel synthetic data genera- tion framework for ATC speech recognition that includes TTS, VC, AC, and domain-specific acoustical simulation. Further- more, this work introduced a novel contr...
[11]

The authors take the full responsibility for the con- tent of this manuscript

Use of Generative AI Disclosure Generative AI tools were solely used for polishing the manuscript. The authors take the full responsibility for the con- tent of this manuscript
[12]

It was granted access to the HPC resources of IDRIS under the allocation 2024-AD011015024 made by GENCI

Acknowledgments This work was funded by the DeepMAUVES project sup- ported by DGA of french MoD and CNRS, by the Inria-NII TrustedSpeech Associate Team, and MEXT KAKENHI Grants (24K21324). It was granted access to the HPC resources of IDRIS under the allocation 2024-AD011015024 made by GENCI. Authors would like to thank Ye-Xin Lu for giving us access to t...

2024
[13]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020

2020
[14]

How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communi- cations,

J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S. S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, and Q. Zhan, “How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communi- cations,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 205–212

2023
[15]

The air- bus air traffic control speech recognition 2018 challenge: Towards ATC automatic transcription and call sign detection,

T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air- bus air traffic control speech recognition 2018 challenge: Towards ATC automatic transcription and call sign detection,” inInter- speech, 2019, pp. 2993–2997

2018
[16]

Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,

J. van Doorn, J. Sun, J. Hoekstra, P. Jonk, and V . de Vries, “Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,” inInternational Conference on Research in Air Transportation (ICRAT), 2024

2024
[17]

In-domain SSL pre-training and streaming ASR: Application to air traffic control communications,

J. Duret, S. Mdhaffar, G. Laperri `ere, R. Whetten, A. Galametz, C. Kobus, M.-C. Martin, J. Oleiwan, and Y . Est `eve, “In-domain SSL pre-training and streaming ASR: Application to air traffic control communications,” inSpeech and Computer, 2026, pp. 3– 12

2026
[18]

BEST-RQ-based self- supervised learning for Whisper domain adaptation,

R. Bagat, I. Illina, and E. Vincent, “BEST-RQ-based self- supervised learning for Whisper domain adaptation,” in2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026

2026
[19]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” inInterspeech, 2015, pp. 3586– 3589

2015
[20]

Deep Speech: Scaling up end-to-end speech recogni- tion,

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep Speech: Scaling up end-to-end speech recogni- tion,” 2014, arXiv preprint arXiv:1412.5567

Pith/arXiv arXiv 2014
[21]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inInterspeech, 2019, pp. 2613–2617

2019
[22]

Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,

S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,”Computer Speech & Language, vol. 72, p. 101302, 2022

2022
[23]

AccentBox: Towards high-fidelity zero-shot accent generation,

J. Zhong, K. Richmond, Z. Su, and S. Sun, “AccentBox: Towards high-fidelity zero-shot accent generation,” in2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[24]

Par- rotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,

F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,” inInterspeech, 2019, pp. 4115–4119

2019
[25]

Accent modification for speech recognition of non-native speakers using neural style transfer,

K. Radzikowski, L. Wang, O. Yoshie, and R. Nowak, “Accent modification for speech recognition of non-native speakers using neural style transfer,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2021, no. 1, p. 11, 2021

2021
[26]

Accent conversion us- ing discrete units with parallel data synthesized from controllable accented TTS,

T.-N. Nguyen, Q. Pham, and A. Waibel, “Accent conversion us- ing discrete units with parallel data synthesized from controllable accented TTS,” inSynthetic Data’s Transformative Role in Foun- dational Speech Models, 2024, pp. 51–55

2024
[27]

Ac- cent normalization using self-supervised discrete tokens with non- parallel data,

Q. Bai, S. Inoue, S. Wang, Z. Jiang, Y . Wang, and H. Li, “Ac- cent normalization using self-supervised discrete tokens with non- parallel data,” inInterspeech, 2025, pp. 1618–1622

2025
[28]

Non-parallel voice conversion for ASR augmentation,

G. Wang, A. Rosenberg, B. Ramabhadran, F. Biadsy, J. Emond, Y . Huang, and P. J. Moreno, “Non-parallel voice conversion for ASR augmentation,” inInterspeech, 2022, pp. 3408–3412

2022
[29]

Fairness in automatic speech recognition isn’t a one-size-fits- all,

H. ElGhazaly, B. Mirheidari, H. Christensen, and N. S. Moosavi, “Fairness in automatic speech recognition isn’t a one-size-fits- all,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 19 169–19 178

2025
[30]

V oice conversion can improve ASR in very low-resource settings,

M. Baas and H. Kamper, “V oice conversion can improve ASR in very low-resource settings,” inInterspeech, 2022, pp. 3513–3517

2022
[31]

Speech recognition for air traf- fic control utilizing a multi-head state-space model and transfer learning,

H. Liang, H. Chang, and J. Kong, “Speech recognition for air traf- fic control utilizing a multi-head state-space model and transfer learning,”Aerospace, vol. 11, no. 5, p. 390, 2024

2024
[32]

Adapting auto- matic speech recognition for accented air traffic control commu- nications,

M. Y . Z. Wee, J. J. H. Wong, L. Lim, J. Y . W. Tan, P. Gupta, D. Lim, E. H. Tew, A. K. S. Han, and Y . Z. Lim, “Adapting auto- matic speech recognition for accented air traffic control commu- nications,” in2025 International Conference on Military Commu- nication and Information Systems (ICMCIS), 2025, pp. 1–10

2025
[33]

A unified frame- work for multilingual speech recognition in air traffic control sys- tems,

Y . Lin, D. Guo, J. Zhang, Z. Chen, and B. Yang, “A unified frame- work for multilingual speech recognition in air traffic control sys- tems,”IEEE Transactions on Neural Networks and Learning Sys- tems, vol. 32, no. 8, pp. 3608–3620, 2020

2020
[34]

Automatic speech recognition for air traffic control communications,

S. Badrinath and H. Balakrishnan, “Automatic speech recognition for air traffic control communications,”Transportation research record, vol. 2676, no. 1, pp. 798–810, 2022

2022
[35]

Improv- ing speech recognition models with small samples for air traffic control systems,

Y . Lin, Q. Li, B. Yang, Z. Yan, H. Tan, and Z. Chen, “Improv- ing speech recognition models with small samples for air traffic control systems,”Neurocomputing, vol. 445, pp. 287–297, 2021

2021
[36]

Exploring con- textual knowledge-enhanced speech recognition in air traffic con- trol communication: A comparative study,

D. Guo, S. Zhang, J. Zhang, B. Yang, and Y . Lin, “Exploring con- textual knowledge-enhanced speech recognition in air traffic con- trol communication: A comparative study,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 9, pp. 16 085–16 099, 2025

2025
[37]

DAIEN-TTS: Disentangled audio infilling for environment- aware text-to-speech synthesis,

Y .-X. Lu, Y . Gu, K. Wei, H.-P. Du, Y . Ai, and Z.-H. Ling, “DAIEN-TTS: Disentangled audio infilling for environment- aware text-to-speech synthesis,” in2026 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2026

2026
[38]

Au- dioSR: Versatile audio super-resolution at scale,

H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley, “Au- dioSR: Versatile audio super-resolution at scale,” in2024 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2024, pp. 1076–1080

2024
[39]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” in63rd Annual Meeting of the ACL (Volume 1: Long Papers), 2025, pp. 6255–6271

2025
[40]

V oice conversion with just nearest neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “V oice conversion with just nearest neighbors,” inInterspeech 2023, 2023, pp. 2053– 2057

2023
[41]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

2021
[42]

ATCO2 corpus: A large-scale dataset for research on automatic speech recognition and natural language understanding of air traffic control communications,

J. P. Z. Gomez, K. Vesel ´y, I. Sz¨oke, A. Blatt, P. Motlicek, M. Ko- cour, K. Choukri, I. Nigmatulina, C. Cevenini, A. Tart, J. Cer- nock´y, and D. Klakow, “ATCO2 corpus: A large-scale dataset for research on automatic speech recognition and natural language understanding of air traffic control communications,”Journal of Data-centric Machine Learning Res...

2024
[43]

L2-ARCTIC: A non-native English speech corpus,

G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev- Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A non-native English speech corpus,” inInterspeech, 2018, pp. 2783–2787

2018
[44]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational Conference on Machine Learning, 2023, pp. 28 492–28 518

2023
[45]

Non- native children speech recognition through transfer learning,

M. Matassoni, R. Gretter, D. Falavigna, and D. Giuliani, “Non- native children speech recognition through transfer learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6229–6233

2018
[46]

NIST, “SCTK,” https://github.com/usnistgov/SCTK.git, 2024

2024

[1] [1]

However, deploying these systems in safety-critical environments like Air Traffic Control (ATC) remains a challenge

Introduction Automatic Speech Recognition (ASR) systems have achieved remarkable accuracy in general-purpose domains [1]. However, deploying these systems in safety-critical environments like Air Traffic Control (ATC) remains a challenge. ATC communica- tions are characterized by acoustic and linguistic complexities: domain-specific phraseology, rapid spe...

Pith/arXiv arXiv 2026

[2] [2]

Proposed Methodology To address data scarcity in the ATC domain, we aim to syn- thesize diverse aspects of ATC speech. Figure 1 illustrates an overview of our framework, which generates diverse variants of transcribed ATC utterances by leveraging pre-trained mod- ules for speech/noise separation, super-resolution, TTS, VC, and AC. In addition, we propose ...

[3] [3]

extracting discrete HuBERT tokens from input L2 speech

[4] [4]

converting the L2 discrete HuBERT tokens into L1 tokens using a token-based accent conversion module, conditioned by the accent embedding computed from the input speech; 1https://gitlab.inria.fr/rbagat/atc_ generation Figure 1:Synthetic audio generation framework for ATC

[5] [5]

synthesizing the Mel-spectrogram from the L1 tokens using a token-to-Mel synthesizer model, conditioned on a speaker embedding extracted from the input speech and the input ut- terance duration

[6] [6]

L1-to-L2 accent conversion (ACL1→L2) —Our fourth gener- ative approach is L1-to-L2 accent conversion, i.e., generating non-native accented speech from native speech

synthesizing the L1 speech waveform from the generated Mel-spectrogram using a vocoder. L1-to-L2 accent conversion (ACL1→L2) —Our fourth gener- ative approach is L1-to-L2 accent conversion, i.e., generating non-native accented speech from native speech. By increas- ing accent diversity, we investigate whether exposure to a wider range of L2 accents improv...

[7] [7]

radio-like

to leverage its learned representations and adapt it to ATC. TokAN’s token conversion module and token-to-mel synthe- sizer are the core components for accent conversion, and typ- ically require substantially less data to fine-tune than the Hu- BERT encoder or the vocoder. We therefore fine-tune these two modules for L1-to-L2 conversion. In the original T...

[8] [8]

Datasets Real ATC data —Our experiments focus on the ATCO2 3 dataset [30]

Experimental Setup 3.1. Datasets Real ATC data —Our experiments focus on the ATCO2 3 dataset [30]. It consists of ATC communications recorded at seven airports. The dataset exhibits key characteristics of ATC speech, including non-native accents, strong channel noise, high speech rates, and domain-specific phraseology. In our experiments, we use only the ...

[9] [9]

normalized

Results and Discussions Baselines —We consider two baselines (last two rows of Ta- ble 1): the out-of-the-box Whisper-small model (63.32% WER) and the same model fine-tuned exclusively on real ATCO2 data (22.69% WER). In-domain fine-tuning substantially im- proves performance, highlighting the importance of adapting ASR models to ATC-specific data. Fine-t...

[10] [10]

We proposed a novel synthetic data genera- tion framework for ATC speech recognition that includes TTS, VC, AC, and domain-specific acoustical simulation

Conclusion We investigated generative data augmentation strategies to ad- dress data scarcity in Air Traffic Control (ATC) automatic speech recognition. We proposed a novel synthetic data genera- tion framework for ATC speech recognition that includes TTS, VC, AC, and domain-specific acoustical simulation. Further- more, this work introduced a novel contr...

[11] [11]

The authors take the full responsibility for the con- tent of this manuscript

Use of Generative AI Disclosure Generative AI tools were solely used for polishing the manuscript. The authors take the full responsibility for the con- tent of this manuscript

[12] [12]

It was granted access to the HPC resources of IDRIS under the allocation 2024-AD011015024 made by GENCI

Acknowledgments This work was funded by the DeepMAUVES project sup- ported by DGA of french MoD and CNRS, by the Inria-NII TrustedSpeech Associate Team, and MEXT KAKENHI Grants (24K21324). It was granted access to the HPC resources of IDRIS under the allocation 2024-AD011015024 made by GENCI. Authors would like to thank Ye-Xin Lu for giving us access to t...

2024

[13] [13]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020

2020

[14] [14]

How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communi- cations,

J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S. S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, and Q. Zhan, “How does pre-trained wav2vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communi- cations,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 205–212

2023

[15] [15]

The air- bus air traffic control speech recognition 2018 challenge: Towards ATC automatic transcription and call sign detection,

T. Pellegrini, J. Farinas, E. Delpech, and F. Lancelot, “The air- bus air traffic control speech recognition 2018 challenge: Towards ATC automatic transcription and call sign detection,” inInter- speech, 2019, pp. 2993–2997

2018

[16] [16]

Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,

J. van Doorn, J. Sun, J. Hoekstra, P. Jonk, and V . de Vries, “Whisper-ATC open models for air traffic control automatic speech recognition with accuracy,” inInternational Conference on Research in Air Transportation (ICRAT), 2024

2024

[17] [17]

In-domain SSL pre-training and streaming ASR: Application to air traffic control communications,

J. Duret, S. Mdhaffar, G. Laperri `ere, R. Whetten, A. Galametz, C. Kobus, M.-C. Martin, J. Oleiwan, and Y . Est `eve, “In-domain SSL pre-training and streaming ASR: Application to air traffic control communications,” inSpeech and Computer, 2026, pp. 3– 12

2026

[18] [18]

BEST-RQ-based self- supervised learning for Whisper domain adaptation,

R. Bagat, I. Illina, and E. Vincent, “BEST-RQ-based self- supervised learning for Whisper domain adaptation,” in2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026

2026

[19] [19]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” inInterspeech, 2015, pp. 3586– 3589

2015

[20] [20]

Deep Speech: Scaling up end-to-end speech recogni- tion,

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep Speech: Scaling up end-to-end speech recogni- tion,” 2014, arXiv preprint arXiv:1412.5567

Pith/arXiv arXiv 2014

[21] [21]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inInterspeech, 2019, pp. 2613–2617

2019

[22] [22]

Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,

S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,”Computer Speech & Language, vol. 72, p. 101302, 2022

2022

[23] [23]

AccentBox: Towards high-fidelity zero-shot accent generation,

J. Zhong, K. Richmond, Z. Su, and S. Sun, “AccentBox: Towards high-fidelity zero-shot accent generation,” in2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025

[24] [24]

Par- rotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,

F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,” inInterspeech, 2019, pp. 4115–4119

2019

[25] [25]

Accent modification for speech recognition of non-native speakers using neural style transfer,

K. Radzikowski, L. Wang, O. Yoshie, and R. Nowak, “Accent modification for speech recognition of non-native speakers using neural style transfer,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2021, no. 1, p. 11, 2021

2021

[26] [26]

Accent conversion us- ing discrete units with parallel data synthesized from controllable accented TTS,

T.-N. Nguyen, Q. Pham, and A. Waibel, “Accent conversion us- ing discrete units with parallel data synthesized from controllable accented TTS,” inSynthetic Data’s Transformative Role in Foun- dational Speech Models, 2024, pp. 51–55

2024

[27] [27]

Ac- cent normalization using self-supervised discrete tokens with non- parallel data,

Q. Bai, S. Inoue, S. Wang, Z. Jiang, Y . Wang, and H. Li, “Ac- cent normalization using self-supervised discrete tokens with non- parallel data,” inInterspeech, 2025, pp. 1618–1622

2025

[28] [28]

Non-parallel voice conversion for ASR augmentation,

G. Wang, A. Rosenberg, B. Ramabhadran, F. Biadsy, J. Emond, Y . Huang, and P. J. Moreno, “Non-parallel voice conversion for ASR augmentation,” inInterspeech, 2022, pp. 3408–3412

2022

[29] [29]

Fairness in automatic speech recognition isn’t a one-size-fits- all,

H. ElGhazaly, B. Mirheidari, H. Christensen, and N. S. Moosavi, “Fairness in automatic speech recognition isn’t a one-size-fits- all,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 19 169–19 178

2025

[30] [30]

V oice conversion can improve ASR in very low-resource settings,

M. Baas and H. Kamper, “V oice conversion can improve ASR in very low-resource settings,” inInterspeech, 2022, pp. 3513–3517

2022

[31] [31]

Speech recognition for air traf- fic control utilizing a multi-head state-space model and transfer learning,

H. Liang, H. Chang, and J. Kong, “Speech recognition for air traf- fic control utilizing a multi-head state-space model and transfer learning,”Aerospace, vol. 11, no. 5, p. 390, 2024

2024

[32] [32]

Adapting auto- matic speech recognition for accented air traffic control commu- nications,

M. Y . Z. Wee, J. J. H. Wong, L. Lim, J. Y . W. Tan, P. Gupta, D. Lim, E. H. Tew, A. K. S. Han, and Y . Z. Lim, “Adapting auto- matic speech recognition for accented air traffic control commu- nications,” in2025 International Conference on Military Commu- nication and Information Systems (ICMCIS), 2025, pp. 1–10

2025

[33] [33]

A unified frame- work for multilingual speech recognition in air traffic control sys- tems,

Y . Lin, D. Guo, J. Zhang, Z. Chen, and B. Yang, “A unified frame- work for multilingual speech recognition in air traffic control sys- tems,”IEEE Transactions on Neural Networks and Learning Sys- tems, vol. 32, no. 8, pp. 3608–3620, 2020

2020

[34] [34]

Automatic speech recognition for air traffic control communications,

S. Badrinath and H. Balakrishnan, “Automatic speech recognition for air traffic control communications,”Transportation research record, vol. 2676, no. 1, pp. 798–810, 2022

2022

[35] [35]

Improv- ing speech recognition models with small samples for air traffic control systems,

Y . Lin, Q. Li, B. Yang, Z. Yan, H. Tan, and Z. Chen, “Improv- ing speech recognition models with small samples for air traffic control systems,”Neurocomputing, vol. 445, pp. 287–297, 2021

2021

[36] [36]

Exploring con- textual knowledge-enhanced speech recognition in air traffic con- trol communication: A comparative study,

D. Guo, S. Zhang, J. Zhang, B. Yang, and Y . Lin, “Exploring con- textual knowledge-enhanced speech recognition in air traffic con- trol communication: A comparative study,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 9, pp. 16 085–16 099, 2025

2025

[37] [37]

DAIEN-TTS: Disentangled audio infilling for environment- aware text-to-speech synthesis,

Y .-X. Lu, Y . Gu, K. Wei, H.-P. Du, Y . Ai, and Z.-H. Ling, “DAIEN-TTS: Disentangled audio infilling for environment- aware text-to-speech synthesis,” in2026 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2026

2026

[38] [38]

Au- dioSR: Versatile audio super-resolution at scale,

H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley, “Au- dioSR: Versatile audio super-resolution at scale,” in2024 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2024, pp. 1076–1080

2024

[39] [39]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” in63rd Annual Meeting of the ACL (Volume 1: Long Papers), 2025, pp. 6255–6271

2025

[40] [40]

V oice conversion with just nearest neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “V oice conversion with just nearest neighbors,” inInterspeech 2023, 2023, pp. 2053– 2057

2023

[41] [41]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

2021

[42] [42]

ATCO2 corpus: A large-scale dataset for research on automatic speech recognition and natural language understanding of air traffic control communications,

J. P. Z. Gomez, K. Vesel ´y, I. Sz¨oke, A. Blatt, P. Motlicek, M. Ko- cour, K. Choukri, I. Nigmatulina, C. Cevenini, A. Tart, J. Cer- nock´y, and D. Klakow, “ATCO2 corpus: A large-scale dataset for research on automatic speech recognition and natural language understanding of air traffic control communications,”Journal of Data-centric Machine Learning Res...

2024

[43] [43]

L2-ARCTIC: A non-native English speech corpus,

G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev- Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A non-native English speech corpus,” inInterspeech, 2018, pp. 2783–2787

2018

[44] [44]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational Conference on Machine Learning, 2023, pp. 28 492–28 518

2023

[45] [45]

Non- native children speech recognition through transfer learning,

M. Matassoni, R. Gretter, D. Falavigna, and D. Giuliani, “Non- native children speech recognition through transfer learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6229–6233

2018

[46] [46]

NIST, “SCTK,” https://github.com/usnistgov/SCTK.git, 2024

2024