Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Martijn Wieling; Reihaneh Amooie; Rik van Noord; Wietse de Vries; Yun Hao

arxiv: 2605.19711 · v1 · pith:GUIZUHKUnew · submitted 2026-05-19 · 💻 cs.CL

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Yun Hao , Reihaneh Amooie , Wietse de Vries , Rik van Noord , Martijn Wieling This is my paper

Pith reviewed 2026-05-20 05:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords Large Language ModelsGenerative Error CorrectionAutomatic Speech RecognitionLow-Resource LanguagesWest FrisianData ContaminationError Analysis

0 comments

The pith

Large language models can correct errors in low-resource West Frisian ASR even when trained on unseen texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generative error correction with LLMs can improve automatic speech recognition outputs for the low-resource language West Frisian. Authors evaluate several models on a standard public corpus and introduce a new offline dataset of non-public texts to check if gains come from genuine correction or from data the models already saw during training. Results show consistent improvements across most settings, and the strongest GPT-5.1 outputs even beat oracle word error rates. Similar gains on the offline set indicate the models are fixing errors rather than recalling memorized sentences. The work also includes an error analysis that maps out the kinds of mistakes the models tend to repair.

Core claim

Generative error correction using LLMs improves ASR performance for West Frisian in most tested configurations, with the best GPT-5.1 results exceeding oracle WERs; comparable gains appear on a newly constructed offline dataset of non-public texts, supporting the conclusion that observed improvements reflect true correction ability rather than contamination.

What carries the argument

Generative error correction (GER) applied to ASR hypotheses for Frisian, using public and offline evaluation sets to separate correction skill from training-data overlap.

If this is right

GER can be added as a post-processing step to raise accuracy for other low-resource ASR systems without retraining the recognizer.
Offline or private evaluation sets become necessary to validate LLM-based correction claims in any language with limited public data.
Error-pattern analysis can inform prompt design or model choice to target the specific mistake types that LLMs handle well.
Surpassing oracle WER suggests hybrid ASR-plus-LLM pipelines may reach lower error rates than either component alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The contamination-control method used here could be adopted as standard practice when testing LLMs on any low-resource speech or text task.
Smaller open models might achieve similar correction gains if the same offline evaluation protocol is applied.
Success on Frisian raises the possibility that LLM correction works for other under-resourced languages whose orthography or phonology differs from high-resource training data.
If the pattern holds, future ASR systems for minority languages could rely on lightweight recognizers followed by LLM correction rather than massive end-to-end training.

Load-bearing premise

The offline dataset of non-public texts shares no overlap with any LLM training corpus, so measured gains must come from correction rather than recall.

What would settle it

Finding any of the offline dataset sentences inside an LLM's training data, or observing that performance gains vanish on a second, independently verified unseen corpus, would falsify the claim of genuine correction.

Figures

Figures reproduced from arXiv: 2605.19711 by Martijn Wieling, Reihaneh Amooie, Rik van Noord, Wietse de Vries, Yun Hao.

**Figure 1.** Figure 1: Pipeline of the LLM-based generative error correction system. Note that we use N = 5 in this paper. LLM-based error correction to multilingual and low-resource settings. Li et al. [11] investigate multilingual one-best correction across 20 languages, and Yang et al. [12] introduce CoVoGER, a multilingual and multitask benchmark covering 15 languages for speech-to-text generative error correction. Xu et a… view at source ↗

**Figure 2.** Figure 2: The prompts used for our generative error correction system [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sentence-level improvement rates of trigram, Qwen3- FT and GPT-5.1 (generation vs. selection) on both Frisian ASR datasets. based approach using the original Qwen3 model. Both finetuned and non-fine-tuned Qwen3 variants yield only marginal improvements. Generation-based methods consistently outperform selection-based corrections. GPT-4o-mini-based GER achieves performance comparable to or better than the… view at source ↗

read the original abstract

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLM-based error correction can lift low-resource Frisian ASR and that gains hold on an offline dataset, but the no-overlap claim rests on an unverified assumption.

read the letter

The main thing to know is that this work tests whether LLMs can correct ASR errors in West Frisian and uses an offline dataset of non-public texts to check whether the gains come from real correction or from training data leakage. The offline comparison is the clearest new element here. They report that generative error correction improves WER in most conditions, that the strongest GPT-5.1 runs beat the oracle, and that the pattern looks similar on the offline set. The error analysis adds some detail on what kinds of mistakes the models fix. That combination gives a practical data point for low-resource settings where new labeled data is hard to get. The design is straightforward and the Frisian case is a useful addition to the existing GER literature. The central claim holds up on its own terms as long as the offline texts really are outside the pretraining distribution. The soft spot is exactly the one the stress-test flags. The paper states it uses non-public texts for the control but supplies no n-gram overlap numbers, no membership-inference results, and no deduplication report against Common Crawl or other known corpora. Without those checks, the comparable gains on the offline set do not fully isolate genuine correction from possible partial leakage. That assumption is load-bearing for the contamination-aware conclusion, so the evidence is weaker than it first appears. Everything else in the experimental setup looks standard and reproducible from the abstract. This paper is for researchers working on low-resource ASR, post-processing with LLMs, or language preservation tools. A reader who needs a concrete protocol for contamination-aware evaluation will find the offline dataset idea worth looking at. It deserves a serious referee because the question is practical, the Frisian results are new, and the control attempt is a step in the right direction even if it needs tighter verification. I would send it to review and ask the authors to add quantitative overlap statistics or a clear statement of what they did to confirm isolation.

Referee Report

1 major / 2 minor

Summary. The paper evaluates generative error correction (GER) using large language models to improve ASR for low-resource West Frisian. It compares results on a public corpus against a constructed offline dataset using non-public texts, reporting that GER yields consistent gains in most settings, with top GPT-5.1 outputs surpassing oracle WERs, and that comparable gains on the offline data indicate genuine correction rather than contamination effects. A detailed error analysis of correction patterns is included.

Significance. If the contamination control holds, the work offers a useful empirical demonstration that LLM-based GER can aid low-resource ASR without relying on memorization, addressing a practical concern for languages with limited public data. The offline dataset construction is a methodological strength that could inform future evaluations in similar settings.

major comments (1)

[Offline dataset construction and evaluation] The section on the offline dataset and contamination control: the central claim that 'comparable gains on the offline dataset indicate that improvements reflect true correction ability' depends on the non-public Frisian texts having zero or negligible overlap with LLM pretraining corpora. No n-gram overlap statistics, deduplication reports, or membership-inference results against Common Crawl or known LLM training mixtures are supplied, leaving the isolation of genuine correction from possible partial contamination unverified.

minor comments (2)

[Results and discussion] Results reporting lacks exact WER numbers, confidence intervals, and statistical significance tests for the claimed improvements over baselines and oracle; these should be added to tables and text for verifiability.
[Abstract] The abstract states 'best GPT-5.1 results surpassing oracle WERs' but does not clarify whether this holds after accounting for variance or on which specific test sets; add precise qualifiers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our manuscript and for highlighting the importance of rigorously verifying the contamination control in our offline dataset. We provide a detailed response to the major comment below and have updated the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Offline dataset construction and evaluation] The section on the offline dataset and contamination control: the central claim that 'comparable gains on the offline dataset indicate that improvements reflect true correction ability' depends on the non-public Frisian texts having zero or negligible overlap with LLM pretraining corpora. No n-gram overlap statistics, deduplication reports, or membership-inference results against Common Crawl or known LLM training mixtures are supplied, leaving the isolation of genuine correction from possible partial contamination unverified.

Authors: We agree with the referee that additional verification such as n-gram overlap statistics would strengthen our claims. The offline dataset was constructed using non-public texts that have not been released online or in any public repository, making their presence in LLM pretraining corpora (sourced primarily from public web data) extremely unlikely. We have revised the manuscript to provide a more thorough description of how the offline dataset was assembled and to include an explicit discussion of the contamination control assumptions in a new limitations paragraph. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no circular derivation

full rationale

This is an empirical study comparing ASR error correction performance using LLMs on a public Frisian corpus versus a constructed offline dataset with non-public texts. The central claim that comparable gains on the offline set demonstrate genuine correction (rather than contamination) rests on direct experimental measurements of WER improvements, not on any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the result to its own inputs by construction. No equations or ansatzes are invoked that would create self-definitional loops, and the offline control is presented as an external benchmark rather than a tautological redefinition of the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the offline dataset avoids LLM training data overlap; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Offline dataset texts have no overlap with any LLM training data.
This premise is required to interpret offline gains as evidence of true correction ability rather than contamination.

pith-pipeline@v0.9.0 · 5697 in / 1119 out tokens · 38724 ms · 2026-05-20T05:28:42.729972+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Introduction In recent years, automatic speech recognition (ASR) has seen remarkable progress, with substantial gains in recognition accu- racy and robustness. Multilingual self-supervised and weakly supervised speech models, including XLS-R [1] and Whis- per [2], have played a central role in improving ASR perfor- mance, particularly for low-resource lan...

work page
[2]

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Related Work 2.1. Generative Error Correction with LLMs Ma et al. [5] first demonstrated that generative LLMs such as ChatGPT can effectively correct ASR outputs using zero-shot and few-shot prompting with N-best hypotheses as input. Chen et al. [7] introduced HyPoradise, an open benchmark providing large-scale N-best hypotheses paired with reference tran...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development

Methods 3.1. Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development. The text in Common V oice primarily originates from publicly avail- able sources, particularly Wikipedia articles, and is supple- mented by community-submitted sentences...

work page
[4]

We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system

Results and Discussion Common Voice Test DatasetTable 2 presents the WER re- sults for the Common V oice test dataset. We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system. Even after LoRA fine-tuning and providing examples, Qwen3-FT yields only a marginal improvement (13.4%), in...

work page
[5]

We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1)

Conclusion This work investigated the effectiveness of LLM-based genera- tive error correction for low-resource ASR on both public and non-public Frisian datasets. We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1). The consistent ...

work page
[6]

It was not used for writ- ing any major part of the paper

Generative AI Use Disclosure In preparing this manuscript, we used GPT-5.1 to improve the quality of the writing, translate written text into English, and help with writing code and debugging. It was not used for writ- ing any major part of the paper. The final content was fully reviewed by all the authors

work page
[7]

XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech 2022, 2022, pp. 2278–2282

work page 2022
[8]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[9]

Making more of little data: Improving low-resource automatic speech recognition using data augmentation,

M. Bartelds, N. San, B. McDonnell, D. Jurafsky, and M. Wieling, “Making more of little data: Improving low-resource automatic speech recognition using data augmentation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 715–729

work page 2023
[10]

Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,

Z. X. Yong, V . Pratap, M. Auli, and J. Maillard, “Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,” inInterspeech 2025, 2025, pp. 1148–1152

work page 2025
[11]

Can genera- tive large language models perform ASR error correction?

R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can genera- tive large language models perform ASR error correction?”arXiv preprint arXiv:2307.04172, 2023

work page arXiv 2023
[12]

Generative speech recognition error correction with large language models and task-activating prompting,

C.-H. H. Yang, Y . Gu, Y .-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative speech recognition error correction with large language models and task-activating prompting,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[13]

Hyporadise: An open baseline for generative speech recognition with large language models,

C. Chen, Y . Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y . Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,”Advances in Neural In- formation Processing Systems, vol. 36, pp. 31 665–31 688, 2023

work page 2023
[14]

LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,

N. Yamashita, M. Yamamoto, H. Kokubo, and Y . Kawaguchi, “LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,” inInterspeech 2025, 2025, pp. 3653–3657

work page 2025
[15]

Com- mon voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

work page 2020
[16]

Towards interfacing large language models with ASR systems using confidence measures and prompting,

M. Naderi, E. Hermann, A. Nanchen, S. Hovsepyan, and M. Magimai.-Doss, “Towards interfacing large language models with ASR systems using confidence measures and prompting,” in Interspeech 2024, 2024, pp. 2980–2984

work page 2024
[17]

Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,

S. Li, C. Chen, C. Y . Kwok, C. Chu, E. S. Chng, and H. Kawai, “Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,” inInterspeech 2024, 2024, pp. 1315–1319

work page 2024
[18]

CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,

Z. Yang, Z. Wan, S. Li, C.-H. H. Yang, and C. Chu, “CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 6313–6325

work page 2025
[19]

Large Language Models based ASR Error Correction for Child Conversations,

A. Xu, T. Feng, S. H. Kim, S. Bishop, C. Lord, and S. Narayanan, “Large Language Models based ASR Error Correction for Child Conversations,” inInterspeech 2025, 2025, pp. 2840–2844

work page 2025
[20]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,

O. Sainz, J. Campos, I. Garc ´ıa-Ferrero, J. Etxaniz, O. L. de Lacalle, and E. Agirre, “NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023...

work page 2023
[21]

An open-source data con- tamination report for large language models,

Y . Li, Y . Guo, F. Guerin, and C. Lin, “An open-source data con- tamination report for large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 528–541

work page 2024
[22]

In- vestigating data contamination in modern benchmarks for large language models,

C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan, “In- vestigating data contamination in modern benchmarks for large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers), 2024, pp. 8706–8719

work page 2024
[23]

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,

S. Chen, Y . Chen, Z. Li, Y . Jiang, Z. Wan, Y . He, D. Ran, T. Gu, H. Li, T. Xieet al., “Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 091–10 109

work page 2025
[24]

ASR error correction us- ing large language models,

R. Ma, M. Qian, M. Gales, and K. Knill, “ASR error correction us- ing large language models,”IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[25]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,

S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Du ˇsek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2024, pp. 67–93

work page 2024
[26]

Benchmark Data Contamination of Large Language Models: A Survey

C. Xu, S. Guan, D. Greene, M. Kechadiet al., “Benchmark data contamination of large language models: A survey,”arXiv preprint arXiv:2406.04244, 2024

work page internal anchor Pith review arXiv 2024
[27]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020

[1] [1]

Introduction In recent years, automatic speech recognition (ASR) has seen remarkable progress, with substantial gains in recognition accu- racy and robustness. Multilingual self-supervised and weakly supervised speech models, including XLS-R [1] and Whis- per [2], have played a central role in improving ASR perfor- mance, particularly for low-resource lan...

work page

[2] [2]

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Related Work 2.1. Generative Error Correction with LLMs Ma et al. [5] first demonstrated that generative LLMs such as ChatGPT can effectively correct ASR outputs using zero-shot and few-shot prompting with N-best hypotheses as input. Chen et al. [7] introduced HyPoradise, an open benchmark providing large-scale N-best hypotheses paired with reference tran...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development

Methods 3.1. Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development. The text in Common V oice primarily originates from publicly avail- able sources, particularly Wikipedia articles, and is supple- mented by community-submitted sentences...

work page

[4] [4]

We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system

Results and Discussion Common Voice Test DatasetTable 2 presents the WER re- sults for the Common V oice test dataset. We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system. Even after LoRA fine-tuning and providing examples, Qwen3-FT yields only a marginal improvement (13.4%), in...

work page

[5] [5]

We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1)

Conclusion This work investigated the effectiveness of LLM-based genera- tive error correction for low-resource ASR on both public and non-public Frisian datasets. We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1). The consistent ...

work page

[6] [6]

It was not used for writ- ing any major part of the paper

Generative AI Use Disclosure In preparing this manuscript, we used GPT-5.1 to improve the quality of the writing, translate written text into English, and help with writing code and debugging. It was not used for writ- ing any major part of the paper. The final content was fully reviewed by all the authors

work page

[7] [7]

XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech 2022, 2022, pp. 2278–2282

work page 2022

[8] [8]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[9] [9]

Making more of little data: Improving low-resource automatic speech recognition using data augmentation,

M. Bartelds, N. San, B. McDonnell, D. Jurafsky, and M. Wieling, “Making more of little data: Improving low-resource automatic speech recognition using data augmentation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 715–729

work page 2023

[10] [10]

Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,

Z. X. Yong, V . Pratap, M. Auli, and J. Maillard, “Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,” inInterspeech 2025, 2025, pp. 1148–1152

work page 2025

[11] [11]

Can genera- tive large language models perform ASR error correction?

R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can genera- tive large language models perform ASR error correction?”arXiv preprint arXiv:2307.04172, 2023

work page arXiv 2023

[12] [12]

Generative speech recognition error correction with large language models and task-activating prompting,

C.-H. H. Yang, Y . Gu, Y .-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative speech recognition error correction with large language models and task-activating prompting,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[13] [13]

Hyporadise: An open baseline for generative speech recognition with large language models,

C. Chen, Y . Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y . Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,”Advances in Neural In- formation Processing Systems, vol. 36, pp. 31 665–31 688, 2023

work page 2023

[14] [14]

LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,

N. Yamashita, M. Yamamoto, H. Kokubo, and Y . Kawaguchi, “LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,” inInterspeech 2025, 2025, pp. 3653–3657

work page 2025

[15] [15]

Com- mon voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

work page 2020

[16] [16]

Towards interfacing large language models with ASR systems using confidence measures and prompting,

M. Naderi, E. Hermann, A. Nanchen, S. Hovsepyan, and M. Magimai.-Doss, “Towards interfacing large language models with ASR systems using confidence measures and prompting,” in Interspeech 2024, 2024, pp. 2980–2984

work page 2024

[17] [17]

Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,

S. Li, C. Chen, C. Y . Kwok, C. Chu, E. S. Chng, and H. Kawai, “Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,” inInterspeech 2024, 2024, pp. 1315–1319

work page 2024

[18] [18]

CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,

Z. Yang, Z. Wan, S. Li, C.-H. H. Yang, and C. Chu, “CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 6313–6325

work page 2025

[19] [19]

Large Language Models based ASR Error Correction for Child Conversations,

A. Xu, T. Feng, S. H. Kim, S. Bishop, C. Lord, and S. Narayanan, “Large Language Models based ASR Error Correction for Child Conversations,” inInterspeech 2025, 2025, pp. 2840–2844

work page 2025

[20] [20]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,

O. Sainz, J. Campos, I. Garc ´ıa-Ferrero, J. Etxaniz, O. L. de Lacalle, and E. Agirre, “NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023...

work page 2023

[21] [21]

An open-source data con- tamination report for large language models,

Y . Li, Y . Guo, F. Guerin, and C. Lin, “An open-source data con- tamination report for large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 528–541

work page 2024

[22] [22]

In- vestigating data contamination in modern benchmarks for large language models,

C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan, “In- vestigating data contamination in modern benchmarks for large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers), 2024, pp. 8706–8719

work page 2024

[23] [23]

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,

S. Chen, Y . Chen, Z. Li, Y . Jiang, Z. Wan, Y . He, D. Ran, T. Gu, H. Li, T. Xieet al., “Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 091–10 109

work page 2025

[24] [24]

ASR error correction us- ing large language models,

R. Ma, M. Qian, M. Gales, and K. Knill, “ASR error correction us- ing large language models,”IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025

[25] [25]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,

S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Du ˇsek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2024, pp. 67–93

work page 2024

[26] [26]

Benchmark Data Contamination of Large Language Models: A Survey

C. Xu, S. Guan, D. Greene, M. Kechadiet al., “Benchmark data contamination of large language models: A survey,”arXiv preprint arXiv:2406.04244, 2024

work page internal anchor Pith review arXiv 2024

[27] [27]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020