Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
Pith reviewed 2026-05-20 05:28 UTC · model grok-4.3
The pith
Large language models can correct errors in low-resource West Frisian ASR even when trained on unseen texts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generative error correction using LLMs improves ASR performance for West Frisian in most tested configurations, with the best GPT-5.1 results exceeding oracle WERs; comparable gains appear on a newly constructed offline dataset of non-public texts, supporting the conclusion that observed improvements reflect true correction ability rather than contamination.
What carries the argument
Generative error correction (GER) applied to ASR hypotheses for Frisian, using public and offline evaluation sets to separate correction skill from training-data overlap.
If this is right
- GER can be added as a post-processing step to raise accuracy for other low-resource ASR systems without retraining the recognizer.
- Offline or private evaluation sets become necessary to validate LLM-based correction claims in any language with limited public data.
- Error-pattern analysis can inform prompt design or model choice to target the specific mistake types that LLMs handle well.
- Surpassing oracle WER suggests hybrid ASR-plus-LLM pipelines may reach lower error rates than either component alone.
Where Pith is reading between the lines
- The contamination-control method used here could be adopted as standard practice when testing LLMs on any low-resource speech or text task.
- Smaller open models might achieve similar correction gains if the same offline evaluation protocol is applied.
- Success on Frisian raises the possibility that LLM correction works for other under-resourced languages whose orthography or phonology differs from high-resource training data.
- If the pattern holds, future ASR systems for minority languages could rely on lightweight recognizers followed by LLM correction rather than massive end-to-end training.
Load-bearing premise
The offline dataset of non-public texts shares no overlap with any LLM training corpus, so measured gains must come from correction rather than recall.
What would settle it
Finding any of the offline dataset sentences inside an LLM's training data, or observing that performance gains vanish on a second, independently verified unseen corpus, would falsify the claim of genuine correction.
Figures
read the original abstract
Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates generative error correction (GER) using large language models to improve ASR for low-resource West Frisian. It compares results on a public corpus against a constructed offline dataset using non-public texts, reporting that GER yields consistent gains in most settings, with top GPT-5.1 outputs surpassing oracle WERs, and that comparable gains on the offline data indicate genuine correction rather than contamination effects. A detailed error analysis of correction patterns is included.
Significance. If the contamination control holds, the work offers a useful empirical demonstration that LLM-based GER can aid low-resource ASR without relying on memorization, addressing a practical concern for languages with limited public data. The offline dataset construction is a methodological strength that could inform future evaluations in similar settings.
major comments (1)
- [Offline dataset construction and evaluation] The section on the offline dataset and contamination control: the central claim that 'comparable gains on the offline dataset indicate that improvements reflect true correction ability' depends on the non-public Frisian texts having zero or negligible overlap with LLM pretraining corpora. No n-gram overlap statistics, deduplication reports, or membership-inference results against Common Crawl or known LLM training mixtures are supplied, leaving the isolation of genuine correction from possible partial contamination unverified.
minor comments (2)
- [Results and discussion] Results reporting lacks exact WER numbers, confidence intervals, and statistical significance tests for the claimed improvements over baselines and oracle; these should be added to tables and text for verifiability.
- [Abstract] The abstract states 'best GPT-5.1 results surpassing oracle WERs' but does not clarify whether this holds after accounting for variance or on which specific test sets; add precise qualifiers.
Simulated Author's Rebuttal
Thank you for reviewing our manuscript and for highlighting the importance of rigorously verifying the contamination control in our offline dataset. We provide a detailed response to the major comment below and have updated the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Offline dataset construction and evaluation] The section on the offline dataset and contamination control: the central claim that 'comparable gains on the offline dataset indicate that improvements reflect true correction ability' depends on the non-public Frisian texts having zero or negligible overlap with LLM pretraining corpora. No n-gram overlap statistics, deduplication reports, or membership-inference results against Common Crawl or known LLM training mixtures are supplied, leaving the isolation of genuine correction from possible partial contamination unverified.
Authors: We agree with the referee that additional verification such as n-gram overlap statistics would strengthen our claims. The offline dataset was constructed using non-public texts that have not been released online or in any public repository, making their presence in LLM pretraining corpora (sourced primarily from public web data) extremely unlikely. We have revised the manuscript to provide a more thorough description of how the offline dataset was assembled and to include an explicit discussion of the contamination control assumptions in a new limitations paragraph. revision: yes
Circularity Check
Empirical evaluation with no circular derivation
full rationale
This is an empirical study comparing ASR error correction performance using LLMs on a public Frisian corpus versus a constructed offline dataset with non-public texts. The central claim that comparable gains on the offline set demonstrate genuine correction (rather than contamination) rests on direct experimental measurements of WER improvements, not on any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the result to its own inputs by construction. No equations or ansatzes are invoked that would create self-definitional loops, and the offline control is presented as an external benchmark rather than a tautological redefinition of the outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Offline dataset texts have no overlap with any LLM training data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction In recent years, automatic speech recognition (ASR) has seen remarkable progress, with substantial gains in recognition accu- racy and robustness. Multilingual self-supervised and weakly supervised speech models, including XLS-R [1] and Whis- per [2], have played a central role in improving ASR perfor- mance, particularly for low-resource lan...
-
[2]
Related Work 2.1. Generative Error Correction with LLMs Ma et al. [5] first demonstrated that generative LLMs such as ChatGPT can effectively correct ASR outputs using zero-shot and few-shot prompting with N-best hypotheses as input. Chen et al. [7] introduced HyPoradise, an open benchmark providing large-scale N-best hypotheses paired with reference tran...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Methods 3.1. Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development. The text in Common V oice primarily originates from publicly avail- able sources, particularly Wikipedia articles, and is supple- mented by community-submitted sentences...
-
[4]
Results and Discussion Common Voice Test DatasetTable 2 presents the WER re- sults for the Common V oice test dataset. We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system. Even after LoRA fine-tuning and providing examples, Qwen3-FT yields only a marginal improvement (13.4%), in...
-
[5]
Conclusion This work investigated the effectiveness of LLM-based genera- tive error correction for low-resource ASR on both public and non-public Frisian datasets. We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1). The consistent ...
-
[6]
It was not used for writ- ing any major part of the paper
Generative AI Use Disclosure In preparing this manuscript, we used GPT-5.1 to improve the quality of the writing, translate written text into English, and help with writing code and debugging. It was not used for writ- ing any major part of the paper. The final content was fully reviewed by all the authors
-
[7]
XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,
A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech 2022, 2022, pp. 2278–2282
work page 2022
-
[8]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[9]
M. Bartelds, N. San, B. McDonnell, D. Jurafsky, and M. Wieling, “Making more of little data: Improving low-resource automatic speech recognition using data augmentation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 715–729
work page 2023
-
[10]
Z. X. Yong, V . Pratap, M. Auli, and J. Maillard, “Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,” inInterspeech 2025, 2025, pp. 1148–1152
work page 2025
-
[11]
Can genera- tive large language models perform ASR error correction?
R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can genera- tive large language models perform ASR error correction?”arXiv preprint arXiv:2307.04172, 2023
-
[12]
C.-H. H. Yang, Y . Gu, Y .-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative speech recognition error correction with large language models and task-activating prompting,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8
work page 2023
-
[13]
Hyporadise: An open baseline for generative speech recognition with large language models,
C. Chen, Y . Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y . Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,”Advances in Neural In- formation Processing Systems, vol. 36, pp. 31 665–31 688, 2023
work page 2023
-
[14]
LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,
N. Yamashita, M. Yamamoto, H. Kokubo, and Y . Kawaguchi, “LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,” inInterspeech 2025, 2025, pp. 3653–3657
work page 2025
-
[15]
Com- mon voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222
work page 2020
-
[16]
Towards interfacing large language models with ASR systems using confidence measures and prompting,
M. Naderi, E. Hermann, A. Nanchen, S. Hovsepyan, and M. Magimai.-Doss, “Towards interfacing large language models with ASR systems using confidence measures and prompting,” in Interspeech 2024, 2024, pp. 2980–2984
work page 2024
-
[17]
Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,
S. Li, C. Chen, C. Y . Kwok, C. Chu, E. S. Chng, and H. Kawai, “Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,” inInterspeech 2024, 2024, pp. 1315–1319
work page 2024
-
[18]
Z. Yang, Z. Wan, S. Li, C.-H. H. Yang, and C. Chu, “CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 6313–6325
work page 2025
-
[19]
Large Language Models based ASR Error Correction for Child Conversations,
A. Xu, T. Feng, S. H. Kim, S. Bishop, C. Lord, and S. Narayanan, “Large Language Models based ASR Error Correction for Child Conversations,” inInterspeech 2025, 2025, pp. 2840–2844
work page 2025
-
[20]
NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,
O. Sainz, J. Campos, I. Garc ´ıa-Ferrero, J. Etxaniz, O. L. de Lacalle, and E. Agirre, “NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023...
work page 2023
-
[21]
An open-source data con- tamination report for large language models,
Y . Li, Y . Guo, F. Guerin, and C. Lin, “An open-source data con- tamination report for large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 528–541
work page 2024
-
[22]
In- vestigating data contamination in modern benchmarks for large language models,
C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan, “In- vestigating data contamination in modern benchmarks for large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers), 2024, pp. 8706–8719
work page 2024
-
[23]
S. Chen, Y . Chen, Z. Li, Y . Jiang, Z. Wan, Y . He, D. Ran, T. Gu, H. Li, T. Xieet al., “Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 091–10 109
work page 2025
-
[24]
ASR error correction us- ing large language models,
R. Ma, M. Qian, M. Gales, and K. Knill, “ASR error correction us- ing large language models,”IEEE Transactions on Audio, Speech and Language Processing, 2025
work page 2025
-
[25]
Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,
S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Du ˇsek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2024, pp. 67–93
work page 2024
-
[26]
Benchmark Data Contamination of Large Language Models: A Survey
C. Xu, S. Guan, D. Greene, M. Kechadiet al., “Benchmark data contamination of large language models: A survey,”arXiv preprint arXiv:2406.04244, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.