pith. sign in

arxiv: 2605.19711 · v1 · pith:GUIZUHKUnew · submitted 2026-05-19 · 💻 cs.CL

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Pith reviewed 2026-05-20 05:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords Large Language ModelsGenerative Error CorrectionAutomatic Speech RecognitionLow-Resource LanguagesWest FrisianData ContaminationError Analysis
3
0 comments X

The pith

Large language models can correct errors in low-resource West Frisian ASR even when trained on unseen texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generative error correction with LLMs can improve automatic speech recognition outputs for the low-resource language West Frisian. Authors evaluate several models on a standard public corpus and introduce a new offline dataset of non-public texts to check if gains come from genuine correction or from data the models already saw during training. Results show consistent improvements across most settings, and the strongest GPT-5.1 outputs even beat oracle word error rates. Similar gains on the offline set indicate the models are fixing errors rather than recalling memorized sentences. The work also includes an error analysis that maps out the kinds of mistakes the models tend to repair.

Core claim

Generative error correction using LLMs improves ASR performance for West Frisian in most tested configurations, with the best GPT-5.1 results exceeding oracle WERs; comparable gains appear on a newly constructed offline dataset of non-public texts, supporting the conclusion that observed improvements reflect true correction ability rather than contamination.

What carries the argument

Generative error correction (GER) applied to ASR hypotheses for Frisian, using public and offline evaluation sets to separate correction skill from training-data overlap.

If this is right

  • GER can be added as a post-processing step to raise accuracy for other low-resource ASR systems without retraining the recognizer.
  • Offline or private evaluation sets become necessary to validate LLM-based correction claims in any language with limited public data.
  • Error-pattern analysis can inform prompt design or model choice to target the specific mistake types that LLMs handle well.
  • Surpassing oracle WER suggests hybrid ASR-plus-LLM pipelines may reach lower error rates than either component alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contamination-control method used here could be adopted as standard practice when testing LLMs on any low-resource speech or text task.
  • Smaller open models might achieve similar correction gains if the same offline evaluation protocol is applied.
  • Success on Frisian raises the possibility that LLM correction works for other under-resourced languages whose orthography or phonology differs from high-resource training data.
  • If the pattern holds, future ASR systems for minority languages could rely on lightweight recognizers followed by LLM correction rather than massive end-to-end training.

Load-bearing premise

The offline dataset of non-public texts shares no overlap with any LLM training corpus, so measured gains must come from correction rather than recall.

What would settle it

Finding any of the offline dataset sentences inside an LLM's training data, or observing that performance gains vanish on a second, independently verified unseen corpus, would falsify the claim of genuine correction.

Figures

Figures reproduced from arXiv: 2605.19711 by Martijn Wieling, Reihaneh Amooie, Rik van Noord, Wietse de Vries, Yun Hao.

Figure 1
Figure 1. Figure 1: Pipeline of the LLM-based generative error correc￾tion system. Note that we use N = 5 in this paper. LLM-based error correction to multilingual and low-resource settings. Li et al. [11] investigate multilingual one-best cor￾rection across 20 languages, and Yang et al. [12] introduce CoVoGER, a multilingual and multitask benchmark covering 15 languages for speech-to-text generative error correction. Xu et a… view at source ↗
Figure 2
Figure 2. Figure 2: The prompts used for our generative error correction system [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sentence-level improvement rates of trigram, Qwen3- FT and GPT-5.1 (generation vs. selection) on both Frisian ASR datasets. based approach using the original Qwen3 model. Both fine￾tuned and non-fine-tuned Qwen3 variants yield only marginal improvements. Generation-based methods consistently out￾perform selection-based corrections. GPT-4o-mini-based GER achieves performance comparable to or better than the… view at source ↗
read the original abstract

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper evaluates generative error correction (GER) using large language models to improve ASR for low-resource West Frisian. It compares results on a public corpus against a constructed offline dataset using non-public texts, reporting that GER yields consistent gains in most settings, with top GPT-5.1 outputs surpassing oracle WERs, and that comparable gains on the offline data indicate genuine correction rather than contamination effects. A detailed error analysis of correction patterns is included.

Significance. If the contamination control holds, the work offers a useful empirical demonstration that LLM-based GER can aid low-resource ASR without relying on memorization, addressing a practical concern for languages with limited public data. The offline dataset construction is a methodological strength that could inform future evaluations in similar settings.

major comments (1)
  1. [Offline dataset construction and evaluation] The section on the offline dataset and contamination control: the central claim that 'comparable gains on the offline dataset indicate that improvements reflect true correction ability' depends on the non-public Frisian texts having zero or negligible overlap with LLM pretraining corpora. No n-gram overlap statistics, deduplication reports, or membership-inference results against Common Crawl or known LLM training mixtures are supplied, leaving the isolation of genuine correction from possible partial contamination unverified.
minor comments (2)
  1. [Results and discussion] Results reporting lacks exact WER numbers, confidence intervals, and statistical significance tests for the claimed improvements over baselines and oracle; these should be added to tables and text for verifiability.
  2. [Abstract] The abstract states 'best GPT-5.1 results surpassing oracle WERs' but does not clarify whether this holds after accounting for variance or on which specific test sets; add precise qualifiers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our manuscript and for highlighting the importance of rigorously verifying the contamination control in our offline dataset. We provide a detailed response to the major comment below and have updated the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Offline dataset construction and evaluation] The section on the offline dataset and contamination control: the central claim that 'comparable gains on the offline dataset indicate that improvements reflect true correction ability' depends on the non-public Frisian texts having zero or negligible overlap with LLM pretraining corpora. No n-gram overlap statistics, deduplication reports, or membership-inference results against Common Crawl or known LLM training mixtures are supplied, leaving the isolation of genuine correction from possible partial contamination unverified.

    Authors: We agree with the referee that additional verification such as n-gram overlap statistics would strengthen our claims. The offline dataset was constructed using non-public texts that have not been released online or in any public repository, making their presence in LLM pretraining corpora (sourced primarily from public web data) extremely unlikely. We have revised the manuscript to provide a more thorough description of how the offline dataset was assembled and to include an explicit discussion of the contamination control assumptions in a new limitations paragraph. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no circular derivation

full rationale

This is an empirical study comparing ASR error correction performance using LLMs on a public Frisian corpus versus a constructed offline dataset with non-public texts. The central claim that comparable gains on the offline set demonstrate genuine correction (rather than contamination) rests on direct experimental measurements of WER improvements, not on any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the result to its own inputs by construction. No equations or ansatzes are invoked that would create self-definitional loops, and the offline control is presented as an external benchmark rather than a tautological redefinition of the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the offline dataset avoids LLM training data overlap; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Offline dataset texts have no overlap with any LLM training data.
    This premise is required to interpret offline gains as evidence of true correction ability rather than contamination.

pith-pipeline@v0.9.0 · 5697 in / 1119 out tokens · 38724 ms · 2026-05-20T05:28:42.729972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    Introduction In recent years, automatic speech recognition (ASR) has seen remarkable progress, with substantial gains in recognition accu- racy and robustness. Multilingual self-supervised and weakly supervised speech models, including XLS-R [1] and Whis- per [2], have played a central role in improving ASR perfor- mance, particularly for low-resource lan...

  2. [2]

    Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

    Related Work 2.1. Generative Error Correction with LLMs Ma et al. [5] first demonstrated that generative LLMs such as ChatGPT can effectively correct ASR outputs using zero-shot and few-shot prompting with N-best hypotheses as input. Chen et al. [7] introduced HyPoradise, an open benchmark providing large-scale N-best hypotheses paired with reference tran...

  3. [3]

    Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development

    Methods 3.1. Datasets Common Voice 17.0The Common V oice corpus [9] is a massively-multilingual collection of transcribed speech in- tended for speech technology research and development. The text in Common V oice primarily originates from publicly avail- able sources, particularly Wikipedia articles, and is supple- mented by community-submitted sentences...

  4. [4]

    We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system

    Results and Discussion Common Voice Test DatasetTable 2 presents the WER re- sults for the Common V oice test dataset. We observe that, with the exception of the original (non-fine-tuned) Qwen3 model, all LLMs improve over the baseline XLS-R system. Even after LoRA fine-tuning and providing examples, Qwen3-FT yields only a marginal improvement (13.4%), in...

  5. [5]

    We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1)

    Conclusion This work investigated the effectiveness of LLM-based genera- tive error correction for low-resource ASR on both public and non-public Frisian datasets. We demonstrated that GPT models can substantially improve ASR performance beyond the tradi- tional trigram model and even surpass the five-best oracle in certain settings (RQ1). The consistent ...

  6. [6]

    It was not used for writ- ing any major part of the paper

    Generative AI Use Disclosure In preparing this manuscript, we used GPT-5.1 to improve the quality of the writing, translate written text into English, and help with writing code and debugging. It was not used for writ- ing any major part of the paper. The final content was fully reviewed by all the authors

  7. [7]

    XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech 2022, 2022, pp. 2278–2282

  8. [8]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  9. [9]

    Making more of little data: Improving low-resource automatic speech recognition using data augmentation,

    M. Bartelds, N. San, B. McDonnell, D. Jurafsky, and M. Wieling, “Making more of little data: Improving low-resource automatic speech recognition using data augmentation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 715–729

  10. [10]

    Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,

    Z. X. Yong, V . Pratap, M. Auli, and J. Maillard, “Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Ac- cent Robustness in Low-Resource ASR,” inInterspeech 2025, 2025, pp. 1148–1152

  11. [11]

    Can genera- tive large language models perform ASR error correction?

    R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can genera- tive large language models perform ASR error correction?”arXiv preprint arXiv:2307.04172, 2023

  12. [12]

    Generative speech recognition error correction with large language models and task-activating prompting,

    C.-H. H. Yang, Y . Gu, Y .-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative speech recognition error correction with large language models and task-activating prompting,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

  13. [13]

    Hyporadise: An open baseline for generative speech recognition with large language models,

    C. Chen, Y . Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y . Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,”Advances in Neural In- formation Processing Systems, vol. 36, pp. 31 665–31 688, 2023

  14. [14]

    LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,

    N. Yamashita, M. Yamamoto, H. Kokubo, and Y . Kawaguchi, “LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context,” inInterspeech 2025, 2025, pp. 3653–3657

  15. [15]

    Com- mon voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

  16. [16]

    Towards interfacing large language models with ASR systems using confidence measures and prompting,

    M. Naderi, E. Hermann, A. Nanchen, S. Hovsepyan, and M. Magimai.-Doss, “Towards interfacing large language models with ASR systems using confidence measures and prompting,” in Interspeech 2024, 2024, pp. 2980–2984

  17. [17]

    Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,

    S. Li, C. Chen, C. Y . Kwok, C. Chu, E. S. Chng, and H. Kawai, “Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses,” inInterspeech 2024, 2024, pp. 1315–1319

  18. [18]

    CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,

    Z. Yang, Z. Wan, S. Li, C.-H. H. Yang, and C. Chu, “CoV oGER: A multilingual multitask benchmark for speech-to-text generative error correction with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 6313–6325

  19. [19]

    Large Language Models based ASR Error Correction for Child Conversations,

    A. Xu, T. Feng, S. H. Kim, S. Bishop, C. Lord, and S. Narayanan, “Large Language Models based ASR Error Correction for Child Conversations,” inInterspeech 2025, 2025, pp. 2840–2844

  20. [20]

    NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,

    O. Sainz, J. Campos, I. Garc ´ıa-Ferrero, J. Etxaniz, O. L. de Lacalle, and E. Agirre, “NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023...

  21. [21]

    An open-source data con- tamination report for large language models,

    Y . Li, Y . Guo, F. Guerin, and C. Lin, “An open-source data con- tamination report for large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 528–541

  22. [22]

    In- vestigating data contamination in modern benchmarks for large language models,

    C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan, “In- vestigating data contamination in modern benchmarks for large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers), 2024, pp. 8706–8719

  23. [23]

    Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,

    S. Chen, Y . Chen, Z. Li, Y . Jiang, Z. Wan, Y . He, D. Ran, T. Gu, H. Li, T. Xieet al., “Benchmarking large language models under data contamination: A survey from static to dynamic evaluation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 091–10 109

  24. [24]

    ASR error correction us- ing large language models,

    R. Ma, M. Qian, M. Gales, and K. Knill, “ASR error correction us- ing large language models,”IEEE Transactions on Audio, Speech and Language Processing, 2025

  25. [25]

    Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,

    S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Du ˇsek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2024, pp. 67–93

  26. [26]

    Benchmark Data Contamination of Large Language Models: A Survey

    C. Xu, S. Guan, D. Greene, M. Kechadiet al., “Benchmark data contamination of large language models: A survey,”arXiv preprint arXiv:2406.04244, 2024

  27. [27]

    wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020