Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades
Pith reviewed 2026-05-20 12:58 UTC · model grok-4.3
The pith
ASR errors cause consistent relative degradation in Korean spoken QA across LLMs of varying strength.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In Korean spoken question answering with ASR-LLM cascades, the relative downstream degradation caused by ASR errors is consistent across LLMs that have different absolute performance levels. This indicates that overall cascade degradation largely tracks the information loss that occurs at the ASR stage. Single-character Korean ASR errors create a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only a minimal difference in the transcription. An auxiliary comparison further shows that a large audio language model outperforms an ASR-LLM pipeline using a matched language backbone when handling noisy Korean spoken questions
What carries the argument
Consistency of relative downstream degradation as a signal that cascade performance tracks ASR-stage information loss, together with single-character semantic-failure channels in Korean transcriptions.
If this is right
- Overall cascade performance for Korean spoken QA is limited primarily by ASR accuracy rather than by the choice of downstream LLM.
- Minimal single-character transcription errors can eliminate the correct answer from the final output even when the rest of the question remains intact.
- Direct audio input models can reduce transcript-induced semantic losses compared with ASR-LLM pipelines in noisy conditions.
- Efforts to improve Korean spoken QA should target preservation of answer-critical characters during recognition.
Where Pith is reading between the lines
- System builders may achieve larger gains by investing in ASR improvements than by swapping in larger language models when speech input is noisy.
- The single-character failure pattern may appear in other character-based or syllabic languages and could be checked with similar controlled error injections.
- ASR systems for QA tasks might benefit from semantic-aware error correction that protects key answer tokens even when overall word error rate stays low.
Load-bearing premise
The observed consistency in relative degradation across LLMs is caused by tracking of ASR-stage information loss rather than by LLM-specific robustness or dataset characteristics.
What would settle it
Repeating the experiments on a new dataset with controlled ASR error rates or on LLMs engineered for matched robustness to noisy text and finding that relative degradation then varies would falsify the claim that degradation tracks ASR information loss.
Figures
read the original abstract
We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes error propagation in ASR-LLM cascades for Korean spoken question answering. It reports that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance levels, suggesting that cascade degradation largely tracks ASR-stage information loss. It further identifies single-character Korean ASR errors as a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only minimal transcription differences. An auxiliary comparison indicates that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA.
Significance. If the consistency of relative degradation is shown to track ASR information loss after controlling for LLM-specific robustness, the work supplies useful empirical evidence that improvements at the ASR stage can yield predictable gains in Korean SQA cascades. The identification of single-character errors supplies a concrete, language-specific failure mode not captured by conventional ASR metrics. The audio-LM comparison provides a direct, falsifiable indication that end-to-end audio modeling can mitigate transcript-induced losses. These contributions rest on empirical measurements and cross-model comparisons rather than parameter fitting or derivations.
major comments (2)
- [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.
- [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.
minor comments (2)
- [§4] Define 'relative downstream degradation' explicitly (e.g., as a normalized difference in exact-match or F1) and state how it is aggregated across LLMs of differing absolute performance.
- [Figures/Tables in §4] Add confidence intervals or significance markers to any plots or tables that display degradation patterns or single-character error rates.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments on our manuscript analyzing error propagation in Korean spoken QA with ASR-LLM cascades. We address each major comment below, providing clarifications and indicating the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.
Authors: We agree that additional controls for LLM-specific factors would further strengthen the inference. The consistency we observe across diverse LLMs (including those with different tokenizers and pre-training corpora) provides suggestive evidence that ASR information loss is the dominant factor, as model-specific effects would likely lead to more variable relative degradations. In the revision, we will expand the discussion in §4 to explicitly address potential LLM-specific confounds and include a qualitative analysis of how tokenization and Hangul handling might interact with ASR errors. If feasible with available resources, we will add a small-scale ablation using a controlled set of synthetic errors. revision: partial
-
Referee: [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.
Authors: We acknowledge these omissions in the current manuscript. In the revised version, we will report the exact sizes of the datasets and subsets used for each experiment, include error bars computed via bootstrapping or multiple random seeds where applicable, conduct and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for key comparisons, and clarify the model selection process to rule out post-hoc biases. These additions will be integrated into §3 and §4 to enhance the reproducibility and robustness of our findings. revision: yes
Circularity Check
No circularity in empirical error analysis
full rationale
The paper reports direct empirical measurements of ASR error propagation through ASR-LLM cascades on Korean SQA tasks, including relative degradation consistency across LLMs and identification of single-character error channels. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described analysis chain. All claims rest on experimental comparisons and observations rather than any reduction to prior inputs by construction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
relative downstream degradation caused by ASR errors is consistent across LLMs... single-character Korean ASR errors as a distinct semantic-failure channel
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use Whisper-large-v3 as the ASR system... downstream QA performance is evaluated using exact match (EM) and F1 score
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
A survey on dialogue systems: Recent advances and new frontiers,
H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25–35, 2017
work page 2017
-
[3]
Spoken dialogue technology: Enabling the conversational user interface,
M. F. McTear, “Spoken dialogue technology: Enabling the conversational user interface,”ACM Computing Sur- veys, vol. 34, no. 1, pp. 90–169, 2002
work page 2002
-
[4]
WavChat: A survey of spoken dialogue models.arXiv preprint arXiv:2411.13577,
S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, et al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024
-
[5]
Speech recognition in noisy environments: A survey,
Y . Gong, “Speech recognition in noisy environments: A survey,”Speech Communication, vol. 16, no. 3, pp. 261– 291, 1995
work page 1995
-
[6]
Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,
M. Faruqui and D. Hakkani-Tur, “Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,”Computational Linguistics, vol. 48, no. 1, pp. 221–232, 2022
work page 2022
-
[7]
An approach to measuring the performance of ASR models in the context of LLM-powered applications,
S. Pulikodan, A. K. Marathe, A. Mehrotra, S. Saxena, et al., “An approach to measuring the performance of ASR models in the context of LLM-powered applications,” in INTERSPEECH, 2025
work page 2025
-
[8]
KorQuAD 1.0: Korean QA dataset for machine reading comprehension,
S. Lim, M. Kim, and J. Lee, “KorQuAD 1.0: Korean QA dataset for machine reading comprehension,”arXiv preprint arXiv:1909.07005, 2019
-
[9]
Google Cloud,Cloud Text-to-Speech Documentation, https://cloud.google.com/text-to-speech/docs, Accessed: 2026-05-17, 2026
work page 2026
-
[10]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inICML, 2023
work page 2023
-
[12]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,
S. Kim, D. Kim, C. Park, W. Lee, W. Song, et al., “SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,” inNAACL In- dustry Track, 2024
work page 2024
-
[14]
Exaone 3.5: Series of large lan- guage models for real-world use cases,
S. An et al., “Exaone 3.5: Series of large lan- guage models for real-world use cases,”arXiv preprint arXiv:2412.04862, 2024
-
[15]
Efficient memory management for large language model serving with PagedAttention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, et al., “Efficient memory management for large language model serving with PagedAttention,” inSOSP, 2023
work page 2023
-
[16]
MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,
L. Zhang, Y . Wu, F. Mo, J.-Y . Nie, and A. Agrawal, “MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,” inFindings of EMNLP, 2023
work page 2023
-
[17]
Kmsav: Korean multi- speaker spontaneous audiovisual dataset,
K. Park, C. Oh, and S. Dong, “Kmsav: Korean multi- speaker spontaneous audiovisual dataset,”ETRI Journal, vol. 46, no. 1, pp. 71–81, 2024
work page 2024
-
[18]
Wavllm: Towards robust and adaptive speech large language model,
S. Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” inFindings of EMNLP, 2024
work page 2024
-
[19]
Audiochatllama: Towards general-purpose speech abil- ities for llms,
Y . Fathullah, C. Wu, E. Lakomkin, K. Li, J. Jia, et al., “Audiochatllama: Towards general-purpose speech abil- ities for llms,” inNAACL, 2024
work page 2024
-
[20]
DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,
Y . Choi, D. Jung, and H. Kim, “DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,” inUIST Adjunct, 2025
work page 2025
-
[21]
J. Xu, Z. Guo, J. He, H. Hu, T. He, et al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.