Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Donghyuk Jung; Youngwon Choi

arxiv: 2605.17443 · v1 · pith:5FQAMGOJnew · submitted 2026-05-17 · 💻 cs.CL · cs.SD· eess.AS

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Donghyuk Jung , Youngwon Choi This is my paper

Pith reviewed 2026-05-20 12:58 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords ASR error propagationKorean spoken QAASR-LLM cascadessemantic failuresingle-character errorsdownstream degradationspoken question answering

0 comments

The pith

ASR errors cause consistent relative degradation in Korean spoken QA across LLMs of varying strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how errors introduced by automatic speech recognition propagate through cascades into large language models when answering spoken questions in Korean. It shows that the proportional performance drop from these errors remains similar even when the LLMs themselves differ substantially in their overall accuracy. This pattern implies that most of the information loss occurs during the initial speech-to-text conversion rather than in the language model's processing of the imperfect text. The analysis also identifies single-character transcription mistakes as a special failure mode in Korean where the correct answer disappears entirely from the model's output. A side comparison indicates that models ingesting audio directly can avoid some of the losses seen in the standard recognition-then-language-model pipeline.

Core claim

In Korean spoken question answering with ASR-LLM cascades, the relative downstream degradation caused by ASR errors is consistent across LLMs that have different absolute performance levels. This indicates that overall cascade degradation largely tracks the information loss that occurs at the ASR stage. Single-character Korean ASR errors create a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only a minimal difference in the transcription. An auxiliary comparison further shows that a large audio language model outperforms an ASR-LLM pipeline using a matched language backbone when handling noisy Korean spoken questions

What carries the argument

Consistency of relative downstream degradation as a signal that cascade performance tracks ASR-stage information loss, together with single-character semantic-failure channels in Korean transcriptions.

If this is right

Overall cascade performance for Korean spoken QA is limited primarily by ASR accuracy rather than by the choice of downstream LLM.
Minimal single-character transcription errors can eliminate the correct answer from the final output even when the rest of the question remains intact.
Direct audio input models can reduce transcript-induced semantic losses compared with ASR-LLM pipelines in noisy conditions.
Efforts to improve Korean spoken QA should target preservation of answer-critical characters during recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

System builders may achieve larger gains by investing in ASR improvements than by swapping in larger language models when speech input is noisy.
The single-character failure pattern may appear in other character-based or syllabic languages and could be checked with similar controlled error injections.
ASR systems for QA tasks might benefit from semantic-aware error correction that protects key answer tokens even when overall word error rate stays low.

Load-bearing premise

The observed consistency in relative degradation across LLMs is caused by tracking of ASR-stage information loss rather than by LLM-specific robustness or dataset characteristics.

What would settle it

Repeating the experiments on a new dataset with controlled ASR error rates or on LLMs engineered for matched robustness to noisy text and finding that relative degradation then varies would falsify the claim that degradation tracks ASR information loss.

Figures

Figures reproduced from arXiv: 2605.17443 by Donghyuk Jung, Youngwon Choi.

**Figure 3.** Figure 3: Representative cases of the Korean single-character ASR loss channel. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies single-character Korean ASR errors as a semantic wipeout in spoken QA cascades and reports consistent relative degradation across LLMs, plus an audio model edge in noise, but the causal reading of the consistency needs tighter checks.

read the letter

The main things to know are that relative downstream drops from ASR errors stay similar across LLMs with different base strengths in Korean spoken QA, single-character transcription mistakes can erase the gold answer completely despite minimal change, and a large audio language model beats the matched ASR-LLM pipeline under noise. These come from their empirical measurements and examples rather than any new theory or derivation. The single-character error channel and the direct audio comparison stand out as the freshest parts relative to earlier cascade work, since they tie the failure mode specifically to Korean character substitutions and show the transcription step as a clear bottleneck. The paper does a clean job documenting how standard ASR metrics miss these downstream semantic losses and gives concrete cases that illustrate the point without overclaiming generality. The consistency finding is useful for thinking about where to intervene in cascades. The soft spot sits in the causal step from consistent relative degradation to pure tracking of ASR information loss. The tested LLMs could simply share similar sensitivities to Hangul errors or training patterns, which would produce the same pattern even if information loss varied. The abstract gives no ablations or controls for that, so the interpretation stays plausible but not locked down. Dataset sizes, error bars, and statistical tests are also not visible here, which leaves the strength of the patterns open until the methods section is checked. This is for researchers focused on multilingual spoken QA or practical cascade debugging, especially anyone handling Korean or character-based languages. It gives targeted failure-mode examples that could inform system design. I would send it for peer review. The specific observations on single-character errors and the audio comparison are narrow but concrete enough to benefit from referee scrutiny on the controls and reproducibility.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes error propagation in ASR-LLM cascades for Korean spoken question answering. It reports that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance levels, suggesting that cascade degradation largely tracks ASR-stage information loss. It further identifies single-character Korean ASR errors as a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only minimal transcription differences. An auxiliary comparison indicates that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA.

Significance. If the consistency of relative degradation is shown to track ASR information loss after controlling for LLM-specific robustness, the work supplies useful empirical evidence that improvements at the ASR stage can yield predictable gains in Korean SQA cascades. The identification of single-character errors supplies a concrete, language-specific failure mode not captured by conventional ASR metrics. The audio-LM comparison provides a direct, falsifiable indication that end-to-end audio modeling can mitigate transcript-induced losses. These contributions rest on empirical measurements and cross-model comparisons rather than parameter fitting or derivations.

major comments (2)

[§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.
[§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.

minor comments (2)

[§4] Define 'relative downstream degradation' explicitly (e.g., as a normalized difference in exact-match or F1) and state how it is aggregated across LLMs of differing absolute performance.
[Figures/Tables in §4] Add confidence intervals or significance markers to any plots or tables that display degradation patterns or single-character error rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript analyzing error propagation in Korean spoken QA with ASR-LLM cascades. We address each major comment below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.

Authors: We agree that additional controls for LLM-specific factors would further strengthen the inference. The consistency we observe across diverse LLMs (including those with different tokenizers and pre-training corpora) provides suggestive evidence that ASR information loss is the dominant factor, as model-specific effects would likely lead to more variable relative degradations. In the revision, we will expand the discussion in §4 to explicitly address potential LLM-specific confounds and include a qualitative analysis of how tokenization and Hangul handling might interact with ASR errors. If feasible with available resources, we will add a small-scale ablation using a controlled set of synthetic errors. revision: partial
Referee: [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.

Authors: We acknowledge these omissions in the current manuscript. In the revised version, we will report the exact sizes of the datasets and subsets used for each experiment, include error bars computed via bootstrapping or multiple random seeds where applicable, conduct and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for key comparisons, and clarify the model selection process to rule out post-hoc biases. These additions will be integrated into §3 and §4 to enhance the reproducibility and robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical error analysis

full rationale

The paper reports direct empirical measurements of ASR error propagation through ASR-LLM cascades on Korean SQA tasks, including relative degradation consistency across LLMs and identification of single-character error channels. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described analysis chain. All claims rest on experimental comparisons and observations rather than any reduction to prior inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical measurement study with no mathematical axioms, free parameters, or invented entities; all claims rest on experimental observations whose details are not supplied in the abstract.

pith-pipeline@v0.9.0 · 5654 in / 1063 out tokens · 26820 ms · 2026-05-20T12:58:02.524780+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

relative downstream degradation caused by ASR errors is consistent across LLMs... single-character Korean ASR errors as a distinct semantic-failure channel
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use Whisper-large-v3 as the ASR system... downstream QA performance is evaluated using exact match (EM) and F1 score

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A survey on dialogue systems: Recent advances and new frontiers,

H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25–35, 2017

work page 2017
[3]

Spoken dialogue technology: Enabling the conversational user interface,

M. F. McTear, “Spoken dialogue technology: Enabling the conversational user interface,”ACM Computing Sur- veys, vol. 34, no. 1, pp. 90–169, 2002

work page 2002
[4]

WavChat: A survey of spoken dialogue models.arXiv preprint arXiv:2411.13577,

S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, et al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024
[5]

Speech recognition in noisy environments: A survey,

Y . Gong, “Speech recognition in noisy environments: A survey,”Speech Communication, vol. 16, no. 3, pp. 261– 291, 1995

work page 1995
[6]

Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,

M. Faruqui and D. Hakkani-Tur, “Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,”Computational Linguistics, vol. 48, no. 1, pp. 221–232, 2022

work page 2022
[7]

An approach to measuring the performance of ASR models in the context of LLM-powered applications,

S. Pulikodan, A. K. Marathe, A. Mehrotra, S. Saxena, et al., “An approach to measuring the performance of ASR models in the context of LLM-powered applications,” in INTERSPEECH, 2025

work page 2025
[8]

KorQuAD 1.0: Korean QA dataset for machine reading comprehension,

S. Lim, M. Kim, and J. Lee, “KorQuAD 1.0: Korean QA dataset for machine reading comprehension,”arXiv preprint arXiv:1909.07005, 2019

work page arXiv 1909
[9]

Google Cloud,Cloud Text-to-Speech Documentation, https://cloud.google.com/text-to-speech/docs, Accessed: 2026-05-17, 2026

work page 2026
[10]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inICML, 2023

work page 2023
[12]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,

S. Kim, D. Kim, C. Park, W. Lee, W. Song, et al., “SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,” inNAACL In- dustry Track, 2024

work page 2024
[14]

Exaone 3.5: Series of large lan- guage models for real-world use cases,

S. An et al., “Exaone 3.5: Series of large lan- guage models for real-world use cases,”arXiv preprint arXiv:2412.04862, 2024

work page arXiv 2024
[15]

Efficient memory management for large language model serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, et al., “Efficient memory management for large language model serving with PagedAttention,” inSOSP, 2023

work page 2023
[16]

MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,

L. Zhang, Y . Wu, F. Mo, J.-Y . Nie, and A. Agrawal, “MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,” inFindings of EMNLP, 2023

work page 2023
[17]

Kmsav: Korean multi- speaker spontaneous audiovisual dataset,

K. Park, C. Oh, and S. Dong, “Kmsav: Korean multi- speaker spontaneous audiovisual dataset,”ETRI Journal, vol. 46, no. 1, pp. 71–81, 2024

work page 2024
[18]

Wavllm: Towards robust and adaptive speech large language model,

S. Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” inFindings of EMNLP, 2024

work page 2024
[19]

Audiochatllama: Towards general-purpose speech abil- ities for llms,

Y . Fathullah, C. Wu, E. Lakomkin, K. Li, J. Jia, et al., “Audiochatllama: Towards general-purpose speech abil- ities for llms,” inNAACL, 2024

work page 2024
[20]

DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,

Y . Choi, D. Jung, and H. Kim, “DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,” inUIST Adjunct, 2025

work page 2025
[21]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, et al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A survey on dialogue systems: Recent advances and new frontiers,

H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25–35, 2017

work page 2017

[3] [3]

Spoken dialogue technology: Enabling the conversational user interface,

M. F. McTear, “Spoken dialogue technology: Enabling the conversational user interface,”ACM Computing Sur- veys, vol. 34, no. 1, pp. 90–169, 2002

work page 2002

[4] [4]

WavChat: A survey of spoken dialogue models.arXiv preprint arXiv:2411.13577,

S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, et al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024

[5] [5]

Speech recognition in noisy environments: A survey,

Y . Gong, “Speech recognition in noisy environments: A survey,”Speech Communication, vol. 16, no. 3, pp. 261– 291, 1995

work page 1995

[6] [6]

Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,

M. Faruqui and D. Hakkani-Tur, “Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,”Computational Linguistics, vol. 48, no. 1, pp. 221–232, 2022

work page 2022

[7] [7]

An approach to measuring the performance of ASR models in the context of LLM-powered applications,

S. Pulikodan, A. K. Marathe, A. Mehrotra, S. Saxena, et al., “An approach to measuring the performance of ASR models in the context of LLM-powered applications,” in INTERSPEECH, 2025

work page 2025

[8] [8]

KorQuAD 1.0: Korean QA dataset for machine reading comprehension,

S. Lim, M. Kim, and J. Lee, “KorQuAD 1.0: Korean QA dataset for machine reading comprehension,”arXiv preprint arXiv:1909.07005, 2019

work page arXiv 1909

[9] [9]

Google Cloud,Cloud Text-to-Speech Documentation, https://cloud.google.com/text-to-speech/docs, Accessed: 2026-05-17, 2026

work page 2026

[10] [10]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inICML, 2023

work page 2023

[12] [12]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,

S. Kim, D. Kim, C. Park, W. Lee, W. Song, et al., “SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,” inNAACL In- dustry Track, 2024

work page 2024

[14] [14]

Exaone 3.5: Series of large lan- guage models for real-world use cases,

S. An et al., “Exaone 3.5: Series of large lan- guage models for real-world use cases,”arXiv preprint arXiv:2412.04862, 2024

work page arXiv 2024

[15] [15]

Efficient memory management for large language model serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, et al., “Efficient memory management for large language model serving with PagedAttention,” inSOSP, 2023

work page 2023

[16] [16]

MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,

L. Zhang, Y . Wu, F. Mo, J.-Y . Nie, and A. Agrawal, “MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,” inFindings of EMNLP, 2023

work page 2023

[17] [17]

Kmsav: Korean multi- speaker spontaneous audiovisual dataset,

K. Park, C. Oh, and S. Dong, “Kmsav: Korean multi- speaker spontaneous audiovisual dataset,”ETRI Journal, vol. 46, no. 1, pp. 71–81, 2024

work page 2024

[18] [18]

Wavllm: Towards robust and adaptive speech large language model,

S. Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” inFindings of EMNLP, 2024

work page 2024

[19] [19]

Audiochatllama: Towards general-purpose speech abil- ities for llms,

Y . Fathullah, C. Wu, E. Lakomkin, K. Li, J. Jia, et al., “Audiochatllama: Towards general-purpose speech abil- ities for llms,” inNAACL, 2024

work page 2024

[20] [20]

DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,

Y . Choi, D. Jung, and H. Kim, “DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,” inUIST Adjunct, 2025

work page 2025

[21] [21]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, et al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025