pith. sign in

arxiv: 2605.17443 · v1 · pith:5FQAMGOJnew · submitted 2026-05-17 · 💻 cs.CL · cs.SD· eess.AS

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Pith reviewed 2026-05-20 12:58 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords ASR error propagationKorean spoken QAASR-LLM cascadessemantic failuresingle-character errorsdownstream degradationspoken question answering
0
0 comments X

The pith

ASR errors cause consistent relative degradation in Korean spoken QA across LLMs of varying strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how errors introduced by automatic speech recognition propagate through cascades into large language models when answering spoken questions in Korean. It shows that the proportional performance drop from these errors remains similar even when the LLMs themselves differ substantially in their overall accuracy. This pattern implies that most of the information loss occurs during the initial speech-to-text conversion rather than in the language model's processing of the imperfect text. The analysis also identifies single-character transcription mistakes as a special failure mode in Korean where the correct answer disappears entirely from the model's output. A side comparison indicates that models ingesting audio directly can avoid some of the losses seen in the standard recognition-then-language-model pipeline.

Core claim

In Korean spoken question answering with ASR-LLM cascades, the relative downstream degradation caused by ASR errors is consistent across LLMs that have different absolute performance levels. This indicates that overall cascade degradation largely tracks the information loss that occurs at the ASR stage. Single-character Korean ASR errors create a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only a minimal difference in the transcription. An auxiliary comparison further shows that a large audio language model outperforms an ASR-LLM pipeline using a matched language backbone when handling noisy Korean spoken questions

What carries the argument

Consistency of relative downstream degradation as a signal that cascade performance tracks ASR-stage information loss, together with single-character semantic-failure channels in Korean transcriptions.

If this is right

  • Overall cascade performance for Korean spoken QA is limited primarily by ASR accuracy rather than by the choice of downstream LLM.
  • Minimal single-character transcription errors can eliminate the correct answer from the final output even when the rest of the question remains intact.
  • Direct audio input models can reduce transcript-induced semantic losses compared with ASR-LLM pipelines in noisy conditions.
  • Efforts to improve Korean spoken QA should target preservation of answer-critical characters during recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • System builders may achieve larger gains by investing in ASR improvements than by swapping in larger language models when speech input is noisy.
  • The single-character failure pattern may appear in other character-based or syllabic languages and could be checked with similar controlled error injections.
  • ASR systems for QA tasks might benefit from semantic-aware error correction that protects key answer tokens even when overall word error rate stays low.

Load-bearing premise

The observed consistency in relative degradation across LLMs is caused by tracking of ASR-stage information loss rather than by LLM-specific robustness or dataset characteristics.

What would settle it

Repeating the experiments on a new dataset with controlled ASR error rates or on LLMs engineered for matched robustness to noisy text and finding that relative degradation then varies would falsify the claim that degradation tracks ASR information loss.

Figures

Figures reproduced from arXiv: 2605.17443 by Donghyuk Jung, Youngwon Choi.

Figure 1
Figure 1. Figure 1: Overview of the speech synthesis and downstream QA evaluation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative cases of the Korean single-character ASR loss channel. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes error propagation in ASR-LLM cascades for Korean spoken question answering. It reports that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance levels, suggesting that cascade degradation largely tracks ASR-stage information loss. It further identifies single-character Korean ASR errors as a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only minimal transcription differences. An auxiliary comparison indicates that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA.

Significance. If the consistency of relative degradation is shown to track ASR information loss after controlling for LLM-specific robustness, the work supplies useful empirical evidence that improvements at the ASR stage can yield predictable gains in Korean SQA cascades. The identification of single-character errors supplies a concrete, language-specific failure mode not captured by conventional ASR metrics. The audio-LM comparison provides a direct, falsifiable indication that end-to-end audio modeling can mitigate transcript-induced losses. These contributions rest on empirical measurements and cross-model comparisons rather than parameter fitting or derivations.

major comments (2)
  1. [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.
  2. [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.
minor comments (2)
  1. [§4] Define 'relative downstream degradation' explicitly (e.g., as a normalized difference in exact-match or F1) and state how it is aggregated across LLMs of differing absolute performance.
  2. [Figures/Tables in §4] Add confidence intervals or significance markers to any plots or tables that display degradation patterns or single-character error rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript analyzing error propagation in Korean spoken QA with ASR-LLM cascades. We address each major comment below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.

    Authors: We agree that additional controls for LLM-specific factors would further strengthen the inference. The consistency we observe across diverse LLMs (including those with different tokenizers and pre-training corpora) provides suggestive evidence that ASR information loss is the dominant factor, as model-specific effects would likely lead to more variable relative degradations. In the revision, we will expand the discussion in §4 to explicitly address potential LLM-specific confounds and include a qualitative analysis of how tokenization and Hangul handling might interact with ASR errors. If feasible with available resources, we will add a small-scale ablation using a controlled set of synthetic errors. revision: partial

  2. Referee: [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.

    Authors: We acknowledge these omissions in the current manuscript. In the revised version, we will report the exact sizes of the datasets and subsets used for each experiment, include error bars computed via bootstrapping or multiple random seeds where applicable, conduct and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for key comparisons, and clarify the model selection process to rule out post-hoc biases. These additions will be integrated into §3 and §4 to enhance the reproducibility and robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical error analysis

full rationale

The paper reports direct empirical measurements of ASR error propagation through ASR-LLM cascades on Korean SQA tasks, including relative degradation consistency across LLMs and identification of single-character error channels. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described analysis chain. All claims rest on experimental comparisons and observations rather than any reduction to prior inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical measurement study with no mathematical axioms, free parameters, or invented entities; all claims rest on experimental observations whose details are not supplied in the abstract.

pith-pipeline@v0.9.0 · 5654 in / 1063 out tokens · 26820 ms · 2026-05-20T12:58:02.524780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

  1. [1]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

  2. [2]

    A survey on dialogue systems: Recent advances and new frontiers,

    H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25–35, 2017

  3. [3]

    Spoken dialogue technology: Enabling the conversational user interface,

    M. F. McTear, “Spoken dialogue technology: Enabling the conversational user interface,”ACM Computing Sur- veys, vol. 34, no. 1, pp. 90–169, 2002

  4. [4]

    WavChat: A survey of spoken dialogue models.arXiv preprint arXiv:2411.13577,

    S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, et al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

  5. [5]

    Speech recognition in noisy environments: A survey,

    Y . Gong, “Speech recognition in noisy environments: A survey,”Speech Communication, vol. 16, no. 3, pp. 261– 291, 1995

  6. [6]

    Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,

    M. Faruqui and D. Hakkani-Tur, “Revisiting the bound- ary between asr and nlu in the age of conversational dialog systems,”Computational Linguistics, vol. 48, no. 1, pp. 221–232, 2022

  7. [7]

    An approach to measuring the performance of ASR models in the context of LLM-powered applications,

    S. Pulikodan, A. K. Marathe, A. Mehrotra, S. Saxena, et al., “An approach to measuring the performance of ASR models in the context of LLM-powered applications,” in INTERSPEECH, 2025

  8. [8]

    KorQuAD 1.0: Korean QA dataset for machine reading comprehension,

    S. Lim, M. Kim, and J. Lee, “KorQuAD 1.0: Korean QA dataset for machine reading comprehension,”arXiv preprint arXiv:1909.07005, 2019

  9. [9]

    Google Cloud,Cloud Text-to-Speech Documentation, https://cloud.google.com/text-to-speech/docs, Accessed: 2026-05-17, 2026

  10. [10]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

  11. [11]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inICML, 2023

  12. [12]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  13. [13]

    SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,

    S. Kim, D. Kim, C. Park, W. Lee, W. Song, et al., “SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling,” inNAACL In- dustry Track, 2024

  14. [14]

    Exaone 3.5: Series of large lan- guage models for real-world use cases,

    S. An et al., “Exaone 3.5: Series of large lan- guage models for real-world use cases,”arXiv preprint arXiv:2412.04862, 2024

  15. [15]

    Efficient memory management for large language model serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, et al., “Efficient memory management for large language model serving with PagedAttention,” inSOSP, 2023

  16. [16]

    MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,

    L. Zhang, Y . Wu, F. Mo, J.-Y . Nie, and A. Agrawal, “MoqaGPT: Zero-shot multi-modal open-domain ques- tion answering with large language model,” inFindings of EMNLP, 2023

  17. [17]

    Kmsav: Korean multi- speaker spontaneous audiovisual dataset,

    K. Park, C. Oh, and S. Dong, “Kmsav: Korean multi- speaker spontaneous audiovisual dataset,”ETRI Journal, vol. 46, no. 1, pp. 71–81, 2024

  18. [18]

    Wavllm: Towards robust and adaptive speech large language model,

    S. Hu et al., “Wavllm: Towards robust and adaptive speech large language model,” inFindings of EMNLP, 2024

  19. [19]

    Audiochatllama: Towards general-purpose speech abil- ities for llms,

    Y . Fathullah, C. Wu, E. Lakomkin, K. Li, J. Jia, et al., “Audiochatllama: Towards general-purpose speech abil- ities for llms,” inNAACL, 2024

  20. [20]

    DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,

    Y . Choi, D. Jung, and H. Kim, “DESAMO: A device for elder-friendly smart homes powered by embedded LLM with audio modality,” inUIST Adjunct, 2025

  21. [21]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, et al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025