pith. sign in

arxiv: 2605.17443 · v2 · pith:5FQAMGOJnew · submitted 2026-05-17 · 💻 cs.CL · cs.SD· eess.AS

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Pith reviewed 2026-05-20 12:58 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords ASR error propagationKorean spoken QAASR-LLM cascadessemantic failuresingle-character errorsdownstream degradationspoken question answering
0
0 comments X

The pith

ASR errors cause consistent relative degradation in Korean spoken QA across LLMs of varying strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how errors introduced by automatic speech recognition propagate through cascades into large language models when answering spoken questions in Korean. It shows that the proportional performance drop from these errors remains similar even when the LLMs themselves differ substantially in their overall accuracy. This pattern implies that most of the information loss occurs during the initial speech-to-text conversion rather than in the language model's processing of the imperfect text. The analysis also identifies single-character transcription mistakes as a special failure mode in Korean where the correct answer disappears entirely from the model's output. A side comparison indicates that models ingesting audio directly can avoid some of the losses seen in the standard recognition-then-language-model pipeline.

Core claim

In Korean spoken question answering with ASR-LLM cascades, the relative downstream degradation caused by ASR errors is consistent across LLMs that have different absolute performance levels. This indicates that overall cascade degradation largely tracks the information loss that occurs at the ASR stage. Single-character Korean ASR errors create a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only a minimal difference in the transcription. An auxiliary comparison further shows that a large audio language model outperforms an ASR-LLM pipeline using a matched language backbone when handling noisy Korean spoken questions

What carries the argument

Consistency of relative downstream degradation as a signal that cascade performance tracks ASR-stage information loss, together with single-character semantic-failure channels in Korean transcriptions.

If this is right

  • Overall cascade performance for Korean spoken QA is limited primarily by ASR accuracy rather than by the choice of downstream LLM.
  • Minimal single-character transcription errors can eliminate the correct answer from the final output even when the rest of the question remains intact.
  • Direct audio input models can reduce transcript-induced semantic losses compared with ASR-LLM pipelines in noisy conditions.
  • Efforts to improve Korean spoken QA should target preservation of answer-critical characters during recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • System builders may achieve larger gains by investing in ASR improvements than by swapping in larger language models when speech input is noisy.
  • The single-character failure pattern may appear in other character-based or syllabic languages and could be checked with similar controlled error injections.
  • ASR systems for QA tasks might benefit from semantic-aware error correction that protects key answer tokens even when overall word error rate stays low.

Load-bearing premise

The observed consistency in relative degradation across LLMs is caused by tracking of ASR-stage information loss rather than by LLM-specific robustness or dataset characteristics.

What would settle it

Repeating the experiments on a new dataset with controlled ASR error rates or on LLMs engineered for matched robustness to noisy text and finding that relative degradation then varies would falsify the claim that degradation tracks ASR information loss.

Figures

Figures reproduced from arXiv: 2605.17443 by Donghyuk Jung, Youngwon Choi.

Figure 1
Figure 1. Figure 1: Overview of the speech synthesis and downstream QA evaluation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative cases of the Korean single-character ASR loss channel. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes error propagation in ASR-LLM cascades for Korean spoken question answering. It reports that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance levels, suggesting that cascade degradation largely tracks ASR-stage information loss. It further identifies single-character Korean ASR errors as a distinct semantic-failure channel in which the gold answer becomes entirely absent from the downstream prediction despite only minimal transcription differences. An auxiliary comparison indicates that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA.

Significance. If the consistency of relative degradation is shown to track ASR information loss after controlling for LLM-specific robustness, the work supplies useful empirical evidence that improvements at the ASR stage can yield predictable gains in Korean SQA cascades. The identification of single-character errors supplies a concrete, language-specific failure mode not captured by conventional ASR metrics. The audio-LM comparison provides a direct, falsifiable indication that end-to-end audio modeling can mitigate transcript-induced losses. These contributions rest on empirical measurements and cross-model comparisons rather than parameter fitting or derivations.

major comments (2)
  1. [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.
  2. [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.
minor comments (2)
  1. [§4] Define 'relative downstream degradation' explicitly (e.g., as a normalized difference in exact-match or F1) and state how it is aggregated across LLMs of differing absolute performance.
  2. [Figures/Tables in §4] Add confidence intervals or significance markers to any plots or tables that display degradation patterns or single-character error rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript analyzing error propagation in Korean spoken QA with ASR-LLM cascades. We address each major comment below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§4] §4 (relative-degradation analysis): the claim that cascade degradation 'largely tracks ASR-stage information loss' rests on observed consistency of relative degradation across LLMs. This inference is not yet load-bearing because the manuscript provides no ablation or measurement of LLM-specific factors (robustness to Hangul character substitutions, tokenization sensitivity, or pre-training overlap with ASR error patterns) that could produce the same consistency without direct tracking of ASR information loss.

    Authors: We agree that additional controls for LLM-specific factors would further strengthen the inference. The consistency we observe across diverse LLMs (including those with different tokenizers and pre-training corpora) provides suggestive evidence that ASR information loss is the dominant factor, as model-specific effects would likely lead to more variable relative degradations. In the revision, we will expand the discussion in §4 to explicitly address potential LLM-specific confounds and include a qualitative analysis of how tokenization and Hangul handling might interact with ASR errors. If feasible with available resources, we will add a small-scale ablation using a controlled set of synthetic errors. revision: partial

  2. Referee: [§3 and §4] Experimental details (throughout §3 and §4): the reported patterns and performance gap lack dataset sizes, error bars, statistical significance tests, and explicit controls for post-hoc model selection. These omissions prevent verification that the central observations on consistency and single-character failures are robust rather than sensitive to unstated choices.

    Authors: We acknowledge these omissions in the current manuscript. In the revised version, we will report the exact sizes of the datasets and subsets used for each experiment, include error bars computed via bootstrapping or multiple random seeds where applicable, conduct and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for key comparisons, and clarify the model selection process to rule out post-hoc biases. These additions will be integrated into §3 and §4 to enhance the reproducibility and robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical error analysis

full rationale

The paper reports direct empirical measurements of ASR error propagation through ASR-LLM cascades on Korean SQA tasks, including relative degradation consistency across LLMs and identification of single-character error channels. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described analysis chain. All claims rest on experimental comparisons and observations rather than any reduction to prior inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical measurement study with no mathematical axioms, free parameters, or invented entities; all claims rest on experimental observations whose details are not supplied in the abstract.

pith-pipeline@v0.9.0 · 5654 in / 1063 out tokens · 26820 ms · 2026-05-20T12:58:02.524780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

    cs.HC 2026-06 unverdicted novelty 6.0

    CORTIS is a text-only adaptation method for spoken language models that enables direct speech-to-structured-output generation for task-oriented agents and matches or exceeds ASR-LLM cascades under acoustic degradation.