The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?

The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?Preprint, arXiv:2602 · 2026 · arXiv 2602.17598

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Interleaved Speech Language Models Latently Work In Text

cs.CL · 2026-06-21 · unverdicted · novelty 7.0

Interleaved SLMs implicitly transcribe spoken words to text tokens in middle layers (top candidate for 77% of data) before predicting in text space and returning to speech.

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

cs.SD · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

WASIL is a released dataset of Arabic spoken interactions with LLMs that includes audio, ASR outputs, responses, user feedback, and answerability labels to isolate ASR effects.

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

cs.SD · 2026-04-07 · conditional · novelty 6.0

A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.

citing papers explorer

Showing 2 of 2 citing papers after filters.

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs cs.SD · 2026-05-09 · unverdicted · none · ref 9 · 2 links
WASIL is a released dataset of Arabic spoken interactions with LLMs that includes audio, ASR outputs, responses, user feedback, and answerability labels to isolate ASR effects.
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization cs.SD · 2026-04-07 · conditional · none · ref 21
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.

The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?

fields

years

verdicts

representative citing papers

citing papers explorer