Interleaved SLMs implicitly transcribe spoken words to text tokens in middle layers (top candidate for 77% of data) before predicting in text space and returning to speech.
The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
WASIL is a released dataset of Arabic spoken interactions with LLMs that includes audio, ASR outputs, responses, user feedback, and answerability labels to isolate ASR effects.
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.
citing papers explorer
-
WASIL: In-the-Wild Arabic Spoken Interactions with LLMs
WASIL is a released dataset of Arabic spoken interactions with LLMs that includes audio, ASR outputs, responses, user feedback, and answerability labels to isolate ASR effects.
-
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.