This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time

ASRQ_ASRA:The student input uses the text transcribed by WhisperPro from the speech-form data, the teacher distribution is also conditioned on this transcribed text

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

cs.CL · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

citing papers explorer

Showing 1 of 1 citing paper.

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 75 · 2 links
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time

fields

years

verdicts

representative citing papers

citing papers explorer