Dixtral uses diarization conditioning on a Whisper-based encoder within Voxtral to outperform baselines on multi-speaker transcription and match or exceed on QA tasks.
Serialized output prompting for large language model-based multi-talker speech recognition,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning
Dixtral uses diarization conditioning on a Whisper-based encoder within Voxtral to outperform baselines on multi-speaker transcription and match or exceed on QA tasks.