MM-When2Speak reformulates conversational timing as dense response-type prediction and achieves up to 3x better performance by integrating video, audio, and text cues on top of an LLM backbone using a new dyadic conversation dataset.
A full-duplex speech dialogue scheme based on large language model
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Words: Multimodal LLM Knows When to Speak
MM-When2Speak reformulates conversational timing as dense response-type prediction and achieves up to 3x better performance by integrating video, audio, and text cues on top of an LLM backbone using a new dyadic conversation dataset.