HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.
Llama-omni: Seamless speech interaction with large language models, 2025
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
HumanOmni-Speaker: Identifying Who said What and When
HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.