HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.
Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
HumanOmni-Speaker: Identifying Who said What and When
HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.