HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.
Qwen3-omni technical report, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.
citing papers explorer
-
HumanOmni-Speaker: Identifying Who said What and When
HumanOmni-Speaker introduces a Visual Delta Encoder and VR-SDR benchmark that enable end-to-end speaker diarization and recognition by sampling video at 25 fps and compressing inter-frame motion residuals into 6 tokens per frame.
-
OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.