A decoupled dual-stream model for audio-visual speaker detection reaches 95.6% mAP on AVA-ActiveSpeaker by isolating temporal continuity and inter-personal social modeling into separate branches.
Extensive experiments on A V A-ActiveSpeaker and Columbia ASD demonstrate state-of-the-art accuracy and efficiency
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.MM 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD
A decoupled dual-stream model for audio-visual speaker detection reaches 95.6% mAP on AVA-ActiveSpeaker by isolating temporal continuity and inter-personal social modeling into separate branches.