Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Haibiao Yao; Jianjun Li; Jinghan Yu; Junhao Xiao; Shun Feng; Yi Chen; Youjun Bao; Zhiyuan Ma; Zhiyu Wu

arxiv: 2512.19130 · v2 · submitted 2025-12-22 · 💻 cs.MM

Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Junhao Xiao , Shun Feng , Zhiyu Wu , Jinghan Yu , Haibiao Yao , Zhiyuan Ma , Jianjun Li , Youjun Bao

show 1 more author

Yi Chen

This is my paper

Pith reviewed 2026-05-16 20:58 UTC · model grok-4.3

classification 💻 cs.MM

keywords audio visual speaker detectiondecoupled dual streamtemporal continuitysocial relation modelingactive speaker detectiongradient divergence

0 comments

The pith

D²Stream decouples temporal and social streams to overcome conflicting biases in audio-visual speaker detection

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that audio-visual speaker detection needs to capture both a speaker's temporal continuity and interactions with others, but these demands conflict in coupled models. Temporal tasks favor smooth low-frequency signals while social tasks need sharp discriminability. By splitting into separate ITC and ISR streams, D²Stream allows each to optimize independently, confirmed by diverging gradient directions of 86.1 degrees. This leads to breaking the prior performance ceiling with 95.6 percent mAP on the AVA-ActiveSpeaker benchmark while staying lightweight. Readers would care as it offers a practical way to improve detection in crowded video scenes.

Core claim

D²Stream is a decoupled dual-stream framework that isolates intra-speaker temporal continuity in one branch and inter-personal social relations in the parallel branch. Gradient update analysis shows the directions stabilize at an 86.1 degree divergence, validating the inherent conflict. This design achieves state-of-the-art results on standard datasets.

What carries the argument

Parallel ITC stream for temporal stability and ISR stream for social cues, with explicit structural separation to handle conflicting inductive biases.

If this is right

Reaches 95.6% mAP on AVA-ActiveSpeaker, surpassing previous methods
Shows better generalization on the Columbia ASD dataset
Maintains a lightweight architecture for efficiency
Provides quantitative evidence of task conflict via gradient divergence

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar decoupling could improve other multi-person video understanding tasks where temporal and relational modeling compete.
The gradient divergence metric offers a diagnostic tool for identifying when task separation is beneficial in neural networks.
Extensions might include applying the framework to real-time processing or integrating with other modalities.

Load-bearing premise

The observed performance improvement stems specifically from the decoupling of the two streams rather than from other changes in capacity or optimization.

What would settle it

Reproduce the model without the dual-stream separation and verify if the mAP on AVA-ActiveSpeaker still reaches 95.6 percent or falls short.

read the original abstract

Audio-Visual Speaker Detection (AVSD) hinges on modeling both individual temporal continuity and inter-personal social context. Existing coupled architectures struggle to reconcile these tasks in shared representation spaces due to conflicting inductive biases: temporal modeling favors low-frequency smoothness, while inter-personal interaction requires high-frequency discriminability. We propose D$^2$Stream, a decoupled dual-stream framework that explicitly isolates these functionalities into parallel, task-specific branches. Specifically, the Intra-speaker Temporal Continuity (ITC) stream captures longitudinal stability, whereas the Inter-personal Social Relation (ISR) stream models transversal social cues. Quantitative gradient analysis reveals an evolutionary divergence in update directions, stabilizing at 86.1{\deg}, which confirms the inherent task conflict and the effectiveness of our structural decoupling. D$^2$Stream breaks the long-standing performance plateau, achieving a state-of-the-art 95.6% mAP on AVA-ActiveSpeaker and superior generalization on Columbia ASD, all within a lightweight and efficient design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D²Stream splits temporal continuity and social interaction into separate streams for AVSD and reports 95.6% mAP, but the post-training gradient angle does not establish that decoupling is required over a tuned single stream.

read the letter

The main point is that this paper splits audio-visual speaker detection into two parallel streams: one for a speaker's own temporal continuity and one for social relations between speakers. They report 95.6% mAP on AVA-ActiveSpeaker plus better generalization on Columbia ASD, and they back the split with a measured 86.1° divergence in gradient directions between the streams. The design stays lightweight. That separation is the concrete new piece. Prior dual-stream work exists in vision and multimodal settings, but applying it here with the gradient check to justify the split is a targeted move for this task. The numbers look competitive and the motivation for handling conflicting biases (smoothness versus discriminability) is straightforward. The paper does a clean job of naming the tension and showing an architecture that lets each stream optimize on its own. The soft spot is the causal claim. The gradient divergence is measured after the decoupled model has already been trained, so it confirms the streams behave differently but does not test whether a single shared representation with matched capacity, loss weighting, or schedule could reach similar reconciliation. Without that direct comparison, the performance gain could come from extra parameters or training details rather than the structural split itself. The abstract is also thin on exact baselines, ablations, and statistical tests, which makes it harder to judge how much the decoupling actually moves the needle. This is for people working on multimodal detection in video, especially speaker or activity tasks. A reader who needs a practical pattern for separating temporal and relational cues would find the architecture and numbers useful. It deserves peer review because the empirical results are strong enough to check in detail and the idea is simple to evaluate, even if the justification for decoupling needs tighter evidence.

Referee Report

2 major / 1 minor

Summary. The paper proposes D²Stream, a decoupled dual-stream framework for Audio-Visual Speaker Detection (AVSD) that isolates intra-speaker temporal continuity (ITC) into one branch and inter-personal social relations (ISR) into another to resolve conflicting inductive biases between low-frequency smoothness and high-frequency discriminability. It reports an 86.1° evolutionary divergence in gradient update directions as evidence of inherent task conflict and claims state-of-the-art results of 95.6% mAP on AVA-ActiveSpeaker plus superior generalization on Columbia ASD within a lightweight design.

Significance. If the empirical claims hold under rigorous validation, the work would be significant for multi-task audio-visual learning by showing that explicit structural decoupling can overcome performance plateaus in AVSD without heavy parameterization. The gradient divergence metric offers a potentially useful diagnostic for task conflicts, and the lightweight efficiency could influence practical deployment in speaker detection systems.

major comments (2)

[Abstract] Abstract: The central SOTA claim of 95.6% mAP on AVA-ActiveSpeaker is stated without any reference to specific baselines, ablation controls, statistical tests, or experimental protocol details, rendering the performance gain impossible to evaluate or attribute to the decoupling mechanism.
[Abstract] Gradient analysis (as described in abstract): The reported 86.1° divergence in update directions is measured only after training the decoupled model; without a matched-capacity coupled baseline under identical loss weighting and optimization, this observation cannot establish that structural decoupling is causally required for the performance improvement rather than increased capacity or training details.

minor comments (1)

[Title/Abstract] Notation: The title uses 'Dual-Stream' while the abstract employs D²Stream and D$^2$Stream; consistent rendering of the superscript is needed throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and have revised the abstract to include additional context on baselines, protocol, and the gradient analysis while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: The central SOTA claim of 95.6% mAP on AVA-ActiveSpeaker is stated without any reference to specific baselines, ablation controls, statistical tests, or experimental protocol details, rendering the performance gain impossible to evaluate or attribute to the decoupling mechanism.

Authors: We agree that the abstract would benefit from more context. In the revised manuscript we have updated the abstract to note that 95.6% mAP exceeds the previous best results from coupled models and that all numbers follow the standard AVA-ActiveSpeaker evaluation protocol (5-fold cross-validation) described in Section 4. The attribution of gains to decoupling is supported by the ablation studies and capacity-controlled comparisons already present in Tables 2 and 3 of the main text. revision: yes
Referee: [Abstract] Gradient analysis (as described in abstract): The reported 86.1° divergence in update directions is measured only after training the decoupled model; without a matched-capacity coupled baseline under identical loss weighting and optimization, this observation cannot establish that structural decoupling is causally required for the performance improvement rather than increased capacity or training details.

Authors: The 86.1° divergence is reported for the decoupled model to demonstrate the conflicting inductive biases between the ITC and ISR streams. We acknowledge that a direct matched-capacity coupled baseline would strengthen the causal argument. In the revision we have added a capacity-matched coupled baseline experiment under identical optimization settings (reported in the new Table 4) showing that the decoupled architecture still yields higher mAP; the abstract has been clarified to state that the divergence is measured within the decoupled streams and that performance gains hold after capacity control. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results and post-hoc observations stand independently

full rationale

The paper advances an empirical architecture (ITC/ISR dual streams) and supports its claims via direct experimental outcomes: 95.6% mAP on AVA-ActiveSpeaker, generalization on Columbia ASD, and a measured 86.1° gradient divergence after training the proposed model. No derivation chain, equations, or fitted parameters are presented that reduce the central performance claim or the conflict argument to the inputs by construction. The gradient observation is reported as a post-training measurement rather than a self-referential prediction or renamed fit; the necessity of decoupling is argued from inductive-bias reasoning and results, not from any self-citation load-bearing theorem or ansatz smuggled via prior work. The paper is therefore self-contained against external benchmarks with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract invokes the existence of conflicting inductive biases between temporal smoothness and social discriminability as a domain assumption but introduces no explicit free parameters, new axioms, or invented entities beyond standard neural-network components.

pith-pipeline@v0.9.0 · 5491 in / 1079 out tokens · 27810 ms · 2026-05-16T20:58:02.656833+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decoupled dual-stream framework that explicitly isolates these functionalities into parallel, task-specific branches... Quantitative gradient analysis reveals an evolutionary divergence in update directions, stabilizing at 86.1°
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Temporal Interaction Stream... Speaker Interaction Stream... cross-attention to enrich representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
cs.RO 2026-05 unverdicted novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Audio-visual speaker detection (A VSD) aims to identify which person is speaking in a video by leveraging synchronized au- dio and visual cues, such as lip movements

INTRODUCTION Effectively integrating and exploiting information from multiple modalities remains a central and challenging problem in multimodal learning. Audio-visual speaker detection (A VSD) aims to identify which person is speaking in a video by leveraging synchronized au- dio and visual cues, such as lip movements. This capability is critical for rea...

work page
[2]

Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

[13] [14] [9] [11] Fig. 1: Visualization of existing methods on A V A-ActiveSpeaker, comparing mAP, FLOPs, and parameter count. structure to simultaneously model temporal and speaker interactions, which can easily lead to feature competition [18–20], thus resulting in mutual interference. To address these issues, we propose D 2STREAM, aDecoupled Dual-Stre...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

The key objective is to jointly model speaker interaction within a single frame and temporal dependencies across frames in multi-speaker scenarios

METHODOLOGY This section presents the proposed decoupled two-stream frame- work, whose overall architecture is illustrated in Figure 2. The key objective is to jointly model speaker interaction within a single frame and temporal dependencies across frames in multi-speaker scenarios. For clarity, we first introduce the basic attention module, and then deta...

work page 2021
[4]

EXPERIMENTS 3.1. Ablation Study In this section, we conduct systematic ablation studies to evaluate the individual contributions and synergistic effects of each module, including modality effectiveness, dual-stream branch design, stream structure (single vs. dual), the number of interaction layers, and the effect of the V oice Gate in suppressing false po...

work page 2021
[5]

Extensive experiments on A V A-ActiveSpeaker and Columbia ASD demonstrate state-of-the-art accuracy and efficiency

CONCLUSION We presented D 2STREAM, a decoupled dual-stream framework for audio-visual speaker detection that separately models cross- frame temporal dependencies and within-frame speaker interactions, enhanced with a lightweight V oice Gate to suppress non-speech false positives. Extensive experiments on A V A-ActiveSpeaker and Columbia ASD demonstrate st...

work page
[6]

62377024, 2024–2027)

ACKNOWLEDGE This work was supported by the National Natural Science Founda- tion of China (General Program, No. 62377024, 2024–2027)

work page 2024
[7]

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” inProceedings of the 29th ACM interna- tional conference on multimedia, 2021, pp. 3927–3935

work page 2021
[8]

Rethinking audio-visual synchronization for active speaker detection,

Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, and Changshui Zhang, “Rethinking audio-visual synchronization for active speaker detection,” in2022 IEEE 32nd international workshop on machine learning for signal processing (MLSP). IEEE, 2022, pp. 01–06

work page 2022
[9]

Asd-transformer: Efficient active speaker detection using self and multimodal transformers,

Gourav Datta, Tyler Etchart, Vivek Yadav, Varsha Hedau, Pradeep Natarajan, and Shih-Fu Chang, “Asd-transformer: Efficient active speaker detection using self and multimodal transformers,” inICASSP 2022-2022 IEEE international con- ference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 4568–4572

work page 2022
[10]

Look&listen: Multi-modal correlation learn- ing for active speaker detection and speech enhancement,

Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, and Yufei Zha, “Look&listen: Multi-modal correlation learn- ing for active speaker detection and speech enhancement,” IEEE Transactions on Multimedia, vol. 25, pp. 5800–5812, 2022

work page 2022
[11]

A light weight model for active speaker detection,

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen, “A light weight model for active speaker detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22932–22941

work page 2023
[12]

Lr-asd: Lightweight and robust network for active speaker detection,

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen, “Lr-asd: Lightweight and robust network for active speaker detection,” International Journal of Computer Vision, vol. 133, no. 7, pp. 4749–4769, 2025

work page 2025
[13]

Maas: Multi-modal assignation for active speaker detection,

Juan Le ´on Alc ´azar, Fabian Caba, Ali K Thabet, and Bernard Ghanem, “Maas: Multi-modal assignation for active speaker detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 265–274

work page 2021
[14]

Unicon: Combating label noise through uniform selection and contrastive learning,

Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, and Mubarak Shah, “Unicon: Combating label noise through uniform selection and contrastive learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9676–9686

work page 2022
[15]

How to design a three-stage architecture for audio-visual active speaker detection in the wild,

Okan K ¨op¨ukl¨u, Maja Taseska, and Gerhard Rigoll, “How to design a three-stage architecture for audio-visual active speaker detection in the wild,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1193– 1203

work page 2021
[16]

End-to-end active speaker detection,

Juan Leon Alcazar, Moritz Cordes, Chen Zhao, and Bernard Ghanem, “End-to-end active speaker detection,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 126–143

work page 2022
[17]

Learning long-term spatial-temporal graphs for active speaker detection,

Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, and Somdeb Majumdar, “Learning long-term spatial-temporal graphs for active speaker detection,” inEuropean conference on computer vision. Springer, 2022, pp. 371–387

work page 2022
[18]

Loconet: Long-short context network for active speaker detection,

Xizi Wang, Feng Cheng, and Gedas Bertasius, “Loconet: Long-short context network for active speaker detection,” in Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), June 2024, pp. 18462– 18472

work page 2024
[19]

Talknce: Improving active speaker detection with talk-aware contrastive learning,

Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, and Joon Son Chung, “Talknce: Improving active speaker detection with talk-aware contrastive learning,” inICASSP 2024-2024 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 8391–8395

work page 2024
[20]

Audio-faces intra-frame alignment with graph attention networks for active speaker detection,

Yongkang Yin, Xusheng Yang, Liming Liang, Xu Li, and Yuexian Zou, “Audio-faces intra-frame alignment with graph attention networks for active speaker detection,” in ICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[21]

The graph neural network model,

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagen- buchner, and Gabriele Monfardini, “The graph neural network model,”IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008

work page 2008
[22]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[23]

Ac- tive speakers in context,

Juan Le ´on Alc´azar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbel´aez, and Bernard Ghanem, “Ac- tive speakers in context,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2020, pp. 12465–12474

work page 2020
[24]

Fa- cilitating multimodal classification via dynamically learning modality gap,

Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, and Yi Xu, “Fa- cilitating multimodal classification via dynamically learning modality gap,”Advances in Neural Information Processing Systems, vol. 37, pp. 62108–62122, 2024

work page 2024
[25]

Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition,

Xuli Shen, Hua Cai, Weilin Shen, Qing Xu, Dingding Yu, Weifeng Ge, and Xiangyang Xue, “Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition,” inProceedings of the Computer Vision and Pat- tern Recognition Conference, 2025, pp. 29591–29600

work page 2025
[26]

Adaptive unimodal regulation for balanced multimodal infor- mation acquisition,

Chengxiang Huang, Yake Wei, Zequn Yang, and Di Hu, “Adaptive unimodal regulation for balanced multimodal infor- mation acquisition,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25854–25863

work page 2025
[27]

Ava active speaker: An audio-visual dataset for active speaker detection,

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Mar- vin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al., “Ava active speaker: An audio-visual dataset for active speaker detection,” inICASSP 2020-2020 IEEE international confer- ence on acoustics, speech and signal processing (ICASSP). IE...

work page 2020
[28]

A hybrid cnn-bilstm voice activity detector,

Nicholas Wilkinson and Thomas Niesler, “A hybrid cnn-bilstm voice activity detector,” inICASSP 2021-2021 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6803–6807

work page 2021
[29]

Cross-modal su- pervision for learning active speaker detection in video,

Punarjay Chakravarty and Tinne Tuytelaars, “Cross-modal su- pervision for learning active speaker detection in video,” in European conference on computer vision. Springer, 2016, pp. 285–301

work page 2016

[1] [1]

Audio-visual speaker detection (A VSD) aims to identify which person is speaking in a video by leveraging synchronized au- dio and visual cues, such as lip movements

INTRODUCTION Effectively integrating and exploiting information from multiple modalities remains a central and challenging problem in multimodal learning. Audio-visual speaker detection (A VSD) aims to identify which person is speaking in a video by leveraging synchronized au- dio and visual cues, such as lip movements. This capability is critical for rea...

work page

[2] [2]

Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

[13] [14] [9] [11] Fig. 1: Visualization of existing methods on A V A-ActiveSpeaker, comparing mAP, FLOPs, and parameter count. structure to simultaneously model temporal and speaker interactions, which can easily lead to feature competition [18–20], thus resulting in mutual interference. To address these issues, we propose D 2STREAM, aDecoupled Dual-Stre...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

The key objective is to jointly model speaker interaction within a single frame and temporal dependencies across frames in multi-speaker scenarios

METHODOLOGY This section presents the proposed decoupled two-stream frame- work, whose overall architecture is illustrated in Figure 2. The key objective is to jointly model speaker interaction within a single frame and temporal dependencies across frames in multi-speaker scenarios. For clarity, we first introduce the basic attention module, and then deta...

work page 2021

[4] [4]

EXPERIMENTS 3.1. Ablation Study In this section, we conduct systematic ablation studies to evaluate the individual contributions and synergistic effects of each module, including modality effectiveness, dual-stream branch design, stream structure (single vs. dual), the number of interaction layers, and the effect of the V oice Gate in suppressing false po...

work page 2021

[5] [5]

Extensive experiments on A V A-ActiveSpeaker and Columbia ASD demonstrate state-of-the-art accuracy and efficiency

CONCLUSION We presented D 2STREAM, a decoupled dual-stream framework for audio-visual speaker detection that separately models cross- frame temporal dependencies and within-frame speaker interactions, enhanced with a lightweight V oice Gate to suppress non-speech false positives. Extensive experiments on A V A-ActiveSpeaker and Columbia ASD demonstrate st...

work page

[6] [6]

62377024, 2024–2027)

ACKNOWLEDGE This work was supported by the National Natural Science Founda- tion of China (General Program, No. 62377024, 2024–2027)

work page 2024

[7] [7]

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” inProceedings of the 29th ACM interna- tional conference on multimedia, 2021, pp. 3927–3935

work page 2021

[8] [8]

Rethinking audio-visual synchronization for active speaker detection,

Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, and Changshui Zhang, “Rethinking audio-visual synchronization for active speaker detection,” in2022 IEEE 32nd international workshop on machine learning for signal processing (MLSP). IEEE, 2022, pp. 01–06

work page 2022

[9] [9]

Asd-transformer: Efficient active speaker detection using self and multimodal transformers,

Gourav Datta, Tyler Etchart, Vivek Yadav, Varsha Hedau, Pradeep Natarajan, and Shih-Fu Chang, “Asd-transformer: Efficient active speaker detection using self and multimodal transformers,” inICASSP 2022-2022 IEEE international con- ference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 4568–4572

work page 2022

[10] [10]

Look&listen: Multi-modal correlation learn- ing for active speaker detection and speech enhancement,

Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, and Yufei Zha, “Look&listen: Multi-modal correlation learn- ing for active speaker detection and speech enhancement,” IEEE Transactions on Multimedia, vol. 25, pp. 5800–5812, 2022

work page 2022

[11] [11]

A light weight model for active speaker detection,

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen, “A light weight model for active speaker detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22932–22941

work page 2023

[12] [12]

Lr-asd: Lightweight and robust network for active speaker detection,

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen, “Lr-asd: Lightweight and robust network for active speaker detection,” International Journal of Computer Vision, vol. 133, no. 7, pp. 4749–4769, 2025

work page 2025

[13] [13]

Maas: Multi-modal assignation for active speaker detection,

Juan Le ´on Alc ´azar, Fabian Caba, Ali K Thabet, and Bernard Ghanem, “Maas: Multi-modal assignation for active speaker detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 265–274

work page 2021

[14] [14]

Unicon: Combating label noise through uniform selection and contrastive learning,

Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, and Mubarak Shah, “Unicon: Combating label noise through uniform selection and contrastive learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9676–9686

work page 2022

[15] [15]

How to design a three-stage architecture for audio-visual active speaker detection in the wild,

Okan K ¨op¨ukl¨u, Maja Taseska, and Gerhard Rigoll, “How to design a three-stage architecture for audio-visual active speaker detection in the wild,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1193– 1203

work page 2021

[16] [16]

End-to-end active speaker detection,

Juan Leon Alcazar, Moritz Cordes, Chen Zhao, and Bernard Ghanem, “End-to-end active speaker detection,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 126–143

work page 2022

[17] [17]

Learning long-term spatial-temporal graphs for active speaker detection,

Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, and Somdeb Majumdar, “Learning long-term spatial-temporal graphs for active speaker detection,” inEuropean conference on computer vision. Springer, 2022, pp. 371–387

work page 2022

[18] [18]

Loconet: Long-short context network for active speaker detection,

Xizi Wang, Feng Cheng, and Gedas Bertasius, “Loconet: Long-short context network for active speaker detection,” in Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), June 2024, pp. 18462– 18472

work page 2024

[19] [19]

Talknce: Improving active speaker detection with talk-aware contrastive learning,

Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, and Joon Son Chung, “Talknce: Improving active speaker detection with talk-aware contrastive learning,” inICASSP 2024-2024 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 8391–8395

work page 2024

[20] [20]

Audio-faces intra-frame alignment with graph attention networks for active speaker detection,

Yongkang Yin, Xusheng Yang, Liming Liang, Xu Li, and Yuexian Zou, “Audio-faces intra-frame alignment with graph attention networks for active speaker detection,” in ICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[21] [21]

The graph neural network model,

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagen- buchner, and Gabriele Monfardini, “The graph neural network model,”IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008

work page 2008

[22] [22]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[23] [23]

Ac- tive speakers in context,

Juan Le ´on Alc´azar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbel´aez, and Bernard Ghanem, “Ac- tive speakers in context,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2020, pp. 12465–12474

work page 2020

[24] [24]

Fa- cilitating multimodal classification via dynamically learning modality gap,

Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, and Yi Xu, “Fa- cilitating multimodal classification via dynamically learning modality gap,”Advances in Neural Information Processing Systems, vol. 37, pp. 62108–62122, 2024

work page 2024

[25] [25]

Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition,

Xuli Shen, Hua Cai, Weilin Shen, Qing Xu, Dingding Yu, Weifeng Ge, and Xiangyang Xue, “Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition,” inProceedings of the Computer Vision and Pat- tern Recognition Conference, 2025, pp. 29591–29600

work page 2025

[26] [26]

Adaptive unimodal regulation for balanced multimodal infor- mation acquisition,

Chengxiang Huang, Yake Wei, Zequn Yang, and Di Hu, “Adaptive unimodal regulation for balanced multimodal infor- mation acquisition,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25854–25863

work page 2025

[27] [27]

Ava active speaker: An audio-visual dataset for active speaker detection,

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Mar- vin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al., “Ava active speaker: An audio-visual dataset for active speaker detection,” inICASSP 2020-2020 IEEE international confer- ence on acoustics, speech and signal processing (ICASSP). IE...

work page 2020

[28] [28]

A hybrid cnn-bilstm voice activity detector,

Nicholas Wilkinson and Thomas Niesler, “A hybrid cnn-bilstm voice activity detector,” inICASSP 2021-2021 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6803–6807

work page 2021

[29] [29]

Cross-modal su- pervision for learning active speaker detection in video,

Punarjay Chakravarty and Tinne Tuytelaars, “Cross-modal su- pervision for learning active speaker detection in video,” in European conference on computer vision. Springer, 2016, pp. 285–301

work page 2016