Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)
Pith reviewed 2026-05-25 15:53 UTC · model grok-4.3
The pith
A 3D CNN front-end plus ensemble of temporal convolution and LSTM classifiers detects active speakers with gains over the baseline on the AVA dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a 3D CNN based front-end together with an ensemble of temporal convolution and LSTM classifiers produces significant improvements over the baseline when predicting whether a visible person is speaking on the AVA-ActiveSpeaker dataset.
What carries the argument
The 3D CNN front-end that extracts spatio-temporal video features, followed by an ensemble of temporal convolution and LSTM classifiers that produce speaker activity predictions.
If this is right
- The described system outperforms the provided baseline on the AVA-ActiveSpeaker dataset.
- The ensemble of temporal models improves prediction accuracy for visible speaker activity.
- The approach is directly applicable to the Active Speaker Detection task in the ActivityNet Challenge.
- The 3D CNN plus temporal classifier pipeline can be used for visual-only speaker detection in video.
Where Pith is reading between the lines
- The visual pipeline could be tested on datasets that include both video and audio to measure added value from sound.
- The same front-end and ensemble structure might transfer to related tasks such as action recognition in video.
- Detailed per-scene error analysis on the AVA data could identify conditions where the ensemble succeeds or fails.
Load-bearing premise
That an ensemble of temporal convolution and LSTM classifiers on top of a 3D CNN front-end will produce reliable speaker predictions on the AVA dataset.
What would settle it
Evaluating the same ensemble on the AVA-ActiveSpeaker test set and observing no improvement over the baseline would falsify the claim of significant gains.
Figures
read the original abstract
This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This manuscript is a short report on the Naver team's submission to the ActivityNet Challenge 2019 Task B (Active Speaker Detection on AVA). It describes a pipeline that extracts features with a 3D CNN front-end and feeds them to an ensemble of temporal-convolution and LSTM classifiers to decide whether a visible person is speaking. The sole quantitative statement is the claim of 'significant improvements over the baseline' on the AVA-ActiveSpeaker dataset.
Significance. If the claimed improvement were accompanied by concrete metrics, ablations, and error analysis, the work would supply a practical data point on the utility of 3D-CNN-plus-temporal-ensemble pipelines for active-speaker detection. The approach itself combines well-known components and does not introduce new theoretical machinery or parameter-free derivations.
major comments (1)
- [Abstract] Abstract: the assertion that 'Our results show significant improvements over the baseline' is unsupported by any numerical evidence (mAP, baseline scores, statistical tests, or ablation tables). Because this is the only performance claim in the manuscript, the central empirical contribution cannot be evaluated.
minor comments (2)
- The method description is limited to a single sentence; no architecture details, input resolution, training schedule, or ensemble weighting scheme are supplied.
- No references to prior AVA-ActiveSpeaker baselines or related challenge entries are provided.
Simulated Author's Rebuttal
We thank the referee for the detailed review. We agree that the abstract's performance claim requires concrete numerical support and will revise the manuscript to address this.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'Our results show significant improvements over the baseline' is unsupported by any numerical evidence (mAP, baseline scores, statistical tests, or ablation tables). Because this is the only performance claim in the manuscript, the central empirical contribution cannot be evaluated.
Authors: We agree with this assessment. The manuscript is a concise challenge report whose abstract currently states only that 'Our results show significant improvements over the baseline' without accompanying numbers. In the revised version we will add the mAP scores of our 3D-CNN + temporal-convolution/LSTM ensemble and the official baseline on the AVA-ActiveSpeaker validation set, together with a brief statement of the improvement magnitude. This will make the central empirical claim directly verifiable. revision: yes
Circularity Check
No derivation chain or equations; empirical challenge report exhibits no circularity
full rationale
The manuscript is a brief empirical submission report describing a 3D-CNN front-end plus ensemble of temporal convolution and LSTM classifiers for AVA active speaker detection. It contains no equations, no derivations, no fitted parameters presented as predictions, and no load-bearing self-citations or ansatzes. The sole claim of 'significant improvements' is an unreported empirical assertion rather than a mathematical result that could reduce to its inputs by construction. Per the evaluation criteria, absence of any derivation chain warrants score 0 with no steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman. Deep audio-visual speech recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2019
work page 2019
-
[2]
T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, 2018
work page 2018
-
[3]
Y . M. Assael, B. Shillingford, S. Whiteson, and N. De Fre- itas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
P. Chakravarty and T. Tuytelaars. Cross-modal supervision for learning active speaker detection in video. In Proc. ECCV, pages 285–301. Springer, 2016
work page 2016
-
[5]
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In Proc. BMVC., 2014
work page 2014
-
[6]
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision , pages 251–263. Springer, 2016
work page 2016
- [7]
-
[8]
T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000
work page 2000
- [9]
-
[10]
D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In Proc. ICLR, 2015
work page 2015
-
[11]
K. Noda, Y . Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learn- ing. Applied Intelligence, 42(4):722–737, 2015
work page 2015
- [12]
-
[13]
J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, et al. A V A-ActiveSpeaker: An audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342, 2019. 3
work page internal anchor Pith review Pith/arXiv arXiv 1901
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.