Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)

Joon Son Chung

arxiv: 1906.10555 · v1 · pith:PAA5QJQBnew · submitted 2019-06-25 · 💻 cs.SD · cs.CV· eess.AS

Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)

Joon Son Chung This is my paper

Pith reviewed 2026-05-25 15:53 UTC · model grok-4.3

classification 💻 cs.SD cs.CVeess.AS

keywords active speaker detectionAVA-ActiveSpeaker3D CNNtemporal convolutionLSTMActivityNet Challengevideo analysis

0 comments

The pith

A 3D CNN front-end plus ensemble of temporal convolution and LSTM classifiers detects active speakers with gains over the baseline on the AVA dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a submission to the ActivityNet Challenge for detecting whether a visible person is speaking in video. A 3D convolutional network extracts features from video frames, which then pass to an ensemble of temporal convolution networks and LSTM models that output speaking or not-speaking predictions. The system is evaluated on the AVA-ActiveSpeaker dataset. The authors report that this setup yields significant improvements compared to the challenge baseline. A sympathetic reader would care because reliable visual speaker detection supports downstream tasks such as conversation analysis in video.

Core claim

The authors establish that a 3D CNN based front-end together with an ensemble of temporal convolution and LSTM classifiers produces significant improvements over the baseline when predicting whether a visible person is speaking on the AVA-ActiveSpeaker dataset.

What carries the argument

The 3D CNN front-end that extracts spatio-temporal video features, followed by an ensemble of temporal convolution and LSTM classifiers that produce speaker activity predictions.

If this is right

The described system outperforms the provided baseline on the AVA-ActiveSpeaker dataset.
The ensemble of temporal models improves prediction accuracy for visible speaker activity.
The approach is directly applicable to the Active Speaker Detection task in the ActivityNet Challenge.
The 3D CNN plus temporal classifier pipeline can be used for visual-only speaker detection in video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The visual pipeline could be tested on datasets that include both video and audio to measure added value from sound.
The same front-end and ensemble structure might transfer to related tasks such as action recognition in video.
Detailed per-scene error analysis on the AVA data could identify conditions where the ensemble succeeds or fails.

Load-bearing premise

That an ensemble of temporal convolution and LSTM classifiers on top of a 3D CNN front-end will produce reliable speaker predictions on the AVA dataset.

What would settle it

Evaluating the same ensemble on the AVA-ActiveSpeaker test set and observing no improvement over the baseline would falsify the claim of significant gains.

Figures

Figures reproduced from arXiv: 1906.10555 by Joon Son Chung.

**Figure 2.** Figure 2: LSTM-based back-end classifier. The architecture of the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a thin competition report claiming significant gains on active speaker detection with no metrics or analysis to back it up.

read the letter

This competition report from Naver describes their entry to the 2019 ActivityNet Challenge on active speaker detection using the AVA dataset. They combine a 3D CNN front-end with an ensemble of temporal convolution and LSTM classifiers. The headline is that they report significant improvements over the baseline, but the text gives no numbers or details to evaluate that. Nothing here is new in terms of methods. 3D CNNs for video features, temporal convolutions for sequence modeling, and LSTMs for classification have all been used in similar audio-visual tasks before. The paper does a decent job of outlining the overall pipeline in a few sentences, which might be helpful if you're looking for a summary of what one team tried in the challenge. The main weakness is the complete absence of results. The abstract and the report assert the improvement but supply no mAP scores, no baseline values, no comparison tables, and no analysis of where the model succeeds or fails. Without those, it's hard to know if the claim holds or by how much. There's also no mention of training details, data preprocessing, or how the ensemble was constructed. This kind of short report is mainly useful to people who follow the ActivityNet challenges and want to see the range of approaches submitted that year. It doesn't offer enough for someone trying to build on the work or compare methods rigorously. The citation pattern is light, as expected for a challenge report. I'd say skip bringing this to a reading group. I wouldn't cite it myself. And it doesn't look like it needs or deserves peer review as a standalone paper, given how little evidence is presented.

Referee Report

1 major / 2 minor

Summary. This manuscript is a short report on the Naver team's submission to the ActivityNet Challenge 2019 Task B (Active Speaker Detection on AVA). It describes a pipeline that extracts features with a 3D CNN front-end and feeds them to an ensemble of temporal-convolution and LSTM classifiers to decide whether a visible person is speaking. The sole quantitative statement is the claim of 'significant improvements over the baseline' on the AVA-ActiveSpeaker dataset.

Significance. If the claimed improvement were accompanied by concrete metrics, ablations, and error analysis, the work would supply a practical data point on the utility of 3D-CNN-plus-temporal-ensemble pipelines for active-speaker detection. The approach itself combines well-known components and does not introduce new theoretical machinery or parameter-free derivations.

major comments (1)

[Abstract] Abstract: the assertion that 'Our results show significant improvements over the baseline' is unsupported by any numerical evidence (mAP, baseline scores, statistical tests, or ablation tables). Because this is the only performance claim in the manuscript, the central empirical contribution cannot be evaluated.

minor comments (2)

The method description is limited to a single sentence; no architecture details, input resolution, training schedule, or ensemble weighting scheme are supplied.
No references to prior AVA-ActiveSpeaker baselines or related challenge entries are provided.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review. We agree that the abstract's performance claim requires concrete numerical support and will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Our results show significant improvements over the baseline' is unsupported by any numerical evidence (mAP, baseline scores, statistical tests, or ablation tables). Because this is the only performance claim in the manuscript, the central empirical contribution cannot be evaluated.

Authors: We agree with this assessment. The manuscript is a concise challenge report whose abstract currently states only that 'Our results show significant improvements over the baseline' without accompanying numbers. In the revised version we will add the mAP scores of our 3D-CNN + temporal-convolution/LSTM ensemble and the official baseline on the AVA-ActiveSpeaker validation set, together with a brief statement of the improvement magnitude. This will make the central empirical claim directly verifiable. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations; empirical challenge report exhibits no circularity

full rationale

The manuscript is a brief empirical submission report describing a 3D-CNN front-end plus ensemble of temporal convolution and LSTM classifiers for AVA active speaker detection. It contains no equations, no derivations, no fitted parameters presented as predictions, and no load-bearing self-citations or ansatzes. The sole claim of 'significant improvements' is an unreported empirical assertion rather than a mathematical result that could reduce to its inputs by construction. Per the evaluation criteria, absence of any derivation chain warrants score 0 with no steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No theoretical content; the paper is an empirical system description with no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5576 in / 920 out tokens · 26513 ms · 2026-05-25T15:53:00.128731+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Afouras, J

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman. Deep audio-visual speech recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2019

work page 2019
[2]

Afouras, J

T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, 2018

work page 2018
[3]

Y . M. Assael, B. Shillingford, S. Whiteson, and N. De Fre- itas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Chakravarty and T

P. Chakravarty and T. Tuytelaars. Cross-modal supervision for learning active speaker detection in video. In Proc. ECCV, pages 285–301. Springer, 2016

work page 2016
[5]

Chatﬁeld, K

K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In Proc. BMVC., 2014

work page 2014
[6]

J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision , pages 251–263. Springer, 2016

work page 2016
[7]

Chung, J

S.-W. Chung, J. S. Chung, and H.-G. Kang. Perfect match: Improved cross-modal embeddings for audio-visual syn- chronisation. In Proc. ICASSP, 2019

work page 2019
[8]

T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classiﬁer systems, pages 1–15. Springer, 2000

work page 2000
[9]

Ephrat, I

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has- sidim, W. T. Freeman, and M. Rubinstein. Looking to lis- ten at the cocktail party: a speaker-independent audio-visual model for speech separation.ACM Transactions on Graphics (TOG), 37(4):112, 2018

work page 2018
[10]

D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In Proc. ICLR, 2015

work page 2015
[11]

K. Noda, Y . Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learn- ing. Applied Intelligence, 42(4):722–737, 2015

work page 2015
[12]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- matic differentiation in pytorch. 2017

work page 2017
[13]

J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, et al. A V A-ActiveSpeaker: An audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1901

[1] [1]

Afouras, J

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman. Deep audio-visual speech recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2019

work page 2019

[2] [2]

Afouras, J

T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, 2018

work page 2018

[3] [3]

Y . M. Assael, B. Shillingford, S. Whiteson, and N. De Fre- itas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Chakravarty and T

P. Chakravarty and T. Tuytelaars. Cross-modal supervision for learning active speaker detection in video. In Proc. ECCV, pages 285–301. Springer, 2016

work page 2016

[5] [5]

Chatﬁeld, K

K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In Proc. BMVC., 2014

work page 2014

[6] [6]

J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision , pages 251–263. Springer, 2016

work page 2016

[7] [7]

Chung, J

S.-W. Chung, J. S. Chung, and H.-G. Kang. Perfect match: Improved cross-modal embeddings for audio-visual syn- chronisation. In Proc. ICASSP, 2019

work page 2019

[8] [8]

T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classiﬁer systems, pages 1–15. Springer, 2000

work page 2000

[9] [9]

Ephrat, I

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has- sidim, W. T. Freeman, and M. Rubinstein. Looking to lis- ten at the cocktail party: a speaker-independent audio-visual model for speech separation.ACM Transactions on Graphics (TOG), 37(4):112, 2018

work page 2018

[10] [10]

D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In Proc. ICLR, 2015

work page 2015

[11] [11]

K. Noda, Y . Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learn- ing. Applied Intelligence, 42(4):722–737, 2015

work page 2015

[12] [12]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- matic differentiation in pytorch. 2017

work page 2017

[13] [13]

J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, et al. A V A-ActiveSpeaker: An audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1901