pith. sign in

arxiv: 1906.10555 · v1 · pith:PAA5QJQBnew · submitted 2019-06-25 · 💻 cs.SD · cs.CV· eess.AS

Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)

Pith reviewed 2026-05-25 15:53 UTC · model grok-4.3

classification 💻 cs.SD cs.CVeess.AS
keywords active speaker detectionAVA-ActiveSpeaker3D CNNtemporal convolutionLSTMActivityNet Challengevideo analysis
0
0 comments X

The pith

A 3D CNN front-end plus ensemble of temporal convolution and LSTM classifiers detects active speakers with gains over the baseline on the AVA dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a submission to the ActivityNet Challenge for detecting whether a visible person is speaking in video. A 3D convolutional network extracts features from video frames, which then pass to an ensemble of temporal convolution networks and LSTM models that output speaking or not-speaking predictions. The system is evaluated on the AVA-ActiveSpeaker dataset. The authors report that this setup yields significant improvements compared to the challenge baseline. A sympathetic reader would care because reliable visual speaker detection supports downstream tasks such as conversation analysis in video.

Core claim

The authors establish that a 3D CNN based front-end together with an ensemble of temporal convolution and LSTM classifiers produces significant improvements over the baseline when predicting whether a visible person is speaking on the AVA-ActiveSpeaker dataset.

What carries the argument

The 3D CNN front-end that extracts spatio-temporal video features, followed by an ensemble of temporal convolution and LSTM classifiers that produce speaker activity predictions.

If this is right

  • The described system outperforms the provided baseline on the AVA-ActiveSpeaker dataset.
  • The ensemble of temporal models improves prediction accuracy for visible speaker activity.
  • The approach is directly applicable to the Active Speaker Detection task in the ActivityNet Challenge.
  • The 3D CNN plus temporal classifier pipeline can be used for visual-only speaker detection in video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The visual pipeline could be tested on datasets that include both video and audio to measure added value from sound.
  • The same front-end and ensemble structure might transfer to related tasks such as action recognition in video.
  • Detailed per-scene error analysis on the AVA data could identify conditions where the ensemble succeeds or fails.

Load-bearing premise

That an ensemble of temporal convolution and LSTM classifiers on top of a 3D CNN front-end will produce reliable speaker predictions on the AVA dataset.

What would settle it

Evaluating the same ensemble on the AVA-ActiveSpeaker test set and observing no improvement over the baseline would falsify the claim of significant gains.

Figures

Figures reproduced from arXiv: 1906.10555 by Joon Son Chung.

Figure 1
Figure 1. Figure 1: Front-end architecture for audio and visual encoders [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LSTM-based back-end classifier. The architecture of the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
read the original abstract

This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This manuscript is a short report on the Naver team's submission to the ActivityNet Challenge 2019 Task B (Active Speaker Detection on AVA). It describes a pipeline that extracts features with a 3D CNN front-end and feeds them to an ensemble of temporal-convolution and LSTM classifiers to decide whether a visible person is speaking. The sole quantitative statement is the claim of 'significant improvements over the baseline' on the AVA-ActiveSpeaker dataset.

Significance. If the claimed improvement were accompanied by concrete metrics, ablations, and error analysis, the work would supply a practical data point on the utility of 3D-CNN-plus-temporal-ensemble pipelines for active-speaker detection. The approach itself combines well-known components and does not introduce new theoretical machinery or parameter-free derivations.

major comments (1)
  1. [Abstract] Abstract: the assertion that 'Our results show significant improvements over the baseline' is unsupported by any numerical evidence (mAP, baseline scores, statistical tests, or ablation tables). Because this is the only performance claim in the manuscript, the central empirical contribution cannot be evaluated.
minor comments (2)
  1. The method description is limited to a single sentence; no architecture details, input resolution, training schedule, or ensemble weighting scheme are supplied.
  2. No references to prior AVA-ActiveSpeaker baselines or related challenge entries are provided.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review. We agree that the abstract's performance claim requires concrete numerical support and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'Our results show significant improvements over the baseline' is unsupported by any numerical evidence (mAP, baseline scores, statistical tests, or ablation tables). Because this is the only performance claim in the manuscript, the central empirical contribution cannot be evaluated.

    Authors: We agree with this assessment. The manuscript is a concise challenge report whose abstract currently states only that 'Our results show significant improvements over the baseline' without accompanying numbers. In the revised version we will add the mAP scores of our 3D-CNN + temporal-convolution/LSTM ensemble and the official baseline on the AVA-ActiveSpeaker validation set, together with a brief statement of the improvement magnitude. This will make the central empirical claim directly verifiable. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations; empirical challenge report exhibits no circularity

full rationale

The manuscript is a brief empirical submission report describing a 3D-CNN front-end plus ensemble of temporal convolution and LSTM classifiers for AVA active speaker detection. It contains no equations, no derivations, no fitted parameters presented as predictions, and no load-bearing self-citations or ansatzes. The sole claim of 'significant improvements' is an unreported empirical assertion rather than a mathematical result that could reduce to its inputs by construction. Per the evaluation criteria, absence of any derivation chain warrants score 0 with no steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No theoretical content; the paper is an empirical system description with no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5576 in / 920 out tokens · 26513 ms · 2026-05-25T15:53:00.128731+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    Afouras, J

    T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman. Deep audio-visual speech recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2019

  2. [2]

    Afouras, J

    T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, 2018

  3. [3]

    Y . M. Assael, B. Shillingford, S. Whiteson, and N. De Fre- itas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016

  4. [4]

    Chakravarty and T

    P. Chakravarty and T. Tuytelaars. Cross-modal supervision for learning active speaker detection in video. In Proc. ECCV, pages 285–301. Springer, 2016

  5. [5]

    Chatfield, K

    K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In Proc. BMVC., 2014

  6. [6]

    J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision , pages 251–263. Springer, 2016

  7. [7]

    Chung, J

    S.-W. Chung, J. S. Chung, and H.-G. Kang. Perfect match: Improved cross-modal embeddings for audio-visual syn- chronisation. In Proc. ICASSP, 2019

  8. [8]

    T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000

  9. [9]

    Ephrat, I

    A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has- sidim, W. T. Freeman, and M. Rubinstein. Looking to lis- ten at the cocktail party: a speaker-independent audio-visual model for speech separation.ACM Transactions on Graphics (TOG), 37(4):112, 2018

  10. [10]

    D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In Proc. ICLR, 2015

  11. [11]

    K. Noda, Y . Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learn- ing. Applied Intelligence, 42(4):722–737, 2015

  12. [12]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- matic differentiation in pytorch. 2017

  13. [13]

    J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, et al. A V A-ActiveSpeaker: An audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342, 2019. 3