Submission to ActivityNet Challenge 2019: Task B Spatio-temporal Action Localization

Byeongwon Lee; Chunfei Ma; Joonhyang Choi; Seungji Yang

arxiv: 1907.10837 · v1 · pith:NQYKKVQRnew · submitted 2019-07-25 · 💻 cs.CV · cs.LG

Submission to ActivityNet Challenge 2019: Task B Spatio-temporal Action Localization

Chunfei Ma , Joonhyang Choi , Byeongwon Lee , Seungji Yang This is my paper

Pith reviewed 2026-05-24 16:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords spatio-temporal action localizationSlowFast Networksdata augmentationclass imbalanceoverfittingActivityNetRGB video

0 comments

The pith

A SlowFast network using only RGB frames plus correlation-preserving augmentation and random label subsampling reduces overfitting in spatio-temporal action localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that an end-to-end trainable two-branch network can perform spatio-temporal action localization on ActivityNet without relying on optical flow or other two-stream inputs. It adapts the SlowFast architecture to extract both short- and long-term features from RGB sequences and adds two targeted techniques to counter the dataset's severe class imbalance and overfitting. A sympathetic reader would care because this approach simplifies the pipeline to RGB alone while claiming to improve results through better data handling.

Core claim

The authors claim that SlowFast Networks, when trained end-to-end on RGB sequences alone, combined with a correlation-preserving data augmentation method and a random label subsampling method, successfully mitigate class imbalance and overfitting and thereby improve performance on the spatio-temporal action localization task.

What carries the argument

SlowFast Networks, a two-branch architecture that captures short- and long-term spatiotemporal features from RGB video, paired with correlation-preserving data augmentation and random label subsampling to address imbalance.

If this is right

Training can proceed end-to-end using only RGB sequences without optical flow.
Class imbalance is handled directly through the new subsampling and augmentation steps.
Overfitting is reduced, producing measurable gains on the localization task.
The overall system becomes simpler and more computationally direct than prior two-stream approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same handling methods could be tested on other video datasets that suffer from long-tail class distributions.
If the augmentation preserves correlations across frames, it may also improve temporal consistency in related tasks such as action recognition.
End-to-end RGB training opens the possibility of joint optimization with downstream tasks that also operate on raw video.

Load-bearing premise

The SlowFast architecture together with the proposed augmentation and subsampling methods is sufficient to overcome class imbalance and overfitting when trained end-to-end on RGB sequences for the ActivityNet dataset.

What would settle it

A controlled experiment on the ActivityNet validation set in which the correlation-preserving augmentation and random label subsampling are removed and the model shows clear signs of overfitting with no performance gain.

Figures

Figures reproduced from arXiv: 1907.10837 by Byeongwon Lee, Chunfei Ma, Joonhyang Choi, Seungji Yang.

**Figure 2.** Figure 2: Co-occurrence matrix of original dataset(a), aug [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

This technical report present an overview of our system proposed for the spatio-temporal action localization(SAL) task in ActivityNet Challenge 2019. Unlike previous two-streams-based works, we focus on exploring the end-to-end trainable architecture using only RGB sequential images. To this end, we employ a previously proposed simple yet effective two-branches network called SlowFast Networks which is capable of capturing both short- and long-term spatiotemporal features. Moreover, to handle the severe class imbalance and overfitting problems, we propose a correlation-preserving data augmentation method and a random label subsampling method which have been proven to be able to reduce overfitting and improve the performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript is a technical report describing a submission to the ActivityNet Challenge 2019 Task B on spatio-temporal action localization. It employs the SlowFast network architecture for end-to-end training on RGB sequences alone (avoiding two-stream approaches) and introduces two techniques—a correlation-preserving data augmentation method and a random label subsampling method—to address class imbalance and overfitting.

Significance. If the performance claims were supported by evidence, the work would indicate that a simplified RGB-only SlowFast pipeline with targeted augmentation and subsampling can effectively handle the challenges of the task, offering a contrast to prior two-stream methods. The specific handling methods for imbalance could be reusable if shown to generalize. However, the complete absence of quantitative results prevents any assessment of significance.

major comments (1)

[Abstract] Abstract: The claim that the correlation-preserving data augmentation and random label subsampling methods 'have been proven to be able to reduce overfitting and improve the performance' is unsupported by any mAP scores, ablation tables, baseline comparisons, or measurements of overfitting/class imbalance on ActivityNet. This directly undermines the central contribution of the report.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our technical report on the ActivityNet Challenge 2019 Task B submission. We address the referee's major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the correlation-preserving data augmentation and random label subsampling methods 'have been proven to be able to reduce overfitting and improve the performance' is unsupported by any mAP scores, ablation tables, baseline comparisons, or measurements of overfitting/class imbalance on ActivityNet. This directly undermines the central contribution of the report.

Authors: We agree that the abstract makes an unsubstantiated claim. The manuscript describes the SlowFast architecture and the two proposed techniques for addressing class imbalance and overfitting but contains no ablation studies, mAP deltas, or quantitative measurements of overfitting reduction. As a short challenge submission report, the focus is on system description rather than experimental validation of each component. We will revise the abstract to remove the phrase 'have been proven to be able to reduce overfitting and improve the performance' and replace it with a description of the methods as proposed solutions intended to mitigate these issues, without asserting empirical proof within the report. revision: yes

Circularity Check

0 steps flagged

No circularity: system description with no derivations or self-referential reductions

full rationale

The manuscript is a challenge submission that describes an architecture (SlowFast Networks) and two proposed handling methods without any equations, derivations, fitted parameters, or mathematical claims. The assertion that the methods 'have been proven to be able to reduce overfitting' is presented as an empirical claim rather than a derived result, and no load-bearing step reduces to its own inputs by construction. No self-citation chains, ansatzes, or uniqueness theorems are invoked in a manner that creates circularity. The paper is self-contained as a descriptive report against external benchmarks (ActivityNet challenge), warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No derivations, free parameters, or new entities are introduced; the work rests on the prior SlowFast paper and standard supervised learning assumptions for video classification.

pith-pipeline@v0.9.0 · 5643 in / 920 out tokens · 18639 ms · 2026-05-24T16:40:00.546970+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

employ a previously proposed simple yet effective two-branches network called SlowFast Networks ... correlation-preserving data augmentation method and a random label subsampling method
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 2: Ablation results on AVA2.2 action localization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recog- nition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017
[2]

S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221-231, 2013

work page 2013
[3]

K. Hara, H. Kataoka, Y . Satoh Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In CVPR, 2018

work page 2018
[4]

A Study on Action Detection in the Wild

Y . Zhang, P. Tokmakov, M.Hebert. A Study on Action Detection in the Wild. arXiv:1904.12993, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

C. Gu, C. Sun, D. A. Ross, C. V ondrick, C. Panto- faru, Y . Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions . arXiv:1705.08421, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Kalogeiton, P

V . Kalogeiton, P. Weinzaepfel, V . Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal ac- tion localization. In ICCV , 2017

work page 2017
[7]

Slowfast networks for video recognition

C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition . arXiv:1812.03982v2, 2019

work page arXiv 2019
[8]

X. Wang, R. Girshick, A. Gupta, and K. He. X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural net- works. CVPR, 2017

work page 2017
[9]

Zhang, P

Y . Zhang, P. Tokmakov, M. Hebert, and C. Schmid. A structured model for action detection. CVPR, 2019

work page 2019
[10]

Girdhar, J

R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman. Video action transformer network. CVPR, 2019

work page 2019
[11]

C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahen- buhl, R. Girshick. Long-Term feature banks for detailed video understanding. CVPR, 2019

work page 2019
[12]

S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He. Ag- gregated residual transformations for deep neural net- works. CVPR, 2017

work page 2017
[13]

T.-Y . Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. CVPR, 2017

work page 2017
[14]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Per- ona, D. Ramanan, P. Dollr, and C. L. Zitnick.Microsoft coco: Common objects in context. ECCV , 2014

work page 2014
[15]

J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. CVPR, 2009

work page 2009
[16]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman and Andrew Zisserman.The kinetics human action video dataset. arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Loshchilov, F

I. Loshchilov, F. Hutter SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR, 2017

work page 2017
[18]

http://blog.qure.ai/notes/deep-learn ing-for-videos-action-recognition-review 4

work page

[1] [1]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recog- nition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017

[2] [2]

S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221-231, 2013

work page 2013

[3] [3]

K. Hara, H. Kataoka, Y . Satoh Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In CVPR, 2018

work page 2018

[4] [4]

A Study on Action Detection in the Wild

Y . Zhang, P. Tokmakov, M.Hebert. A Study on Action Detection in the Wild. arXiv:1904.12993, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[5] [5]

C. Gu, C. Sun, D. A. Ross, C. V ondrick, C. Panto- faru, Y . Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions . arXiv:1705.08421, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Kalogeiton, P

V . Kalogeiton, P. Weinzaepfel, V . Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal ac- tion localization. In ICCV , 2017

work page 2017

[7] [7]

Slowfast networks for video recognition

C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition . arXiv:1812.03982v2, 2019

work page arXiv 2019

[8] [8]

X. Wang, R. Girshick, A. Gupta, and K. He. X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural net- works. CVPR, 2017

work page 2017

[9] [9]

Zhang, P

Y . Zhang, P. Tokmakov, M. Hebert, and C. Schmid. A structured model for action detection. CVPR, 2019

work page 2019

[10] [10]

Girdhar, J

R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman. Video action transformer network. CVPR, 2019

work page 2019

[11] [11]

C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahen- buhl, R. Girshick. Long-Term feature banks for detailed video understanding. CVPR, 2019

work page 2019

[12] [12]

S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He. Ag- gregated residual transformations for deep neural net- works. CVPR, 2017

work page 2017

[13] [13]

T.-Y . Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. CVPR, 2017

work page 2017

[14] [14]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Per- ona, D. Ramanan, P. Dollr, and C. L. Zitnick.Microsoft coco: Common objects in context. ECCV , 2014

work page 2014

[15] [15]

J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. CVPR, 2009

work page 2009

[16] [16]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman and Andrew Zisserman.The kinetics human action video dataset. arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Loshchilov, F

I. Loshchilov, F. Hutter SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR, 2017

work page 2017

[18] [18]

http://blog.qure.ai/notes/deep-learn ing-for-videos-action-recognition-review 4

work page