vireoJD-MM at Activity Detection in Extended Videos

Chong-Wah Ngo; Fuchen Long; Qi Cai; Ting Yao; Yingwei Pan; Zhaofan Qiu; Zhijian Hou

arxiv: 1906.08547 · v1 · pith:R2FFERCGnew · submitted 2019-06-20 · 💻 cs.CV

vireoJD-MM at Activity Detection in Extended Videos

Fuchen Long , Qi Cai , Zhaofan Qiu , Zhijian Hou , Yingwei Pan , Ting Yao , Chong-Wah Ngo This is my paper

Pith reviewed 2026-05-25 20:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords activity detectionsurveillance videoslate fusionperson detectionaction localizationtubelet generationActivityNet Challenge

0 comments

The pith

A surveillance video system detects activities by fusing spatial person and vehicle detections with temporal action localizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a system for activity detection in extended surveillance videos that processes inputs at two levels: spatial detection of people and vehicles, plus temporal localization of actions. Outputs from these modules are combined through late fusion to produce final activity predictions. The work also examines different methods for generating tubelets and decomposing models. A sympathetic reader would care because extended videos pose unique challenges of duration and sparsity that single-stage approaches struggle with. The paper reports results from the ActivityNet Challenge 2019.

Core claim

The system exploits person/vehicle detections in spatial level and action localization in temporal level for action detection in surveillance videos. The detection results are finally predicted by late fusing the results from each component. Different tubelet generation and model decomposition methods are studied as well.

What carries the argument

Late fusion that combines outputs from a spatial person/vehicle detection module and a temporal action localization module.

If this is right

Tubelet generation choices directly affect the quality of candidate activity regions passed to the fusion stage.
Model decomposition into spatial and temporal parts allows each component to be trained and evaluated independently before fusion.
Late fusion produces the final activity predictions used for ranking in the challenge leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular separation might allow reuse of the spatial detector on non-surveillance video domains where only person localization is needed.
If spatial and temporal cues are strongly correlated in some activities, an integrated model could outperform late fusion.
Extending the same decomposition to audio or text streams could test whether late fusion scales beyond visual features.

Load-bearing premise

Late fusion of separately trained spatial and temporal modules will improve overall detection accuracy over any individual module.

What would settle it

A controlled test on the ActEV-PC dataset in which a single end-to-end model or an early-fusion variant records higher mean average precision than the late-fusion system.

Figures

Figures reproduced from arXiv: 1906.08547 by Chong-Wah Ngo, Fuchen Long, Qi Cai, Ting Yao, Yingwei Pan, Zhaofan Qiu, Zhijian Hou.

**Figure 1.** Figure 1: Framework of our three-stage system for Activity Detec [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Recall of tubelet generation with respect to different IoU [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

This notebook paper presents an overview and comparative analysis of our system designed for activity detection in extended videos (ActEV-PC) in ActivityNet Challenge 2019. Specifically, we exploit person/vehicle detections in spatial level and action localization in temporal level for action detection in surveillance videos. The mechanism of different tubelet generation and model decomposition methods are studied as well. The detection results are finally predicted by late fusing the results from each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard challenge notebook that assembles existing detectors and localizers with late fusion for the ActEV task and compares tubelet variants, but claims no new methods or general results.

read the letter

This paper is a notebook entry from the ActivityNet 2019 challenge describing the team's ActEV-PC system for activity detection in long surveillance videos. They run person and vehicle detection spatially, do temporal action localization, and combine the outputs with late fusion. They also compare different tubelet generation methods and ways to decompose the models. That comparative piece on tubelets is the only part that shows any analysis beyond just describing the pipeline. The rest is a straightforward assembly of known components. The abstract supplies no numbers, ablations, or error breakdowns, so it is impossible to tell whether the fusion step adds anything measurable or whether the tubelet comparisons control for obvious variables like detector strength. The work stays inside the challenge setting and does not claim or test any broader claim about activity detection. This kind of report can be useful to other teams entering the same competition who want to see what configurations were tried. Outside that narrow group it offers little that would change how someone builds or evaluates a system. The math and citations are not an issue because there is no derivation and the references are the expected challenge and detector papers. I would not bring this to a reading group or cite it. It does not rise to the level that needs referee time; these challenge notebooks belong on arXiv as technical reports rather than in the review pipeline.

Referee Report

1 major / 1 minor

Summary. This notebook paper presents an overview of the vireoJD-MM team's ActEV-PC system for the Activity Detection in Extended Videos task in the ActivityNet Challenge 2019. It combines spatial-level person/vehicle detections with temporal-level action localization, examines different tubelet generation and model decomposition methods, and predicts final detections via late fusion of the component outputs.

Significance. As a challenge notebook paper the work supplies a practical system description and comparative analysis within the ActEV-PC leaderboard context. The explicit study of tubelet generation and model decomposition, together with the late-fusion pipeline, could serve as a useful reference for surveillance-video activity detection if the full manuscript supplies the missing quantitative results and ablations; without them the broader methodological contribution remains modest.

major comments (1)

[Abstract] Abstract: the description of the spatial-detection, temporal-localization, and late-fusion pipeline is supplied without any quantitative results, ablation tables, or error analysis, so it is impossible to verify whether the fusion step yields measurable improvement over the individual components after controlling for data splits or hyperparameters.

minor comments (1)

The manuscript is a short challenge notebook; adding at least one table of component-wise and fused performance numbers (even if only leaderboard scores) would make the comparative analysis concrete.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the suggestion to strengthen the abstract. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the description of the spatial-detection, temporal-localization, and late-fusion pipeline is supplied without any quantitative results, ablation tables, or error analysis, so it is impossible to verify whether the fusion step yields measurable improvement over the individual components after controlling for data splits or hyperparameters.

Authors: As a challenge notebook paper, the manuscript prioritizes a concise system description and the study of tubelet generation and model decomposition variants, with overall performance placed in the context of the ActEV-PC leaderboard. The abstract therefore omits detailed numbers. We agree that a brief quantitative statement on the fusion benefit would improve clarity and verifiability. We will revise the abstract to include one sentence reporting the observed improvement from late fusion on the official test set. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical system-description notebook for a challenge entry. It reports an architecture (spatial detections + temporal localization + late fusion) and comparative results on the ActEV-PC leaderboard. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The central claim reduces to an externally scored competition entry rather than an internal derivation that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, derivation, or new postulated entity appears in the abstract; the work is a descriptive engineering report.

pith-pipeline@v0.9.0 · 5603 in / 1044 out tokens · 19363 ms · 2026-05-25T20:00:46.340788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Learning spatiotemporal features with 3d convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inICCV, 2015

work page 2015
[2]

Deep quantization: En- coding convolutional activations with deep generative model,

Z. Qiu, T. Yao, and T. Mei, “Deep quantization: En- coding convolutional activations with deep generative model,” in CVPR, 2017

work page 2017
[3]

Learning spatio-temporal representation with local and global diffusion,

Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with local and global diffusion,” in CVPR, 2019

work page 2019
[4]

Trimmed Action Recognition, Dense-Captioning Events in Videos, and Spatio-temporal Action Localization with Focus on ActivityNet Challenge 2019

Z. Qiu, D. Li, Y . Li, Q. Cai, Y . Pan, and T. Yao, “Trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization with focus on activitynet challenge 2019,” arXiv preprint arXiv:1906.07016, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Temporal ac- tion localization in untrimmed videos via multi-stage cnns,

Z. Shou, D. Wang, and S.-F. Chang, “Temporal ac- tion localization in untrimmed videos via multi-stage cnns,” in CVPR, 2016

work page 2016
[6]

CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,” inCVPR, 2017. 3

work page 2017
[7]

Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,

T. Lin, X. Zhao, and Z. Shou, “Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,” CoRR, 2017

work page 2017
[8]

Gaussian temporal awareness networks for actionlo- calization,

F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, “Gaussian temporal awareness networks for actionlo- calization,” in CVPR, 2019

work page 2019
[9]

Video index- ing, search, detection, and description with focus on trecvid,

G. Awad, D.-D. Le, C.-W. Ngo, V .-T. Nguyen, G. Qu ´enot, C. Snoek, and S. Satoh, “Video index- ing, search, detection, and description with focus on trecvid,” in ICMR, 2017

work page 2017
[10]

Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,

H. Zhang, L. Pang, Y .-J. Lu, and C.-W. Ngo, “Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,” in Proc. Int. Workshop Video Retrieval Eval., 2016

work page 2016
[11]

Faster r-cnn: Towards real-time object detection with region pro- posal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” in NIPS, 2015

work page 2015
[12]

Exploring object relation in mean teacher for cross-domain detection,

Q. Cai, Y . Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean teacher for cross-domain detection,” in CVPR, 2019

work page 2019
[13]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

work page 2016
[14]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

work page 2014
[15]

Learning spatio-temporal representation with pseudo-3d residual networks,

Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in ICCV, 2017

work page 2017
[16]

Training regionbased object detectors with online hard example mining,

A. Shrivastava, A. Gupta, and R. Girshick, “Training regionbased object detectors with online hard example mining,” in CVPR, 2016

work page 2016
[17]

Soft-NMS – Improving Object Detection With One Line of Code,

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS – Improving Object Detection With One Line of Code,” in ICCV, 2017. 4

work page 2017

[1] [1]

Learning spatiotemporal features with 3d convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inICCV, 2015

work page 2015

[2] [2]

Deep quantization: En- coding convolutional activations with deep generative model,

Z. Qiu, T. Yao, and T. Mei, “Deep quantization: En- coding convolutional activations with deep generative model,” in CVPR, 2017

work page 2017

[3] [3]

Learning spatio-temporal representation with local and global diffusion,

Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with local and global diffusion,” in CVPR, 2019

work page 2019

[4] [4]

Trimmed Action Recognition, Dense-Captioning Events in Videos, and Spatio-temporal Action Localization with Focus on ActivityNet Challenge 2019

Z. Qiu, D. Li, Y . Li, Q. Cai, Y . Pan, and T. Yao, “Trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization with focus on activitynet challenge 2019,” arXiv preprint arXiv:1906.07016, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Temporal ac- tion localization in untrimmed videos via multi-stage cnns,

Z. Shou, D. Wang, and S.-F. Chang, “Temporal ac- tion localization in untrimmed videos via multi-stage cnns,” in CVPR, 2016

work page 2016

[6] [6]

CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,” inCVPR, 2017. 3

work page 2017

[7] [7]

Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,

T. Lin, X. Zhao, and Z. Shou, “Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,” CoRR, 2017

work page 2017

[8] [8]

Gaussian temporal awareness networks for actionlo- calization,

F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, “Gaussian temporal awareness networks for actionlo- calization,” in CVPR, 2019

work page 2019

[9] [9]

Video index- ing, search, detection, and description with focus on trecvid,

G. Awad, D.-D. Le, C.-W. Ngo, V .-T. Nguyen, G. Qu ´enot, C. Snoek, and S. Satoh, “Video index- ing, search, detection, and description with focus on trecvid,” in ICMR, 2017

work page 2017

[10] [10]

Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,

H. Zhang, L. Pang, Y .-J. Lu, and C.-W. Ngo, “Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,” in Proc. Int. Workshop Video Retrieval Eval., 2016

work page 2016

[11] [11]

Faster r-cnn: Towards real-time object detection with region pro- posal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” in NIPS, 2015

work page 2015

[12] [12]

Exploring object relation in mean teacher for cross-domain detection,

Q. Cai, Y . Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean teacher for cross-domain detection,” in CVPR, 2019

work page 2019

[13] [13]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

work page 2016

[14] [14]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

work page 2014

[15] [15]

Learning spatio-temporal representation with pseudo-3d residual networks,

Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in ICCV, 2017

work page 2017

[16] [16]

Training regionbased object detectors with online hard example mining,

A. Shrivastava, A. Gupta, and R. Girshick, “Training regionbased object detectors with online hard example mining,” in CVPR, 2016

work page 2016

[17] [17]

Soft-NMS – Improving Object Detection With One Line of Code,

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS – Improving Object Detection With One Line of Code,” in ICCV, 2017. 4

work page 2017