vireoJD-MM at Activity Detection in Extended Videos
Pith reviewed 2026-05-25 20:00 UTC · model grok-4.3
The pith
A surveillance video system detects activities by fusing spatial person and vehicle detections with temporal action localizations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The system exploits person/vehicle detections in spatial level and action localization in temporal level for action detection in surveillance videos. The detection results are finally predicted by late fusing the results from each component. Different tubelet generation and model decomposition methods are studied as well.
What carries the argument
Late fusion that combines outputs from a spatial person/vehicle detection module and a temporal action localization module.
If this is right
- Tubelet generation choices directly affect the quality of candidate activity regions passed to the fusion stage.
- Model decomposition into spatial and temporal parts allows each component to be trained and evaluated independently before fusion.
- Late fusion produces the final activity predictions used for ranking in the challenge leaderboard.
Where Pith is reading between the lines
- The modular separation might allow reuse of the spatial detector on non-surveillance video domains where only person localization is needed.
- If spatial and temporal cues are strongly correlated in some activities, an integrated model could outperform late fusion.
- Extending the same decomposition to audio or text streams could test whether late fusion scales beyond visual features.
Load-bearing premise
Late fusion of separately trained spatial and temporal modules will improve overall detection accuracy over any individual module.
What would settle it
A controlled test on the ActEV-PC dataset in which a single end-to-end model or an early-fusion variant records higher mean average precision than the late-fusion system.
Figures
read the original abstract
This notebook paper presents an overview and comparative analysis of our system designed for activity detection in extended videos (ActEV-PC) in ActivityNet Challenge 2019. Specifically, we exploit person/vehicle detections in spatial level and action localization in temporal level for action detection in surveillance videos. The mechanism of different tubelet generation and model decomposition methods are studied as well. The detection results are finally predicted by late fusing the results from each component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This notebook paper presents an overview of the vireoJD-MM team's ActEV-PC system for the Activity Detection in Extended Videos task in the ActivityNet Challenge 2019. It combines spatial-level person/vehicle detections with temporal-level action localization, examines different tubelet generation and model decomposition methods, and predicts final detections via late fusion of the component outputs.
Significance. As a challenge notebook paper the work supplies a practical system description and comparative analysis within the ActEV-PC leaderboard context. The explicit study of tubelet generation and model decomposition, together with the late-fusion pipeline, could serve as a useful reference for surveillance-video activity detection if the full manuscript supplies the missing quantitative results and ablations; without them the broader methodological contribution remains modest.
major comments (1)
- [Abstract] Abstract: the description of the spatial-detection, temporal-localization, and late-fusion pipeline is supplied without any quantitative results, ablation tables, or error analysis, so it is impossible to verify whether the fusion step yields measurable improvement over the individual components after controlling for data splits or hyperparameters.
minor comments (1)
- The manuscript is a short challenge notebook; adding at least one table of component-wise and fused performance numbers (even if only leaderboard scores) would make the comparative analysis concrete.
Simulated Author's Rebuttal
We thank the referee for the review and the suggestion to strengthen the abstract. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the description of the spatial-detection, temporal-localization, and late-fusion pipeline is supplied without any quantitative results, ablation tables, or error analysis, so it is impossible to verify whether the fusion step yields measurable improvement over the individual components after controlling for data splits or hyperparameters.
Authors: As a challenge notebook paper, the manuscript prioritizes a concise system description and the study of tubelet generation and model decomposition variants, with overall performance placed in the context of the ActEV-PC leaderboard. The abstract therefore omits detailed numbers. We agree that a brief quantitative statement on the fusion benefit would improve clarity and verifiability. We will revise the abstract to include one sentence reporting the observed improvement from late fusion on the official test set. revision: yes
Circularity Check
No significant circularity
full rationale
This is an empirical system-description notebook for a challenge entry. It reports an architecture (spatial detections + temporal localization + late fusion) and comparative results on the ActEV-PC leaderboard. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The central claim reduces to an externally scored competition entry rather than an internal derivation that collapses to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning spatiotemporal features with 3d convolutional networks,
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inICCV, 2015
work page 2015
-
[2]
Deep quantization: En- coding convolutional activations with deep generative model,
Z. Qiu, T. Yao, and T. Mei, “Deep quantization: En- coding convolutional activations with deep generative model,” in CVPR, 2017
work page 2017
-
[3]
Learning spatio-temporal representation with local and global diffusion,
Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with local and global diffusion,” in CVPR, 2019
work page 2019
-
[4]
Z. Qiu, D. Li, Y . Li, Q. Cai, Y . Pan, and T. Yao, “Trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization with focus on activitynet challenge 2019,” arXiv preprint arXiv:1906.07016, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
Temporal ac- tion localization in untrimmed videos via multi-stage cnns,
Z. Shou, D. Wang, and S.-F. Chang, “Temporal ac- tion localization in untrimmed videos via multi-stage cnns,” in CVPR, 2016
work page 2016
-
[6]
Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,” inCVPR, 2017. 3
work page 2017
-
[7]
Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,
T. Lin, X. Zhao, and Z. Shou, “Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,” CoRR, 2017
work page 2017
-
[8]
Gaussian temporal awareness networks for actionlo- calization,
F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, “Gaussian temporal awareness networks for actionlo- calization,” in CVPR, 2019
work page 2019
-
[9]
Video index- ing, search, detection, and description with focus on trecvid,
G. Awad, D.-D. Le, C.-W. Ngo, V .-T. Nguyen, G. Qu ´enot, C. Snoek, and S. Satoh, “Video index- ing, search, detection, and description with focus on trecvid,” in ICMR, 2017
work page 2017
-
[10]
Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,
H. Zhang, L. Pang, Y .-J. Lu, and C.-W. Ngo, “Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,” in Proc. Int. Workshop Video Retrieval Eval., 2016
work page 2016
-
[11]
Faster r-cnn: Towards real-time object detection with region pro- posal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” in NIPS, 2015
work page 2015
-
[12]
Exploring object relation in mean teacher for cross-domain detection,
Q. Cai, Y . Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean teacher for cross-domain detection,” in CVPR, 2019
work page 2019
-
[13]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016
work page 2016
-
[14]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014
work page 2014
-
[15]
Learning spatio-temporal representation with pseudo-3d residual networks,
Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in ICCV, 2017
work page 2017
-
[16]
Training regionbased object detectors with online hard example mining,
A. Shrivastava, A. Gupta, and R. Girshick, “Training regionbased object detectors with online hard example mining,” in CVPR, 2016
work page 2016
-
[17]
Soft-NMS – Improving Object Detection With One Line of Code,
N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS – Improving Object Detection With One Line of Code,” in ICCV, 2017. 4
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.