pith. sign in

arxiv: 1906.08547 · v1 · pith:R2FFERCGnew · submitted 2019-06-20 · 💻 cs.CV

vireoJD-MM at Activity Detection in Extended Videos

Pith reviewed 2026-05-25 20:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords activity detectionsurveillance videoslate fusionperson detectionaction localizationtubelet generationActivityNet Challenge
0
0 comments X

The pith

A surveillance video system detects activities by fusing spatial person and vehicle detections with temporal action localizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a system for activity detection in extended surveillance videos that processes inputs at two levels: spatial detection of people and vehicles, plus temporal localization of actions. Outputs from these modules are combined through late fusion to produce final activity predictions. The work also examines different methods for generating tubelets and decomposing models. A sympathetic reader would care because extended videos pose unique challenges of duration and sparsity that single-stage approaches struggle with. The paper reports results from the ActivityNet Challenge 2019.

Core claim

The system exploits person/vehicle detections in spatial level and action localization in temporal level for action detection in surveillance videos. The detection results are finally predicted by late fusing the results from each component. Different tubelet generation and model decomposition methods are studied as well.

What carries the argument

Late fusion that combines outputs from a spatial person/vehicle detection module and a temporal action localization module.

If this is right

  • Tubelet generation choices directly affect the quality of candidate activity regions passed to the fusion stage.
  • Model decomposition into spatial and temporal parts allows each component to be trained and evaluated independently before fusion.
  • Late fusion produces the final activity predictions used for ranking in the challenge leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular separation might allow reuse of the spatial detector on non-surveillance video domains where only person localization is needed.
  • If spatial and temporal cues are strongly correlated in some activities, an integrated model could outperform late fusion.
  • Extending the same decomposition to audio or text streams could test whether late fusion scales beyond visual features.

Load-bearing premise

Late fusion of separately trained spatial and temporal modules will improve overall detection accuracy over any individual module.

What would settle it

A controlled test on the ActEV-PC dataset in which a single end-to-end model or an early-fusion variant records higher mean average precision than the late-fusion system.

Figures

Figures reproduced from arXiv: 1906.08547 by Chong-Wah Ngo, Fuchen Long, Qi Cai, Ting Yao, Yingwei Pan, Zhaofan Qiu, Zhijian Hou.

Figure 2
Figure 2. Figure 2: Greedy based object linking with bounding boxes and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Framework of our three-stage system for Activity Detec [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Recall of tubelet generation with respect to different IoU [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

This notebook paper presents an overview and comparative analysis of our system designed for activity detection in extended videos (ActEV-PC) in ActivityNet Challenge 2019. Specifically, we exploit person/vehicle detections in spatial level and action localization in temporal level for action detection in surveillance videos. The mechanism of different tubelet generation and model decomposition methods are studied as well. The detection results are finally predicted by late fusing the results from each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This notebook paper presents an overview of the vireoJD-MM team's ActEV-PC system for the Activity Detection in Extended Videos task in the ActivityNet Challenge 2019. It combines spatial-level person/vehicle detections with temporal-level action localization, examines different tubelet generation and model decomposition methods, and predicts final detections via late fusion of the component outputs.

Significance. As a challenge notebook paper the work supplies a practical system description and comparative analysis within the ActEV-PC leaderboard context. The explicit study of tubelet generation and model decomposition, together with the late-fusion pipeline, could serve as a useful reference for surveillance-video activity detection if the full manuscript supplies the missing quantitative results and ablations; without them the broader methodological contribution remains modest.

major comments (1)
  1. [Abstract] Abstract: the description of the spatial-detection, temporal-localization, and late-fusion pipeline is supplied without any quantitative results, ablation tables, or error analysis, so it is impossible to verify whether the fusion step yields measurable improvement over the individual components after controlling for data splits or hyperparameters.
minor comments (1)
  1. The manuscript is a short challenge notebook; adding at least one table of component-wise and fused performance numbers (even if only leaderboard scores) would make the comparative analysis concrete.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the suggestion to strengthen the abstract. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of the spatial-detection, temporal-localization, and late-fusion pipeline is supplied without any quantitative results, ablation tables, or error analysis, so it is impossible to verify whether the fusion step yields measurable improvement over the individual components after controlling for data splits or hyperparameters.

    Authors: As a challenge notebook paper, the manuscript prioritizes a concise system description and the study of tubelet generation and model decomposition variants, with overall performance placed in the context of the ActEV-PC leaderboard. The abstract therefore omits detailed numbers. We agree that a brief quantitative statement on the fusion benefit would improve clarity and verifiability. We will revise the abstract to include one sentence reporting the observed improvement from late fusion on the official test set. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical system-description notebook for a challenge entry. It reports an architecture (spatial detections + temporal localization + late fusion) and comparative results on the ActEV-PC leaderboard. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The central claim reduces to an externally scored competition entry rather than an internal derivation that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, derivation, or new postulated entity appears in the abstract; the work is a descriptive engineering report.

pith-pipeline@v0.9.0 · 5603 in / 1044 out tokens · 19363 ms · 2026-05-25T20:00:46.340788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Learning spatiotemporal features with 3d convolutional networks,

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inICCV, 2015

  2. [2]

    Deep quantization: En- coding convolutional activations with deep generative model,

    Z. Qiu, T. Yao, and T. Mei, “Deep quantization: En- coding convolutional activations with deep generative model,” in CVPR, 2017

  3. [3]

    Learning spatio-temporal representation with local and global diffusion,

    Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with local and global diffusion,” in CVPR, 2019

  4. [4]

    Trimmed Action Recognition, Dense-Captioning Events in Videos, and Spatio-temporal Action Localization with Focus on ActivityNet Challenge 2019

    Z. Qiu, D. Li, Y . Li, Q. Cai, Y . Pan, and T. Yao, “Trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization with focus on activitynet challenge 2019,” arXiv preprint arXiv:1906.07016, 2019

  5. [5]

    Temporal ac- tion localization in untrimmed videos via multi-stage cnns,

    Z. Shou, D. Wang, and S.-F. Chang, “Temporal ac- tion localization in untrimmed videos via multi-stage cnns,” in CVPR, 2016

  6. [6]

    CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,

    Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos,” inCVPR, 2017. 3

  7. [7]

    Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,

    T. Lin, X. Zhao, and Z. Shou, “Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017,” CoRR, 2017

  8. [8]

    Gaussian temporal awareness networks for actionlo- calization,

    F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, “Gaussian temporal awareness networks for actionlo- calization,” in CVPR, 2019

  9. [9]

    Video index- ing, search, detection, and description with focus on trecvid,

    G. Awad, D.-D. Le, C.-W. Ngo, V .-T. Nguyen, G. Qu ´enot, C. Snoek, and S. Satoh, “Video index- ing, search, detection, and description with focus on trecvid,” in ICMR, 2017

  10. [10]

    Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,

    H. Zhang, L. Pang, Y .-J. Lu, and C.-W. Ngo, “Vireo@ trecvid 2016: Multimedia event detection, ad-hoc video search, video to text description,” in Proc. Int. Workshop Video Retrieval Eval., 2016

  11. [11]

    Faster r-cnn: Towards real-time object detection with region pro- posal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” in NIPS, 2015

  12. [12]

    Exploring object relation in mean teacher for cross-domain detection,

    Q. Cai, Y . Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean teacher for cross-domain detection,” in CVPR, 2019

  13. [13]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

  14. [14]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

  15. [15]

    Learning spatio-temporal representation with pseudo-3d residual networks,

    Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in ICCV, 2017

  16. [16]

    Training regionbased object detectors with online hard example mining,

    A. Shrivastava, A. Gupta, and R. Girshick, “Training regionbased object detectors with online hard example mining,” in CVPR, 2016

  17. [17]

    Soft-NMS – Improving Object Detection With One Line of Code,

    N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS – Improving Object Detection With One Line of Code,” in ICCV, 2017. 4