pith. sign in

arxiv: 1907.04988 · v1 · pith:2EHCQAHInew · submitted 2019-07-11 · 💻 cs.CV

Object Detection in Video with Spatial-temporal Context Aggregation

Pith reviewed 2026-05-24 23:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords video object detectionfeature aggregationspatio-temporal contextobject proposalsImageNet VIDFaster R-CNN
0
0 comments X

The pith

Proposal-level modeling of semantic and spatio-temporal relationships lifts video object detection to 80.3% mAP on ImageNet VID.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that estimating feature correspondences across video frames is unreliable because of motion blur and low image quality, so it replaces that step with a proposal-level aggregation step. Each object proposal receives an enhanced feature vector by learning both semantic similarities and spatio-temporal links to other proposals inside the same frame and in nearby frames. The resulting detector improves a single-frame Faster R-CNN baseline by 5.8 points and reaches 80.3% mAP, exceeding prior video methods even when those methods use temporal post-processing.

Core claim

The proposed feature aggregation framework operates on the object proposal-level and learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames.

What carries the argument

Proposal-level feature aggregation that models semantic and spatio-temporal relationships among object proposals within and across frames.

If this is right

  • The aggregation step improves a single-frame Faster R-CNN baseline by 5.8% mAP.
  • The full method reaches 80.3% mAP on ImageNet VID without bells and whistles.
  • Under the setting of no temporal post-processing the method exceeds the prior state-of-the-art by 1.4% mAP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that explicit modeling of proposal relationships can substitute for precise pixel-level temporal alignment when video quality is low.
  • The same relationship-modeling idea could be tested on other video tasks such as action recognition or multi-object tracking.
  • Removing the cross-frame component would isolate how much of the gain comes from intra-frame semantic context alone.

Load-bearing premise

Feature correspondence estimation across frames is fundamentally difficult and unstable because of poor image quality and motion blur.

What would settle it

An experiment in which an accurate feature-correspondence method is substituted for the proposal-level relationship model and still matches or exceeds 80.3% mAP on ImageNet VID.

Figures

Figures reproduced from arXiv: 1907.04988 by Chang Huang, Han Shen, Hao Luo, Lichao Huang, Xinggang Wang, Yuan Li.

Figure 1
Figure 1. Figure 1: Illustration of the differences between pixel/instance [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Proposed spatial-temporal feature aggregation framework. For each input video frame, ResNet-101 is used to extract the feature [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detection results in mAP [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example detection results of single-frame baseline, MANet [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of the dependency (in Eq. (8)) between the target proposals and the candidate proposals from adjacent frames in the last aggregation unit. Red and blue boxes are target pro￾posals in a key frame and candidate proposals in its neighboring frames, respectively. Each row shows one of the target propos￾als and its corresponding candidate proposals with highest depen￾dency in both key frame and neighbo… view at source ↗
read the original abstract

Recent cutting-edge feature aggregation paradigms for video object detection rely on inferring feature correspondence. The feature correspondence estimation problem is fundamentally difficult due to poor image quality, motion blur, etc, and the results of feature correspondence estimation are unstable. To avoid the problem, we propose a simple but effective feature aggregation framework which operates on the object proposal-level. It learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames. Experiments are carried out on the ImageNet VID dataset. Without any bells and whistles, our method obtains 80.3\% mAP on the ImageNet VID dataset, which is superior over the previous state-of-the-arts. The proposed feature aggregation mechanism improves the single frame Faster RCNN baseline by 5.8% mAP. Besides, under the setting of no temporal post-processing, our method outperforms the previous state-of-the-art by 1.4% mAP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a proposal-level feature aggregation framework for video object detection that models semantic and spatio-temporal relationships among object proposals within a frame and across adjacent frames, thereby avoiding explicit feature correspondence estimation. On the ImageNet VID dataset the method reports 80.3% mAP, a 5.8% gain over a single-frame Faster R-CNN baseline and a 1.4% gain over prior state-of-the-art under the no-temporal-post-processing protocol.

Significance. If the reported mAP figures hold under standard evaluation protocols, the work supplies a concrete, comparatively simple alternative to correspondence-based aggregation pipelines and supplies explicit numerical comparisons against both a single-frame baseline and prior video methods.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'without any bells and whistles' is used to characterize the 80.3% result; the manuscript should explicitly list which components (backbone, training schedule, data augmentation, etc.) are included so that the baseline comparison can be reproduced.
  2. [Abstract] The motivation paragraph asserts that feature correspondence is 'fundamentally difficult' due to motion blur, yet the central empirical claim does not depend on proving this premise; a brief sentence clarifying that the performance numbers stand independently of the motivation would avoid any appearance of circularity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The report raises no specific major comments, so our response focuses on acknowledging the overall assessment while confirming we will incorporate any minor suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity; empirical performance claims only

full rationale

The paper reports experimental results on ImageNet VID (80.3% mAP, +5.8% over single-frame Faster R-CNN baseline, +1.4% over prior SOTA without post-processing). No derivation chain, equations, or predictions are present that reduce to inputs by construction. The proposal-level aggregation is a designed architecture whose effectiveness is measured externally via standard benchmarks; the motivation about correspondence instability is stated but is not required for the numerical claims to hold. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no details on model architecture, loss functions, or training hyperparameters are provided, preventing enumeration of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5701 in / 1086 out tokens · 20883 ms · 2026-05-24T23:29:45.857108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 1, 5

  2. [2]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2

  3. [3]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1, 2

  4. [4]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. 1, 2, 3, 5

  5. [5]

    DenseBox: Unifying Landmark Localization with End to End Object Detection

    Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- box: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 1

  6. [6]

    Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos

    Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3376–3385, 2017. 1

  7. [7]

    Large-scale video classification with convolutional neural networks

    Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Pro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 1

  8. [8]

    Flow-guided feature aggregation for video object detec- tion

    Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detec- tion. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 1, 2, 3, 8

  9. [9]

    Object detection in video with spatiotemporal sampling networks

    Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 331–346, 2018. 1, 2, 3, 6, 7, 8, 9

  10. [10]

    Fully motion-aware network for video object detection

    Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 542–557, 2018. 1, 2, 3, 6, 7, 8

  11. [11]

    Flownet: Learning optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Pro- ceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 1, 3

  12. [12]

    Deformable convolutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international confer- ence on computer vision, pages 764–773, 2017. 1

  13. [13]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision , pages 1440–1448,

  14. [14]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 2, 3

  15. [15]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 580–587, 2014. 2

  16. [16]

    R-fcn: Object detection via region-based fully convolutional networks

    Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems , pages 379–387, 2016. 2, 5

  17. [17]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017. 2

  18. [18]

    Ssd: Single shot multibox detector

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European con- ference on computer vision, pages 21–37. Springer, 2016. 2

  19. [19]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2

  20. [20]

    Relation networks for object detection

    Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 2, 3, 4, 5

  21. [21]

    Object Detection in Videos by High Quality Object Linking

    Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by high quality object linking. arXiv preprint arXiv:1801.09823, 2018. 3

  22. [22]

    T-cnn: Tubelets with convolutional neural networks for object detection from videos

    Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, 2018. 3, 6

  23. [23]

    Seq-NMS for Video Object Detection

    Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachan- dran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465, 2016. 3, 8, 9

  24. [24]

    Deep feature flow for video recognition

    Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017. 3, 6

  25. [25]

    Non-local neural networks

    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7794–7803, 2018. 3

  26. [26]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

  27. [27]

    Training region-based object detectors with online hard ex- ample mining

    Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 761–769,

  28. [28]

    MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

    Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems.arXiv preprint arXiv:1512.01274, 2015. 6

  29. [29]

    Video object detection with an aligned spatial-temporal memory

    Fanyi Xiao and Yong Jae Lee. Video object detection with an aligned spatial-temporal memory. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 485– 501, 2018. 8, 9

  30. [30]

    Detect to track and track to detect

    Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision , pages 3038–3046, 2017. 8

  31. [31]

    Optimizing video object detection via a scale-time lattice

    Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuan- jun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7814–7823, 2018. 8

  32. [32]

    Universal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. In ICLR, 2019. 9