Object Detection in Video with Spatial-temporal Context Aggregation
Pith reviewed 2026-05-24 23:29 UTC · model grok-4.3
The pith
Proposal-level modeling of semantic and spatio-temporal relationships lifts video object detection to 80.3% mAP on ImageNet VID.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed feature aggregation framework operates on the object proposal-level and learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames.
What carries the argument
Proposal-level feature aggregation that models semantic and spatio-temporal relationships among object proposals within and across frames.
If this is right
- The aggregation step improves a single-frame Faster R-CNN baseline by 5.8% mAP.
- The full method reaches 80.3% mAP on ImageNet VID without bells and whistles.
- Under the setting of no temporal post-processing the method exceeds the prior state-of-the-art by 1.4% mAP.
Where Pith is reading between the lines
- The result suggests that explicit modeling of proposal relationships can substitute for precise pixel-level temporal alignment when video quality is low.
- The same relationship-modeling idea could be tested on other video tasks such as action recognition or multi-object tracking.
- Removing the cross-frame component would isolate how much of the gain comes from intra-frame semantic context alone.
Load-bearing premise
Feature correspondence estimation across frames is fundamentally difficult and unstable because of poor image quality and motion blur.
What would settle it
An experiment in which an accurate feature-correspondence method is substituted for the proposal-level relationship model and still matches or exceeds 80.3% mAP on ImageNet VID.
Figures
read the original abstract
Recent cutting-edge feature aggregation paradigms for video object detection rely on inferring feature correspondence. The feature correspondence estimation problem is fundamentally difficult due to poor image quality, motion blur, etc, and the results of feature correspondence estimation are unstable. To avoid the problem, we propose a simple but effective feature aggregation framework which operates on the object proposal-level. It learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames. Experiments are carried out on the ImageNet VID dataset. Without any bells and whistles, our method obtains 80.3\% mAP on the ImageNet VID dataset, which is superior over the previous state-of-the-arts. The proposed feature aggregation mechanism improves the single frame Faster RCNN baseline by 5.8% mAP. Besides, under the setting of no temporal post-processing, our method outperforms the previous state-of-the-art by 1.4% mAP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a proposal-level feature aggregation framework for video object detection that models semantic and spatio-temporal relationships among object proposals within a frame and across adjacent frames, thereby avoiding explicit feature correspondence estimation. On the ImageNet VID dataset the method reports 80.3% mAP, a 5.8% gain over a single-frame Faster R-CNN baseline and a 1.4% gain over prior state-of-the-art under the no-temporal-post-processing protocol.
Significance. If the reported mAP figures hold under standard evaluation protocols, the work supplies a concrete, comparatively simple alternative to correspondence-based aggregation pipelines and supplies explicit numerical comparisons against both a single-frame baseline and prior video methods.
minor comments (2)
- [Abstract] Abstract: the phrase 'without any bells and whistles' is used to characterize the 80.3% result; the manuscript should explicitly list which components (backbone, training schedule, data augmentation, etc.) are included so that the baseline comparison can be reproduced.
- [Abstract] The motivation paragraph asserts that feature correspondence is 'fundamentally difficult' due to motion blur, yet the central empirical claim does not depend on proving this premise; a brief sentence clarifying that the performance numbers stand independently of the motivation would avoid any appearance of circularity.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation of minor revision. The report raises no specific major comments, so our response focuses on acknowledging the overall assessment while confirming we will incorporate any minor suggestions in the revised manuscript.
Circularity Check
No significant circularity; empirical performance claims only
full rationale
The paper reports experimental results on ImageNet VID (80.3% mAP, +5.8% over single-frame Faster R-CNN baseline, +1.4% over prior SOTA without post-processing). No derivation chain, equations, or predictions are present that reduce to inputs by construction. The proposal-level aggregation is a designed architecture whose effectiveness is measured externally via standard benchmarks; the motivation about correspondence instability is stated but is not required for the numerical claims to hold. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatz smuggling appear in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 1, 5
work page 2009
-
[2]
You only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2
work page 2016
-
[3]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1, 2
work page 2017
-
[4]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. 1, 2, 3, 5
work page 2015
-
[5]
DenseBox: Unifying Landmark Localization with End to End Object Detection
Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- box: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 1
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3376–3385, 2017. 1
work page 2017
-
[7]
Large-scale video classification with convolutional neural networks
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Pro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 1
work page 2014
-
[8]
Flow-guided feature aggregation for video object detec- tion
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detec- tion. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 1, 2, 3, 8
work page 2017
-
[9]
Object detection in video with spatiotemporal sampling networks
Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 331–346, 2018. 1, 2, 3, 6, 7, 8, 9
work page 2018
-
[10]
Fully motion-aware network for video object detection
Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 542–557, 2018. 1, 2, 3, 6, 7, 8
work page 2018
-
[11]
Flownet: Learning optical flow with convolutional networks
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Pro- ceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 1, 3
work page 2015
-
[12]
Deformable convolutional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international confer- ence on computer vision, pages 764–773, 2017. 1
work page 2017
-
[13]
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision , pages 1440–1448,
-
[14]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 2, 3
work page 2017
-
[15]
Rich feature hierarchies for accurate object detection and semantic segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 580–587, 2014. 2
work page 2014
-
[16]
R-fcn: Object detection via region-based fully convolutional networks
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems , pages 379–387, 2016. 2, 5
work page 2016
-
[17]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017. 2
work page 2017
-
[18]
Ssd: Single shot multibox detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European con- ference on computer vision, pages 21–37. Springer, 2016. 2
work page 2016
-
[19]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2
work page 2017
-
[20]
Relation networks for object detection
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 2, 3, 4, 5
work page 2018
-
[21]
Object Detection in Videos by High Quality Object Linking
Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by high quality object linking. arXiv preprint arXiv:1801.09823, 2018. 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
T-cnn: Tubelets with convolutional neural networks for object detection from videos
Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, 2018. 3, 6
work page 2018
-
[23]
Seq-NMS for Video Object Detection
Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachan- dran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465, 2016. 3, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
Deep feature flow for video recognition
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017. 3, 6
work page 2017
-
[25]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7794–7803, 2018. 3
work page 2018
-
[26]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3
work page 2016
-
[27]
Training region-based object detectors with online hard ex- ample mining
Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 761–769,
-
[28]
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems.arXiv preprint arXiv:1512.01274, 2015. 6
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[29]
Video object detection with an aligned spatial-temporal memory
Fanyi Xiao and Yong Jae Lee. Video object detection with an aligned spatial-temporal memory. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 485– 501, 2018. 8, 9
work page 2018
-
[30]
Detect to track and track to detect
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision , pages 3038–3046, 2017. 8
work page 2017
-
[31]
Optimizing video object detection via a scale-time lattice
Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuan- jun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7814–7823, 2018. 8
work page 2018
-
[32]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. In ICLR, 2019. 9
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.