Object Detection in Video with Spatial-temporal Context Aggregation

Chang Huang; Han Shen; Hao Luo; Lichao Huang; Xinggang Wang; Yuan Li

arxiv: 1907.04988 · v1 · pith:2EHCQAHInew · submitted 2019-07-11 · 💻 cs.CV

Object Detection in Video with Spatial-temporal Context Aggregation

Hao Luo , Lichao Huang , Han Shen , Yuan Li , Chang Huang , Xinggang Wang This is my paper

Pith reviewed 2026-05-24 23:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords video object detectionfeature aggregationspatio-temporal contextobject proposalsImageNet VIDFaster R-CNN

0 comments

The pith

Proposal-level modeling of semantic and spatio-temporal relationships lifts video object detection to 80.3% mAP on ImageNet VID.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that estimating feature correspondences across video frames is unreliable because of motion blur and low image quality, so it replaces that step with a proposal-level aggregation step. Each object proposal receives an enhanced feature vector by learning both semantic similarities and spatio-temporal links to other proposals inside the same frame and in nearby frames. The resulting detector improves a single-frame Faster R-CNN baseline by 5.8 points and reaches 80.3% mAP, exceeding prior video methods even when those methods use temporal post-processing.

Core claim

The proposed feature aggregation framework operates on the object proposal-level and learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames.

What carries the argument

Proposal-level feature aggregation that models semantic and spatio-temporal relationships among object proposals within and across frames.

If this is right

The aggregation step improves a single-frame Faster R-CNN baseline by 5.8% mAP.
The full method reaches 80.3% mAP on ImageNet VID without bells and whistles.
Under the setting of no temporal post-processing the method exceeds the prior state-of-the-art by 1.4% mAP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that explicit modeling of proposal relationships can substitute for precise pixel-level temporal alignment when video quality is low.
The same relationship-modeling idea could be tested on other video tasks such as action recognition or multi-object tracking.
Removing the cross-frame component would isolate how much of the gain comes from intra-frame semantic context alone.

Load-bearing premise

Feature correspondence estimation across frames is fundamentally difficult and unstable because of poor image quality and motion blur.

What would settle it

An experiment in which an accurate feature-correspondence method is substituted for the proposal-level relationship model and still matches or exceeds 80.3% mAP on ImageNet VID.

Figures

Figures reproduced from arXiv: 1907.04988 by Chang Huang, Han Shen, Hao Luo, Lichao Huang, Xinggang Wang, Yuan Li.

**Figure 2.** Figure 2: Proposed spatial-temporal feature aggregation framework. For each input video frame, ResNet-101 is used to extract the feature [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Detection results in mAP [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example detection results of single-frame baseline, MANet [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of the dependency (in Eq. (8)) between the target proposals and the candidate proposals from adjacent frames in the last aggregation unit. Red and blue boxes are target proposals in a key frame and candidate proposals in its neighboring frames, respectively. Each row shows one of the target proposals and its corresponding candidate proposals with highest dependency in both key frame and neighbo… view at source ↗

read the original abstract

Recent cutting-edge feature aggregation paradigms for video object detection rely on inferring feature correspondence. The feature correspondence estimation problem is fundamentally difficult due to poor image quality, motion blur, etc, and the results of feature correspondence estimation are unstable. To avoid the problem, we propose a simple but effective feature aggregation framework which operates on the object proposal-level. It learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames. Experiments are carried out on the ImageNet VID dataset. Without any bells and whistles, our method obtains 80.3\% mAP on the ImageNet VID dataset, which is superior over the previous state-of-the-arts. The proposed feature aggregation mechanism improves the single frame Faster RCNN baseline by 5.8% mAP. Besides, under the setting of no temporal post-processing, our method outperforms the previous state-of-the-art by 1.4% mAP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a proposal-level aggregation method can lift video object detection to 80.3 mAP on ImageNet VID by modeling proposal relationships instead of estimating feature correspondences.

read the letter

The main point for you is that this work replaces correspondence-based feature aggregation with a simpler proposal-level model that learns semantic and spatio-temporal relations among proposals within and across frames. It reports 80.3% mAP on ImageNet VID, a 5.8 point gain over single-frame Faster R-CNN, and a 1.4 point edge over prior work when no temporal post-processing is used. The abstract frames the shift away from correspondence as the key move because those estimates are unstable under blur and low quality. If the full experiments back this up, it is a practical alternative worth noting. The paper does a clean job of stating the baseline comparisons and keeping the method description focused on the aggregation step rather than adding extra tricks. The numbers are presented plainly, which makes the empirical claim easy to evaluate on its own terms. The soft spots are mostly about missing details. Only the abstract is in front of us, so we cannot yet see the exact architecture for the relationship modeling, the ablation breakdowns, or how sensitive the gains are to proposal quality or frame sampling. That leaves open whether the reported lift is robust across different detectors or datasets. The motivation about correspondence difficulty is stated but is not required for the performance numbers to be true; the claim reduces to whether the implemented aggregation actually produces those mAP figures under the stated protocol. No circularity or hidden fitting shows up in the given text. This paper is for people working on video object detection who are looking for aggregation options that avoid explicit matching. A reader who needs concrete benchmark numbers on ImageNet VID will get direct value from the comparisons. It deserves a serious referee because the results are stated in falsifiable form on a standard dataset and the approach is distinct enough from prior paradigms to merit checking the implementation details.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a proposal-level feature aggregation framework for video object detection that models semantic and spatio-temporal relationships among object proposals within a frame and across adjacent frames, thereby avoiding explicit feature correspondence estimation. On the ImageNet VID dataset the method reports 80.3% mAP, a 5.8% gain over a single-frame Faster R-CNN baseline and a 1.4% gain over prior state-of-the-art under the no-temporal-post-processing protocol.

Significance. If the reported mAP figures hold under standard evaluation protocols, the work supplies a concrete, comparatively simple alternative to correspondence-based aggregation pipelines and supplies explicit numerical comparisons against both a single-frame baseline and prior video methods.

minor comments (2)

[Abstract] Abstract: the phrase 'without any bells and whistles' is used to characterize the 80.3% result; the manuscript should explicitly list which components (backbone, training schedule, data augmentation, etc.) are included so that the baseline comparison can be reproduced.
[Abstract] The motivation paragraph asserts that feature correspondence is 'fundamentally difficult' due to motion blur, yet the central empirical claim does not depend on proving this premise; a brief sentence clarifying that the performance numbers stand independently of the motivation would avoid any appearance of circularity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The report raises no specific major comments, so our response focuses on acknowledging the overall assessment while confirming we will incorporate any minor suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity; empirical performance claims only

full rationale

The paper reports experimental results on ImageNet VID (80.3% mAP, +5.8% over single-frame Faster R-CNN baseline, +1.4% over prior SOTA without post-processing). No derivation chain, equations, or predictions are present that reduce to inputs by construction. The proposal-level aggregation is a designed architecture whose effectiveness is measured externally via standard benchmarks; the motivation about correspondence instability is stated but is not required for the numerical claims to hold. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no details on model architecture, loss functions, or training hyperparameters are provided, preventing enumeration of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5701 in / 1086 out tokens · 20883 ms · 2026-05-24T23:29:45.857108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 1, 5

work page 2009
[2]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2

work page 2016
[3]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1, 2

work page 2017
[4]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. 1, 2, 3, 5

work page 2015
[5]

DenseBox: Unifying Landmark Localization with End to End Object Detection

Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- box: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos

Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3376–3385, 2017. 1

work page 2017
[7]

Large-scale video classiﬁcation with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. In Pro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 1

work page 2014
[8]

Flow-guided feature aggregation for video object detec- tion

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detec- tion. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 1, 2, 3, 8

work page 2017
[9]

Object detection in video with spatiotemporal sampling networks

Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 331–346, 2018. 1, 2, 3, 6, 7, 8, 9

work page 2018
[10]

Fully motion-aware network for video object detection

Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 542–557, 2018. 1, 2, 3, 6, 7, 8

work page 2018
[11]

Flownet: Learning optical ﬂow with convolutional networks

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical ﬂow with convolutional networks. In Pro- ceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 1, 3

work page 2015
[12]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international confer- ence on computer vision, pages 764–773, 2017. 1

work page 2017
[13]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision , pages 1440–1448,

work page
[14]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 2, 3

work page 2017
[15]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 580–587, 2014. 2

work page 2014
[16]

R-fcn: Object detection via region-based fully convolutional networks

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems , pages 379–387, 2016. 2, 5

work page 2016
[17]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017. 2

work page 2017
[18]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European con- ference on computer vision, pages 21–37. Springer, 2016. 2

work page 2016
[19]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2

work page 2017
[20]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 2, 3, 4, 5

work page 2018
[21]

Object Detection in Videos by High Quality Object Linking

Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by high quality object linking. arXiv preprint arXiv:1801.09823, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

T-cnn: Tubelets with convolutional neural networks for object detection from videos

Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, 2018. 3, 6

work page 2018
[23]

Seq-NMS for Video Object Detection

Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachan- dran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465, 2016. 3, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Deep feature ﬂow for video recognition

Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature ﬂow for video recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017. 3, 6

work page 2017
[25]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7794–7803, 2018. 3

work page 2018
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

work page 2016
[27]

Training region-based object detectors with online hard ex- ample mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 761–769,

work page
[28]

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems.arXiv preprint arXiv:1512.01274, 2015. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

Video object detection with an aligned spatial-temporal memory

Fanyi Xiao and Yong Jae Lee. Video object detection with an aligned spatial-temporal memory. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 485– 501, 2018. 8, 9

work page 2018
[30]

Detect to track and track to detect

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision , pages 3038–3046, 2017. 8

work page 2017
[31]

Optimizing video object detection via a scale-time lattice

Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuan- jun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7814–7823, 2018. 8

work page 2018
[32]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. In ICLR, 2019. 9

work page 2019

[1] [1]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 1, 5

work page 2009

[2] [2]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2

work page 2016

[3] [3]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1, 2

work page 2017

[4] [4]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. 1, 2, 3, 5

work page 2015

[5] [5]

DenseBox: Unifying Landmark Localization with End to End Object Detection

Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- box: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos

Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. Adascan: Adaptive scan pooling in deep convolutional neu- ral networks for human action recognition in videos. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3376–3385, 2017. 1

work page 2017

[7] [7]

Large-scale video classiﬁcation with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. In Pro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 1

work page 2014

[8] [8]

Flow-guided feature aggregation for video object detec- tion

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detec- tion. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 1, 2, 3, 8

work page 2017

[9] [9]

Object detection in video with spatiotemporal sampling networks

Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 331–346, 2018. 1, 2, 3, 6, 7, 8, 9

work page 2018

[10] [10]

Fully motion-aware network for video object detection

Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 542–557, 2018. 1, 2, 3, 6, 7, 8

work page 2018

[11] [11]

Flownet: Learning optical ﬂow with convolutional networks

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical ﬂow with convolutional networks. In Pro- ceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 1, 3

work page 2015

[12] [12]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international confer- ence on computer vision, pages 764–773, 2017. 1

work page 2017

[13] [13]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision , pages 1440–1448,

work page

[14] [14]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 2, 3

work page 2017

[15] [15]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 580–587, 2014. 2

work page 2014

[16] [16]

R-fcn: Object detection via region-based fully convolutional networks

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems , pages 379–387, 2016. 2, 5

work page 2016

[17] [17]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017. 2

work page 2017

[18] [18]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European con- ference on computer vision, pages 21–37. Springer, 2016. 2

work page 2016

[19] [19]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2

work page 2017

[20] [20]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 2, 3, 4, 5

work page 2018

[21] [21]

Object Detection in Videos by High Quality Object Linking

Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by high quality object linking. arXiv preprint arXiv:1801.09823, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

T-cnn: Tubelets with convolutional neural networks for object detection from videos

Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, 2018. 3, 6

work page 2018

[23] [23]

Seq-NMS for Video Object Detection

Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachan- dran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465, 2016. 3, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Deep feature ﬂow for video recognition

Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature ﬂow for video recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017. 3, 6

work page 2017

[25] [25]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7794–7803, 2018. 3

work page 2018

[26] [26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

work page 2016

[27] [27]

Training region-based object detectors with online hard ex- ample mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 761–769,

work page

[28] [28]

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems.arXiv preprint arXiv:1512.01274, 2015. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

Video object detection with an aligned spatial-temporal memory

Fanyi Xiao and Yong Jae Lee. Video object detection with an aligned spatial-temporal memory. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 485– 501, 2018. 8, 9

work page 2018

[30] [30]

Detect to track and track to detect

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision , pages 3038–3046, 2017. 8

work page 2017

[31] [31]

Optimizing video object detection via a scale-time lattice

Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuan- jun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7814–7823, 2018. 8

work page 2018

[32] [32]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. In ICLR, 2019. 9

work page 2019