BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Errui Ding; Shilei Wen; Tianwei Lin; Xiao Liu; Xin Li

arxiv: 1907.09702 · v1 · pith:VZ6UGOK6new · submitted 2019-07-23 · 💻 cs.CV

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Tianwei Lin , Xiao Liu , Xin Li , Errui Ding , Shilei Wen This is my paper

Pith reviewed 2026-05-24 17:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal action proposalboundary matchingvideo action detectionend-to-end networkconfidence mapTHUMOS-14ActivityNet

0 comments

The pith

The Boundary-Matching Network produces video action proposals that carry both precise start-end times and reliable ranking scores from a single trained model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the gap in temporal action proposal generation where bottom-up approaches locate exact boundaries yet fail to assign trustworthy scores for ranking proposals. It introduces the Boundary-Matching mechanism that represents each candidate proposal as a paired start and end boundary and assembles all such pairs into one dense confidence map. From this map the Boundary-Matching Network extracts both boundaries and scores in one forward pass through jointly trained branches. A sympathetic reader would expect this to remove the need for separate post-processing stages that previous methods required to make scores usable.

Core claim

Based on the Boundary-Matching mechanism, which evaluates confidence scores of densely distributed proposals by denoting each proposal as a matching pair of starting and ending boundaries and combining all such pairs into a BM confidence map, the Boundary-Matching Network generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously through an end-to-end two-branch architecture trained jointly.

What carries the argument

The Boundary-Matching mechanism, which treats each proposal as a start-end boundary pair and aggregates pairs into a single confidence map that supplies both location and score outputs.

If this is right

BMN supplies both boundary locations and usable scores in one network, removing the separate ranking stage required by earlier bottom-up generators.
Joint training of the two branches lets boundary precision and score reliability improve together rather than in isolation.
The same trained model reaches higher proposal quality on both THUMOS-14 and ActivityNet-1.3 while remaining computationally light.
When the proposals are fed to an existing action classifier the overall temporal action detection pipeline reaches state-of-the-art numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The confidence map format could be reused for other dense sequence prediction problems where paired endpoints need scoring.
Embedding the BM module inside an end-to-end detector might collapse the traditional proposal-then-classify pipeline into one trainable system.
Because the method already reports strong cross-dataset results, it is reasonable to test whether the same architecture transfers to longer untrimmed streams without retraining.

Load-bearing premise

That treating start and end boundaries as matched pairs on a dense grid produces confidence values reliable enough to retrieve good proposals without any later adjustment steps.

What would settle it

A side-by-side retrieval experiment on THUMOS-14 or ActivityNet-1.3 in which the average precision of BMN proposals falls below that of the strongest prior bottom-up method when both are given identical boundary candidates.

Figures

Figures reproduced from arXiv: 1907.09702 by Errui Ding, Shilei Wen, Tianwei Lin, Xiao Liu, Xin Li.

**Figure 2.** Figure 2: Illustration of BM confidence map. Proposals in the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: The framework of Boundary-Matching Network. After [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation comparison between BSN and BMN in terms [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization examples of proposals and BM map gen [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Temporal action proposal generation is an challenging and promising task which aims to locate temporal regions in real-world videos where action or event may occur. Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map. Based on BM mechanism, we propose an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously. The two-branches of BMN are jointly trained in an unified framework. We conduct experiments on two challenging datasets: THUMOS-14 and ActivityNet-1.3, where BMN shows significant performance improvement with remarkable efficiency and generalizability. Further, combining with existing action classifier, BMN can achieve state-of-the-art temporal action detection performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BMN adds a boundary-matching map to produce proposals and confidence scores in one network, but the abstract gives almost no evidence that the scores hold up when boundaries have typical localization error.

read the letter

The new piece is the Boundary-Matching mechanism that pairs densely sampled start and end boundaries into a 2D map, then extracts proposals from it while a second branch handles regression. Joint training of the two branches is meant to give both precise boundaries and usable scores without separate post-processing steps. That setup is distinct from the bottom-up methods cited in the abstract and looks like a practical engineering move for video pipelines that need proposals for detection or retrieval. The reported gains on THUMOS-14 and ActivityNet-1.3 plus the efficiency claim are the concrete results offered so far. The soft spot is exactly the one flagged in the stress-test note: the map only produces reliable ranking scores if the boundary predictions are already accurate enough that near-miss pairs do not get inflated or deflated values. The abstract supplies no equations for the matching operation, no ablation on how the map behaves under localization noise, and no error bars or data-selection details, so it is impossible to tell whether the unified training actually fixes the correlation problem or whether the method still relies on later adjustments. From the abstract alone the soundness rating is low for that reason. This paper is for people already working on temporal action proposals who want to see a concrete alternative to separate boundary and scoring stages. It is worth sending to peer review because the core mechanism is new enough and the datasets are standard, even though the current description needs the full implementation and controls before the reliability claim can be assessed.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Boundary-Matching (BM) mechanism, which represents proposals as pairs of densely sampled start and end boundaries and aggregates them into a 2D BM confidence map to assign reliable scores. Building on this, it proposes the end-to-end Boundary-Matching Network (BMN) whose two branches (boundary regression and matching) are jointly trained to output both precise temporal boundaries and confidence scores simultaneously. Experiments on THUMOS-14 and ActivityNet-1.3 are reported to show performance gains over prior bottom-up methods, with further gains in temporal action detection when combined with an action classifier.

Significance. If the central claim holds—that the BM map produces retrieval-reliable scores directly from the joint boundary predictions without post-hoc adjustment—it would meaningfully advance bottom-up proposal generation by removing a common two-stage pipeline and improving efficiency. The unified training and reported gains on two standard benchmarks would position BMN as a practical component for action detection systems.

major comments (3)

[§3] §3 (BM mechanism): the claim that the BM confidence map yields 'reliable confidence scores' for retrieval rests on the untested assumption that boundary-regression errors do not systematically inflate or deflate near-miss pair entries; no propagation analysis or correlation study between boundary localization error and map quality is supplied, which is load-bearing for the 'simultaneously' assertion.
[§4] §4 (network architecture and training): the two-branch joint training is presented as guaranteeing that the matching operation produces scores correlated with proposal quality, yet no ablation isolates the effect of boundary-branch accuracy on final ranking metrics (e.g., AR@AN curves with vs. without the matching branch), leaving the superiority over prior bottom-up methods unverified.
[§5] §5 (experiments on THUMOS-14): reported improvements lack error bars, data-selection criteria, and explicit comparison of score reliability (e.g., precision-recall of the BM map itself) against methods that apply post-hoc score adjustment, weakening the claim that BMN is superior without such adjustments.

minor comments (2)

[Abstract] Abstract: 'an challenging' should read 'a challenging'; 'where action or event may occur' should read 'where an action or event may occur'.
[§3] Notation for the BM map construction is introduced without an explicit equation linking the start/end probability tensors to the final 2D map entries; a single equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (BM mechanism): the claim that the BM confidence map yields 'reliable confidence scores' for retrieval rests on the untested assumption that boundary-regression errors do not systematically inflate or deflate near-miss pair entries; no propagation analysis or correlation study between boundary localization error and map quality is supplied, which is load-bearing for the 'simultaneously' assertion.

Authors: We agree that a dedicated propagation analysis would provide additional support. The joint training objective is designed such that the matching branch learns to produce scores robust to the boundary predictions from the regression branch. In the revision we will add a correlation analysis between boundary localization error and BM map entry quality on a held-out validation set. revision: yes
Referee: [§4] §4 (network architecture and training): the two-branch joint training is presented as guaranteeing that the matching operation produces scores correlated with proposal quality, yet no ablation isolates the effect of boundary-branch accuracy on final ranking metrics (e.g., AR@AN curves with vs. without the matching branch), leaving the superiority over prior bottom-up methods unverified.

Authors: An ablation isolating the contribution of boundary accuracy to the final ranking would indeed clarify the benefit of joint training. We will include additional experiments in the revision that compare AR@AN when the matching branch receives ground-truth boundaries versus predicted boundaries, as well as a variant trained without the boundary regression loss. revision: yes
Referee: [§5] §5 (experiments on THUMOS-14): reported improvements lack error bars, data-selection criteria, and explicit comparison of score reliability (e.g., precision-recall of the BM map itself) against methods that apply post-hoc score adjustment, weakening the claim that BMN is superior without such adjustments.

Authors: Data selection follows the standard THUMOS-14 protocol used by prior work. We will add error bars computed over multiple random seeds. Direct precision-recall evaluation of the BM map itself is not the primary metric in the proposal-generation literature; our claims rest on the standard AR@AN and AUC metrics, which already show gains without post-hoc adjustment. We will clarify this distinction and report the requested error bars. revision: partial

Circularity Check

0 steps flagged

No circularity: BM mechanism and BMN are introduced as novel constructs

full rationale

The paper defines the Boundary-Matching mechanism explicitly as a new pairing of start/end boundaries into a 2D map and builds BMN as an end-to-end network with two jointly trained branches. No equations, fitted parameters, or predictions are shown to reduce by construction to prior inputs; the central claim is the architectural proposal itself rather than a derived quantity forced by self-citation or re-labeling of existing fits. The provided text contains no load-bearing self-citations or ansatzes imported from prior author work that would collapse the result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5726 in / 917 out tokens · 26944 ms · 2026-05-24T17:59:41.121431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

[1]

Bodla, B

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft- nmsimproving object detection with one line of code. In Computer Vision (ICCV), 2017 IEEE International Confer- ence on, pages 5562–5570. IEEE, 2017. 6

work page 2017
[2]

S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference, 2017. 2

work page 2017
[3]

S. Buch, V . Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6373–6382. IEEE, 2017. 1, 2, 3, 6, 8

work page 2017
[4]

Caba Heilbron, J

F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. Fast temporal activity proposals for efﬁcient detection of human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1914–1923, 2016. 1, 2

work page 1914
[5]

Caba Heilbron, V

F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Car- los Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 961–970, 2015. 1, 6

work page 2015
[6]

Chang and Y .-S

J.-R. Chang and Y .-S. Chen. Pyramid stereo matching net- work. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018. 2

work page 2018
[7]

X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y . Q. Chen. Temporal context network for activity localization in videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5727–5736. IEEE, 2017. 6

work page 2017
[8]

Escorcia, F

V . Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In European Conference on Computer Vision, pages 768–784. Springer, 2016. 1, 2, 3

work page 2016
[9]

Feichtenhofer, A

C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016. 2

work page 1933
[10]

J. Gao, K. Chen, and R. Nevatia. Ctap: Complementary tem- poral action proposal generation. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 68–83,

work page
[11]

J. Gao, Z. Yang, and R. Nevatia. Cascaded boundary re- gression for temporal action detection. In Proceedings of the British Machine Vision Conference, 2017. 3

work page 2017
[12]

J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, pages 3648–3656. IEEE, 2017. 2, 3, 6, 8

work page 2017
[13]

Ghanem, J

B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Al- wassel, R. Khrisna, V . Escorcia, K. Hata, and S. Buch. Ac- tivitynet challenge 2017 summary. CVPR ActivityNet Work- shop, 2017. 6

work page 2017
[14]

Y . G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. Thumos challenge: Action recognition with a large number of classes. In ECCV Work- shop, 2014. 1, 6

work page 2014
[15]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 8

work page 2014
[16]

Learning for Disparity Estimation through Feature Constancy

Z. Liang, Y . Feng, Y . Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang. Learning deep correspondence through prior and posterior feature constancy.arXiv preprint arXiv:1712.01039, 7(8), 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

T. Lin, X. Zhao, and S. Haisheng. Bsn: Boundary sensitive network for temporal action proposal generation. In Euro- pean Conference on Computer Vision, 2018. 1, 2, 3, 4, 5, 6, 7, 8

work page 2018
[18]

T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, pages 988–996. ACM, 2017. 2, 3

work page 2017
[19]

T. Lin, X. Zhao, and Z. Shou. Temporal convolution based action proposal: Submission to activitynet 2017. CVPR Ac- tivityNet Workshop, 2017. 6, 8

work page 2017
[20]

Mayer, E

N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4040– 4048, 2016. 2

work page 2016
[21]

Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5534–5542. IEEE, 2017. 2

work page 2017
[22]

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Com- puter Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1417–1426. IEEE, 2017. 8

work page 2017
[23]

Z. Shou, D. Wang, and S.-F. Chang. Temporal action local- ization in untrimmed videos via multi-stage cnns. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016. 1, 2, 6, 8

work page 2016
[24]

Simonyan and A

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems , pages 568–576,

work page
[25]

Singh and F

G. Singh and F. Cuzzolin. Untrimmed video classiﬁcation for activity detection: submission to activitynet challenge. CVPR ActivityNet Workshop, 2016. 2

work page 2016
[26]

X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A con- text integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE International Conference on Computer Vision, pages 4489–4497, 2015. 2

work page 2015
[28]

D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Con- vnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

H. Wang, A. Kl ¨aser, C. Schmid, and C.-L. Liu. Action recog- nition by dense trajectories. In Computer Vision and Pat- tern Recognition (CVPR), 2011 IEEE Conference on , pages 3169–3176. IEEE, 2011. 2

work page 2011
[30]

Wang and C

H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 3551–3558, 2013. 2

work page 2013
[31]

L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmed- nets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017. 8

work page 2017
[32]

L. Wang, Y . Xiong, Z. Wang, and Y . Qiao. Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

Xiong, L

Y . Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y . Qiao, L. V . Gool, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2016. CVPR ActivityNet Workshop, 2016. 6

work page 2016
[34]

A Pursuit of Temporal Accuracy in General Activity Detection

Y . Xiong, Y . Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. CoRR, abs/1703.02716, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Y . Zhao, Y . Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2933–2942. IEEE, 2017. 2, 3, 6

work page 2017
[36]

Y . Zhao, B. Zhang, Z. Wu, S. Yang, L. Zhou, S. Yan, L. Wang, Y . Xiong, D. Lin, Y . Qiao, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Bodla, B

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft- nmsimproving object detection with one line of code. In Computer Vision (ICCV), 2017 IEEE International Confer- ence on, pages 5562–5570. IEEE, 2017. 6

work page 2017

[2] [2]

S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference, 2017. 2

work page 2017

[3] [3]

S. Buch, V . Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6373–6382. IEEE, 2017. 1, 2, 3, 6, 8

work page 2017

[4] [4]

Caba Heilbron, J

F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. Fast temporal activity proposals for efﬁcient detection of human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1914–1923, 2016. 1, 2

work page 1914

[5] [5]

Caba Heilbron, V

F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Car- los Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 961–970, 2015. 1, 6

work page 2015

[6] [6]

Chang and Y .-S

J.-R. Chang and Y .-S. Chen. Pyramid stereo matching net- work. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018. 2

work page 2018

[7] [7]

X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y . Q. Chen. Temporal context network for activity localization in videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5727–5736. IEEE, 2017. 6

work page 2017

[8] [8]

Escorcia, F

V . Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In European Conference on Computer Vision, pages 768–784. Springer, 2016. 1, 2, 3

work page 2016

[9] [9]

Feichtenhofer, A

C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016. 2

work page 1933

[10] [10]

J. Gao, K. Chen, and R. Nevatia. Ctap: Complementary tem- poral action proposal generation. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 68–83,

work page

[11] [11]

J. Gao, Z. Yang, and R. Nevatia. Cascaded boundary re- gression for temporal action detection. In Proceedings of the British Machine Vision Conference, 2017. 3

work page 2017

[12] [12]

J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, pages 3648–3656. IEEE, 2017. 2, 3, 6, 8

work page 2017

[13] [13]

Ghanem, J

B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Al- wassel, R. Khrisna, V . Escorcia, K. Hata, and S. Buch. Ac- tivitynet challenge 2017 summary. CVPR ActivityNet Work- shop, 2017. 6

work page 2017

[14] [14]

Y . G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. Thumos challenge: Action recognition with a large number of classes. In ECCV Work- shop, 2014. 1, 6

work page 2014

[15] [15]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 8

work page 2014

[16] [16]

Learning for Disparity Estimation through Feature Constancy

Z. Liang, Y . Feng, Y . Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang. Learning deep correspondence through prior and posterior feature constancy.arXiv preprint arXiv:1712.01039, 7(8), 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

T. Lin, X. Zhao, and S. Haisheng. Bsn: Boundary sensitive network for temporal action proposal generation. In Euro- pean Conference on Computer Vision, 2018. 1, 2, 3, 4, 5, 6, 7, 8

work page 2018

[18] [18]

T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, pages 988–996. ACM, 2017. 2, 3

work page 2017

[19] [19]

T. Lin, X. Zhao, and Z. Shou. Temporal convolution based action proposal: Submission to activitynet 2017. CVPR Ac- tivityNet Workshop, 2017. 6, 8

work page 2017

[20] [20]

Mayer, E

N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4040– 4048, 2016. 2

work page 2016

[21] [21]

Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5534–5542. IEEE, 2017. 2

work page 2017

[22] [22]

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Com- puter Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1417–1426. IEEE, 2017. 8

work page 2017

[23] [23]

Z. Shou, D. Wang, and S.-F. Chang. Temporal action local- ization in untrimmed videos via multi-stage cnns. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016. 1, 2, 6, 8

work page 2016

[24] [24]

Simonyan and A

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems , pages 568–576,

work page

[25] [25]

Singh and F

G. Singh and F. Cuzzolin. Untrimmed video classiﬁcation for activity detection: submission to activitynet challenge. CVPR ActivityNet Workshop, 2016. 2

work page 2016

[26] [26]

X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A con- text integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE International Conference on Computer Vision, pages 4489–4497, 2015. 2

work page 2015

[28] [28]

D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Con- vnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

H. Wang, A. Kl ¨aser, C. Schmid, and C.-L. Liu. Action recog- nition by dense trajectories. In Computer Vision and Pat- tern Recognition (CVPR), 2011 IEEE Conference on , pages 3169–3176. IEEE, 2011. 2

work page 2011

[30] [30]

Wang and C

H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 3551–3558, 2013. 2

work page 2013

[31] [31]

L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmed- nets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017. 8

work page 2017

[32] [32]

L. Wang, Y . Xiong, Z. Wang, and Y . Qiao. Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

Xiong, L

Y . Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y . Qiao, L. V . Gool, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2016. CVPR ActivityNet Workshop, 2016. 6

work page 2016

[34] [34]

A Pursuit of Temporal Accuracy in General Activity Detection

Y . Xiong, Y . Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. CoRR, abs/1703.02716, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Y . Zhao, Y . Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2933–2942. IEEE, 2017. 2, 3, 6

work page 2017

[36] [36]

Y . Zhao, B. Zhang, Z. Wu, S. Yang, L. Zhou, S. Yan, L. Wang, Y . Xiong, D. Lin, Y . Qiao, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017