BMN: Boundary-Matching Network for Temporal Action Proposal Generation
Pith reviewed 2026-05-24 17:59 UTC · model grok-4.3
The pith
The Boundary-Matching Network produces video action proposals that carry both precise start-end times and reliable ranking scores from a single trained model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Based on the Boundary-Matching mechanism, which evaluates confidence scores of densely distributed proposals by denoting each proposal as a matching pair of starting and ending boundaries and combining all such pairs into a BM confidence map, the Boundary-Matching Network generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously through an end-to-end two-branch architecture trained jointly.
What carries the argument
The Boundary-Matching mechanism, which treats each proposal as a start-end boundary pair and aggregates pairs into a single confidence map that supplies both location and score outputs.
If this is right
- BMN supplies both boundary locations and usable scores in one network, removing the separate ranking stage required by earlier bottom-up generators.
- Joint training of the two branches lets boundary precision and score reliability improve together rather than in isolation.
- The same trained model reaches higher proposal quality on both THUMOS-14 and ActivityNet-1.3 while remaining computationally light.
- When the proposals are fed to an existing action classifier the overall temporal action detection pipeline reaches state-of-the-art numbers.
Where Pith is reading between the lines
- The confidence map format could be reused for other dense sequence prediction problems where paired endpoints need scoring.
- Embedding the BM module inside an end-to-end detector might collapse the traditional proposal-then-classify pipeline into one trainable system.
- Because the method already reports strong cross-dataset results, it is reasonable to test whether the same architecture transfers to longer untrimmed streams without retraining.
Load-bearing premise
That treating start and end boundaries as matched pairs on a dense grid produces confidence values reliable enough to retrieve good proposals without any later adjustment steps.
What would settle it
A side-by-side retrieval experiment on THUMOS-14 or ActivityNet-1.3 in which the average precision of BMN proposals falls below that of the strongest prior bottom-up method when both are given identical boundary candidates.
Figures
read the original abstract
Temporal action proposal generation is an challenging and promising task which aims to locate temporal regions in real-world videos where action or event may occur. Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map. Based on BM mechanism, we propose an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously. The two-branches of BMN are jointly trained in an unified framework. We conduct experiments on two challenging datasets: THUMOS-14 and ActivityNet-1.3, where BMN shows significant performance improvement with remarkable efficiency and generalizability. Further, combining with existing action classifier, BMN can achieve state-of-the-art temporal action detection performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Boundary-Matching (BM) mechanism, which represents proposals as pairs of densely sampled start and end boundaries and aggregates them into a 2D BM confidence map to assign reliable scores. Building on this, it proposes the end-to-end Boundary-Matching Network (BMN) whose two branches (boundary regression and matching) are jointly trained to output both precise temporal boundaries and confidence scores simultaneously. Experiments on THUMOS-14 and ActivityNet-1.3 are reported to show performance gains over prior bottom-up methods, with further gains in temporal action detection when combined with an action classifier.
Significance. If the central claim holds—that the BM map produces retrieval-reliable scores directly from the joint boundary predictions without post-hoc adjustment—it would meaningfully advance bottom-up proposal generation by removing a common two-stage pipeline and improving efficiency. The unified training and reported gains on two standard benchmarks would position BMN as a practical component for action detection systems.
major comments (3)
- [§3] §3 (BM mechanism): the claim that the BM confidence map yields 'reliable confidence scores' for retrieval rests on the untested assumption that boundary-regression errors do not systematically inflate or deflate near-miss pair entries; no propagation analysis or correlation study between boundary localization error and map quality is supplied, which is load-bearing for the 'simultaneously' assertion.
- [§4] §4 (network architecture and training): the two-branch joint training is presented as guaranteeing that the matching operation produces scores correlated with proposal quality, yet no ablation isolates the effect of boundary-branch accuracy on final ranking metrics (e.g., AR@AN curves with vs. without the matching branch), leaving the superiority over prior bottom-up methods unverified.
- [§5] §5 (experiments on THUMOS-14): reported improvements lack error bars, data-selection criteria, and explicit comparison of score reliability (e.g., precision-recall of the BM map itself) against methods that apply post-hoc score adjustment, weakening the claim that BMN is superior without such adjustments.
minor comments (2)
- [Abstract] Abstract: 'an challenging' should read 'a challenging'; 'where action or event may occur' should read 'where an action or event may occur'.
- [§3] Notation for the BM map construction is introduced without an explicit equation linking the start/end probability tensors to the final 2D map entries; a single equation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (BM mechanism): the claim that the BM confidence map yields 'reliable confidence scores' for retrieval rests on the untested assumption that boundary-regression errors do not systematically inflate or deflate near-miss pair entries; no propagation analysis or correlation study between boundary localization error and map quality is supplied, which is load-bearing for the 'simultaneously' assertion.
Authors: We agree that a dedicated propagation analysis would provide additional support. The joint training objective is designed such that the matching branch learns to produce scores robust to the boundary predictions from the regression branch. In the revision we will add a correlation analysis between boundary localization error and BM map entry quality on a held-out validation set. revision: yes
-
Referee: [§4] §4 (network architecture and training): the two-branch joint training is presented as guaranteeing that the matching operation produces scores correlated with proposal quality, yet no ablation isolates the effect of boundary-branch accuracy on final ranking metrics (e.g., AR@AN curves with vs. without the matching branch), leaving the superiority over prior bottom-up methods unverified.
Authors: An ablation isolating the contribution of boundary accuracy to the final ranking would indeed clarify the benefit of joint training. We will include additional experiments in the revision that compare AR@AN when the matching branch receives ground-truth boundaries versus predicted boundaries, as well as a variant trained without the boundary regression loss. revision: yes
-
Referee: [§5] §5 (experiments on THUMOS-14): reported improvements lack error bars, data-selection criteria, and explicit comparison of score reliability (e.g., precision-recall of the BM map itself) against methods that apply post-hoc score adjustment, weakening the claim that BMN is superior without such adjustments.
Authors: Data selection follows the standard THUMOS-14 protocol used by prior work. We will add error bars computed over multiple random seeds. Direct precision-recall evaluation of the BM map itself is not the primary metric in the proposal-generation literature; our claims rest on the standard AR@AN and AUC metrics, which already show gains without post-hoc adjustment. We will clarify this distinction and report the requested error bars. revision: partial
Circularity Check
No circularity: BM mechanism and BMN are introduced as novel constructs
full rationale
The paper defines the Boundary-Matching mechanism explicitly as a new pairing of start/end boundaries into a 2D map and builds BMN as an end-to-end network with two jointly trained branches. No equations, fitted parameters, or predictions are shown to reduce by construction to prior inputs; the central claim is the architectural proposal itself rather than a derived quantity forced by self-citation or re-labeling of existing fits. The provided text contains no load-bearing self-citations or ansatzes imported from prior author work that would collapse the result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference, 2017. 2
work page 2017
-
[3]
S. Buch, V . Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6373–6382. IEEE, 2017. 1, 2, 3, 6, 8
work page 2017
-
[4]
F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1914–1923, 2016. 1, 2
work page 1914
-
[5]
F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Car- los Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 961–970, 2015. 1, 6
work page 2015
-
[6]
J.-R. Chang and Y .-S. Chen. Pyramid stereo matching net- work. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018. 2
work page 2018
-
[7]
X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y . Q. Chen. Temporal context network for activity localization in videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5727–5736. IEEE, 2017. 6
work page 2017
-
[8]
V . Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In European Conference on Computer Vision, pages 768–784. Springer, 2016. 1, 2, 3
work page 2016
-
[9]
C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016. 2
work page 1933
-
[10]
J. Gao, K. Chen, and R. Nevatia. Ctap: Complementary tem- poral action proposal generation. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 68–83,
-
[11]
J. Gao, Z. Yang, and R. Nevatia. Cascaded boundary re- gression for temporal action detection. In Proceedings of the British Machine Vision Conference, 2017. 3
work page 2017
-
[12]
J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, pages 3648–3656. IEEE, 2017. 2, 3, 6, 8
work page 2017
- [13]
-
[14]
Y . G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. Thumos challenge: Action recognition with a large number of classes. In ECCV Work- shop, 2014. 1, 6
work page 2014
-
[15]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 8
work page 2014
-
[16]
Learning for Disparity Estimation through Feature Constancy
Z. Liang, Y . Feng, Y . Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang. Learning deep correspondence through prior and posterior feature constancy.arXiv preprint arXiv:1712.01039, 7(8), 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
T. Lin, X. Zhao, and S. Haisheng. Bsn: Boundary sensitive network for temporal action proposal generation. In Euro- pean Conference on Computer Vision, 2018. 1, 2, 3, 4, 5, 6, 7, 8
work page 2018
-
[18]
T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, pages 988–996. ACM, 2017. 2, 3
work page 2017
-
[19]
T. Lin, X. Zhao, and Z. Shou. Temporal convolution based action proposal: Submission to activitynet 2017. CVPR Ac- tivityNet Workshop, 2017. 6, 8
work page 2017
-
[20]
N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4040– 4048, 2016. 2
work page 2016
-
[21]
Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5534–5542. IEEE, 2017. 2
work page 2017
-
[22]
Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Com- puter Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1417–1426. IEEE, 2017. 8
work page 2017
-
[23]
Z. Shou, D. Wang, and S.-F. Chang. Temporal action local- ization in untrimmed videos via multi-stage cnns. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016. 1, 2, 6, 8
work page 2016
-
[24]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems , pages 568–576,
-
[25]
G. Singh and F. Cuzzolin. Untrimmed video classification for activity detection: submission to activitynet challenge. CVPR ActivityNet Workshop, 2016. 2
work page 2016
-
[26]
X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A con- text integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE International Conference on Computer Vision, pages 4489–4497, 2015. 2
work page 2015
-
[28]
D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Con- vnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017. 8
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
H. Wang, A. Kl ¨aser, C. Schmid, and C.-L. Liu. Action recog- nition by dense trajectories. In Computer Vision and Pat- tern Recognition (CVPR), 2011 IEEE Conference on , pages 3169–3176. IEEE, 2011. 2
work page 2011
-
[30]
H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 3551–3558, 2013. 2
work page 2013
-
[31]
L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmed- nets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017. 8
work page 2017
-
[32]
L. Wang, Y . Xiong, Z. Wang, and Y . Qiao. Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159, 2015. 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [33]
-
[34]
A Pursuit of Temporal Accuracy in General Activity Detection
Y . Xiong, Y . Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. CoRR, abs/1703.02716, 2017. 8
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Y . Zhao, Y . Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2933–2942. IEEE, 2017. 2, 3, 6
work page 2017
-
[36]
Y . Zhao, B. Zhang, Z. Wu, S. Yang, L. Zhou, S. Yan, L. Wang, Y . Xiong, D. Lin, Y . Qiao, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011, 2017. 8
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.