pith. sign in

arxiv: 1907.09702 · v1 · pith:VZ6UGOK6new · submitted 2019-07-23 · 💻 cs.CV

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Pith reviewed 2026-05-24 17:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords temporal action proposalboundary matchingvideo action detectionend-to-end networkconfidence mapTHUMOS-14ActivityNet
0
0 comments X

The pith

The Boundary-Matching Network produces video action proposals that carry both precise start-end times and reliable ranking scores from a single trained model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the gap in temporal action proposal generation where bottom-up approaches locate exact boundaries yet fail to assign trustworthy scores for ranking proposals. It introduces the Boundary-Matching mechanism that represents each candidate proposal as a paired start and end boundary and assembles all such pairs into one dense confidence map. From this map the Boundary-Matching Network extracts both boundaries and scores in one forward pass through jointly trained branches. A sympathetic reader would expect this to remove the need for separate post-processing stages that previous methods required to make scores usable.

Core claim

Based on the Boundary-Matching mechanism, which evaluates confidence scores of densely distributed proposals by denoting each proposal as a matching pair of starting and ending boundaries and combining all such pairs into a BM confidence map, the Boundary-Matching Network generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously through an end-to-end two-branch architecture trained jointly.

What carries the argument

The Boundary-Matching mechanism, which treats each proposal as a start-end boundary pair and aggregates pairs into a single confidence map that supplies both location and score outputs.

If this is right

  • BMN supplies both boundary locations and usable scores in one network, removing the separate ranking stage required by earlier bottom-up generators.
  • Joint training of the two branches lets boundary precision and score reliability improve together rather than in isolation.
  • The same trained model reaches higher proposal quality on both THUMOS-14 and ActivityNet-1.3 while remaining computationally light.
  • When the proposals are fed to an existing action classifier the overall temporal action detection pipeline reaches state-of-the-art numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The confidence map format could be reused for other dense sequence prediction problems where paired endpoints need scoring.
  • Embedding the BM module inside an end-to-end detector might collapse the traditional proposal-then-classify pipeline into one trainable system.
  • Because the method already reports strong cross-dataset results, it is reasonable to test whether the same architecture transfers to longer untrimmed streams without retraining.

Load-bearing premise

That treating start and end boundaries as matched pairs on a dense grid produces confidence values reliable enough to retrieve good proposals without any later adjustment steps.

What would settle it

A side-by-side retrieval experiment on THUMOS-14 or ActivityNet-1.3 in which the average precision of BMN proposals falls below that of the strongest prior bottom-up method when both are given identical boundary candidates.

Figures

Figures reproduced from arXiv: 1907.09702 by Errui Ding, Shilei Wen, Tianwei Lin, Xiao Liu, Xin Li.

Figure 1
Figure 1. Figure 1: Overview of our method. Given an untrimmed video, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of BM confidence map. Proposals in the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The framework of Boundary-Matching Network. After [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation comparison between BSN and BMN in terms [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization examples of proposals and BM map gen [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Temporal action proposal generation is an challenging and promising task which aims to locate temporal regions in real-world videos where action or event may occur. Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map. Based on BM mechanism, we propose an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously. The two-branches of BMN are jointly trained in an unified framework. We conduct experiments on two challenging datasets: THUMOS-14 and ActivityNet-1.3, where BMN shows significant performance improvement with remarkable efficiency and generalizability. Further, combining with existing action classifier, BMN can achieve state-of-the-art temporal action detection performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Boundary-Matching (BM) mechanism, which represents proposals as pairs of densely sampled start and end boundaries and aggregates them into a 2D BM confidence map to assign reliable scores. Building on this, it proposes the end-to-end Boundary-Matching Network (BMN) whose two branches (boundary regression and matching) are jointly trained to output both precise temporal boundaries and confidence scores simultaneously. Experiments on THUMOS-14 and ActivityNet-1.3 are reported to show performance gains over prior bottom-up methods, with further gains in temporal action detection when combined with an action classifier.

Significance. If the central claim holds—that the BM map produces retrieval-reliable scores directly from the joint boundary predictions without post-hoc adjustment—it would meaningfully advance bottom-up proposal generation by removing a common two-stage pipeline and improving efficiency. The unified training and reported gains on two standard benchmarks would position BMN as a practical component for action detection systems.

major comments (3)
  1. [§3] §3 (BM mechanism): the claim that the BM confidence map yields 'reliable confidence scores' for retrieval rests on the untested assumption that boundary-regression errors do not systematically inflate or deflate near-miss pair entries; no propagation analysis or correlation study between boundary localization error and map quality is supplied, which is load-bearing for the 'simultaneously' assertion.
  2. [§4] §4 (network architecture and training): the two-branch joint training is presented as guaranteeing that the matching operation produces scores correlated with proposal quality, yet no ablation isolates the effect of boundary-branch accuracy on final ranking metrics (e.g., AR@AN curves with vs. without the matching branch), leaving the superiority over prior bottom-up methods unverified.
  3. [§5] §5 (experiments on THUMOS-14): reported improvements lack error bars, data-selection criteria, and explicit comparison of score reliability (e.g., precision-recall of the BM map itself) against methods that apply post-hoc score adjustment, weakening the claim that BMN is superior without such adjustments.
minor comments (2)
  1. [Abstract] Abstract: 'an challenging' should read 'a challenging'; 'where action or event may occur' should read 'where an action or event may occur'.
  2. [§3] Notation for the BM map construction is introduced without an explicit equation linking the start/end probability tensors to the final 2D map entries; a single equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (BM mechanism): the claim that the BM confidence map yields 'reliable confidence scores' for retrieval rests on the untested assumption that boundary-regression errors do not systematically inflate or deflate near-miss pair entries; no propagation analysis or correlation study between boundary localization error and map quality is supplied, which is load-bearing for the 'simultaneously' assertion.

    Authors: We agree that a dedicated propagation analysis would provide additional support. The joint training objective is designed such that the matching branch learns to produce scores robust to the boundary predictions from the regression branch. In the revision we will add a correlation analysis between boundary localization error and BM map entry quality on a held-out validation set. revision: yes

  2. Referee: [§4] §4 (network architecture and training): the two-branch joint training is presented as guaranteeing that the matching operation produces scores correlated with proposal quality, yet no ablation isolates the effect of boundary-branch accuracy on final ranking metrics (e.g., AR@AN curves with vs. without the matching branch), leaving the superiority over prior bottom-up methods unverified.

    Authors: An ablation isolating the contribution of boundary accuracy to the final ranking would indeed clarify the benefit of joint training. We will include additional experiments in the revision that compare AR@AN when the matching branch receives ground-truth boundaries versus predicted boundaries, as well as a variant trained without the boundary regression loss. revision: yes

  3. Referee: [§5] §5 (experiments on THUMOS-14): reported improvements lack error bars, data-selection criteria, and explicit comparison of score reliability (e.g., precision-recall of the BM map itself) against methods that apply post-hoc score adjustment, weakening the claim that BMN is superior without such adjustments.

    Authors: Data selection follows the standard THUMOS-14 protocol used by prior work. We will add error bars computed over multiple random seeds. Direct precision-recall evaluation of the BM map itself is not the primary metric in the proposal-generation literature; our claims rest on the standard AR@AN and AUC metrics, which already show gains without post-hoc adjustment. We will clarify this distinction and report the requested error bars. revision: partial

Circularity Check

0 steps flagged

No circularity: BM mechanism and BMN are introduced as novel constructs

full rationale

The paper defines the Boundary-Matching mechanism explicitly as a new pairing of start/end boundaries into a 2D map and builds BMN as an end-to-end network with two jointly trained branches. No equations, fitted parameters, or predictions are shown to reduce by construction to prior inputs; the central claim is the architectural proposal itself rather than a derived quantity forced by self-citation or re-labeling of existing fits. The provided text contains no load-bearing self-citations or ansatzes imported from prior author work that would collapse the result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5726 in / 917 out tokens · 26944 ms · 2026-05-24T17:59:41.121431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

  1. [1]

    Bodla, B

    N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft- nmsimproving object detection with one line of code. In Computer Vision (ICCV), 2017 IEEE International Confer- ence on, pages 5562–5570. IEEE, 2017. 6

  2. [2]

    S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference, 2017. 2

  3. [3]

    S. Buch, V . Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6373–6382. IEEE, 2017. 1, 2, 3, 6, 8

  4. [4]

    Caba Heilbron, J

    F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1914–1923, 2016. 1, 2

  5. [5]

    Caba Heilbron, V

    F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Car- los Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 961–970, 2015. 1, 6

  6. [6]

    Chang and Y .-S

    J.-R. Chang and Y .-S. Chen. Pyramid stereo matching net- work. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018. 2

  7. [7]

    X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y . Q. Chen. Temporal context network for activity localization in videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5727–5736. IEEE, 2017. 6

  8. [8]

    Escorcia, F

    V . Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In European Conference on Computer Vision, pages 768–784. Springer, 2016. 1, 2, 3

  9. [9]

    Feichtenhofer, A

    C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016. 2

  10. [10]

    J. Gao, K. Chen, and R. Nevatia. Ctap: Complementary tem- poral action proposal generation. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 68–83,

  11. [11]

    J. Gao, Z. Yang, and R. Nevatia. Cascaded boundary re- gression for temporal action detection. In Proceedings of the British Machine Vision Conference, 2017. 3

  12. [12]

    J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, pages 3648–3656. IEEE, 2017. 2, 3, 6, 8

  13. [13]

    Ghanem, J

    B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Al- wassel, R. Khrisna, V . Escorcia, K. Hata, and S. Buch. Ac- tivitynet challenge 2017 summary. CVPR ActivityNet Work- shop, 2017. 6

  14. [14]

    Y . G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. Thumos challenge: Action recognition with a large number of classes. In ECCV Work- shop, 2014. 1, 6

  15. [15]

    Karpathy, G

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 8

  16. [16]

    Learning for Disparity Estimation through Feature Constancy

    Z. Liang, Y . Feng, Y . Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang. Learning deep correspondence through prior and posterior feature constancy.arXiv preprint arXiv:1712.01039, 7(8), 2017. 2

  17. [17]

    T. Lin, X. Zhao, and S. Haisheng. Bsn: Boundary sensitive network for temporal action proposal generation. In Euro- pean Conference on Computer Vision, 2018. 1, 2, 3, 4, 5, 6, 7, 8

  18. [18]

    T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, pages 988–996. ACM, 2017. 2, 3

  19. [19]

    T. Lin, X. Zhao, and Z. Shou. Temporal convolution based action proposal: Submission to activitynet 2017. CVPR Ac- tivityNet Workshop, 2017. 6, 8

  20. [20]

    Mayer, E

    N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4040– 4048, 2016. 2

  21. [21]

    Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5534–5542. IEEE, 2017. 2

  22. [22]

    Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Com- puter Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1417–1426. IEEE, 2017. 8

  23. [23]

    Z. Shou, D. Wang, and S.-F. Chang. Temporal action local- ization in untrimmed videos via multi-stage cnns. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016. 1, 2, 6, 8

  24. [24]

    Simonyan and A

    K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems , pages 568–576,

  25. [25]

    Singh and F

    G. Singh and F. Cuzzolin. Untrimmed video classification for activity detection: submission to activitynet challenge. CVPR ActivityNet Workshop, 2016. 2

  26. [26]

    X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A con- text integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018. 2

  27. [27]

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE International Conference on Computer Vision, pages 4489–4497, 2015. 2

  28. [28]

    D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Con- vnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017. 8

  29. [29]

    H. Wang, A. Kl ¨aser, C. Schmid, and C.-L. Liu. Action recog- nition by dense trajectories. In Computer Vision and Pat- tern Recognition (CVPR), 2011 IEEE Conference on , pages 3169–3176. IEEE, 2011. 2

  30. [30]

    Wang and C

    H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 3551–3558, 2013. 2

  31. [31]

    L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmed- nets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017. 8

  32. [32]

    L. Wang, Y . Xiong, Z. Wang, and Y . Qiao. Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159, 2015. 2

  33. [33]

    Xiong, L

    Y . Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y . Qiao, L. V . Gool, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2016. CVPR ActivityNet Workshop, 2016. 6

  34. [34]

    A Pursuit of Temporal Accuracy in General Activity Detection

    Y . Xiong, Y . Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. CoRR, abs/1703.02716, 2017. 8

  35. [35]

    Y . Zhao, Y . Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2933–2942. IEEE, 2017. 2, 3, 6

  36. [36]

    Y . Zhao, B. Zhang, Z. Wu, S. Yang, L. Zhou, S. Yan, L. Wang, Y . Xiong, D. Lin, Y . Qiao, and X. Tang. Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011, 2017. 8