pith. sign in

arxiv: 2412.07157 · v3 · submitted 2024-12-10 · 💻 cs.CV

Multi-Scale Contrastive Learning for Video Temporal Grounding

Pith reviewed 2026-05-23 07:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords temporal groundingcontrastive learningmulti-scale learningvideo moment localizationvision-languagefeature pyramid
0
0 comments X

The pith

Contrastive learning across video encoder stages links short and long moments to improve temporal grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the information loss in higher levels of feature pyramids for video temporal grounding, where downsampling for longer moments reduces representational capacity. It introduces a contrastive framework that obtains positive and negative samples directly from multiple stages of the video encoder itself, without data augmentation or memory banks. A new sampling process draws multiple moments per query, enabling multi-scale and cross-scale contrasts that connect local short-range representations to global long-range ones. Experiments show gains on both long-form and short-form grounding tasks. A sympathetic reader cares because this approach preserves semantics across pyramid scales using only internal encoder outputs.

Core claim

The paper claims that sampling multiple video moments per query and contrasting their representations across video encoder layers instantiates a novel multi-scale and cross-scale contrastive learning process that links local short-range video moments with global long-range video moments, thereby capturing salient semantics and mitigating information degradation in higher pyramid levels.

What carries the argument

multi-scale and cross-scale contrastive learning that uses representations from multiple stages of the video encoder as positive and negative samples

If this is right

  • Higher pyramid levels retain more information through cross-scale contrasts with lower levels.
  • The approach improves localization accuracy for both long-form and short-form videos.
  • Contrastive learning for this task requires neither external augmentations nor memory banks.
  • Multiple moments per query supply sufficient diversity for the contrastive signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The internal diversity of encoder stages might reduce reliance on external contrastive techniques in other multi-scale vision tasks.
  • The sampling process could be adapted to query-video pairs in related grounding or retrieval settings.
  • If the method scales, it suggests encoder layers already encode the necessary scale variation for contrast without added machinery.

Load-bearing premise

Representations from multiple stages of the video encoder can serve directly as positive and negative samples for contrastive learning without augmentation or memory banks, and sampling multiple moments per query yields effective training signals.

What would settle it

Training the model without the cross-scale contrastive losses and observing no drop in grounding accuracy on standard benchmarks would falsify the claim that these contrasts are what link the moments and improve performance.

Figures

Figures reproduced from arXiv: 2412.07157 by Anh Tuan Luu, Cong-Duy T Nguyen, See-kiong Ng, Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Zhiyuan Hu.

Figure 1
Figure 1. Figure 1: (Left) Illustration of feature pyramid to encode video moments of different lengths; (Right) An example where recent [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: First and Second: IoU results with respect to target video moment length on Ego4D-NLQ (Grauman et al. 2022) of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall illustration of the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding. Code is available at https://github.com/nguyentthong/MSCL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a contrastive learning framework for video temporal grounding that operates on a feature pyramid encoder. Lower pyramid levels target short-range moments while higher levels (after downsampling) target long-range moments; the authors introduce a sampling procedure to obtain multiple moments per query and then apply multi-scale and cross-scale contrastive losses that treat representations from different encoder stages directly as positives and negatives, without data augmentation or memory banks. The central claim is that this linkage recovers information lost to downsampling and improves grounding performance on both long-form and short-form videos.

Significance. If the cross-scale objective demonstrably enforces semantic rather than trivial layer-wise alignment, the method would constitute a lightweight engineering contribution that avoids extra parameters or memory banks while addressing a recognized limitation of pyramid-based video encoders. Code release supports reproducibility, which is a positive factor.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the central claim that representations from multiple pyramid stages can serve directly as positive/negative samples for contrastive learning rests on the unverified assumption that the sampling of multiple moments per query produces complementary semantics rather than scale-specific artifacts. Because stages share backbone weights and process overlapping temporal content, the risk of collapse to trivial similarity is load-bearing for the claimed linkage between short-range and long-range moments; no concrete test or ablation is referenced to rule this out.
  2. [Abstract] Abstract: the statement that 'extensive experiments demonstrate the effectiveness' supplies no quantitative results, baselines, ablation details, or error analysis, leaving the improvement claim unevidenced and preventing assessment of whether the framework actually outperforms prior multi-level temporal grounding methods.
minor comments (2)
  1. Notation for the contrastive loss (temperature, positive/negative definitions across scales) is not introduced in the provided description, which would aid clarity.
  2. [Abstract] The abstract could briefly indicate the evaluation datasets and primary metrics to contextualize the claimed effectiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the central claim that representations from multiple pyramid stages can serve directly as positive/negative samples for contrastive learning rests on the unverified assumption that the sampling of multiple moments per query produces complementary semantics rather than scale-specific artifacts. Because stages share backbone weights and process overlapping temporal content, the risk of collapse to trivial similarity is load-bearing for the claimed linkage between short-range and long-range moments; no concrete test or ablation is referenced to rule this out.

    Authors: We agree that an explicit check against trivial alignment strengthens the central claim. Section 3.2 details the multi-moment sampling per query, which selects temporally distinct moments at different scales; the cross-scale loss then treats same-query representations from different pyramid stages as positives. Existing ablations in Section 4.3 already show that ablating the cross-scale term degrades performance on both long- and short-form benchmarks, indicating the loss captures more than layer-wise artifacts. To directly address the referee's concern, we will add a new analysis measuring inter-stage cosine similarity with and without the contrastive objective, confirming that the loss increases semantic alignment beyond trivial similarity. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'extensive experiments demonstrate the effectiveness' supplies no quantitative results, baselines, ablation details, or error analysis, leaving the improvement claim unevidenced and preventing assessment of whether the framework actually outperforms prior multi-level temporal grounding methods.

    Authors: The abstract is intentionally concise, but we accept that including key quantitative evidence would improve clarity. We will revise the abstract to report the main performance gains (e.g., absolute improvements on Charades-STA and ActivityNet-Captions relative to recent pyramid-based baselines) together with a brief mention of the ablation findings. revision: yes

Circularity Check

0 steps flagged

No circularity; independent engineering proposal for multi-scale contrastive pairs

full rationale

The paper introduces a contrastive learning framework that treats representations from multiple stages of an existing feature pyramid encoder as positives/negatives, augmented by a new sampling process for multiple moments per query. This construction is defined directly in the method section without reducing any claimed result to a fitted parameter, prior self-citation, or input quantity by construction. The central claim (multi-scale and cross-scale contrastive learning linking short- and long-range moments) is an additive loss term whose effectiveness is evaluated externally via experiments rather than derived tautologically from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that intra-encoder multi-stage features can substitute for augmented positives/negatives in contrastive learning; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Contrastive learning on multi-stage encoder features can capture salient semantics and compensate for information loss in downsampled pyramid levels
    Invoked to justify the proposed multi-scale and cross-scale contrastive losses.

pith-pipeline@v0.9.0 · 5775 in / 1212 out tokens · 44759 ms · 2026-05-23T07:27:43.731508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    An, X.; Deng, J.; Yang, K.; Li, J.; Feng, Z.; Guo, J.; Yang, J.; and Liu, T. 2023. Unicom: Universal and compact representation learning for image retrieval. arXiv preprint arXiv:2304.05884

  4. [4]

    Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; and Russell, B. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, 5803--5812

  5. [5]

    D.; and Buchwalter, W

    Bachman, P.; Hjelm, R. D.; and Buchwalter, W. 2019. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32

  6. [6]

    Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, 5561--5569

  7. [7]

    C.; and Choset, H

    Burgner-Kahrs, J.; Rucker, D. C.; and Choset, H. 2015. Continuum robots for medical applications: A survey. IEEE Transactions on Robotics, 31(6): 1261--1280

  8. [8]

    Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299--6308

  9. [9]

    Chaitanya, K.; Erdil, E.; Karani, N.; and Konukoglu, E. 2020. Contrastive learning of global and local features for medical image segmentation with limited annotations. Advances in neural information processing systems, 33: 12546--12558

  10. [10]

    Claussmann, L.; Revilloud, M.; Gruyer, D.; and Glaser, S. 2019. A review of motion planning for highway autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(5): 1826--1848

  11. [11]

    Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  12. [12]

    Fang, X.; Liu, D.; Zhou, P.; and Nan, G. 2023. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2448--2460

  13. [13]

    Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, 6202--6211

  14. [14]

    Gao, J.; Sun, C.; Yang, Z.; and Nevatia, R. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267--5275

  15. [15]

    Gao, J.; Sun, X.; Xu, M.; Zhou, X.; and Ghanem, B. 2021. Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717

  16. [16]

    Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012

  17. [17]

    Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; and Pan, C. 2020. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12595--12604

  18. [18]

    Han, D.; Cheng, X.; Guo, N.; Ye, X.; Rainer, B.; and Priller, P. 2023. Momentum cross-modal contrastive learning for video moment retrieval. IEEE Transactions on Circuits and Systems for Video Technology

  19. [19]

    Learning deep representations by mutual information estimation and maximization

    Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670

  20. [20]

    Hou, Z.; Zhong, W.; Ji, L.; Gao, D.; Yan, K.; Chan, W.-K.; Ngo, C.-W.; Shou, Z.; and Duan, N. 2022. Cone: An efficient coarse-to-fine alignment framework for long video temporal grounding. arXiv preprint arXiv:2209.10918

  21. [21]

    Hu, H.; Cui, J.; and Wang, L. 2021. Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16291--16301

  22. [22]

    Ji, W.; Shi, R.; Wei, Y.; Zhao, S.; and Zimmermann, R. 2024. Weakly Supervised Video Moment Retrieval via Location-irrelevant Proposal Learning. In Companion Proceedings of the ACM on Web Conference 2024, 1595--1603

  23. [23]

    Jung, M.; Jang, Y.; Choi, S.; Kim, J.; Kim, J.-H.; and Zhang, B.-T. 2023. Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval. arXiv preprint arXiv:2306.02728

  24. [24]

    Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  25. [25]

    Kim, T.; Kim, J.; Shim, M.; Yun, S.; Kang, M.; Wee, D.; and Lee, S. 2022. Exploring temporally dynamic data augmentation for video recognition. arXiv preprint arXiv:2206.15015

  26. [26]

    Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Carlos Niebles, J. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706--715

  27. [27]

    L.; and Bansal, M

    Lei, J.; Berg, T. L.; and Bansal, M. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846--11858

  28. [28]

    Li, H.; Cao, M.; Cheng, X.; Li, Y.; Zhu, Z.; and Zou, Y. 2023. G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12032--12042

  29. [29]

    Li, J.; Xie, J.; Qian, L.; Zhu, L.; Tang, S.; Wu, F.; Yang, Y.; Zhuang, Y.; and Wang, X. E. 2022. Compositional temporal grounding with structured variational cross-graph correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3032--3041

  30. [30]

    Li, K.; Guo, D.; and Wang, M. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1902--1910

  31. [31]

    Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E

    Lin, K. Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E. Z.; Gao, D.; Tu, R.-C.; Zhao, W.; Kong, W.; et al. 2022. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35: 7575--7586

  32. [32]

    Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A

    Lin, K. Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A. J.; Yan, R.; and Shou, M. Z. 2023. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2794--2804

  33. [33]

    Liu, D.; Qu, X.; Di, X.; Cheng, Y.; Xu, Z.; and Zhou, P. 2022. Memory-guided semantic learning network for temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1665--1673

  34. [34]

    Liu, D.; Qu, X.; Dong, J.; and Zhou, P. 2021. Adaptive proposal generation network for temporal sentence localization in videos. arXiv preprint arXiv:2109.06398

  35. [35]

    Liu, Z.; Li, J.; Xie, H.; Li, P.; Ge, J.; Liu, S.-A.; and Jin, G. 2024. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 3855--3863

  36. [36]

    Mu, F.; Mo, S.; and Li, Y. 2024. SnAG: Scalable and Accurate Video Grounding. arXiv preprint arXiv:2404.02257

  37. [37]

    A.; and Tuan, L

    Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227

  38. [38]

    Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024 a . Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486

  39. [39]

    Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2025. Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. In European Conference on Computer Vision, 77--98. Springer

  40. [40]

    Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986

  41. [41]

    Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549

  42. [42]

    T.; Ng, S.-K.; and Luu, A

    Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 b . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577

  43. [43]

    Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

  44. [44]

    Pan, Y.; He, X.; Gong, B.; Lv, Y.; Shen, Y.; Peng, Y.; and Zhao, D. 2023. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13767--13777

  45. [45]

    Panta, L.; Shrestha, P.; Sapkota, B.; Bhattarai, A.; Manandhar, S.; and Sah, A. K. 2024. Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 607--614

  46. [46]

    Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543

  47. [47]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR

  48. [48]

    Regneri, M.; Rohrbach, M.; Wetzel, D.; Thater, S.; Schiele, B.; and Pinkal, M. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1: 25--36

  49. [49]

    A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A

    Sigurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, 510--526. Springer

  50. [50]

    L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B

    Soldan, M.; Pardo, A.; Alc \'a zar, J. L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B. 2022. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5026--5035

  51. [51]

    Soldan, M.; Xu, M.; Qu, S.; Tegner, J.; and Ghanem, B. 2021. Vlg-net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3224--3234

  52. [52]

    Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; and J \'e gou, H. 2021. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 32--42

  53. [53]

    Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489--4497

  54. [54]

    Wang, H.; Zha, Z.-J.; Li, L.; Liu, D.; and Luo, J. 2021 a . Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7026--7035

  55. [55]

    Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; and Van Gool, L. 2021 b . Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 7303--7313

  56. [56]

    Wang, Z.; Wang, L.; Wu, T.; Li, T.; and Wu, G. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2613--2623

  57. [57]

    Woo, S.; Park, J.; Koo, I.; Lee, S.; Jeong, M.; and Kim, C. 2022. Explore and match: End-to-end video grounding with transformer. arXiv preprint arXiv:2201.10168, 1(4)

  58. [58]

    Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771

  59. [59]

    Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...

  60. [60]

    Xiao, S.; Chen, L.; Shao, J.; Zhuang, Y.; and Xiao, J. 2021 a . Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678

  61. [61]

    Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.; and Xiao, J. 2021 b . Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2986--2994

  62. [62]

    Xiao, Y.; Luo, Z.; Liu, Y.; Ma, Y.; Bian, H.; Ji, Y.; Yang, Y.; and Li, X. 2024. Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18709--18719

  63. [63]

    Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; and Jiang, Y.-G. 2023. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18816--18826

  64. [64]

    Xu, M.; Soldan, M.; Gao, J.; Liu, S.; P \'e rez-R \'u a, J.-M.; and Ghanem, B. 2023. Boundary-denoising for video activity localization. arXiv preprint arXiv:2304.02934

  65. [65]

    Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; and Liang, R. 2023. AFPN: asymptotic feature pyramid network for object detection. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2184--2189. IEEE

  66. [66]

    Zhang, C.-L.; Wu, J.; and Li, Y. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, 492--510. Springer

  67. [67]

    Zhang, H.; Sun, A.; Jing, W.; and Zhou, J. T. 2020 a . Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931

  68. [68]

    Zhang, M.; Yang, Y.; Chen, X.; Ji, Y.; Xu, X.; Li, J.; and Shen, H. T. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12669--12678

  69. [69]

    Zhang, S.; Peng, H.; Fu, J.; and Luo, J. 2020 b . Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12870--12877

  70. [70]

    Zhang, S.; Zhu, Y.; and Roy-Chowdhury, A. K. 2016. Context-aware surveillance video summarization. IEEE Transactions on Image Processing, 25(11): 5469--5478

  71. [71]

    An unsupervised sentence embedding method by mutual information maximization

    Zhang, Y.; He, R.; Liu, Z.; Lim, K. H.; and Bing, L. 2020 c . An unsupervised sentence embedding method by mutual information maximization. arXiv preprint arXiv:2009.12061

  72. [72]

    Zhou, H.; Zhang, C.; Luo, Y.; Chen, Y.; and Hu, C. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445--8454

  73. [73]

    Zhu, J.; Liu, D.; Zhou, P.; Di, X.; Cheng, Y.; Yang, S.; Xu, W.; Xu, Z.; Wan, Y.; Sun, L.; et al. 2023. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. arXiv preprint arXiv:2301.00514