Multi-Scale Contrastive Learning for Video Temporal Grounding

Anh Tuan Luu; Cong-Duy T Nguyen; See-kiong Ng; Thong Thanh Nguyen; Xiaobao Wu; Yi Bin; Zhiyuan Hu

arxiv: 2412.07157 · v3 · submitted 2024-12-10 · 💻 cs.CV

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen , Yi Bin , Xiaobao Wu , Zhiyuan Hu , Cong-Duy T Nguyen , See-kiong Ng , Anh Tuan Luu This is my paper

Pith reviewed 2026-05-23 07:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal groundingcontrastive learningmulti-scale learningvideo moment localizationvision-languagefeature pyramid

0 comments

The pith

Contrastive learning across video encoder stages links short and long moments to improve temporal grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the information loss in higher levels of feature pyramids for video temporal grounding, where downsampling for longer moments reduces representational capacity. It introduces a contrastive framework that obtains positive and negative samples directly from multiple stages of the video encoder itself, without data augmentation or memory banks. A new sampling process draws multiple moments per query, enabling multi-scale and cross-scale contrasts that connect local short-range representations to global long-range ones. Experiments show gains on both long-form and short-form grounding tasks. A sympathetic reader cares because this approach preserves semantics across pyramid scales using only internal encoder outputs.

Core claim

The paper claims that sampling multiple video moments per query and contrasting their representations across video encoder layers instantiates a novel multi-scale and cross-scale contrastive learning process that links local short-range video moments with global long-range video moments, thereby capturing salient semantics and mitigating information degradation in higher pyramid levels.

What carries the argument

multi-scale and cross-scale contrastive learning that uses representations from multiple stages of the video encoder as positive and negative samples

If this is right

Higher pyramid levels retain more information through cross-scale contrasts with lower levels.
The approach improves localization accuracy for both long-form and short-form videos.
Contrastive learning for this task requires neither external augmentations nor memory banks.
Multiple moments per query supply sufficient diversity for the contrastive signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The internal diversity of encoder stages might reduce reliance on external contrastive techniques in other multi-scale vision tasks.
The sampling process could be adapted to query-video pairs in related grounding or retrieval settings.
If the method scales, it suggests encoder layers already encode the necessary scale variation for contrast without added machinery.

Load-bearing premise

Representations from multiple stages of the video encoder can serve directly as positive and negative samples for contrastive learning without augmentation or memory banks, and sampling multiple moments per query yields effective training signals.

What would settle it

Training the model without the cross-scale contrastive losses and observing no drop in grounding accuracy on standard benchmarks would falsify the claim that these contrasts are what link the moments and improve performance.

Figures

Figures reproduced from arXiv: 2412.07157 by Anh Tuan Luu, Cong-Duy T Nguyen, See-kiong Ng, Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Zhiyuan Hu.

**Figure 1.** Figure 1: (Left) Illustration of feature pyramid to encode video moments of different lengths; (Right) An example where recent [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: First and Second: IoU results with respect to target video moment length on Ego4D-NLQ (Grauman et al. 2022) of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overall illustration of the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding. Code is available at https://github.com/nguyentthong/MSCL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a sampling step for multiple moments per query plus multi-scale contrastive on raw encoder stages, but the abstract shows no numbers so the payoff is still unproven.

read the letter

The one thing to know is that they sample several moments for each query and then run contrastive directly on features from different encoder layers to tie short-range and long-range representations together without extra augmentations or banks. That combination is not in the prior work they cite. They also correctly flag the information loss that comes with downsampling in feature pyramids, which is a real practical issue for variable-length grounding. The method description is clear on how the positives and negatives are drawn from the pyramid stages themselves. The main soft spot is the complete lack of results in the abstract: no tables, no baselines, no ablation on whether the cross-scale pairs actually add semantic signal or just pick up layer-wise correlation. The stress-test worry about trivial similarity is plausible because the stages share weights and see overlapping video content, and nothing in the abstract rules it out. If the full paper has solid quantitative gains that survive that check, the contribution becomes more interesting; right now the central claim rests on an assertion. The work is aimed at people already using pyramid encoders for temporal grounding who want a lightweight contrastive add-on. A reader who needs reproducible improvements on standard benchmarks would get limited value until the numbers appear. It is coherent enough on its own terms to deserve a serious referee who can look at the experiments and the actual contrastive pairs.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a contrastive learning framework for video temporal grounding that operates on a feature pyramid encoder. Lower pyramid levels target short-range moments while higher levels (after downsampling) target long-range moments; the authors introduce a sampling procedure to obtain multiple moments per query and then apply multi-scale and cross-scale contrastive losses that treat representations from different encoder stages directly as positives and negatives, without data augmentation or memory banks. The central claim is that this linkage recovers information lost to downsampling and improves grounding performance on both long-form and short-form videos.

Significance. If the cross-scale objective demonstrably enforces semantic rather than trivial layer-wise alignment, the method would constitute a lightweight engineering contribution that avoids extra parameters or memory banks while addressing a recognized limitation of pyramid-based video encoders. Code release supports reproducibility, which is a positive factor.

major comments (2)

[Abstract / Method] Abstract and method description: the central claim that representations from multiple pyramid stages can serve directly as positive/negative samples for contrastive learning rests on the unverified assumption that the sampling of multiple moments per query produces complementary semantics rather than scale-specific artifacts. Because stages share backbone weights and process overlapping temporal content, the risk of collapse to trivial similarity is load-bearing for the claimed linkage between short-range and long-range moments; no concrete test or ablation is referenced to rule this out.
[Abstract] Abstract: the statement that 'extensive experiments demonstrate the effectiveness' supplies no quantitative results, baselines, ablation details, or error analysis, leaving the improvement claim unevidenced and preventing assessment of whether the framework actually outperforms prior multi-level temporal grounding methods.

minor comments (2)

Notation for the contrastive loss (temperature, positive/negative definitions across scales) is not introduced in the provided description, which would aid clarity.
[Abstract] The abstract could briefly indicate the evaluation datasets and primary metrics to contextualize the claimed effectiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that representations from multiple pyramid stages can serve directly as positive/negative samples for contrastive learning rests on the unverified assumption that the sampling of multiple moments per query produces complementary semantics rather than scale-specific artifacts. Because stages share backbone weights and process overlapping temporal content, the risk of collapse to trivial similarity is load-bearing for the claimed linkage between short-range and long-range moments; no concrete test or ablation is referenced to rule this out.

Authors: We agree that an explicit check against trivial alignment strengthens the central claim. Section 3.2 details the multi-moment sampling per query, which selects temporally distinct moments at different scales; the cross-scale loss then treats same-query representations from different pyramid stages as positives. Existing ablations in Section 4.3 already show that ablating the cross-scale term degrades performance on both long- and short-form benchmarks, indicating the loss captures more than layer-wise artifacts. To directly address the referee's concern, we will add a new analysis measuring inter-stage cosine similarity with and without the contrastive objective, confirming that the loss increases semantic alignment beyond trivial similarity. revision: yes
Referee: [Abstract] Abstract: the statement that 'extensive experiments demonstrate the effectiveness' supplies no quantitative results, baselines, ablation details, or error analysis, leaving the improvement claim unevidenced and preventing assessment of whether the framework actually outperforms prior multi-level temporal grounding methods.

Authors: The abstract is intentionally concise, but we accept that including key quantitative evidence would improve clarity. We will revise the abstract to report the main performance gains (e.g., absolute improvements on Charades-STA and ActivityNet-Captions relative to recent pyramid-based baselines) together with a brief mention of the ablation findings. revision: yes

Circularity Check

0 steps flagged

No circularity; independent engineering proposal for multi-scale contrastive pairs

full rationale

The paper introduces a contrastive learning framework that treats representations from multiple stages of an existing feature pyramid encoder as positives/negatives, augmented by a new sampling process for multiple moments per query. This construction is defined directly in the method section without reducing any claimed result to a fitted parameter, prior self-citation, or input quantity by construction. The central claim (multi-scale and cross-scale contrastive learning linking short- and long-range moments) is an additive loss term whose effectiveness is evaluated externally via experiments rather than derived tautologically from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that intra-encoder multi-stage features can substitute for augmented positives/negatives in contrastive learning; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Contrastive learning on multi-stage encoder features can capture salient semantics and compensate for information loss in downsampled pyramid levels
Invoked to justify the proposed multi-scale and cross-scale contrastive losses.

pith-pipeline@v0.9.0 · 5775 in / 1212 out tokens · 44759 ms · 2026-05-23T07:27:43.731508+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

An, X.; Deng, J.; Yang, K.; Li, J.; Feng, Z.; Guo, J.; Yang, J.; and Liu, T. 2023. Unicom: Universal and compact representation learning for image retrieval. arXiv preprint arXiv:2304.05884

work page arXiv 2023
[4]

Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; and Russell, B. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, 5803--5812

work page 2017
[5]

D.; and Buchwalter, W

Bachman, P.; Hjelm, R. D.; and Buchwalter, W. 2019. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32

work page 2019
[6]

Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, 5561--5569

work page 2017
[7]

C.; and Choset, H

Burgner-Kahrs, J.; Rucker, D. C.; and Choset, H. 2015. Continuum robots for medical applications: A survey. IEEE Transactions on Robotics, 31(6): 1261--1280

work page 2015
[8]

Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299--6308

work page 2017
[9]

Chaitanya, K.; Erdil, E.; Karani, N.; and Konukoglu, E. 2020. Contrastive learning of global and local features for medical image segmentation with limited annotations. Advances in neural information processing systems, 33: 12546--12558

work page 2020
[10]

Claussmann, L.; Revilloud, M.; Gruyer, D.; and Glaser, S. 2019. A review of motion planning for highway autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(5): 1826--1848

work page 2019
[11]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Fang, X.; Liu, D.; Zhou, P.; and Nan, G. 2023. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2448--2460

work page 2023
[13]

Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, 6202--6211

work page 2019
[14]

Gao, J.; Sun, C.; Yang, Z.; and Nevatia, R. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267--5275

work page 2017
[15]

Gao, J.; Sun, X.; Xu, M.; Zhou, X.; and Ghanem, B. 2021. Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717

work page arXiv 2021
[16]

Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012

work page 2022
[17]

Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; and Pan, C. 2020. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12595--12604

work page 2020
[18]

Han, D.; Cheng, X.; Guo, N.; Ye, X.; Rainer, B.; and Priller, P. 2023. Momentum cross-modal contrastive learning for video moment retrieval. IEEE Transactions on Circuits and Systems for Video Technology

work page 2023
[19]

Learning deep representations by mutual information estimation and maximization

Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Hou, Z.; Zhong, W.; Ji, L.; Gao, D.; Yan, K.; Chan, W.-K.; Ngo, C.-W.; Shou, Z.; and Duan, N. 2022. Cone: An efficient coarse-to-fine alignment framework for long video temporal grounding. arXiv preprint arXiv:2209.10918

work page arXiv 2022
[21]

Hu, H.; Cui, J.; and Wang, L. 2021. Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16291--16301

work page 2021
[22]

Ji, W.; Shi, R.; Wei, Y.; Zhao, S.; and Zimmermann, R. 2024. Weakly Supervised Video Moment Retrieval via Location-irrelevant Proposal Learning. In Companion Proceedings of the ACM on Web Conference 2024, 1595--1603

work page 2024
[23]

Jung, M.; Jang, Y.; Choi, S.; Kim, J.; Kim, J.-H.; and Zhang, B.-T. 2023. Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval. arXiv preprint arXiv:2306.02728

work page arXiv 2023
[24]

Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Kim, T.; Kim, J.; Shim, M.; Yun, S.; Kang, M.; Wee, D.; and Lee, S. 2022. Exploring temporally dynamic data augmentation for video recognition. arXiv preprint arXiv:2206.15015

work page arXiv 2022
[26]

Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Carlos Niebles, J. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706--715

work page 2017
[27]

L.; and Bansal, M

Lei, J.; Berg, T. L.; and Bansal, M. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846--11858

work page 2021
[28]

Li, H.; Cao, M.; Cheng, X.; Li, Y.; Zhu, Z.; and Zou, Y. 2023. G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12032--12042

work page 2023
[29]

Li, J.; Xie, J.; Qian, L.; Zhu, L.; Tang, S.; Wu, F.; Yang, Y.; Zhuang, Y.; and Wang, X. E. 2022. Compositional temporal grounding with structured variational cross-graph correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3032--3041

work page 2022
[30]

Li, K.; Guo, D.; and Wang, M. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1902--1910

work page 2021
[31]

Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E

Lin, K. Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E. Z.; Gao, D.; Tu, R.-C.; Zhao, W.; Kong, W.; et al. 2022. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35: 7575--7586

work page 2022
[32]

Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A

Lin, K. Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A. J.; Yan, R.; and Shou, M. Z. 2023. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2794--2804

work page 2023
[33]

Liu, D.; Qu, X.; Di, X.; Cheng, Y.; Xu, Z.; and Zhou, P. 2022. Memory-guided semantic learning network for temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1665--1673

work page 2022
[34]

Liu, D.; Qu, X.; Dong, J.; and Zhou, P. 2021. Adaptive proposal generation network for temporal sentence localization in videos. arXiv preprint arXiv:2109.06398

work page arXiv 2021
[35]

Liu, Z.; Li, J.; Xie, H.; Li, P.; Ge, J.; Liu, S.-A.; and Jin, G. 2024. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 3855--3863

work page 2024
[36]

Mu, F.; Mo, S.; and Li, Y. 2024. SnAG: Scalable and Accurate Video Grounding. arXiv preprint arXiv:2404.02257

work page arXiv 2024
[37]

A.; and Tuan, L

Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227

work page arXiv 2023
[38]

Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024 a . Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486

work page arXiv 2024
[39]

Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2025. Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. In European Conference on Computer Vision, 77--98. Springer

work page 2025
[40]

Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986

work page 2021
[41]

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

T.; Ng, S.-K.; and Luu, A

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 b . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577

work page arXiv 2024
[43]

Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Pan, Y.; He, X.; Gong, B.; Lv, Y.; Shen, Y.; Peng, Y.; and Zhao, D. 2023. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13767--13777

work page 2023
[45]

Panta, L.; Shrestha, P.; Sapkota, B.; Bhattarai, A.; Manandhar, S.; and Sah, A. K. 2024. Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 607--614

work page 2024
[46]

Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543

work page 2014
[47]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR

work page 2021
[48]

Regneri, M.; Rohrbach, M.; Wetzel, D.; Thater, S.; Schiele, B.; and Pinkal, M. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1: 25--36

work page 2013
[49]

A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A

Sigurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, 510--526. Springer

work page 2016
[50]

L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B

Soldan, M.; Pardo, A.; Alc \'a zar, J. L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B. 2022. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5026--5035

work page 2022
[51]

Soldan, M.; Xu, M.; Qu, S.; Tegner, J.; and Ghanem, B. 2021. Vlg-net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3224--3234

work page 2021
[52]

Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; and J \'e gou, H. 2021. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 32--42

work page 2021
[53]

Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489--4497

work page 2015
[54]

Wang, H.; Zha, Z.-J.; Li, L.; Liu, D.; and Luo, J. 2021 a . Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7026--7035

work page 2021
[55]

Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; and Van Gool, L. 2021 b . Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 7303--7313

work page 2021
[56]

Wang, Z.; Wang, L.; Wu, T.; Li, T.; and Wu, G. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2613--2623

work page 2022
[57]

Woo, S.; Park, J.; Koo, I.; Lee, S.; Jeong, M.; and Kim, C. 2022. Explore and match: End-to-end video grounding with transformer. arXiv preprint arXiv:2201.10168, 1(4)

work page arXiv 2022
[58]

Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771

work page 2023
[59]

Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...

work page 2024
[60]

Xiao, S.; Chen, L.; Shao, J.; Zhuang, Y.; and Xiao, J. 2021 a . Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678

work page arXiv 2021
[61]

Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.; and Xiao, J. 2021 b . Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2986--2994

work page 2021
[62]

Xiao, Y.; Luo, Z.; Liu, Y.; Ma, Y.; Bian, H.; Ji, Y.; Yang, Y.; and Li, X. 2024. Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18709--18719

work page 2024
[63]

Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; and Jiang, Y.-G. 2023. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18816--18826

work page 2023
[64]

Xu, M.; Soldan, M.; Gao, J.; Liu, S.; P \'e rez-R \'u a, J.-M.; and Ghanem, B. 2023. Boundary-denoising for video activity localization. arXiv preprint arXiv:2304.02934

work page arXiv 2023
[65]

Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; and Liang, R. 2023. AFPN: asymptotic feature pyramid network for object detection. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2184--2189. IEEE

work page 2023
[66]

Zhang, C.-L.; Wu, J.; and Li, Y. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, 492--510. Springer

work page 2022
[67]

Zhang, H.; Sun, A.; Jing, W.; and Zhou, J. T. 2020 a . Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931

work page arXiv 2020
[68]

Zhang, M.; Yang, Y.; Chen, X.; Ji, Y.; Xu, X.; Li, J.; and Shen, H. T. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12669--12678

work page 2021
[69]

Zhang, S.; Peng, H.; Fu, J.; and Luo, J. 2020 b . Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12870--12877

work page 2020
[70]

Zhang, S.; Zhu, Y.; and Roy-Chowdhury, A. K. 2016. Context-aware surveillance video summarization. IEEE Transactions on Image Processing, 25(11): 5469--5478

work page 2016
[71]

An unsupervised sentence embedding method by mutual information maximization

Zhang, Y.; He, R.; Liu, Z.; Lim, K. H.; and Bing, L. 2020 c . An unsupervised sentence embedding method by mutual information maximization. arXiv preprint arXiv:2009.12061

work page arXiv 2020
[72]

Zhou, H.; Zhang, C.; Luo, Y.; Chen, Y.; and Hu, C. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445--8454

work page 2021
[73]

Zhu, J.; Liu, D.; Zhou, P.; Di, X.; Cheng, Y.; Yang, S.; Xu, W.; Xu, Z.; Wan, Y.; Sun, L.; et al. 2023. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. arXiv preprint arXiv:2301.00514

work page arXiv 2023

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

An, X.; Deng, J.; Yang, K.; Li, J.; Feng, Z.; Guo, J.; Yang, J.; and Liu, T. 2023. Unicom: Universal and compact representation learning for image retrieval. arXiv preprint arXiv:2304.05884

work page arXiv 2023

[4] [4]

Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; and Russell, B. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, 5803--5812

work page 2017

[5] [5]

D.; and Buchwalter, W

Bachman, P.; Hjelm, R. D.; and Buchwalter, W. 2019. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32

work page 2019

[6] [6]

Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, 5561--5569

work page 2017

[7] [7]

C.; and Choset, H

Burgner-Kahrs, J.; Rucker, D. C.; and Choset, H. 2015. Continuum robots for medical applications: A survey. IEEE Transactions on Robotics, 31(6): 1261--1280

work page 2015

[8] [8]

Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299--6308

work page 2017

[9] [9]

Chaitanya, K.; Erdil, E.; Karani, N.; and Konukoglu, E. 2020. Contrastive learning of global and local features for medical image segmentation with limited annotations. Advances in neural information processing systems, 33: 12546--12558

work page 2020

[10] [10]

Claussmann, L.; Revilloud, M.; Gruyer, D.; and Glaser, S. 2019. A review of motion planning for highway autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(5): 1826--1848

work page 2019

[11] [11]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Fang, X.; Liu, D.; Zhou, P.; and Nan, G. 2023. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2448--2460

work page 2023

[13] [13]

Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, 6202--6211

work page 2019

[14] [14]

Gao, J.; Sun, C.; Yang, Z.; and Nevatia, R. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267--5275

work page 2017

[15] [15]

Gao, J.; Sun, X.; Xu, M.; Zhou, X.; and Ghanem, B. 2021. Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717

work page arXiv 2021

[16] [16]

Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012

work page 2022

[17] [17]

Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; and Pan, C. 2020. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12595--12604

work page 2020

[18] [18]

Han, D.; Cheng, X.; Guo, N.; Ye, X.; Rainer, B.; and Priller, P. 2023. Momentum cross-modal contrastive learning for video moment retrieval. IEEE Transactions on Circuits and Systems for Video Technology

work page 2023

[19] [19]

Learning deep representations by mutual information estimation and maximization

Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Hou, Z.; Zhong, W.; Ji, L.; Gao, D.; Yan, K.; Chan, W.-K.; Ngo, C.-W.; Shou, Z.; and Duan, N. 2022. Cone: An efficient coarse-to-fine alignment framework for long video temporal grounding. arXiv preprint arXiv:2209.10918

work page arXiv 2022

[21] [21]

Hu, H.; Cui, J.; and Wang, L. 2021. Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16291--16301

work page 2021

[22] [22]

Ji, W.; Shi, R.; Wei, Y.; Zhao, S.; and Zimmermann, R. 2024. Weakly Supervised Video Moment Retrieval via Location-irrelevant Proposal Learning. In Companion Proceedings of the ACM on Web Conference 2024, 1595--1603

work page 2024

[23] [23]

Jung, M.; Jang, Y.; Choi, S.; Kim, J.; Kim, J.-H.; and Zhang, B.-T. 2023. Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval. arXiv preprint arXiv:2306.02728

work page arXiv 2023

[24] [24]

Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Kim, T.; Kim, J.; Shim, M.; Yun, S.; Kang, M.; Wee, D.; and Lee, S. 2022. Exploring temporally dynamic data augmentation for video recognition. arXiv preprint arXiv:2206.15015

work page arXiv 2022

[26] [26]

Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Carlos Niebles, J. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706--715

work page 2017

[27] [27]

L.; and Bansal, M

Lei, J.; Berg, T. L.; and Bansal, M. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846--11858

work page 2021

[28] [28]

Li, H.; Cao, M.; Cheng, X.; Li, Y.; Zhu, Z.; and Zou, Y. 2023. G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12032--12042

work page 2023

[29] [29]

Li, J.; Xie, J.; Qian, L.; Zhu, L.; Tang, S.; Wu, F.; Yang, Y.; Zhuang, Y.; and Wang, X. E. 2022. Compositional temporal grounding with structured variational cross-graph correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3032--3041

work page 2022

[30] [30]

Li, K.; Guo, D.; and Wang, M. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1902--1910

work page 2021

[31] [31]

Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E

Lin, K. Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E. Z.; Gao, D.; Tu, R.-C.; Zhao, W.; Kong, W.; et al. 2022. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35: 7575--7586

work page 2022

[32] [32]

Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A

Lin, K. Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A. J.; Yan, R.; and Shou, M. Z. 2023. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2794--2804

work page 2023

[33] [33]

Liu, D.; Qu, X.; Di, X.; Cheng, Y.; Xu, Z.; and Zhou, P. 2022. Memory-guided semantic learning network for temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1665--1673

work page 2022

[34] [34]

Liu, D.; Qu, X.; Dong, J.; and Zhou, P. 2021. Adaptive proposal generation network for temporal sentence localization in videos. arXiv preprint arXiv:2109.06398

work page arXiv 2021

[35] [35]

Liu, Z.; Li, J.; Xie, H.; Li, P.; Ge, J.; Liu, S.-A.; and Jin, G. 2024. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 3855--3863

work page 2024

[36] [36]

Mu, F.; Mo, S.; and Li, Y. 2024. SnAG: Scalable and Accurate Video Grounding. arXiv preprint arXiv:2404.02257

work page arXiv 2024

[37] [37]

A.; and Tuan, L

Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227

work page arXiv 2023

[38] [38]

Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024 a . Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486

work page arXiv 2024

[39] [39]

Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2025. Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. In European Conference on Computer Vision, 77--98. Springer

work page 2025

[40] [40]

Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986

work page 2021

[41] [41]

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

T.; Ng, S.-K.; and Luu, A

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 b . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577

work page arXiv 2024

[43] [43]

Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Pan, Y.; He, X.; Gong, B.; Lv, Y.; Shen, Y.; Peng, Y.; and Zhao, D. 2023. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13767--13777

work page 2023

[45] [45]

Panta, L.; Shrestha, P.; Sapkota, B.; Bhattarai, A.; Manandhar, S.; and Sah, A. K. 2024. Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 607--614

work page 2024

[46] [46]

Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543

work page 2014

[47] [47]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR

work page 2021

[48] [48]

Regneri, M.; Rohrbach, M.; Wetzel, D.; Thater, S.; Schiele, B.; and Pinkal, M. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1: 25--36

work page 2013

[49] [49]

A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A

Sigurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, 510--526. Springer

work page 2016

[50] [50]

L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B

Soldan, M.; Pardo, A.; Alc \'a zar, J. L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B. 2022. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5026--5035

work page 2022

[51] [51]

Soldan, M.; Xu, M.; Qu, S.; Tegner, J.; and Ghanem, B. 2021. Vlg-net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3224--3234

work page 2021

[52] [52]

Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; and J \'e gou, H. 2021. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 32--42

work page 2021

[53] [53]

Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489--4497

work page 2015

[54] [54]

Wang, H.; Zha, Z.-J.; Li, L.; Liu, D.; and Luo, J. 2021 a . Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7026--7035

work page 2021

[55] [55]

Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; and Van Gool, L. 2021 b . Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 7303--7313

work page 2021

[56] [56]

Wang, Z.; Wang, L.; Wu, T.; Li, T.; and Wu, G. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2613--2623

work page 2022

[57] [57]

Woo, S.; Park, J.; Koo, I.; Lee, S.; Jeong, M.; and Kim, C. 2022. Explore and match: End-to-end video grounding with transformer. arXiv preprint arXiv:2201.10168, 1(4)

work page arXiv 2022

[58] [58]

Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771

work page 2023

[59] [59]

Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...

work page 2024

[60] [60]

Xiao, S.; Chen, L.; Shao, J.; Zhuang, Y.; and Xiao, J. 2021 a . Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678

work page arXiv 2021

[61] [61]

Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.; and Xiao, J. 2021 b . Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2986--2994

work page 2021

[62] [62]

Xiao, Y.; Luo, Z.; Liu, Y.; Ma, Y.; Bian, H.; Ji, Y.; Yang, Y.; and Li, X. 2024. Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18709--18719

work page 2024

[63] [63]

Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; and Jiang, Y.-G. 2023. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18816--18826

work page 2023

[64] [64]

Xu, M.; Soldan, M.; Gao, J.; Liu, S.; P \'e rez-R \'u a, J.-M.; and Ghanem, B. 2023. Boundary-denoising for video activity localization. arXiv preprint arXiv:2304.02934

work page arXiv 2023

[65] [65]

Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; and Liang, R. 2023. AFPN: asymptotic feature pyramid network for object detection. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2184--2189. IEEE

work page 2023

[66] [66]

Zhang, C.-L.; Wu, J.; and Li, Y. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, 492--510. Springer

work page 2022

[67] [67]

Zhang, H.; Sun, A.; Jing, W.; and Zhou, J. T. 2020 a . Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931

work page arXiv 2020

[68] [68]

Zhang, M.; Yang, Y.; Chen, X.; Ji, Y.; Xu, X.; Li, J.; and Shen, H. T. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12669--12678

work page 2021

[69] [69]

Zhang, S.; Peng, H.; Fu, J.; and Luo, J. 2020 b . Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12870--12877

work page 2020

[70] [70]

Zhang, S.; Zhu, Y.; and Roy-Chowdhury, A. K. 2016. Context-aware surveillance video summarization. IEEE Transactions on Image Processing, 25(11): 5469--5478

work page 2016

[71] [71]

An unsupervised sentence embedding method by mutual information maximization

Zhang, Y.; He, R.; Liu, Z.; Lim, K. H.; and Bing, L. 2020 c . An unsupervised sentence embedding method by mutual information maximization. arXiv preprint arXiv:2009.12061

work page arXiv 2020

[72] [72]

Zhou, H.; Zhang, C.; Luo, Y.; Chen, Y.; and Hu, C. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445--8454

work page 2021

[73] [73]

Zhu, J.; Liu, D.; Zhou, P.; Di, X.; Cheng, Y.; Yang, S.; Xu, W.; Xu, Z.; Wan, Y.; Sun, L.; et al. 2023. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. arXiv preprint arXiv:2301.00514

work page arXiv 2023