Multi-Scale Contrastive Learning for Video Temporal Grounding
Pith reviewed 2026-05-23 07:27 UTC · model grok-4.3
The pith
Contrastive learning across video encoder stages links short and long moments to improve temporal grounding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that sampling multiple video moments per query and contrasting their representations across video encoder layers instantiates a novel multi-scale and cross-scale contrastive learning process that links local short-range video moments with global long-range video moments, thereby capturing salient semantics and mitigating information degradation in higher pyramid levels.
What carries the argument
multi-scale and cross-scale contrastive learning that uses representations from multiple stages of the video encoder as positive and negative samples
If this is right
- Higher pyramid levels retain more information through cross-scale contrasts with lower levels.
- The approach improves localization accuracy for both long-form and short-form videos.
- Contrastive learning for this task requires neither external augmentations nor memory banks.
- Multiple moments per query supply sufficient diversity for the contrastive signals.
Where Pith is reading between the lines
- The internal diversity of encoder stages might reduce reliance on external contrastive techniques in other multi-scale vision tasks.
- The sampling process could be adapted to query-video pairs in related grounding or retrieval settings.
- If the method scales, it suggests encoder layers already encode the necessary scale variation for contrast without added machinery.
Load-bearing premise
Representations from multiple stages of the video encoder can serve directly as positive and negative samples for contrastive learning without augmentation or memory banks, and sampling multiple moments per query yields effective training signals.
What would settle it
Training the model without the cross-scale contrastive losses and observing no drop in grounding accuracy on standard benchmarks would falsify the claim that these contrasts are what link the moments and improve performance.
Figures
read the original abstract
Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding. Code is available at https://github.com/nguyentthong/MSCL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a contrastive learning framework for video temporal grounding that operates on a feature pyramid encoder. Lower pyramid levels target short-range moments while higher levels (after downsampling) target long-range moments; the authors introduce a sampling procedure to obtain multiple moments per query and then apply multi-scale and cross-scale contrastive losses that treat representations from different encoder stages directly as positives and negatives, without data augmentation or memory banks. The central claim is that this linkage recovers information lost to downsampling and improves grounding performance on both long-form and short-form videos.
Significance. If the cross-scale objective demonstrably enforces semantic rather than trivial layer-wise alignment, the method would constitute a lightweight engineering contribution that avoids extra parameters or memory banks while addressing a recognized limitation of pyramid-based video encoders. Code release supports reproducibility, which is a positive factor.
major comments (2)
- [Abstract / Method] Abstract and method description: the central claim that representations from multiple pyramid stages can serve directly as positive/negative samples for contrastive learning rests on the unverified assumption that the sampling of multiple moments per query produces complementary semantics rather than scale-specific artifacts. Because stages share backbone weights and process overlapping temporal content, the risk of collapse to trivial similarity is load-bearing for the claimed linkage between short-range and long-range moments; no concrete test or ablation is referenced to rule this out.
- [Abstract] Abstract: the statement that 'extensive experiments demonstrate the effectiveness' supplies no quantitative results, baselines, ablation details, or error analysis, leaving the improvement claim unevidenced and preventing assessment of whether the framework actually outperforms prior multi-level temporal grounding methods.
minor comments (2)
- Notation for the contrastive loss (temperature, positive/negative definitions across scales) is not introduced in the provided description, which would aid clarity.
- [Abstract] The abstract could briefly indicate the evaluation datasets and primary metrics to contextualize the claimed effectiveness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the central claim that representations from multiple pyramid stages can serve directly as positive/negative samples for contrastive learning rests on the unverified assumption that the sampling of multiple moments per query produces complementary semantics rather than scale-specific artifacts. Because stages share backbone weights and process overlapping temporal content, the risk of collapse to trivial similarity is load-bearing for the claimed linkage between short-range and long-range moments; no concrete test or ablation is referenced to rule this out.
Authors: We agree that an explicit check against trivial alignment strengthens the central claim. Section 3.2 details the multi-moment sampling per query, which selects temporally distinct moments at different scales; the cross-scale loss then treats same-query representations from different pyramid stages as positives. Existing ablations in Section 4.3 already show that ablating the cross-scale term degrades performance on both long- and short-form benchmarks, indicating the loss captures more than layer-wise artifacts. To directly address the referee's concern, we will add a new analysis measuring inter-stage cosine similarity with and without the contrastive objective, confirming that the loss increases semantic alignment beyond trivial similarity. revision: yes
-
Referee: [Abstract] Abstract: the statement that 'extensive experiments demonstrate the effectiveness' supplies no quantitative results, baselines, ablation details, or error analysis, leaving the improvement claim unevidenced and preventing assessment of whether the framework actually outperforms prior multi-level temporal grounding methods.
Authors: The abstract is intentionally concise, but we accept that including key quantitative evidence would improve clarity. We will revise the abstract to report the main performance gains (e.g., absolute improvements on Charades-STA and ActivityNet-Captions relative to recent pyramid-based baselines) together with a brief mention of the ablation findings. revision: yes
Circularity Check
No circularity; independent engineering proposal for multi-scale contrastive pairs
full rationale
The paper introduces a contrastive learning framework that treats representations from multiple stages of an existing feature pyramid encoder as positives/negatives, augmented by a new sampling process for multiple moments per query. This construction is defined directly in the method section without reducing any claimed result to a fitted parameter, prior self-citation, or input quantity by construction. The central claim (multi-scale and cross-scale contrastive learning linking short- and long-range moments) is an additive loss term whose effectiveness is evaluated externally via experiments rather than derived tautologically from the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive learning on multi-stage encoder features can capture salient semantics and compensate for information loss in downsampled pyramid levels
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
- [3]
-
[4]
Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; and Russell, B. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, 5803--5812
work page 2017
-
[5]
Bachman, P.; Hjelm, R. D.; and Buchwalter, W. 2019. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32
work page 2019
-
[6]
Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, 5561--5569
work page 2017
-
[7]
Burgner-Kahrs, J.; Rucker, D. C.; and Choset, H. 2015. Continuum robots for medical applications: A survey. IEEE Transactions on Robotics, 31(6): 1261--1280
work page 2015
-
[8]
Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299--6308
work page 2017
-
[9]
Chaitanya, K.; Erdil, E.; Karani, N.; and Konukoglu, E. 2020. Contrastive learning of global and local features for medical image segmentation with limited annotations. Advances in neural information processing systems, 33: 12546--12558
work page 2020
-
[10]
Claussmann, L.; Revilloud, M.; Gruyer, D.; and Glaser, S. 2019. A review of motion planning for highway autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(5): 1826--1848
work page 2019
-
[11]
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Fang, X.; Liu, D.; Zhou, P.; and Nan, G. 2023. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2448--2460
work page 2023
-
[13]
Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, 6202--6211
work page 2019
-
[14]
Gao, J.; Sun, C.; Yang, Z.; and Nevatia, R. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267--5275
work page 2017
- [15]
-
[16]
Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012
work page 2022
-
[17]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; and Pan, C. 2020. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12595--12604
work page 2020
-
[18]
Han, D.; Cheng, X.; Guo, N.; Ye, X.; Rainer, B.; and Priller, P. 2023. Momentum cross-modal contrastive learning for video moment retrieval. IEEE Transactions on Circuits and Systems for Video Technology
work page 2023
-
[19]
Learning deep representations by mutual information estimation and maximization
Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [20]
-
[21]
Hu, H.; Cui, J.; and Wang, L. 2021. Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16291--16301
work page 2021
-
[22]
Ji, W.; Shi, R.; Wei, Y.; Zhao, S.; and Zimmermann, R. 2024. Weakly Supervised Video Moment Retrieval via Location-irrelevant Proposal Learning. In Companion Proceedings of the ACM on Web Conference 2024, 1595--1603
work page 2024
- [23]
-
[24]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [25]
-
[26]
Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Carlos Niebles, J. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706--715
work page 2017
-
[27]
Lei, J.; Berg, T. L.; and Bansal, M. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846--11858
work page 2021
-
[28]
Li, H.; Cao, M.; Cheng, X.; Li, Y.; Zhu, Z.; and Zou, Y. 2023. G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12032--12042
work page 2023
-
[29]
Li, J.; Xie, J.; Qian, L.; Zhu, L.; Tang, S.; Wu, F.; Yang, Y.; Zhuang, Y.; and Wang, X. E. 2022. Compositional temporal grounding with structured variational cross-graph correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3032--3041
work page 2022
-
[30]
Li, K.; Guo, D.; and Wang, M. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1902--1910
work page 2021
-
[31]
Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E
Lin, K. Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu, E. Z.; Gao, D.; Tu, R.-C.; Zhao, W.; Kong, W.; et al. 2022. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35: 7575--7586
work page 2022
-
[32]
Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A
Lin, K. Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A. J.; Yan, R.; and Shou, M. Z. 2023. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2794--2804
work page 2023
-
[33]
Liu, D.; Qu, X.; Di, X.; Cheng, Y.; Xu, Z.; and Zhou, P. 2022. Memory-guided semantic learning network for temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1665--1673
work page 2022
- [34]
-
[35]
Liu, Z.; Li, J.; Xie, H.; Li, P.; Ge, J.; Liu, S.-A.; and Jin, G. 2024. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 3855--3863
work page 2024
- [36]
-
[37]
Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227
- [38]
-
[39]
Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2025. Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. In European Conference on Computer Vision, 77--98. Springer
work page 2025
-
[40]
Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986
work page 2021
-
[41]
Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 b . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577
-
[43]
Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Pan, Y.; He, X.; Gong, B.; Lv, Y.; Shen, Y.; Peng, Y.; and Zhao, D. 2023. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13767--13777
work page 2023
-
[45]
Panta, L.; Shrestha, P.; Sapkota, B.; Bhattarai, A.; Manandhar, S.; and Sah, A. K. 2024. Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 607--614
work page 2024
-
[46]
Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543
work page 2014
-
[47]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR
work page 2021
-
[48]
Regneri, M.; Rohrbach, M.; Wetzel, D.; Thater, S.; Schiele, B.; and Pinkal, M. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1: 25--36
work page 2013
-
[49]
A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A
Sigurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; and Gupta, A. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, 510--526. Springer
work page 2016
-
[50]
L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B
Soldan, M.; Pardo, A.; Alc \'a zar, J. L.; Caba, F.; Zhao, C.; Giancola, S.; and Ghanem, B. 2022. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5026--5035
work page 2022
-
[51]
Soldan, M.; Xu, M.; Qu, S.; Tegner, J.; and Ghanem, B. 2021. Vlg-net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3224--3234
work page 2021
-
[52]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; and J \'e gou, H. 2021. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 32--42
work page 2021
-
[53]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489--4497
work page 2015
-
[54]
Wang, H.; Zha, Z.-J.; Li, L.; Liu, D.; and Luo, J. 2021 a . Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7026--7035
work page 2021
-
[55]
Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; and Van Gool, L. 2021 b . Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 7303--7313
work page 2021
-
[56]
Wang, Z.; Wang, L.; Wu, T.; Li, T.; and Wu, G. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2613--2623
work page 2022
- [57]
-
[58]
Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771
work page 2023
-
[59]
Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...
work page 2024
- [60]
-
[61]
Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.; and Xiao, J. 2021 b . Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2986--2994
work page 2021
-
[62]
Xiao, Y.; Luo, Z.; Liu, Y.; Ma, Y.; Bian, H.; Ji, Y.; Yang, Y.; and Li, X. 2024. Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18709--18719
work page 2024
-
[63]
Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; and Jiang, Y.-G. 2023. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18816--18826
work page 2023
- [64]
-
[65]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; and Liang, R. 2023. AFPN: asymptotic feature pyramid network for object detection. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2184--2189. IEEE
work page 2023
-
[66]
Zhang, C.-L.; Wu, J.; and Li, Y. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, 492--510. Springer
work page 2022
- [67]
-
[68]
Zhang, M.; Yang, Y.; Chen, X.; Ji, Y.; Xu, X.; Li, J.; and Shen, H. T. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12669--12678
work page 2021
-
[69]
Zhang, S.; Peng, H.; Fu, J.; and Luo, J. 2020 b . Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12870--12877
work page 2020
-
[70]
Zhang, S.; Zhu, Y.; and Roy-Chowdhury, A. K. 2016. Context-aware surveillance video summarization. IEEE Transactions on Image Processing, 25(11): 5469--5478
work page 2016
-
[71]
An unsupervised sentence embedding method by mutual information maximization
Zhang, Y.; He, R.; Liu, Z.; Lim, K. H.; and Bing, L. 2020 c . An unsupervised sentence embedding method by mutual information maximization. arXiv preprint arXiv:2009.12061
-
[72]
Zhou, H.; Zhang, C.; Luo, Y.; Chen, Y.; and Hu, C. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445--8454
work page 2021
- [73]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.