ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

Ho-Joong Kim; Ji-Hyeon Kim; Seong-Whan Lee

arxiv: 2604.27591 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

Ji-Hyeon Kim , Ho-Joong Kim , Seong-Whan Lee This is my paper

Pith reviewed 2026-05-07 09:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords moment retrievaltemporal boundary predictionclip-pair alignmentboundary-aware learningvideo-text retrievalauxiliary lossmultimodal alignment

0 comments

The pith

ClipTBP learns relationships between matching video segments to sharpen temporal boundary predictions in moment retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ClipTBP to fix a gap in video moment retrieval where models score text-video similarity only at the snippet level and ignore how multiple answer segments relate to one another. It introduces a clip-level alignment loss that explicitly trains the model on semantic relationships between pairs of answer segments so irrelevant but visually similar segments can be excluded. The framework adds an auxiliary boundary loss alongside the main boundary loss to refine the exact start and end times. When plugged into existing models, ClipTBP raises retrieval accuracy and stays stable even when the text query is ambiguous.

Core claim

ClipTBP is a clip-pair temporal boundary prediction framework based on boundary-aware learning. It adds a clip-level alignment loss that learns the semantic relationship between answer segments and combines a main boundary loss with an auxiliary boundary loss to produce accurate temporal boundaries.

What carries the argument

Clip-level alignment loss that operates on pairs of clips to capture relationships between multiple answer segments, paired with auxiliary boundary loss for boundary-aware refinement.

If this is right

Existing moment retrieval models gain consistent performance lifts when the clip-pair losses are added.
Boundary predictions become more robust when the query text leaves room for multiple possible segments.
The model learns to suppress visually similar but query-irrelevant clips that snippet-level scoring misses.
Multimodal alignment moves from independent snippet scoring to explicit inter-segment relationship modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pair-wise alignment idea could be tested on other temporal localization tasks such as action segmentation or video grounding.
Real-world video search tools might see fewer false positives when users give vague descriptions.
If clip-pair relationships prove decisive, future architectures could replace snippet-level similarity heads entirely with pair-aware modules.

Load-bearing premise

The clip-level alignment loss will successfully teach the model to use relationships between answer segments to drop irrelevant ones, and the auxiliary boundary loss will measurably raise boundary accuracy.

What would settle it

A controlled test in which adding the clip-level alignment loss and auxiliary boundary loss to a baseline model produces no gain or a drop in recall on standard benchmarks such as ActivityNet Captions or Charades-STA.

Figures

Figures reproduced from arXiv: 2604.27591 by Ho-Joong Kim, Ji-Hyeon Kim, Seong-Whan Lee.

**Figure 1.** Figure 1: Challenges when multiple answer segments exist in a video. The blue boxes are the answer segments matching the query in the ground truth. The red boxes at the top are the segments predicted by baseline model (FlashVTG [1]) and the red boxes at the bottom are the segments predicted by our proposed model. The images connected to the green arrows are the answer segments that match the query, and the images co… view at source ↗

**Figure 2.** Figure 2: Overall framework of ClipTBP. ClipTBP encodes the input video V = {vt} T t=1 and text query q into a video encoder fv(·) and text encoder fq(·), respectively, and then passes them through a multimodal encoder fm(·) to generate snippet-level multimodal representations Z = {zi} T i=1. The prediction head generates a saliency score, start/end offset, and foreground probability for each clip. The training uses… view at source ↗

**Figure 3.** Figure 3: Visualization of experimental results for two different example queries using the baseline (FlashVTG [1]) and our model on QVHighlights [13]. The photos above show part of the connected segment, with the green arrow indicating the correct segment and the red arrow indicating the incorrect segment. The colored sections are the segments predicted by each model. (i), (ii), (iii), (iv), (v): The baseline mode… view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of snippet-level embeddings on QVHighlights. Blue points denote embeddings of clips that correspond to the ground truth answer segments, while red points denote embeddings of clips that are not part of the ground truth answers and are unrelated to the query. A comparison experiment between FlashVTG [1] and our proposed model. embeddings of segments unrelated to the query are mixed in th… view at source ↗

read the original abstract

Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClipTBP adds a clip-pair alignment loss and auxiliary boundary term on top of standard snippet similarity and transformer regression, but the gains look incremental and need the full tables to judge.

read the letter

The main point is that this paper targets a real but narrow weakness in video moment retrieval: snippet-level alignment ignores how multiple answer segments relate to each other, so models get pulled toward visually similar distractors. ClipTBP tries to fix that with a clip-level alignment loss on pairs plus a dual boundary loss setup, and it claims the combination plugs into existing models without a full rewrite.

Referee Report

2 major / 3 minor

Summary. The paper proposes ClipTBP, a framework for video moment retrieval that augments existing snippet-level multimodal alignment and transformer-based boundary regression models. It introduces a clip-level alignment loss to explicitly model semantic relationships across multiple answer segments matching a query (addressing the problem of visually similar but irrelevant surrounding segments) and combines a main boundary loss with an auxiliary boundary loss for improved temporal boundary prediction. The authors claim that ClipTBP can be plugged into various existing models, yielding consistent performance gains and greater robustness under ambiguous queries.

Significance. If the empirical gains and robustness claims hold under standard benchmarks (e.g., Charades-STA, ActivityNet-Captions), the work would offer a practical, boundary-aware refinement to current alignment objectives in moment retrieval. The clip-pair formulation could generalize to other tasks requiring exclusion of contextually similar negatives, and the dual-loss boundary prediction might reduce sensitivity to query ambiguity without architectural overhaul.

major comments (2)

[§3.2] §3.2, Eq. (3)–(5): the clip-level alignment loss is defined solely in terms of the proposed framework’s own positive/negative clip pairs; without an external anchor (e.g., comparison to a fixed contrastive baseline or human-annotated segment relations), the reported improvement risks being circular to the loss definition itself. An ablation that isolates this loss against a standard InfoNCE or triplet baseline on the same backbone is needed to substantiate the central claim.
[Table 2, §4.3] Table 2 and §4.3: the statement that ClipTBP “consistently improves performance when applied to various existing models” is supported only by aggregate R@1, mIoU numbers; per-model delta tables and statistical significance tests (e.g., paired t-test across 5 seeds) are absent, making it impossible to judge whether gains are uniform or driven by a single backbone.

minor comments (3)

[§3.1] §3.1: the distinction between “snippet-level” and “clip-level” alignment is introduced without a formal definition or diagram; a small schematic showing how clip pairs are sampled from the same video would clarify the input to Eq. (3).
[§4.1] §4.1: training hyper-parameters (learning rate schedule, loss weighting λ_aux, clip-pair sampling ratio) are listed but not ablated; a sensitivity plot would strengthen reproducibility.
[Figure 4] Figure 4: the qualitative examples of “ambiguous query scenarios” lack ground-truth boundaries and model predictions side-by-side; adding these would make the robustness claim visually verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, providing clarifications and committing to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (3)–(5): the clip-level alignment loss is defined solely in terms of the proposed framework’s own positive/negative clip pairs; without an external anchor (e.g., comparison to a fixed contrastive baseline or human-annotated segment relations), the reported improvement risks being circular to the loss definition itself. An ablation that isolates this loss against a standard InfoNCE or triplet baseline on the same backbone is needed to substantiate the central claim.

Authors: We acknowledge the referee's concern that the clip-level alignment loss requires comparison to standard baselines to fully substantiate its contribution and avoid any appearance of circular evaluation. The proposed loss is specifically designed to model semantic relationships across multiple answer segments via clip pairs, which extends beyond standard snippet-level contrastive objectives. To address this directly, we will add an ablation study in the revised manuscript that isolates the clip-level alignment loss against both a standard InfoNCE loss and a triplet loss applied to the same backbone models, reporting the resulting performance differences on the benchmarks. revision: yes
Referee: [Table 2, §4.3] Table 2 and §4.3: the statement that ClipTBP “consistently improves performance when applied to various existing models” is supported only by aggregate R@1, mIoU numbers; per-model delta tables and statistical significance tests (e.g., paired t-test across 5 seeds) are absent, making it impossible to judge whether gains are uniform or driven by a single backbone.

Authors: We agree that aggregate metrics alone are insufficient to demonstrate uniform improvements across backbones. In the revised manuscript, we will expand the presentation of results to include a per-model delta table showing individual performance gains (R@1 and mIoU) for each existing model to which ClipTBP is applied. We will also report statistical significance via paired t-tests computed over multiple random seeds (e.g., 5 runs) to confirm that the gains are consistent and not attributable to any single backbone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper identifies limitations in prior snippet-level alignment methods and introduces clip-level alignment loss plus auxiliary boundary losses as part of a new framework, then reports empirical gains on benchmarks when applied to existing models. No load-bearing derivation step reduces by construction to a fitted input, self-citation, or renamed ansatz; the losses are explicitly defined as novel components and validated externally rather than tautologically. The argument is self-contained against standard retrieval benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters or assumptions; empty lists.

pith-pipeline@v0.9.0 · 9032 in / 907 out tokens · 109341 ms · 2026-05-07T09:14:23.254683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 7 canonical work pages

[1]

Flashvtg: Feature layering and adaptive score handling network for video temporal grounding.arXiv preprint arXiv:2412.13441, 2024

Cao, Z. et al.: FlashVTG: Feature layering and adaptive score handling network for video temporal grounding. arXiv preprint arXiv:2412.13441 (2024)

work page arXiv 2024
[2]

et al.: End-to-end object detection with transformers

Carion, N. et al.: End-to-end object detection with transformers. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 213–229 (2020)

2020
[3]

Video mamba suite: State space model as a ver- satile alternative for video understanding

Chen, G. et al.: Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626 (2024)

work page arXiv 2024
[4]

arXiv preprint arXiv:1907.12763 (2019)

Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Finding moments in video collections using natural language. arXiv preprint arXiv:1907.12763 (2019)

work page arXiv 1907
[5]

In: Proc

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 6202–6211 (2019)

2019
[6]

In: Proc

Girshick, R.: Fast R-CNN. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 1440–1448 (2015)

2015
[7]

Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: Int. Conf. Learn. Represent. (ICLR) (2022)

2022
[8]

In: Proc

Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). vol. 2, pp. 1735–1742 (2006)

2006
[9]

et al.: Multimodal signal dataset for 11 intuitive movement tasks from single upper extremity during multiple recording sessions

Jeong, J.H. et al.: Multimodal signal dataset for 11 intuitive movement tasks from single upper extremity during multiple recording sessions. GigaScience9(10), giaa098 (2020)

2020
[10]

et al.: Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval

Jiang, Y. et al.: Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. In: Proc. ACM Int. Conf. Multimedia (ACM MM). pp. 7249–7258 (2024)

2024
[11]

arXiv preprint arXiv:2106.02297 (2021)

Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W.: Fre-GAN: Adversarial frequency- consistent audio synthesis. arXiv preprint arXiv:2106.02297 (2021)

work page arXiv 2021
[12]

In: Proc

Lee, G.H., Lee, S.W.: Uncertainty-aware mesh decoder for high fidelity 3D face reconstruction. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) (2020)

2020
[13]

Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Adv. Neural Inf. Process. Syst. (NeurIPS). vol. 34, pp. 11846–11858 (2021)

2021
[14]

In: Proc

Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: A large-scale dataset for video-subtitle moment retrieval. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 447–463 (2020)

2020
[15]

et al.: MomentDiff: Generative video moment retrieval from random to real

Li, P. et al.: MomentDiff: Generative video moment retrieval from random to real. Adv. Neural Inf. Process. Syst. (NeurIPS)36, 65948–65966 (2023)

2023
[16]

et al.: UniVTG: Towards unified video-language temporal grounding

Lin, K.Q. et al.: UniVTG: Towards unified video-language temporal grounding. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 2794–2804 (2023)

2023
[17]

et al.: R2-Tuning: Efficient image-to-video transfer learning for video temporal grounding

Liu, Y. et al.: R2-Tuning: Efficient image-to-video transfer learning for video temporal grounding. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 421–438 (2024)

2024
[18]

et al.: UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection

Liu, Y. et al.: UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 3042–3051 (2022)

2022
[19]

In: Proc

Luo, D., Huang, J., Gong, S., Jin, H., Liu, Y.: Zero-shot video moment retrieval from frozen vision-language models. In: Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 5464–5473 (2024)

2024
[20]

Neural Networks155, 439–450 (2022) ClipTBP 15

Min, K., Lee, G.H., Lee, S.W.: Attentional feature pyramid network for small object detection. Neural Networks155, 439–450 (2022) ClipTBP 15

2022
[21]

Correlation-guided query-dependency calibration in video representation learning for temporal grounding

Moon, W., Hyun, S., Lee, S., Heo, J.P.: Correlation-guided query-dependency calibration for video temporal grounding. arXiv preprint arXiv:2311.08835 (2023)

work page arXiv 2023
[22]

In: Proc

Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video repre- sentation for moment retrieval and highlight detection. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 23023–23033 (2023)

2023
[23]

In: Proc

Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 10810–10819 (2020)

2020
[24]

In: Proc

Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 299–307 (2017)

2017
[25]

In: Proc

Pan, Y., Zhang, Y., Zhao, X.: FAWL: Weakly-supervised video corpus moment retrieval with frame-wise auxiliary alignment and weighted contrastive learning. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). pp. 1–5 (2025)

2025
[26]

et al.: Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval

Panta, L. et al.: Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval. In: Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 607–614 (2024)

2024
[27]

et al.: Learning transferable visual models from natural language supervision

Radford, A. et al.: Learning transferable visual models from natural language supervision. In: Proc. Int. Conf. Machine Learn. (ICML). vol. 139, pp. 8748–8763 (2021)

2021
[28]

et al.: Grounding action descriptions in videos

Regneri, M. et al.: Grounding action descriptions in videos. Trans. of the Assoc. for Comput. Linguistics (TACL)1, 25–36 (2013)

2013
[29]

In: Proc

Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 815–823 (2015)

2015
[30]

Neural Networks180, 106642 (2024)

Shrewsbury, D., Kim, S., Lee, S.W.: Adaptive ambiguity-aware weighting for multi- label recognition with limited annotations. Neural Networks180, 106642 (2024)

2024
[31]

In: Proc

Sun, H., Zhou, M., Chen, W., Xie, W.: TR-DETR: Task-reciprocal transformer for joint moment retrieval and highlight detection. In: Proc. AAAI Conf. Artif. Intell. (AAAI). vol. 38, pp. 4998–5007 (2024)

2024
[32]

et al.: InternVideo2: Scaling foundation models for multimodal video understanding

Wang, Y. et al.: InternVideo2: Scaling foundation models for multimodal video understanding. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 396–416 (2024)

2024
[33]

et al.: Bridging the Gap: A unified video comprehension framework for moment retrieval and highlight detection

Xiao, Y. et al.: Bridging the Gap: A unified video comprehension framework for moment retrieval and highlight detection. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 18709–18719 (2024)

2024
[34]

arXiv preprint arXiv:1804.051132(6), 7 (2018)

Xu, H., He, K., Sigal, L., Sclaroff, S., Saenko, K.: Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.051132(6), 7 (2018)

work page arXiv 2018
[35]

et al.: Video corpus moment retrieval with contrastive learning

Zhang, H. et al.: Video corpus moment retrieval with contrastive learning. arXiv preprint arXiv:2105.06247 (2021)

work page arXiv 2021
[36]

In: Proc

Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proc. AAAI Conf. Artif. Intell. (AAAI). vol. 34, pp. 12870–12877 (2020)

2020
[37]

In: Proc

Zhou, X., Wei, F., Duan, L., Yao, A., Li, W.: The devil is in the spurious correlations: Boosting moment retrieval with dynamic learning. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 20981–20990 (2025)

2025
[38]

et al.: Vision Mamba: Efficient visual representation learning with bidirec- tional state space model

Zhu, L. et al.: Vision Mamba: Efficient visual representation learning with bidirec- tional state space model. In: Proc. Int. Conf. Machine Learn. (ICML) (2024)

2024

[1] [1]

Flashvtg: Feature layering and adaptive score handling network for video temporal grounding.arXiv preprint arXiv:2412.13441, 2024

Cao, Z. et al.: FlashVTG: Feature layering and adaptive score handling network for video temporal grounding. arXiv preprint arXiv:2412.13441 (2024)

work page arXiv 2024

[2] [2]

et al.: End-to-end object detection with transformers

Carion, N. et al.: End-to-end object detection with transformers. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 213–229 (2020)

2020

[3] [3]

Video mamba suite: State space model as a ver- satile alternative for video understanding

Chen, G. et al.: Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626 (2024)

work page arXiv 2024

[4] [4]

arXiv preprint arXiv:1907.12763 (2019)

Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Finding moments in video collections using natural language. arXiv preprint arXiv:1907.12763 (2019)

work page arXiv 1907

[5] [5]

In: Proc

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 6202–6211 (2019)

2019

[6] [6]

In: Proc

Girshick, R.: Fast R-CNN. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 1440–1448 (2015)

2015

[7] [7]

Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: Int. Conf. Learn. Represent. (ICLR) (2022)

2022

[8] [8]

In: Proc

Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). vol. 2, pp. 1735–1742 (2006)

2006

[9] [9]

et al.: Multimodal signal dataset for 11 intuitive movement tasks from single upper extremity during multiple recording sessions

Jeong, J.H. et al.: Multimodal signal dataset for 11 intuitive movement tasks from single upper extremity during multiple recording sessions. GigaScience9(10), giaa098 (2020)

2020

[10] [10]

et al.: Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval

Jiang, Y. et al.: Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. In: Proc. ACM Int. Conf. Multimedia (ACM MM). pp. 7249–7258 (2024)

2024

[11] [11]

arXiv preprint arXiv:2106.02297 (2021)

Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W.: Fre-GAN: Adversarial frequency- consistent audio synthesis. arXiv preprint arXiv:2106.02297 (2021)

work page arXiv 2021

[12] [12]

In: Proc

Lee, G.H., Lee, S.W.: Uncertainty-aware mesh decoder for high fidelity 3D face reconstruction. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) (2020)

2020

[13] [13]

Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Adv. Neural Inf. Process. Syst. (NeurIPS). vol. 34, pp. 11846–11858 (2021)

2021

[14] [14]

In: Proc

Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: A large-scale dataset for video-subtitle moment retrieval. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 447–463 (2020)

2020

[15] [15]

et al.: MomentDiff: Generative video moment retrieval from random to real

Li, P. et al.: MomentDiff: Generative video moment retrieval from random to real. Adv. Neural Inf. Process. Syst. (NeurIPS)36, 65948–65966 (2023)

2023

[16] [16]

et al.: UniVTG: Towards unified video-language temporal grounding

Lin, K.Q. et al.: UniVTG: Towards unified video-language temporal grounding. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 2794–2804 (2023)

2023

[17] [17]

et al.: R2-Tuning: Efficient image-to-video transfer learning for video temporal grounding

Liu, Y. et al.: R2-Tuning: Efficient image-to-video transfer learning for video temporal grounding. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 421–438 (2024)

2024

[18] [18]

et al.: UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection

Liu, Y. et al.: UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 3042–3051 (2022)

2022

[19] [19]

In: Proc

Luo, D., Huang, J., Gong, S., Jin, H., Liu, Y.: Zero-shot video moment retrieval from frozen vision-language models. In: Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 5464–5473 (2024)

2024

[20] [20]

Neural Networks155, 439–450 (2022) ClipTBP 15

Min, K., Lee, G.H., Lee, S.W.: Attentional feature pyramid network for small object detection. Neural Networks155, 439–450 (2022) ClipTBP 15

2022

[21] [21]

Correlation-guided query-dependency calibration in video representation learning for temporal grounding

Moon, W., Hyun, S., Lee, S., Heo, J.P.: Correlation-guided query-dependency calibration for video temporal grounding. arXiv preprint arXiv:2311.08835 (2023)

work page arXiv 2023

[22] [22]

In: Proc

Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video repre- sentation for moment retrieval and highlight detection. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 23023–23033 (2023)

2023

[23] [23]

In: Proc

Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 10810–10819 (2020)

2020

[24] [24]

In: Proc

Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 299–307 (2017)

2017

[25] [25]

In: Proc

Pan, Y., Zhang, Y., Zhao, X.: FAWL: Weakly-supervised video corpus moment retrieval with frame-wise auxiliary alignment and weighted contrastive learning. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). pp. 1–5 (2025)

2025

[26] [26]

et al.: Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval

Panta, L. et al.: Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval. In: Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 607–614 (2024)

2024

[27] [27]

et al.: Learning transferable visual models from natural language supervision

Radford, A. et al.: Learning transferable visual models from natural language supervision. In: Proc. Int. Conf. Machine Learn. (ICML). vol. 139, pp. 8748–8763 (2021)

2021

[28] [28]

et al.: Grounding action descriptions in videos

Regneri, M. et al.: Grounding action descriptions in videos. Trans. of the Assoc. for Comput. Linguistics (TACL)1, 25–36 (2013)

2013

[29] [29]

In: Proc

Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 815–823 (2015)

2015

[30] [30]

Neural Networks180, 106642 (2024)

Shrewsbury, D., Kim, S., Lee, S.W.: Adaptive ambiguity-aware weighting for multi- label recognition with limited annotations. Neural Networks180, 106642 (2024)

2024

[31] [31]

In: Proc

Sun, H., Zhou, M., Chen, W., Xie, W.: TR-DETR: Task-reciprocal transformer for joint moment retrieval and highlight detection. In: Proc. AAAI Conf. Artif. Intell. (AAAI). vol. 38, pp. 4998–5007 (2024)

2024

[32] [32]

et al.: InternVideo2: Scaling foundation models for multimodal video understanding

Wang, Y. et al.: InternVideo2: Scaling foundation models for multimodal video understanding. In: Proc. Eur. Conf. Comput. Vis. (ECCV). pp. 396–416 (2024)

2024

[33] [33]

et al.: Bridging the Gap: A unified video comprehension framework for moment retrieval and highlight detection

Xiao, Y. et al.: Bridging the Gap: A unified video comprehension framework for moment retrieval and highlight detection. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). pp. 18709–18719 (2024)

2024

[34] [34]

arXiv preprint arXiv:1804.051132(6), 7 (2018)

Xu, H., He, K., Sigal, L., Sclaroff, S., Saenko, K.: Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.051132(6), 7 (2018)

work page arXiv 2018

[35] [35]

et al.: Video corpus moment retrieval with contrastive learning

Zhang, H. et al.: Video corpus moment retrieval with contrastive learning. arXiv preprint arXiv:2105.06247 (2021)

work page arXiv 2021

[36] [36]

In: Proc

Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proc. AAAI Conf. Artif. Intell. (AAAI). vol. 34, pp. 12870–12877 (2020)

2020

[37] [37]

In: Proc

Zhou, X., Wei, F., Duan, L., Yao, A., Li, W.: The devil is in the spurious correlations: Boosting moment retrieval with dynamic learning. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV). pp. 20981–20990 (2025)

2025

[38] [38]

et al.: Vision Mamba: Efficient visual representation learning with bidirec- tional state space model

Zhu, L. et al.: Vision Mamba: Efficient visual representation learning with bidirec- tional state space model. In: Proc. Int. Conf. Machine Learn. (ICML) (2024)

2024