Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

Bin Jiang; Bolin Zhang; Chao Yang; Ichiro Ide; Takahiro Komamizu

arxiv: 2605.14838 · v1 · pith:FYAADENTnew · submitted 2026-05-14 · 💻 cs.CV · cs.MM

Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

Bolin Zhang , Chao Yang , Bin Jiang , Takahiro Komamizu , Ichiro Ide This is my paper

Pith reviewed 2026-06-30 21:42 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords weakly-supervised video moment retrievalmulti-proposal collaborationmulti-task trainingGaussian masksmasked query reconstructiontemporal proposalsvideo retrieval

0 comments

The pith

Fusing multiple proposal masks and adding dual reconstruction tasks produces more stable weakly-supervised video moment retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a new method called MCMT can locate specific moments in videos matching a text query using only video-level labels during training. It does this by generating several proposals, turning them into Gaussian masks, fusing those masks into one that highlights the best matching parts, and training the model with two different masked query reconstruction tasks. A sympathetic reader would care because earlier approaches either mixed all video parts together or used only one extra task, resulting in less reliable results. If this works, it reduces the need for detailed temporal annotations when building video search systems.

Core claim

The MCMT method generates multiple proposals from which learnable Gaussian masks are derived and fused to form a high-quality positive sample mask for the query-relevant clips, classifies other clips as easy negatives and the whole video as hard negative, and employs both forward and inverse masked query reconstruction tasks to constrain the network more effectively, leading to more robust and stable retrieval performance on weakly-supervised video moment retrieval tasks.

What carries the argument

The multi-proposal collaboration that combines learnable Gaussian masks from multiple proposals into a single positive mask, together with the pair of forward and inverse masked query reconstruction tasks.

If this is right

Multiple proposals fused via masks yield higher quality positive samples than single proposals.
Forward and inverse reconstruction tasks together provide stronger training constraints than a single auxiliary task.
Easy and hard negative samples help distinguish misaligned moments within the same video.
The overall approach leads to improved performance on standard VMR benchmarks like Charades-STA and ActivityNet-Captions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Gaussian mask fusion idea could be applied to other weakly supervised localization problems in videos or images.
If the reconstruction tasks prove key, similar multi-task setups might benefit related tasks like video captioning.
Testing the method with different numbers of proposals could reveal optimal configurations for various video lengths.

Load-bearing premise

Fusing the learnable Gaussian masks from multiple proposals will consistently identify the most relevant video clips based only on video-level supervision.

What would settle it

Running the MCMT method on the Charades-STA or ActivityNet-Captions dataset and finding that its retrieval metrics such as R@1 or mIoU do not exceed those of previous weakly-supervised methods would falsify the claim of improved robustness.

read the original abstract

This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCMT fuses multiple Gaussian masks from proposals with forward-inverse reconstruction tasks for weakly-supervised VMR, but the abstract gives no numbers so the actual gains stay unverified.

read the letter

The main takeaway is that this paper introduces MCMT, which generates multiple proposals, turns them into learnable Gaussian masks, fuses those masks into a positive sample, labels easy and hard negatives, and trains with both forward and inverse masked query reconstruction under video-level labels only.

What stands out as new is the explicit multi-proposal collaboration step combined with the dual reconstruction tasks. Earlier work either pooled predictions across the whole video or used a single auxiliary reconstruction, and the authors correctly flag the resulting problems with proposal quality and training stability.

The method description is clear on how the masks are derived and combined, and how the two reconstruction directions are meant to add constraints. That part reads as a reasonable incremental fix.

The soft spots sit mostly with the missing evidence. The abstract states that experiments on two benchmarks affirm effectiveness, yet supplies no scores, no baseline comparisons, and no ablation or failure-case discussion. Without those, it is impossible to judge whether the mask fusion actually isolates query-relevant clips or whether the masks simply correlate or collapse under weak supervision, which is exactly the stress-test worry. The reconstruction losses do not directly supervise localization, so the collaboration step could turn out ineffective.

The setup itself looks internally consistent and the citations follow the standard VMR references.

This is for people already working on weakly-supervised temporal video tasks who want ideas for adding more training constraints. A reader could pull the mask-fusion and dual-task pattern for their own experiments, but the paper needs the full results before it becomes citable.

Send it to peer review so the experiments can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes MCMT, a weakly-supervised method for Video Moment Retrieval (VMR). It generates multiple proposals from which learnable Gaussian masks are derived and fused into a high-quality positive mask for query-relevant clips; other clips are treated as easy negatives and the full video as a hard negative. Forward and inverse masked query reconstruction tasks are added as auxiliary objectives to constrain the network. The central claim is that this multi-proposal collaboration plus multi-task training yields more robust and stable retrieval than prior single-proposal or single-auxiliary-task approaches, with effectiveness affirmed by experiments on two standard benchmarks.

Significance. If the empirical gains hold and the mask-collaboration mechanism proves stable under video-level supervision, the work would offer a concrete advance in weakly-supervised VMR by addressing instability from low-quality proposals and single auxiliary tasks. The explicit use of both forward and inverse reconstruction plus easy/hard negative classification is a clear methodological contribution that could be adopted more broadly.

major comments (2)

[Abstract and §3] Abstract and §3 (method overview): the claim that fusing multiple learnable Gaussian masks produces a reliably high-quality positive mask is load-bearing for the central contribution, yet the optimization is driven only by reconstruction losses under video-level labels. Nothing in the described procedure prevents degenerate solutions (all masks covering the full video or collapsing to identical regions), and the skeptic concern that the collaboration step may be ineffective therefore remains unaddressed without additional analysis or ablation.
[Abstract] Abstract: the statement that 'extensive experiments on two standard benchmarks affirm the effectiveness' is presented without any quantitative numbers, baseline comparisons, or failure-mode discussion. This makes it impossible to assess whether reported gains are independent of hyper-parameter tuning or post-hoc selection, directly affecting soundness of the empirical claim.

minor comments (2)

[§3] Notation for the Gaussian mask parameters and the exact fusion operation (weighted sum, product, etc.) should be defined with an equation in §3 rather than left descriptive.
[§3] The distinction between 'easy negative' and 'hard negative' samples is introduced but not formalized; a short paragraph or equation clarifying how these are used in the loss would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with point-by-point responses. Where appropriate, we indicate revisions that will be incorporated into the next manuscript version to strengthen the presentation and empirical support.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method overview): the claim that fusing multiple learnable Gaussian masks produces a reliably high-quality positive mask is load-bearing for the central contribution, yet the optimization is driven only by reconstruction losses under video-level labels. Nothing in the described procedure prevents degenerate solutions (all masks covering the full video or collapsing to identical regions), and the skeptic concern that the collaboration step may be ineffective therefore remains unaddressed without additional analysis or ablation.

Authors: We agree that the potential for degenerate mask solutions is an important consideration for the multi-proposal collaboration mechanism. While the combination of forward and inverse masked query reconstruction tasks together with easy/hard negative classification is intended to encourage the masks to capture distinct query-relevant regions (as the inverse task penalizes overly broad coverage and the hard-negative objective requires precise discrimination), we acknowledge that the current manuscript does not include explicit analysis of mask diversity or an ablation isolating the collaboration step. We will add both a quantitative analysis of mask overlap/diversity across proposals and a dedicated ablation study on the effect of fusing multiple masks versus using a single mask in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the statement that 'extensive experiments on two standard benchmarks affirm the effectiveness' is presented without any quantitative numbers, baseline comparisons, or failure-mode discussion. This makes it impossible to assess whether reported gains are independent of hyper-parameter tuning or post-hoc selection, directly affecting soundness of the empirical claim.

Authors: We accept the referee's point that the abstract would benefit from concrete quantitative grounding to allow readers to immediately gauge the scale of improvement. In the revised version we will include the key performance numbers (e.g., R@1, mIoU on Charades-STA and ActivityNet-Captions) together with the primary baseline comparisons, while keeping the abstract concise. Failure-mode discussion will remain in the main text and supplementary material as it requires more space. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in method description or claims

full rationale

The paper presents an empirical method (MCMT) for weakly-supervised video moment retrieval using multi-proposal Gaussian masks, negative sampling, and forward/inverse reconstruction tasks. No load-bearing derivation, equation, or claim reduces by construction to its own fitted inputs or self-citations. The central performance claims rest on experiments against standard benchmarks rather than any self-referential prediction or uniqueness theorem imported from prior author work. This is the expected non-finding for a standard applied CV paper whose contributions are architectural and loss-based rather than mathematical identities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions for weakly-supervised video processing and benchmark evaluation; no new free parameters, axioms, or invented entities are explicitly introduced beyond the method components described.

axioms (1)

domain assumption Video-level labels are sufficient to supervise temporal localization when combined with proposal-based masking and reconstruction losses.
Invoked implicitly when the method claims to solve the task without temporal annotations.

pith-pipeline@v0.9.1-grok · 5755 in / 1188 out tokens · 34253 ms · 2026-06-30T21:42:36.718647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 4 canonical work pages · 1 internal anchor

[1]

VSAM Final Report2000, 1–68 (2000)

Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., Wixson, L.: A system for video surveillance and monitoring. VSAM Final Report2000, 1–68 (2000)

2000
[2]

International Journal of Machine Learning and Cybernetics, 1–16 (2024)

He, Q., Shi, R., Chen, L., Huo, L.: Video anomaly detection based on multi-scale optical flow spatio-temporal enhancement and normality mining. International Journal of Machine Learning and Cybernetics, 1–16 (2024)

2024
[3]

IEEE Robotics & Automation Magazine14, 20–29 (2007)

Kemp, C.C., Edsinger, A., Torres-Jara, E.: Challenges for robot manipulation in human environments [grand challenges of robotics]. IEEE Robotics & Automation Magazine14, 20–29 (2007)

2007
[4]

In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

2017
[5]

In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp

Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)

2017
[6]

In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp

Zhang, B., Jiang, B., Yang, C., Pang, L.: Dual-channel localization networks for moment retrieval with natural language. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 351–359 (2022)

2022
[7]

In: Proceedings of the 30th ACM International Conference on Multimedia, pp

Zhang, B., Yang, C., Jiang, B., Zhou, X.: Video moment retrieval with hierarchical contrastive learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 346–355 (2022)

2022
[8]

In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp

Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.-S.: Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 15–24 (2018)

2018
[9]

IEEE Transactions on Multimedia25, 3921–3933 (2022)

Wang, Y., Liu, M., Wei, Y., Cheng, Z., Wang, Y., Nie, L.: Siamese alignment network for weakly supervised video moment retrieval. IEEE Transactions on Multimedia25, 3921–3933 (2022)

2022
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Yoon, S., Koo, G., Kim, D., Yoo, C.D.: Scanet: Scene complexity aware network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13576–13586 (2023) 23

2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huang, Y., Yang, L., Sato, Y.: Weakly supervised temporal sentence ground- ing with uncertainty-guided self-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18908–18918 (2023)

2023
[12]

In: Proceedings of the 31st ACM Interna- tional Conference on Multimedia, pp

Lv, Z., Su, B., Wen, J.-R.: Counterfactual cross-modality reasoning for weakly supervised video moment localization. In: Proceedings of the 31st ACM Interna- tional Conference on Multimedia, pp. 6539–6547 (2023)

2023
[13]

In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)

2019
[14]

In: Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Tan, R., Xu, H., Saenko, K., Plummer, B.A.: Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)

2021
[15]

In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp

Huang, J., Liu, Y., Gong, S., Jin, H.: Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp. 7199–7208 (2021)

2021
[16]

IEEE Transactions on Image Processing 30, 3252–3262 (2021)

Yang, W., Zhang, T., Zhang, Y., Wu, F.: Local correspondence network for weakly supervised temporal sentence grounding. IEEE Transactions on Image Processing 30, 3252–3262 (2021)

2021
[17]

Advances in Neural Information Processing Systems31, 1–11 (2018)

Duan, X., Huang, W., Gan, C., Wang, J., Zhu, W., Huang, J.: Weakly supervised dense event captioning in videos. Advances in Neural Information Processing Systems31, 1–11 (2018)

2018
[18]

In: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11539–11546 (2020)

2020
[19]

In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, S., Jiang, Y.-G.: Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8425–8435 (2021)

2021
[20]

In: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment local- ization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3517–3525 (2022)

2022
[21]

IEEE Transactions on Pattern Analysis and Machine Intelligence45, 10443–10465 (2023)

Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence45, 10443–10465 (2023)

2023
[22]

ACM Computing Surveys55, 1–37 (2023)

Liu, M., Nie, L., Wang, Y., Wang, M., Rui, Y.: A survey on video moment 24 localization. ACM Computing Surveys55, 1–37 (2023)

2023
[23]

arXiv preprint arXiv:1909.00239 (2019)

Gao, M., Davis, L.S., Socher, R., Xiong, C.: Wslln: Weakly supervised natural language localization networks. Computing Research Repository arXiv Preprint, arXiv:1909.00239 (2019)

work page arXiv 1909
[24]

In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII, pp

Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII, pp. 156–171 (2020)

2020
[25]

Computing Research Repository arXiv Preprint, arXiv:2001.09308 (2020)

Chen, Z., Ma, L., Luo, W., Tang, P., Wong, K.-Y.K.: Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. Computing Research Repository arXiv Preprint, arXiv:2001.09308 (2020)

work page arXiv 2001
[26]

IEEE Transactions on Multimedia24, 3276–3286 (2022)

Wang, Y., Deng, J., Zhou, W., Li, H.: Weakly supervised temporal adjacent network for language grounding. IEEE Transactions on Multimedia24, 3276–3286 (2022)

2022
[27]

Computing Research Repository arXiv Preprint, arXiv:2003.07048 (2020)

Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level atten- tional reconstruction network for grounding textual queries in videos. Computing Research Repository arXiv Preprint, arXiv:2003.07048 (2020)

work page arXiv 2003
[28]

In: Proceedings of the 28th ACM International Conference on Multimedia, pp

Zhang, Z., Lin, Z., Zhao, Z., Zhu, J., He, X.: Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4098–4106 (2020)

2020
[29]

In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp

Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp. 1470–1479 (2021)

2021
[30]

IEEE Transactions on Circuits and Systems for Video Technology32, 1646–1657 (2021)

Gao, J., Xu, C.: Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology32, 1646–1657 (2021)

2021
[31]

In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word rep- resentation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

2014
[32]

In: Proceedings of the 38th International Conference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763 (2021)

2021
[33]

In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) 25

2017
[34]

Advances in Neural Information Processing Systems30, 1–11 (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems30, 1–11 (2017)

2017
[35]

In: Proceedings of the 3rd International Conference on Learning Representations, pp

Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learn- ing to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, pp. 1–10 (2015)

2015
[36]

Encyclopedia of Biometrics, 270–273 (2009)

Zhou, Z.-H.: Ensemble learning. Encyclopedia of Biometrics, 270–273 (2009)

2009
[37]

In: Pro- ceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zheng, M., Huang, Y., Chen, Q., Peng, Y., Liu, Y.: Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In: Pro- ceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15555–15564 (2022)

2022
[38]

In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp

Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp. 706–715 (2017)

2017
[39]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. Computing Research Repository arXiv Preprint, arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[40]

In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp

Wu, H., Lyu, Y., Shen, X., Zhao, X., Wang, M., Zhang, X., Luo, Z.: Atomic-action- based contrastive network for weakly supervised temporal language grounding. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1523–1528 (2023)

2023
[41]

Neurocomputing554, 126625 (2023) 26

Song, Y., Wang, J., Ma, L., Yu, J., Liang, J., Yuan, L., Yu, Z.: MARN: Multi- level attentional reconstruction networks for weakly supervised video temporal grounding. Neurocomputing554, 126625 (2023) 26

2023

[1] [1]

VSAM Final Report2000, 1–68 (2000)

Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., Wixson, L.: A system for video surveillance and monitoring. VSAM Final Report2000, 1–68 (2000)

2000

[2] [2]

International Journal of Machine Learning and Cybernetics, 1–16 (2024)

He, Q., Shi, R., Chen, L., Huo, L.: Video anomaly detection based on multi-scale optical flow spatio-temporal enhancement and normality mining. International Journal of Machine Learning and Cybernetics, 1–16 (2024)

2024

[3] [3]

IEEE Robotics & Automation Magazine14, 20–29 (2007)

Kemp, C.C., Edsinger, A., Torres-Jara, E.: Challenges for robot manipulation in human environments [grand challenges of robotics]. IEEE Robotics & Automation Magazine14, 20–29 (2007)

2007

[4] [4]

In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

2017

[5] [5]

In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp

Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)

2017

[6] [6]

In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp

Zhang, B., Jiang, B., Yang, C., Pang, L.: Dual-channel localization networks for moment retrieval with natural language. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 351–359 (2022)

2022

[7] [7]

In: Proceedings of the 30th ACM International Conference on Multimedia, pp

Zhang, B., Yang, C., Jiang, B., Zhou, X.: Video moment retrieval with hierarchical contrastive learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 346–355 (2022)

2022

[8] [8]

In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp

Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.-S.: Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 15–24 (2018)

2018

[9] [9]

IEEE Transactions on Multimedia25, 3921–3933 (2022)

Wang, Y., Liu, M., Wei, Y., Cheng, Z., Wang, Y., Nie, L.: Siamese alignment network for weakly supervised video moment retrieval. IEEE Transactions on Multimedia25, 3921–3933 (2022)

2022

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Yoon, S., Koo, G., Kim, D., Yoo, C.D.: Scanet: Scene complexity aware network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13576–13586 (2023) 23

2023

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huang, Y., Yang, L., Sato, Y.: Weakly supervised temporal sentence ground- ing with uncertainty-guided self-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18908–18918 (2023)

2023

[12] [12]

In: Proceedings of the 31st ACM Interna- tional Conference on Multimedia, pp

Lv, Z., Su, B., Wen, J.-R.: Counterfactual cross-modality reasoning for weakly supervised video moment localization. In: Proceedings of the 31st ACM Interna- tional Conference on Multimedia, pp. 6539–6547 (2023)

2023

[13] [13]

In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)

2019

[14] [14]

In: Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Tan, R., Xu, H., Saenko, K., Plummer, B.A.: Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)

2021

[15] [15]

In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp

Huang, J., Liu, Y., Gong, S., Jin, H.: Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp. 7199–7208 (2021)

2021

[16] [16]

IEEE Transactions on Image Processing 30, 3252–3262 (2021)

Yang, W., Zhang, T., Zhang, Y., Wu, F.: Local correspondence network for weakly supervised temporal sentence grounding. IEEE Transactions on Image Processing 30, 3252–3262 (2021)

2021

[17] [17]

Advances in Neural Information Processing Systems31, 1–11 (2018)

Duan, X., Huang, W., Gan, C., Wang, J., Zhu, W., Huang, J.: Weakly supervised dense event captioning in videos. Advances in Neural Information Processing Systems31, 1–11 (2018)

2018

[18] [18]

In: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11539–11546 (2020)

2020

[19] [19]

In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, S., Jiang, Y.-G.: Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8425–8435 (2021)

2021

[20] [20]

In: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment local- ization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3517–3525 (2022)

2022

[21] [21]

IEEE Transactions on Pattern Analysis and Machine Intelligence45, 10443–10465 (2023)

Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence45, 10443–10465 (2023)

2023

[22] [22]

ACM Computing Surveys55, 1–37 (2023)

Liu, M., Nie, L., Wang, Y., Wang, M., Rui, Y.: A survey on video moment 24 localization. ACM Computing Surveys55, 1–37 (2023)

2023

[23] [23]

arXiv preprint arXiv:1909.00239 (2019)

Gao, M., Davis, L.S., Socher, R., Xiong, C.: Wslln: Weakly supervised natural language localization networks. Computing Research Repository arXiv Preprint, arXiv:1909.00239 (2019)

work page arXiv 1909

[24] [24]

In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII, pp

Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII, pp. 156–171 (2020)

2020

[25] [25]

Computing Research Repository arXiv Preprint, arXiv:2001.09308 (2020)

Chen, Z., Ma, L., Luo, W., Tang, P., Wong, K.-Y.K.: Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. Computing Research Repository arXiv Preprint, arXiv:2001.09308 (2020)

work page arXiv 2001

[26] [26]

IEEE Transactions on Multimedia24, 3276–3286 (2022)

Wang, Y., Deng, J., Zhou, W., Li, H.: Weakly supervised temporal adjacent network for language grounding. IEEE Transactions on Multimedia24, 3276–3286 (2022)

2022

[27] [27]

Computing Research Repository arXiv Preprint, arXiv:2003.07048 (2020)

Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level atten- tional reconstruction network for grounding textual queries in videos. Computing Research Repository arXiv Preprint, arXiv:2003.07048 (2020)

work page arXiv 2003

[28] [28]

In: Proceedings of the 28th ACM International Conference on Multimedia, pp

Zhang, Z., Lin, Z., Zhao, Z., Zhu, J., He, X.: Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4098–4106 (2020)

2020

[29] [29]

In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp

Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pp. 1470–1479 (2021)

2021

[30] [30]

IEEE Transactions on Circuits and Systems for Video Technology32, 1646–1657 (2021)

Gao, J., Xu, C.: Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology32, 1646–1657 (2021)

2021

[31] [31]

In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word rep- resentation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

2014

[32] [32]

In: Proceedings of the 38th International Conference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763 (2021)

2021

[33] [33]

In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) 25

2017

[34] [34]

Advances in Neural Information Processing Systems30, 1–11 (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems30, 1–11 (2017)

2017

[35] [35]

In: Proceedings of the 3rd International Conference on Learning Representations, pp

Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learn- ing to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, pp. 1–10 (2015)

2015

[36] [36]

Encyclopedia of Biometrics, 270–273 (2009)

Zhou, Z.-H.: Ensemble learning. Encyclopedia of Biometrics, 270–273 (2009)

2009

[37] [37]

In: Pro- ceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zheng, M., Huang, Y., Chen, Q., Peng, Y., Liu, Y.: Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In: Pro- ceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15555–15564 (2022)

2022

[38] [38]

In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp

Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the 16th IEEE International Conference on Computer Vision, pp. 706–715 (2017)

2017

[39] [39]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. Computing Research Repository arXiv Preprint, arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[40] [40]

In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp

Wu, H., Lyu, Y., Shen, X., Zhao, X., Wang, M., Zhang, X., Luo, Z.: Atomic-action- based contrastive network for weakly supervised temporal language grounding. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1523–1528 (2023)

2023

[41] [41]

Neurocomputing554, 126625 (2023) 26

Song, Y., Wang, J., Ma, L., Yu, J., Liang, J., Yuan, L., Yu, Z.: MARN: Multi- level attentional reconstruction networks for weakly supervised video temporal grounding. Neurocomputing554, 126625 (2023) 26

2023