pith. sign in

arxiv: 2605.00826 · v1 · submitted 2026-03-07 · 💻 cs.IR · cs.CV· cs.LG· cs.MM

Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

Pith reviewed 2026-05-15 15:01 UTC · model grok-4.3

classification 💻 cs.IR cs.CVcs.LGcs.MM
keywords text-to-video retrievalcaption analysisperformance analysisquery difficultymultimodal modelsvideo searchlinguistic features
0
0 comments X

The pith

Short, clear captions describing single actions achieve higher recall in text-to-video retrieval than complex descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests 14 retrieval methods on three datasets in one consistent setup to understand why performance has plateaued. It breaks down captions by length, clarity, and whether they describe actions or scenes. Simple captions about one action or a color get better results from every model. Complex multi-step events or detailed scenes stay hard no matter the architecture. The work shows how query type interacts with model design choices.

Core claim

The study finds that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy.

What carries the argument

Analysis of caption characteristics (length, clarity, semantic category, Action vs. Scene balance) linked to performance under a unified preprocessing and evaluation framework.

If this is right

  • Cross-dataset generalization improves with larger, more diverse caption sets.
  • Generative captions do not consistently enhance retrieval accuracy.
  • Attention-driven models are better suited for temporally dependent queries.
  • Dual-encoder models handle simpler captions effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model development should prioritize handling of multi-step activities and fine-grained details.
  • New benchmarks could include controlled variations in caption complexity to isolate query effects.
  • Dataset curation might benefit from including more balanced mixes of simple and complex descriptions.

Load-bearing premise

That the unified framework fully eliminates biases and that caption traits are the main cause of performance gaps instead of other dataset factors.

What would settle it

Re-running the evaluations after rewriting complex captions into simpler equivalents without changing meaning and observing no recall improvement would falsify the link.

read the original abstract

Text-to-video retrieval enables users to find relevant video content using natural language queries, a task that has grown increasingly important with the rapid expansion of online video. Over the past six years, research has produced numerous methods, such as dual encoders, attention-driven models, and multimodal fusion approaches; however, fundamental questions remain about model behavior, dataset influence, and query difficulty. In this work, we evaluate 14 state-of-the-art retrieval methods across 3 widely used datasets under a unified preprocessing and evaluation framework. We analyze caption characteristics, including length, clarity, semantic category, and Action vs. Scene balance, and link these to model performance. Our results show that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy. Overall, our findings highlight key dataset factors, benchmark challenges, and the interplay between query content and model architecture, providing guidance for developing more effective text-to-video retrieval systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates 14 state-of-the-art text-to-video retrieval methods across three widely used datasets under a single unified preprocessing and evaluation framework. It examines caption properties including length, clarity, semantic category, and Action-vs-Scene balance, then links these properties to retrieval recall. The central empirical claim is that short, clear, single-action or color-attribute captions yield higher recall while complex multi-step events and fine-grained scene descriptions remain difficult for all tested architectures; attention-based models are reported to handle temporal queries better than dual-encoder or fusion models, and larger caption diversity aids cross-dataset generalization.

Significance. If the attribution of performance differences to caption traits can be isolated from video-level confounds, the work would supply concrete, actionable guidance on query difficulty and architecture-specific strengths, directly addressing the observed performance plateau in text-to-video retrieval and informing both model design and dataset construction.

major comments (2)
  1. [Results and Analysis] The central claim that caption characteristics are the primary driver of recall differences is load-bearing yet unsupported by controls for correlated video properties. The unified framework standardizes preprocessing but reports no regression, matching, or stratification on video duration, motion entropy, or scene complexity—factors that plausibly co-vary with caption complexity across the three datasets—leaving the attribution vulnerable to omitted-variable bias.
  2. [Experimental Setup] No statistical significance tests, confidence intervals, or multiple-comparison corrections are described for the reported recall differences across caption categories or architectures, so the strength of the performance contrasts cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract lists the number of methods and datasets but does not name the three datasets or the 14 methods; adding these identifiers would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of robustness and statistical rigor that we address below. We have revised the manuscript to incorporate additional analyses where feasible.

read point-by-point responses
  1. Referee: [Results and Analysis] The central claim that caption characteristics are the primary driver of recall differences is load-bearing yet unsupported by controls for correlated video properties. The unified framework standardizes preprocessing but reports no regression, matching, or stratification on video duration, motion entropy, or scene complexity—factors that plausibly co-vary with caption complexity across the three datasets—leaving the attribution vulnerable to omitted-variable bias.

    Authors: We acknowledge the potential for omitted-variable bias from video-level factors. While the unified preprocessing framework reduces some dataset-specific confounds, we agree that explicit controls are needed. In the revised manuscript, we will add a multiple linear regression analysis with recall as the dependent variable and caption properties (length, clarity, action/scene balance) as predictors, while controlling for video duration, estimated motion entropy, and scene complexity (computed via optical flow variance and object detection entropy). We will also report stratified recall results by video duration bins to demonstrate that the caption effects persist across strata. revision: yes

  2. Referee: [Experimental Setup] No statistical significance tests, confidence intervals, or multiple-comparison corrections are described for the reported recall differences across caption categories or architectures, so the strength of the performance contrasts cannot be assessed.

    Authors: We agree that formal statistical assessment strengthens the claims. In the revision, we will report 95% bootstrap confidence intervals for all recall@K values and apply paired t-tests (or Wilcoxon signed-rank tests for non-normal distributions) with Bonferroni correction for multiple comparisons across the caption categories and model architectures. These results will be added to the relevant tables and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis

full rationale

The paper conducts a unified evaluation of 14 existing retrieval methods across 3 external datasets, followed by linguistic analysis of caption properties and their correlation with recall. No derivations, equations, fitted parameters, or predictions are present; performance differences are reported as direct observations from standardized runs on public data. No self-citation chains or ansatzes are invoked to justify core claims. The analysis is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical evaluation of existing methods and datasets using standard IR metrics; no new free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)
  • domain assumption Standard information retrieval metrics such as recall@K are appropriate and sufficient for comparing text-to-video retrieval performance.
    Invoked in the unified evaluation framework described in the abstract.

pith-pipeline@v0.9.0 · 5585 in / 1182 out tokens · 76029 ms · 2026-05-15T15:01:16.711584+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · 1 internal anchor

  1. [1]

    Gharahsouflou, A., Maihami, V., Khamforoosh, K.: An Efficient Approach for Large-Scale Image-to-Video Retrieval with Convolutional Neural Network Features (2022)

  2. [2]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Liu, L., Li, J., Niu, L., Xu, R., Zhang, L.: Activity Image-to-Video Retrieval by Disentangling Appearance and Motion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2145–2153 (2021)

  3. [3]

    arXiv preprint arXiv:2509.26391 (2025)

    Zhu, C., Wu, Y., Wang, S., Wu, G., Wang, L.: MotionRAG: Motion Retrieval- Augmented Image-to-Video Generation. arXiv preprint arXiv:2509.26391 (2025)

  4. [4]

    In: 2022 34th Chinese Control and Decision Conference (CCDC), pp

    Liu, Y., Yang, J., Yan, X., Song, L.: Activity Image-to-Video Retrieval via Domain Adversarial Learning. In: 2022 34th Chinese Control and Decision Conference (CCDC), pp. 6183–6188 (2022). IEEE

  5. [5]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Xu, R., Niu, L., Zhang, J., Zhang, L.: A Proposal-based Approach for Activity Image-to-Video Retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12524–12531 (2020)

  6. [6]

    Frontiers in Imaging1, 951934 (2022)

    Qiu, G.: Challenges and Opportunities of Image and Video Retrieval. Frontiers in Imaging1, 951934 (2022)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yuan, L., Wang, T., Zhang, X., Tay, F.E., Jie, Z., Liu, W., Feng, J.: Central Similarity Quantization for Efficient Image and Video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3083–3092 (2020)

  8. [8]

    IEEE Access12, 79342–79366 (2024) 36

    Vadicamo, L., Arnold, R., Bailer, W., Carrara, F., Gurrin, C., Hezel, N., Li, X., Lokoc, J., Lubos, S., Ma, Z.,et al.: Evaluating Performance and Trends in Interactive Video Retrieval: Insights from the 12th VBS Competition. IEEE Access12, 79342–79366 (2024) 36

  9. [9]

    In: Proceedings of the 30th ACM International Conference on Multimedia, pp

    Dong, J., Chen, X., Zhang, M., Yang, X., Chen, S., Li, X., Wang, X.: Par- tially Relevant Video Retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 246–257 (2022)

  10. [10]

    In: 2021 10th International Conference on System Modeling & Advancement in Research Trends (SMART), pp

    Chavate, S., Mishra, R., Yadav, P.: A Comparative Analysis of Video Shot Boundary Detection using Different Approaches. In: 2021 10th International Conference on System Modeling & Advancement in Research Trends (SMART), pp. 1–7 (2021). IEEE

  11. [11]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Hu, Z., Ye, A.N., Hosseini Khorasgani, S., Mohomed, I.: AdaCLIP: Towards Pragmatic Multimodal Video Retrieval. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5623–5633 (2023)

  12. [12]

    In: Proceedings of the 30th ACM International Conference on Multimedia, pp

    Falcon, A., Serra, G., Lanz, O.: A Feature-Space Multimodal Data Augmen- tation Technique for Text-Video Retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4385–4394 (2022)

  13. [13]

    In: Proceedings of the 29th ACM International Conference on Multimedia, pp

    Zhang, H., Jepson, A.D., Mohomed, I., Derpanis, K.G., Zhang, R., Fazly, A.: Personalized Multi-modal Video Retrieval on Mobile Devices. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1185–1191 (2021)

  14. [14]

    In: Proceedings of the 29th ACM International Conference on Multimedia, pp

    Jiang, C., Huang, K., He, S., Yang, X., Zhang, W., Zhang, X., Cheng, Y., Yang, L., Wang, Q., Xu, F.,et al.: Learning Segment Similarity and Alignment in Large-Scale Content based Video Retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1618–1626 (2021)

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at Once-Multi-Modal Fusion Transformer for Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20020–20029 (2022)

  16. [16]

    Information Systems87, 101374 (2020)

    Aumüller, M., Bernhardsson, E., Faithfull, A.: ANN-Benchmarks: A Bench- marking Tool for Approximate Nearest Neighbor Algorithms. Information Systems87, 101374 (2020)

  17. [17]

    In: International Conference on Multimedia Modeling, pp

    Pegia, M., Lopez, F.A., Moumtzidou, A., Gutierrez-Torre, A., Jónsson, B.Þ., García, J.L.B., Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I.: Time- Quality Tradeoff of MuseHash Query Processing Performance. In: International Conference on Multimedia Modeling, pp. 270–283 (2024). Springer

  18. [18]

    In: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp

    Manohar, M.D., Shen, Z., Blelloch, G., Dhulipala, L., Gu, Y., Simhadri, H.V., Sun, Y.: ParlayANN: Scalable and Deterministic Parallel Graph-based Approximate Nearest Neighbor Search Algorithms. In: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 270–285 (2024) 37

  19. [19]

    IEEE IVMSP, 9826–9836 (2022)

    Pegia, M., Moumtzidou, A., Gialampoukidis, I., Jónsson, B.Þ., Vrochidis, S., Kompatsiaris, I.: Biashash: A bayesian hashing framework for image retrieval. IEEE IVMSP, 9826–9836 (2022)

  20. [20]

    Scientific Programming2022(1), 1911345 (2022)

    Subramanian, B., Paul, A., Kim, J., Chee, K.-W.-A.: Metrics Space and Norm: Taxonomy to Distance Metrics. Scientific Programming2022(1), 1911345 (2022)

  21. [21]

    Nature Communications16(1), 5181 (2025)

    Kalinin, A.A., Arevalo, J., Serrano, E., Vulliard, L., Tsang, H., Bornholdt, M., Muñoz, A.F., Sivagurunathan, S., Rajwa, B., Carpenter, A.E.,et al.: A Versatile InformationRetrievalFrameworkforEvaluatingProfileStrengthandSimilarity. Nature Communications16(1), 5181 (2025)

  22. [22]

    In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp

    Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6787–6800 (2021)

  23. [23]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp

    Wang, J., Wang, C., Huang, K., Huang, J., Jin, L.: VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 16061–16075 (2024)

  24. [24]

    In: International Conference on Multimedia Modeling, pp

    Nguyen, T.-N., Quang, L.M., Healy, G., Nguyen, B.T., Gurrin, C.: Videoclip 2.0: An Interactive Clip-based Video Retrieval System for Novice Users at VBS2024. In: International Conference on Multimedia Modeling, pp. 394–399 (2024). Springer

  25. [25]

    A CLIP-Hitchhiker’s Guide to Long Video Retrieval

    Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022)

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Wang, Z., Sung, Y.-L., Cheng, F., Bertasius, G., Bansal, M.: Unified coarse- to-fine alignment for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2816–2827 (2023)

  27. [27]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Ventura, L., Yang, A., Schmid, C., Varol, G.: Covr: Learning composed video retrieval from web video captions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 5270–5279 (2024)

  28. [28]

    IEEE transactions on circuits and systems for video technology32(8), 5680– 5694 (2022)

    Dong, J., Wang, Y., Chen, X., Qu, X., Li, X., He, Y., Wang, X.: Reading- Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. IEEE transactions on circuits and systems for video technology32(8), 5680– 5694 (2022)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Tian,K.,Zhao,R.,Xin,Z.,Lan,B.,Li,X.:HolisticFeaturesarealmostSufficient for Text-to-Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17138–17147 (2024) 38

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Zhang, B., Cao, Z., Du, H., Li, Y., Li, X., Liu, J., Wang, S.: Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22120–22130 (2025)

  31. [31]

    In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

    Zhao, Z., Chen, Z., Huang, Z., Sadiq, S., Chen, T.: Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1011–1021 (2025)

  32. [32]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Yang, X., Zhu, L., Wang, X., Yang, Y.: DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6540–6548 (2024)

  33. [33]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Song, X., Chen, J., Jiang, Y.-G.: Relation Triplet Construction for Cross- Modal Text-to-Video Retrieval. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4759–4767 (2023)

  34. [34]

    In: Proceedings of the 29th ACM International Conference on Multimedia, pp

    Han, N., Chen, J., Xiao, G., Zhang, H., Zeng, Y., Chen, H.: Fine-grained Cross- Modal Alignment Network for Text-Video Retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3826–3834 (2021)

  35. [35]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Jiang, C., Liu, H., Yu, X., Wang, Q., Cheng, Y., Xu, J., Liu, Z., Guo, Q., Chu, W.,Yang,M.,et al.:Dual-ModalAttention-EnhancedText-VideoRetrievalwith Triplet Partial Margin Contrastive Learning. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4626–4636 (2023)

  36. [36]

    In: Proceedings of the 28th ACM International Conference on Multimedia, pp

    Lokoć, J., Soućek, T., Vesel` y, P., Mejzlík, F., Ji, J., Xu, C., Li, X.: A W2VV++ Case study with Automated and Interactive Text-to-Video Retrieval. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2553–2561 (2020)

  37. [37]

    10704–10713 (2023)

    Wu, W., Luo, H., Fang, B., Wang, J., Ouyang, W.: Cap4video: What can Aux- iliary Captions do for Text-Video Retrieval? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10704–10713 (2023)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, X., Zhu, L., Yang, Y.: T2vlad: Global-local sequence alignment for text- video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5079–5088 (2021)

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Gorti, S.K., Vouitsis, N., Ma, J., Golestan, K., Volkovs, M., Garg, A., Yu, G.: X-Pool: Cross-modal Language-Video Attention for Text-Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5006–5015 (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Duarte, A., Albanie, S., Giró-i-Nieto, X., Varol, G.: Sign Language Video 39 Retrieval with Free-Form Textual Queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14094–14104 (2022)

  41. [41]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Ibrahimi, S., Sun, X., Wang, P., Garg, A., Sanan, A., Omar, M.: Audio-enhanced Text-to-Video Retrieval using Text-conditioned Feature Alignment. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12054–12064 (2023)

  42. [42]

    IEEE Transactions on Multimedia24, 2914–2923 (2021)

    Song, X., Chen, J., Wu, Z., Jiang, Y.-G.: Spatial-Temporal Graphs for Cross- Modal Text2Video Retrieval. IEEE Transactions on Multimedia24, 2914–2923 (2021)

  43. [43]

    IEEE Transactions on Multimedia25, 6079–6089 (2022)

    Wang, X., Zhu, L., Zheng, Z., Xu, M., Yang, Y.: Align and Tell: Boosting Text- Video Retrieval with Local Alignment and Fine-Grained Supervision. IEEE Transactions on Multimedia25, 6079–6089 (2022)

  44. [44]

    IEEE Transactions on Multimedia23, 4351–4362 (2020)

    Li, X., Zhou, F., Xu, C., Ji, J., Yang, G.: SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia23, 4351–4362 (2020)

  45. [45]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) https://doi.org/10.1109/TPAMI.2021.3059295

    Dong,J., Li,X., Xu,C., Yang,X., Yang, G.,Wang, X.,Wang,M.: Dualencoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) https://doi.org/10.1109/TPAMI.2021.3059295

  46. [46]

    arXiv preprint arXiv:2305.12218 (2023)

    Jin, P., Li, H., Cheng, Z., Huang, J., Wang, Z., Yuan, L., Liu, C., Chen, J.: Text- Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment. arXiv preprint arXiv:2305.12218 (2023)

  47. [47]

    In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

    Zhao, S., Zhu, L., Wang, X., Yang, Y.: CenterCLIP: Token Clustering for Effi- cient Text-Video Retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 970–981 (2022)

  48. [48]

    In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

    Ji, K., Liu, J., Hong, W., Zhong, L., Wang, J., Chen, J., Chu, W.: CRET: Cross- modal Retrieval Transformer for efficient Text-Video Retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–959 (2022)

  49. [49]

    In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

    Yakovlev, K., Polyakov, G., Alimova, I., Podolskiy, A., Bout, A., Nikolenko, S., Piontkovskaya, I.: Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2394– 2398 (2023)

  50. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern 40 Recognition, pp

    Wray, M., Doughty, H., Damen, D.: On Semantic Similarity in Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern 40 Recognition, pp. 3650–3660 (2021)

  51. [51]

    In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp

    Lei, W., Gao, D., Wang, Y., Mao, D., Liang, Z., Ran, L., Shou, M.Z.: Assistsr: Task-oriented video segment retrieval for personal ai assistant. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 319–338 (2022)

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Dong, Z., Liu, X., Chen, B., Polak, P., Zhang, P.: Musechat: A conversational music recommendation system for videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12775–12785 (2024)

  53. [53]

    ACM Transactions on Interactive Intelligent Systems13(1), 1–41 (2023)

    Afzal, S., Ghani, S., Hittawe, M.M., Rashid, S.F., Knio, O.M., Hadwiger, M., Hoteit, I.: Visualization and visual analytics approaches for image and video datasets: A survey. ACM Transactions on Interactive Intelligent Systems13(1), 1–41 (2023)

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yu, W., Liu, Y., Hua, W., Jiang, D., Ren, B., Bai, X.: Turning a CLIP Model into a Scene Text Detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6978–6988 (2023)

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Xie, C.-W., Sun, S., Xiong, X., Zheng, Y., Zhao, D., Zhou, J.: Ra-Clip: Retrieval Augmented Contrastive Language-Image Pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19265–19274 (2023)

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective Conditioned and Composed Image Retrieval Combining Clip-based Features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21466–21474 (2022)

  57. [57]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Croitoru, I., Bogolin, S.-V., Leordeanu, M., Jin, H., Zisserman, A., Albanie, S., Liu, Y.: TEACHTEXT: Crossmodal Generalized Distillation for Text- Video Retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11583–11593 (2021)

  58. [58]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(8), 4065–4080 (2021)

    Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X., Wang, M.: Dual Encod- ing for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence44(8), 4065–4080 (2021)

  59. [59]

    Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021

    Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. arXiv preprint arXiv:2109.04290 (2021)

  60. [60]

    Advances in Neural Information Processing Systems34, 41 24206–24221 (2021)

    Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. Advances in Neural Information Processing Systems34, 41 24206–24221 (2021)

  61. [61]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: Mul- tidomain Multimodal Transformer for Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3363 (2021)

  62. [62]

    arXiv preprint arXiv:2112.01194 (2021)

    Yan, R., Shou, M.Z., Ge, Y., Wang, A.J., Lin, X., Cai, G., Tang, J.: Video-Text Pre-training with Learned Regions. arXiv preprint arXiv:2112.01194 (2021)

  63. [63]

    In: Proceedings of the 2020 International Conference on Multimedia Retrieval, pp

    Galanopoulos, D., Mezaris, V.: Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search with Dual Encoding Net- works. In: Proceedings of the 2020 International Conference on Multimedia Retrieval, pp. 336–340 (2020)

  64. [64]

    arXiv preprint arXiv:2110.15609 (2021)

    Han, N., Chen, J., Xiao, G., Zeng, Y., Shi, C., Chen, H.: Visual Spatio- Temporal Relation-Enhanced Network for Cross-modal Text-Video Retrieval. arXiv preprint arXiv:2110.15609 (2021)

  65. [65]

    In: Pro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

    Tan, R., Xu, H., Saenko, K., Plummer, B.A.: LOGAN: Latent Graph Co- Attention Network for Weakly-Supervised Video Moment Retrieval. In: Pro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)

  66. [66]

    In: Proceedings of the 30th ACM International Conference on Multimedia, pp

    Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: End-to-End Multi- Grained Contrastive Learning for Video-Text Retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 638–647 (2022)

  67. [67]

    Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024

    Shen, L., Hao, T., He, T., Zhao, S., Zhang, Y., Liu, P., Bao, Y., Ding, G.: TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval. arXiv preprint arXiv:2409.01156 (2024)

  68. [68]

    Mathematics10(12), 2135 (2022)

    Mothe, J.: Analytics methods to understand information retrieval effective- ness—a survey. Mathematics10(12), 2135 (2022)

  69. [69]

    IEEE Access9, 121665–121685 (2021)

    Rafiq, M., Rafiq, G., Choi, G.S.: Video description: Datasets & evaluation metrics. IEEE Access9, 121665–121685 (2021)

  70. [70]

    arXiv preprint arXiv:2210.07595 (2022)

    Nozza, D., Hovy, D.: The state of profanity obfuscation in natural language processing. arXiv preprint arXiv:2210.07595 (2022)

  71. [71]

    CoRR (2024)

    Cooper, N., Scholak, T.: Perplexed: Understanding when large language models are confused. CoRR (2024)

  72. [72]

    In: International Conference on 42 Emerging Technologies and Intelligent Systems, pp

    Mohammed, L.A., Aljaberi, M.A., Anmary, A.S., Abdulkhaleq, M.: Analysing english for science and technology reading texts using flesch reading ease online formula: The preparation for academic reading. In: International Conference on 42 Emerging Technologies and Intelligent Systems, pp. 546–561 (2022). Springer

  73. [73]

    Journal of social studies education research8(3), 238–248 (2017)

    Solnyshkina, M., Zamaletdinov, R., Gorodetskaya, L., Gabitov, A.: Evaluat- ing text complexity and flesch-kincaid grade level. Journal of social studies education research8(3), 238–248 (2017)

  74. [74]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Fang, B., Wu, W., Liu, C., Zhou, Y., Song, Y., Wang, W., Shu, X., Ji, X., Wang, J.: UATVR: Uncertainty-Adaptive Text-Video Retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13723– 13733 (2023)

  75. [75]

    In: 2004 IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol

    Christel, M.G., Huang, C., Moraveji, N., Papernick, N.: Exploiting Multiple Modalities for Interactive Video Retrieval. In: 2004 IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol. 3, p. 1032 (2004). IEEE

  76. [76]

    Multimedia Tools and Applications 80(25), 33971–34017 (2021)

    Spolaôr, N., Lee, H.D., Takaki, W.S.R., Ensina, L.A., Parmezan, A.R.S., Oliva, J.T., Coy, C.S.R., Wu, F.C.: A Video Indexing and Retrieval Computational Prototype based on Transcribed Speech. Multimedia Tools and Applications 80(25), 33971–34017 (2021)

  77. [77]

    In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp

    Khan, O.S., Jónsson, B.Þ., Zahálka, J., Rudinac, S., Worring, M.: Impact of Interaction Strategies on User Relevance Feedback. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp. 590–598 (2021)

  78. [78]

    In: Proceedings of the 30th ACM Joint Euro- pean Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp

    Kumar, H., Mahindru, R., Kar, D.: Metadata-based Retrieval for Resolution Recommendation in AIOps. In: Proceedings of the 30th ACM Joint Euro- pean Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1379–1389 (2022)

  79. [79]

    ACM Transactions on Multimedia Computing, Communications and Applications20(10), 1–21 (2024)

    Yin, S., Zhao, S., Wang, H., Xu, T., Chen, E.: Exploiting Instance-level Rela- tionships in Weakly Supervised Text-to-Video Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications20(10), 1–21 (2024)

  80. [80]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Liang, K., Albanie, S.: Simple Baselines for Interactive Video Retrieval with Questions and Answers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11091–11101 (2023)

Showing first 80 references.