Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis
Pith reviewed 2026-05-15 15:01 UTC · model grok-4.3
The pith
Short, clear captions describing single actions achieve higher recall in text-to-video retrieval than complex descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study finds that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy.
What carries the argument
Analysis of caption characteristics (length, clarity, semantic category, Action vs. Scene balance) linked to performance under a unified preprocessing and evaluation framework.
If this is right
- Cross-dataset generalization improves with larger, more diverse caption sets.
- Generative captions do not consistently enhance retrieval accuracy.
- Attention-driven models are better suited for temporally dependent queries.
- Dual-encoder models handle simpler captions effectively.
Where Pith is reading between the lines
- Model development should prioritize handling of multi-step activities and fine-grained details.
- New benchmarks could include controlled variations in caption complexity to isolate query effects.
- Dataset curation might benefit from including more balanced mixes of simple and complex descriptions.
Load-bearing premise
That the unified framework fully eliminates biases and that caption traits are the main cause of performance gaps instead of other dataset factors.
What would settle it
Re-running the evaluations after rewriting complex captions into simpler equivalents without changing meaning and observing no recall improvement would falsify the link.
read the original abstract
Text-to-video retrieval enables users to find relevant video content using natural language queries, a task that has grown increasingly important with the rapid expansion of online video. Over the past six years, research has produced numerous methods, such as dual encoders, attention-driven models, and multimodal fusion approaches; however, fundamental questions remain about model behavior, dataset influence, and query difficulty. In this work, we evaluate 14 state-of-the-art retrieval methods across 3 widely used datasets under a unified preprocessing and evaluation framework. We analyze caption characteristics, including length, clarity, semantic category, and Action vs. Scene balance, and link these to model performance. Our results show that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy. Overall, our findings highlight key dataset factors, benchmark challenges, and the interplay between query content and model architecture, providing guidance for developing more effective text-to-video retrieval systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates 14 state-of-the-art text-to-video retrieval methods across three widely used datasets under a single unified preprocessing and evaluation framework. It examines caption properties including length, clarity, semantic category, and Action-vs-Scene balance, then links these properties to retrieval recall. The central empirical claim is that short, clear, single-action or color-attribute captions yield higher recall while complex multi-step events and fine-grained scene descriptions remain difficult for all tested architectures; attention-based models are reported to handle temporal queries better than dual-encoder or fusion models, and larger caption diversity aids cross-dataset generalization.
Significance. If the attribution of performance differences to caption traits can be isolated from video-level confounds, the work would supply concrete, actionable guidance on query difficulty and architecture-specific strengths, directly addressing the observed performance plateau in text-to-video retrieval and informing both model design and dataset construction.
major comments (2)
- [Results and Analysis] The central claim that caption characteristics are the primary driver of recall differences is load-bearing yet unsupported by controls for correlated video properties. The unified framework standardizes preprocessing but reports no regression, matching, or stratification on video duration, motion entropy, or scene complexity—factors that plausibly co-vary with caption complexity across the three datasets—leaving the attribution vulnerable to omitted-variable bias.
- [Experimental Setup] No statistical significance tests, confidence intervals, or multiple-comparison corrections are described for the reported recall differences across caption categories or architectures, so the strength of the performance contrasts cannot be assessed.
minor comments (1)
- [Abstract] The abstract lists the number of methods and datasets but does not name the three datasets or the 14 methods; adding these identifiers would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of robustness and statistical rigor that we address below. We have revised the manuscript to incorporate additional analyses where feasible.
read point-by-point responses
-
Referee: [Results and Analysis] The central claim that caption characteristics are the primary driver of recall differences is load-bearing yet unsupported by controls for correlated video properties. The unified framework standardizes preprocessing but reports no regression, matching, or stratification on video duration, motion entropy, or scene complexity—factors that plausibly co-vary with caption complexity across the three datasets—leaving the attribution vulnerable to omitted-variable bias.
Authors: We acknowledge the potential for omitted-variable bias from video-level factors. While the unified preprocessing framework reduces some dataset-specific confounds, we agree that explicit controls are needed. In the revised manuscript, we will add a multiple linear regression analysis with recall as the dependent variable and caption properties (length, clarity, action/scene balance) as predictors, while controlling for video duration, estimated motion entropy, and scene complexity (computed via optical flow variance and object detection entropy). We will also report stratified recall results by video duration bins to demonstrate that the caption effects persist across strata. revision: yes
-
Referee: [Experimental Setup] No statistical significance tests, confidence intervals, or multiple-comparison corrections are described for the reported recall differences across caption categories or architectures, so the strength of the performance contrasts cannot be assessed.
Authors: We agree that formal statistical assessment strengthens the claims. In the revision, we will report 95% bootstrap confidence intervals for all recall@K values and apply paired t-tests (or Wilcoxon signed-rank tests for non-normal distributions) with Bonferroni correction for multiple comparisons across the caption categories and model architectures. These results will be added to the relevant tables and figures. revision: yes
Circularity Check
No circularity: purely observational empirical analysis
full rationale
The paper conducts a unified evaluation of 14 existing retrieval methods across 3 external datasets, followed by linguistic analysis of caption properties and their correlation with recall. No derivations, equations, fitted parameters, or predictions are present; performance differences are reported as direct observations from standardized runs on public data. No self-citation chains or ansatzes are invoked to justify core claims. The analysis is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard information retrieval metrics such as recall@K are appropriate and sufficient for comparing text-to-video retrieval performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gharahsouflou, A., Maihami, V., Khamforoosh, K.: An Efficient Approach for Large-Scale Image-to-Video Retrieval with Convolutional Neural Network Features (2022)
work page 2022
-
[2]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Liu, L., Li, J., Niu, L., Xu, R., Zhang, L.: Activity Image-to-Video Retrieval by Disentangling Appearance and Motion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2145–2153 (2021)
work page 2021
-
[3]
arXiv preprint arXiv:2509.26391 (2025)
Zhu, C., Wu, Y., Wang, S., Wu, G., Wang, L.: MotionRAG: Motion Retrieval- Augmented Image-to-Video Generation. arXiv preprint arXiv:2509.26391 (2025)
-
[4]
In: 2022 34th Chinese Control and Decision Conference (CCDC), pp
Liu, Y., Yang, J., Yan, X., Song, L.: Activity Image-to-Video Retrieval via Domain Adversarial Learning. In: 2022 34th Chinese Control and Decision Conference (CCDC), pp. 6183–6188 (2022). IEEE
work page 2022
-
[5]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Xu, R., Niu, L., Zhang, J., Zhang, L.: A Proposal-based Approach for Activity Image-to-Video Retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12524–12531 (2020)
work page 2020
-
[6]
Frontiers in Imaging1, 951934 (2022)
Qiu, G.: Challenges and Opportunities of Image and Video Retrieval. Frontiers in Imaging1, 951934 (2022)
work page 2022
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Yuan, L., Wang, T., Zhang, X., Tay, F.E., Jie, Z., Liu, W., Feng, J.: Central Similarity Quantization for Efficient Image and Video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3083–3092 (2020)
work page 2020
-
[8]
IEEE Access12, 79342–79366 (2024) 36
Vadicamo, L., Arnold, R., Bailer, W., Carrara, F., Gurrin, C., Hezel, N., Li, X., Lokoc, J., Lubos, S., Ma, Z.,et al.: Evaluating Performance and Trends in Interactive Video Retrieval: Insights from the 12th VBS Competition. IEEE Access12, 79342–79366 (2024) 36
work page 2024
-
[9]
In: Proceedings of the 30th ACM International Conference on Multimedia, pp
Dong, J., Chen, X., Zhang, M., Yang, X., Chen, S., Li, X., Wang, X.: Par- tially Relevant Video Retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 246–257 (2022)
work page 2022
-
[10]
Chavate, S., Mishra, R., Yadav, P.: A Comparative Analysis of Video Shot Boundary Detection using Different Approaches. In: 2021 10th International Conference on System Modeling & Advancement in Research Trends (SMART), pp. 1–7 (2021). IEEE
work page 2021
-
[11]
In: Proceedings of the 31st ACM International Conference on Multimedia, pp
Hu, Z., Ye, A.N., Hosseini Khorasgani, S., Mohomed, I.: AdaCLIP: Towards Pragmatic Multimodal Video Retrieval. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5623–5633 (2023)
work page 2023
-
[12]
In: Proceedings of the 30th ACM International Conference on Multimedia, pp
Falcon, A., Serra, G., Lanz, O.: A Feature-Space Multimodal Data Augmen- tation Technique for Text-Video Retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4385–4394 (2022)
work page 2022
-
[13]
In: Proceedings of the 29th ACM International Conference on Multimedia, pp
Zhang, H., Jepson, A.D., Mohomed, I., Derpanis, K.G., Zhang, R., Fazly, A.: Personalized Multi-modal Video Retrieval on Mobile Devices. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1185–1191 (2021)
work page 2021
-
[14]
In: Proceedings of the 29th ACM International Conference on Multimedia, pp
Jiang, C., Huang, K., He, S., Yang, X., Zhang, W., Zhang, X., Cheng, Y., Yang, L., Wang, Q., Xu, F.,et al.: Learning Segment Similarity and Alignment in Large-Scale Content based Video Retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1618–1626 (2021)
work page 2021
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at Once-Multi-Modal Fusion Transformer for Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20020–20029 (2022)
work page 2022
-
[16]
Information Systems87, 101374 (2020)
Aumüller, M., Bernhardsson, E., Faithfull, A.: ANN-Benchmarks: A Bench- marking Tool for Approximate Nearest Neighbor Algorithms. Information Systems87, 101374 (2020)
work page 2020
-
[17]
In: International Conference on Multimedia Modeling, pp
Pegia, M., Lopez, F.A., Moumtzidou, A., Gutierrez-Torre, A., Jónsson, B.Þ., García, J.L.B., Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I.: Time- Quality Tradeoff of MuseHash Query Processing Performance. In: International Conference on Multimedia Modeling, pp. 270–283 (2024). Springer
work page 2024
-
[18]
Manohar, M.D., Shen, Z., Blelloch, G., Dhulipala, L., Gu, Y., Simhadri, H.V., Sun, Y.: ParlayANN: Scalable and Deterministic Parallel Graph-based Approximate Nearest Neighbor Search Algorithms. In: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 270–285 (2024) 37
work page 2024
-
[19]
Pegia, M., Moumtzidou, A., Gialampoukidis, I., Jónsson, B.Þ., Vrochidis, S., Kompatsiaris, I.: Biashash: A bayesian hashing framework for image retrieval. IEEE IVMSP, 9826–9836 (2022)
work page 2022
-
[20]
Scientific Programming2022(1), 1911345 (2022)
Subramanian, B., Paul, A., Kim, J., Chee, K.-W.-A.: Metrics Space and Norm: Taxonomy to Distance Metrics. Scientific Programming2022(1), 1911345 (2022)
work page 2022
-
[21]
Nature Communications16(1), 5181 (2025)
Kalinin, A.A., Arevalo, J., Serrano, E., Vulliard, L., Tsang, H., Bornholdt, M., Muñoz, A.F., Sivagurunathan, S., Rajwa, B., Carpenter, A.E.,et al.: A Versatile InformationRetrievalFrameworkforEvaluatingProfileStrengthandSimilarity. Nature Communications16(1), 5181 (2025)
work page 2025
-
[22]
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp
Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6787–6800 (2021)
work page 2021
-
[23]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp
Wang, J., Wang, C., Huang, K., Huang, J., Jin, L.: VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 16061–16075 (2024)
work page 2024
-
[24]
In: International Conference on Multimedia Modeling, pp
Nguyen, T.-N., Quang, L.M., Healy, G., Nguyen, B.T., Gurrin, C.: Videoclip 2.0: An Interactive Clip-based Video Retrieval System for Novice Users at VBS2024. In: International Conference on Multimedia Modeling, pp. 394–399 (2024). Springer
work page 2024
-
[25]
A CLIP-Hitchhiker’s Guide to Long Video Retrieval
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022)
-
[26]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Wang, Z., Sung, Y.-L., Cheng, F., Bertasius, G., Bansal, M.: Unified coarse- to-fine alignment for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2816–2827 (2023)
work page 2023
-
[27]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Ventura, L., Yang, A., Schmid, C., Varol, G.: Covr: Learning composed video retrieval from web video captions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 5270–5279 (2024)
work page 2024
-
[28]
IEEE transactions on circuits and systems for video technology32(8), 5680– 5694 (2022)
Dong, J., Wang, Y., Chen, X., Qu, X., Li, X., He, Y., Wang, X.: Reading- Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. IEEE transactions on circuits and systems for video technology32(8), 5680– 5694 (2022)
work page 2022
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Tian,K.,Zhao,R.,Xin,Z.,Lan,B.,Li,X.:HolisticFeaturesarealmostSufficient for Text-to-Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17138–17147 (2024) 38
work page 2024
-
[30]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Zhang, B., Cao, Z., Du, H., Li, Y., Li, X., Liu, J., Wang, S.: Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22120–22130 (2025)
work page 2025
-
[31]
Zhao, Z., Chen, Z., Huang, Z., Sadiq, S., Chen, T.: Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1011–1021 (2025)
work page 2025
-
[32]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Yang, X., Zhu, L., Wang, X., Yang, Y.: DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6540–6548 (2024)
work page 2024
-
[33]
In: Proceedings of the 31st ACM International Conference on Multimedia, pp
Song, X., Chen, J., Jiang, Y.-G.: Relation Triplet Construction for Cross- Modal Text-to-Video Retrieval. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4759–4767 (2023)
work page 2023
-
[34]
In: Proceedings of the 29th ACM International Conference on Multimedia, pp
Han, N., Chen, J., Xiao, G., Zhang, H., Zeng, Y., Chen, H.: Fine-grained Cross- Modal Alignment Network for Text-Video Retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3826–3834 (2021)
work page 2021
-
[35]
In: Proceedings of the 31st ACM International Conference on Multimedia, pp
Jiang, C., Liu, H., Yu, X., Wang, Q., Cheng, Y., Xu, J., Liu, Z., Guo, Q., Chu, W.,Yang,M.,et al.:Dual-ModalAttention-EnhancedText-VideoRetrievalwith Triplet Partial Margin Contrastive Learning. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4626–4636 (2023)
work page 2023
-
[36]
In: Proceedings of the 28th ACM International Conference on Multimedia, pp
Lokoć, J., Soućek, T., Vesel` y, P., Mejzlík, F., Ji, J., Xu, C., Li, X.: A W2VV++ Case study with Automated and Interactive Text-to-Video Retrieval. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2553–2561 (2020)
work page 2020
-
[37]
Wu, W., Luo, H., Fang, B., Wang, J., Ouyang, W.: Cap4video: What can Aux- iliary Captions do for Text-Video Retrieval? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10704–10713 (2023)
work page 2023
-
[38]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wang, X., Zhu, L., Yang, Y.: T2vlad: Global-local sequence alignment for text- video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5079–5088 (2021)
work page 2021
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Gorti, S.K., Vouitsis, N., Ma, J., Golestan, K., Volkovs, M., Garg, A., Yu, G.: X-Pool: Cross-modal Language-Video Attention for Text-Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5006–5015 (2022)
work page 2022
-
[40]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Duarte, A., Albanie, S., Giró-i-Nieto, X., Varol, G.: Sign Language Video 39 Retrieval with Free-Form Textual Queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14094–14104 (2022)
work page 2022
-
[41]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp
Ibrahimi, S., Sun, X., Wang, P., Garg, A., Sanan, A., Omar, M.: Audio-enhanced Text-to-Video Retrieval using Text-conditioned Feature Alignment. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12054–12064 (2023)
work page 2023
-
[42]
IEEE Transactions on Multimedia24, 2914–2923 (2021)
Song, X., Chen, J., Wu, Z., Jiang, Y.-G.: Spatial-Temporal Graphs for Cross- Modal Text2Video Retrieval. IEEE Transactions on Multimedia24, 2914–2923 (2021)
work page 2021
-
[43]
IEEE Transactions on Multimedia25, 6079–6089 (2022)
Wang, X., Zhu, L., Zheng, Z., Xu, M., Yang, Y.: Align and Tell: Boosting Text- Video Retrieval with Local Alignment and Fine-Grained Supervision. IEEE Transactions on Multimedia25, 6079–6089 (2022)
work page 2022
-
[44]
IEEE Transactions on Multimedia23, 4351–4362 (2020)
Li, X., Zhou, F., Xu, C., Ji, J., Yang, G.: SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia23, 4351–4362 (2020)
work page 2020
-
[45]
Dong,J., Li,X., Xu,C., Yang,X., Yang, G.,Wang, X.,Wang,M.: Dualencoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) https://doi.org/10.1109/TPAMI.2021.3059295
-
[46]
arXiv preprint arXiv:2305.12218 (2023)
Jin, P., Li, H., Cheng, Z., Huang, J., Wang, Z., Yuan, L., Liu, C., Chen, J.: Text- Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment. arXiv preprint arXiv:2305.12218 (2023)
-
[47]
Zhao, S., Zhu, L., Wang, X., Yang, Y.: CenterCLIP: Token Clustering for Effi- cient Text-Video Retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 970–981 (2022)
work page 2022
-
[48]
Ji, K., Liu, J., Hong, W., Zhong, L., Wang, J., Chen, J., Chu, W.: CRET: Cross- modal Retrieval Transformer for efficient Text-Video Retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–959 (2022)
work page 2022
-
[49]
Yakovlev, K., Polyakov, G., Alimova, I., Podolskiy, A., Bout, A., Nikolenko, S., Piontkovskaya, I.: Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2394– 2398 (2023)
work page 2023
-
[50]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern 40 Recognition, pp
Wray, M., Doughty, H., Damen, D.: On Semantic Similarity in Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern 40 Recognition, pp. 3650–3660 (2021)
work page 2021
-
[51]
In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp
Lei, W., Gao, D., Wang, Y., Mao, D., Liang, Z., Ran, L., Shou, M.Z.: Assistsr: Task-oriented video segment retrieval for personal ai assistant. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 319–338 (2022)
work page 2022
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Dong, Z., Liu, X., Chen, B., Polak, P., Zhang, P.: Musechat: A conversational music recommendation system for videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12775–12785 (2024)
work page 2024
-
[53]
ACM Transactions on Interactive Intelligent Systems13(1), 1–41 (2023)
Afzal, S., Ghani, S., Hittawe, M.M., Rashid, S.F., Knio, O.M., Hadwiger, M., Hoteit, I.: Visualization and visual analytics approaches for image and video datasets: A survey. ACM Transactions on Interactive Intelligent Systems13(1), 1–41 (2023)
work page 2023
-
[54]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Yu, W., Liu, Y., Hua, W., Jiang, D., Ren, B., Bai, X.: Turning a CLIP Model into a Scene Text Detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6978–6988 (2023)
work page 2023
-
[55]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Xie, C.-W., Sun, S., Xiong, X., Zheng, Y., Zhao, D., Zhou, J.: Ra-Clip: Retrieval Augmented Contrastive Language-Image Pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19265–19274 (2023)
work page 2023
-
[56]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective Conditioned and Composed Image Retrieval Combining Clip-based Features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21466–21474 (2022)
work page 2022
-
[57]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Croitoru, I., Bogolin, S.-V., Leordeanu, M., Jin, H., Zisserman, A., Albanie, S., Liu, Y.: TEACHTEXT: Crossmodal Generalized Distillation for Text- Video Retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11583–11593 (2021)
work page 2021
-
[58]
IEEE Transactions on Pattern Analysis and Machine Intelligence44(8), 4065–4080 (2021)
Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X., Wang, M.: Dual Encod- ing for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence44(8), 4065–4080 (2021)
work page 2021
-
[59]
Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. arXiv preprint arXiv:2109.04290 (2021)
-
[60]
Advances in Neural Information Processing Systems34, 41 24206–24221 (2021)
Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. Advances in Neural Information Processing Systems34, 41 24206–24221 (2021)
work page 2021
-
[61]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: Mul- tidomain Multimodal Transformer for Video Retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3363 (2021)
work page 2021
-
[62]
arXiv preprint arXiv:2112.01194 (2021)
Yan, R., Shou, M.Z., Ge, Y., Wang, A.J., Lin, X., Cai, G., Tang, J.: Video-Text Pre-training with Learned Regions. arXiv preprint arXiv:2112.01194 (2021)
-
[63]
In: Proceedings of the 2020 International Conference on Multimedia Retrieval, pp
Galanopoulos, D., Mezaris, V.: Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search with Dual Encoding Net- works. In: Proceedings of the 2020 International Conference on Multimedia Retrieval, pp. 336–340 (2020)
work page 2020
-
[64]
arXiv preprint arXiv:2110.15609 (2021)
Han, N., Chen, J., Xiao, G., Zeng, Y., Shi, C., Chen, H.: Visual Spatio- Temporal Relation-Enhanced Network for Cross-modal Text-Video Retrieval. arXiv preprint arXiv:2110.15609 (2021)
-
[65]
In: Pro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp
Tan, R., Xu, H., Saenko, K., Plummer, B.A.: LOGAN: Latent Graph Co- Attention Network for Weakly-Supervised Video Moment Retrieval. In: Pro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)
work page 2083
-
[66]
In: Proceedings of the 30th ACM International Conference on Multimedia, pp
Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: End-to-End Multi- Grained Contrastive Learning for Video-Text Retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 638–647 (2022)
work page 2022
-
[67]
Shen, L., Hao, T., He, T., Zhao, S., Zhang, Y., Liu, P., Bao, Y., Ding, G.: TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval. arXiv preprint arXiv:2409.01156 (2024)
-
[68]
Mathematics10(12), 2135 (2022)
Mothe, J.: Analytics methods to understand information retrieval effective- ness—a survey. Mathematics10(12), 2135 (2022)
work page 2022
-
[69]
IEEE Access9, 121665–121685 (2021)
Rafiq, M., Rafiq, G., Choi, G.S.: Video description: Datasets & evaluation metrics. IEEE Access9, 121665–121685 (2021)
work page 2021
-
[70]
arXiv preprint arXiv:2210.07595 (2022)
Nozza, D., Hovy, D.: The state of profanity obfuscation in natural language processing. arXiv preprint arXiv:2210.07595 (2022)
-
[71]
Cooper, N., Scholak, T.: Perplexed: Understanding when large language models are confused. CoRR (2024)
work page 2024
-
[72]
In: International Conference on 42 Emerging Technologies and Intelligent Systems, pp
Mohammed, L.A., Aljaberi, M.A., Anmary, A.S., Abdulkhaleq, M.: Analysing english for science and technology reading texts using flesch reading ease online formula: The preparation for academic reading. In: International Conference on 42 Emerging Technologies and Intelligent Systems, pp. 546–561 (2022). Springer
work page 2022
-
[73]
Journal of social studies education research8(3), 238–248 (2017)
Solnyshkina, M., Zamaletdinov, R., Gorodetskaya, L., Gabitov, A.: Evaluat- ing text complexity and flesch-kincaid grade level. Journal of social studies education research8(3), 238–248 (2017)
work page 2017
-
[74]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Fang, B., Wu, W., Liu, C., Zhou, Y., Song, Y., Wang, W., Shu, X., Ji, X., Wang, J.: UATVR: Uncertainty-Adaptive Text-Video Retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13723– 13733 (2023)
work page 2023
-
[75]
In: 2004 IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol
Christel, M.G., Huang, C., Moraveji, N., Papernick, N.: Exploiting Multiple Modalities for Interactive Video Retrieval. In: 2004 IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol. 3, p. 1032 (2004). IEEE
work page 2004
-
[76]
Multimedia Tools and Applications 80(25), 33971–34017 (2021)
Spolaôr, N., Lee, H.D., Takaki, W.S.R., Ensina, L.A., Parmezan, A.R.S., Oliva, J.T., Coy, C.S.R., Wu, F.C.: A Video Indexing and Retrieval Computational Prototype based on Transcribed Speech. Multimedia Tools and Applications 80(25), 33971–34017 (2021)
work page 2021
-
[77]
In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp
Khan, O.S., Jónsson, B.Þ., Zahálka, J., Rudinac, S., Worring, M.: Impact of Interaction Strategies on User Relevance Feedback. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp. 590–598 (2021)
work page 2021
-
[78]
Kumar, H., Mahindru, R., Kar, D.: Metadata-based Retrieval for Resolution Recommendation in AIOps. In: Proceedings of the 30th ACM Joint Euro- pean Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1379–1389 (2022)
work page 2022
-
[79]
ACM Transactions on Multimedia Computing, Communications and Applications20(10), 1–21 (2024)
Yin, S., Zhao, S., Wang, H., Xu, T., Chen, E.: Exploiting Instance-level Rela- tionships in Weakly Supervised Text-to-Video Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications20(10), 1–21 (2024)
work page 2024
-
[80]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Liang, K., Albanie, S.: Simple Baselines for Interactive Video Retrieval with Questions and Answers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11091–11101 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.