U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

Duc-Nhuan Le; Hoang-Phuc Nguyen; Minh-Hoang Le; Minh-Nhut Dang; Thanh-Duy Lam

arxiv: 2605.23274 · v1 · pith:WP7CJY5Gnew · submitted 2026-05-22 · 💻 cs.CV

U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

Duc-Nhuan Le , Hoang-Phuc Nguyen , Thanh-Duy Lam , Minh-Nhut Dang , Minh-Hoang Le This is my paper

Pith reviewed 2026-05-25 04:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords event retrievalvideo search enginemultimodal retrievalkeyframe extractionvideo captioningclip-based processingunified framework

0 comments

The pith

U-CESE merges three prior modules into one unified clip-based engine for consistent multimodal video event retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents U-CESE as a solution for retrieving events from large video collections where temporal, spatial, and multimodal details make the task difficult. It unifies the three modules from an earlier CESE system into a single framework to ensure the same processing steps apply no matter the query type. A new Unified Clipping Algorithm combines previous clipping methods into one pipeline. DAKE extracts keyframes without training by watching JPEG file sizes for scene changes, and ReCap creates captions that stay consistent over time. If successful, this approach would let systems handle big video archives more reliably and efficiently across different kinds of searches.

Core claim

U-CESE integrates its three modules into a single cohesive framework with the Unified Clipping Algorithm at its core, proposes DAKE as a lightweight keyframe extraction method based on JPEG file size variations, and introduces ReCap as a recurrent-inspired captioning framework, resulting in robust, consistent, and efficient performance for large-scale multimodal event retrieval.

What carries the argument

The Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline to ensure consistent processing across query types.

If this is right

Enables consistent retrieval across diverse query types in large video datasets
Provides an efficient, training-free way to extract keyframes using file size changes
Generates temporally consistent and detailed captions for events
Supports scalable performance in multimodal event search challenges

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification might reduce maintenance overhead when updating individual components in future systems.
DAKE's reliance on JPEG sizes could be tested on compressed video formats other than those in the challenge.
ReCap's RNN inspiration suggests potential for integration with modern sequence models for even better temporal consistency.

Load-bearing premise

Merging the three CESE modules into a single cohesive framework with the Unified Clipping Algorithm will produce consistent processing and retrieval across query types without introducing new inconsistencies or performance drops.

What would settle it

An experiment that applies the unified U-CESE and the original separate CESE modules to the same set of queries and measures whether retrieval accuracy or consistency decreases in the unified version.

Figures

Figures reproduced from arXiv: 2605.23274 by Duc-Nhuan Le, Hoang-Phuc Nguyen, Minh-Hoang Le, Minh-Nhut Dang, Thanh-Duy Lam.

**Figure 1.** Figure 1: Overall system architecture of U-CESE To address these challenges, we present U-CESE, a Unified Clip-based Event Search Engine for AIC [7]. Our system extends the CESE framework [15], which retrieves coherent clips matching event descriptions across multiple queries rather than single frames. However, CESE employs three separate modules, each with distinct user interfaces and re-ranking strategies, leading… view at source ↗

**Figure 2.** Figure 2: Our data preprocessing pipeline with large motion, texture, or lighting changes exhibit abrupt variations in compressed file size [3], while static scenes produce stable sizes [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Recurrence Captioning (ReCap) framework. where each shot St includes keyframes and a subtitle. We utilize AutoShot [28] for this task. At time step t, the system maintains a memory string Mt capturing accumulated contextual information. We employ Gemini [5] as the reasoning and generation engine: (Ct, Mt) = fLVLM(St, Mt−1), where fLVLM denotes the LVLM’s reasoning and generation function. T… view at source ↗

**Figure 4.** Figure 4: Main screen of U-CESE’s user interface [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: U-CESE’s Interactive Window 5 Ablation Study 5.1 Comparing DAKE with AutoShot We compare keyframes detected by DAKE with those identified by AutoShot on the video K01_V001 from the organizers’ dataset. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Frame JPEG sizes across frame indices in a video sample. “True Positive” denotes exact matches, “False Positive” refers to DAKE detections not found in AutoShot results, and “False Negative” indicates keyframes detected by AutoShot but missed by DAKE. 0.00 0.01 0.02 0.03 0.04 Keyframe Ratio 0.0 0.2 0.4 0.6 0.8 1.0 AutoShot Detection Ratio = 0 (Exact Match) = 0.5 × fps = 1.0 × fps = 2.0 × fps [PITH_FULL_… view at source ↗

**Figure 8.** Figure 8: Effect of Recurrent Memory on captions [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: User’s workflow for TRAKE queries. The chosen query is the 4-th TRAKE in the final round of AIC2025, which is "In a stir-fried beef cooking video, identify the first moments when each of the following ingredients makes contact with the pan: E1: cooking oil, E2: beef, E3: onion, E4: sesbania flower." By utilizing the Tab shortcut, user can quickly edit the answer. Acknowledgments This research is supported … view at source ↗

read the original abstract

Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competition system paper that names some engineering integrations but supplies no metrics, baselines, or results to back its claims.

read the letter

The core of this paper is a description of U-CESE, built for the AI Challenge HCMC 2025. It folds three modules from an earlier CESE framework into one pipeline, adds a Unified Clipping Algorithm, introduces DAKE for keyframe extraction via JPEG file size changes, and presents ReCap for temporally consistent captioning inspired by RNNs. These are practical choices aimed at consistent processing and efficiency on large video collections. The integration itself and the lightweight, training-free nature of DAKE are the clearest additions on the page. They target real constraints in multimodal event retrieval without requiring new model training. The paper does a decent job spelling out the motivation for each piece and how they fit together for the contest setting. The main weakness is the total absence of evidence. The abstract claims experiments show robust, consistent, and efficient performance, but the manuscript gives no retrieval metrics, no runtime numbers, no ablations, no comparison to the original three-module CESE, and no challenge results. Without those, the assertion that unification avoids new inconsistencies cannot be checked. This kind of write-up is mainly useful to other teams entering the same challenge or to engineers looking for a quick starting architecture for video search. It does not advance core methods or provide reproducible findings that would interest a broader research audience. I would not send it for peer review. It functions as a system report rather than a contribution with verifiable claims.

Referee Report

1 major / 1 minor

Summary. The paper presents U-CESE, a Unified Clip-based Event Search Engine for multimodal event retrieval in large-scale video datasets for the AI Challenge HCMC 2025. It builds on CESE by integrating its three modules into a single framework with a Unified Clipping Algorithm, introduces DAKE for lightweight keyframe extraction using JPEG file size variations, and ReCap for temporally consistent captioning inspired by RNNs. The abstract claims that experiments demonstrate robust, consistent, and efficient performance.

Significance. If the empirical results were to hold, the unified framework and proposed components could offer practical advances in efficient processing for large-scale multimodal video retrieval tasks, particularly in competition settings where consistency across query types is valuable. The training-free nature of DAKE is a potential strength for scalability.

major comments (1)

[Abstract] The assertion that 'Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval' lacks any supporting evidence; the manuscript provides no retrieval metrics such as mAP or recall@K, no runtime measurements, no ablation studies, no comparisons to the original CESE modules, and no challenge leaderboard results or dataset details.

minor comments (1)

The description of ReCap as 'inspired by Recurrent Neural Network' is vague; clarify the specific architectural connection or differences from standard RNN-based captioning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [Abstract] The assertion that 'Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval' lacks any supporting evidence; the manuscript provides no retrieval metrics such as mAP or recall@K, no runtime measurements, no ablation studies, no comparisons to the original CESE modules, and no challenge leaderboard results or dataset details.

Authors: The referee correctly identifies that the manuscript does not contain the supporting experimental evidence for the claim in the abstract. There are no reported metrics, measurements, studies, comparisons, or dataset details. As this is a system description paper for the AI Challenge HCMC 2025, the performance claim was based on internal testing and challenge participation, but we acknowledge it should not be stated without evidence. We will revise the abstract to remove the unsubstantiated claim about experimental performance. revision: yes

Circularity Check

0 steps flagged

No circularity: system integration paper with no derivations or fitted quantities

full rationale

The paper presents U-CESE as an engineering integration of prior modules (CESE, DAKE, ReCap) plus a Unified Clipping Algorithm, with performance asserted via unspecified experiments. No equations, parameter-fitting steps, uniqueness theorems, or ansatzes appear in the provided text. Claims reduce to component descriptions and empirical assertion rather than any self-referential loop; the derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5711 in / 995 out tokens · 20908 ms · 2026-05-25T04:29:48.514259+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

elastic.co/elasticsearch

Elasticsearch: The official distributed search & analytics engine.https://www. elastic.co/elasticsearch

work page
[2]

In: European conference on computer vision

Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive se- quence models. In: European conference on computer vision. pp. 178–196. Springer (2022)

work page 2022
[3]

Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675

Boreczky, J., Rowe, L.: Comparison of video shot boundary detection techniques. Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675

work page doi:10.1117/12.238675 1996
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024
[5]

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., et al.: Gemini 2.5:Pushingthefrontierwithadvancedreasoning,multimodality,longcontext,and next generation agentic capabilities (2025),https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Inter- national Symposium on Information and Communication Technology

Dinh-Thi, X.B., Dao, A., Trinh, Q.B., Dinh, N.T., Vu, H.N.: Transforming video search: leveraging multimodal techniques and llms for optimal retrieval. In: Inter- national Symposium on Information and Communication Technology. pp. 121–131. Springer (2024) U-CESE: Unified Clip-based Event Search Engine 15

work page 2024
[7]

CCIS, Springer, Nha Trang, Vietnam (2025)

Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T.,Ninh,T.V.,Le,T.K.,Ngo,T.D.,Dang-Nguyen,D.T.,Ngo,T.T.,Schöffmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi MinhCityAIChallenge2025.In:Proceedingsofthe14thInt...

work page 2025
[8]

In: Proceedings of the 12th International Symposium on Information and Communication Technology

Do, T.L., Nguyen, H.D., Nguyen, Q.T., Tran, M.K., Huynh, V.T., Gurrin, C., Ninh, T.V., Le, T.K., Ngo, T.D., Ngo, T.T., et al.: News event retrieval from large video collection in ho chi minh city ai challenge 2023. In: Proceedings of the 12th International Symposium on Information and Communication Technology. pp. 1011–1017 (2023)

work page 2023
[9]

arXiv preprint arXiv:2408.12480 (2024)

Doan, K.T., Huynh, B.G., Hoang, D.T., Pham, T.D., Pham, N.H., Nguyen, Q., Vo, B.Q., Hoang, S.N.: Vintern-1b: an efficient multimodal large language model for vietnamese. arXiv preprint arXiv:2408.12480 (2024)

work page arXiv 2024
[10]

IEEE Transactions on Big Data (2025)

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transactions on Big Data (2025)

work page 2025
[11]

Cognitive Science14(2), 179–211 (1990)

Elman, J.L.: Finding structure in time. Cognitive Science14(2), 179–211 (1990)

work page 1990
[12]

In: International Symposium on Information and Communication Technology

Gia, B.T., Khanh, T.B.C., Thanh, T.L.T., Tran, K., Trong, H.H., Doan, T.T., Le, K., Do, T., Le, D.D., Ngo, T.D.: Addressing ambiguous queries in video retrieval with advanced temporal search. In: International Symposium on Information and Communication Technology. pp. 167–180. Springer (2024)

work page 2024
[13]

com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)

Google: Faster r-cnn inception resnet v2 model.https://www.kaggle. com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)

work page 2024
[14]

In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Gurrin,C.,Jónsson,B.Þ.,Nguyen,D.T.D.,Healy,G.,Lokoc,J.,Zhou,L.,Rossetto, L., Tran, M.T., Hürst, W., Bailer, W., et al.: Introduction to the sixth annual lifelog search challenge, lsc’23. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. pp. 678–679 (2023)

work page 2023
[15]

(eds.) Information and Communication Technology

Le,D.N.,Nguyen,H.P.,Lam,T.D.,Dang,M.N.,Le,M.H.:Cese:Aclip-basedevent searchengineforaichallengehcmc2024.In:Buntine,W.,Fjeld,M.,Tran,T.,Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 254–267. Springer Nature Singapore, Singapore (2025)

work page 2025
[16]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

work page 2023
[17]

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)

Lokoč, J., Vesel` y, P., Mejzlík, F., Kovalčík, G., Souček, T., Rossetto, L., Schoeff- mann, K., Bailer, W., Gurrin, C., Sauter, L., et al.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)

work page 2020
[18]

In: International Symposium on Information and Communication Technology

Phat, T.A., Minh, T.T., Hoan, D.N.T., Nguyen, K.D.: Revimm: Enhanced video retrieval with reweighting mechanism for multi-modal queries. In: International Symposium on Information and Communication Technology. pp. 18–28. Springer (2024)

work page 2024
[19]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 16 Nhuan et al

work page 2021
[20]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021
[21]

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision (2022),https://arxiv.org/ abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

arXiv preprint arXiv:2008.04838 (2020)

Souček, T., Lokoč, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020)

work page arXiv 2008
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

work page 2024
[24]

In: Proceedings of the 2021 International Conference on Management of Data

Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al.: Milvus: A purpose-built vector data management system. In: Proceedings of the 2021 International Conference on Management of Data. pp. 2614–2627 (2021)

work page 2021
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed, O.K., Singhal, S., Som, S., et al.: Image as a foreign language: Beit pre- training for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19175–19186 (2023)

work page 2023
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., Tao, D.: Deepsolo: Let trans- former decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19348– 19357 (2023)

work page 2023
[27]

International Journal of Multimedia Information Retrieval12(1), 3 (2023)

Zhu, C., Jia, Q., Chen, W., Guo, Y., Liu, Y.: Deep learning for video-text re- trieval: a review. International Journal of Multimedia Information Retrieval12(1), 3 (2023)

work page 2023
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)

Zhu, W., Huang, Y., Xie, X., Liu, W., Deng, J., Zhang, D., Wang, Z., Liu, J.: Autoshot: A short video dataset and state-of-the-art shot boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)

work page 2023

[1] [1]

elastic.co/elasticsearch

Elasticsearch: The official distributed search & analytics engine.https://www. elastic.co/elasticsearch

work page

[2] [2]

In: European conference on computer vision

Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive se- quence models. In: European conference on computer vision. pp. 178–196. Springer (2022)

work page 2022

[3] [3]

Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675

Boreczky, J., Rowe, L.: Comparison of video shot boundary detection techniques. Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675

work page doi:10.1117/12.238675 1996

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024

[5] [5]

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., et al.: Gemini 2.5:Pushingthefrontierwithadvancedreasoning,multimodality,longcontext,and next generation agentic capabilities (2025),https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

In: Inter- national Symposium on Information and Communication Technology

Dinh-Thi, X.B., Dao, A., Trinh, Q.B., Dinh, N.T., Vu, H.N.: Transforming video search: leveraging multimodal techniques and llms for optimal retrieval. In: Inter- national Symposium on Information and Communication Technology. pp. 121–131. Springer (2024) U-CESE: Unified Clip-based Event Search Engine 15

work page 2024

[7] [7]

CCIS, Springer, Nha Trang, Vietnam (2025)

Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T.,Ninh,T.V.,Le,T.K.,Ngo,T.D.,Dang-Nguyen,D.T.,Ngo,T.T.,Schöffmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi MinhCityAIChallenge2025.In:Proceedingsofthe14thInt...

work page 2025

[8] [8]

In: Proceedings of the 12th International Symposium on Information and Communication Technology

Do, T.L., Nguyen, H.D., Nguyen, Q.T., Tran, M.K., Huynh, V.T., Gurrin, C., Ninh, T.V., Le, T.K., Ngo, T.D., Ngo, T.T., et al.: News event retrieval from large video collection in ho chi minh city ai challenge 2023. In: Proceedings of the 12th International Symposium on Information and Communication Technology. pp. 1011–1017 (2023)

work page 2023

[9] [9]

arXiv preprint arXiv:2408.12480 (2024)

Doan, K.T., Huynh, B.G., Hoang, D.T., Pham, T.D., Pham, N.H., Nguyen, Q., Vo, B.Q., Hoang, S.N.: Vintern-1b: an efficient multimodal large language model for vietnamese. arXiv preprint arXiv:2408.12480 (2024)

work page arXiv 2024

[10] [10]

IEEE Transactions on Big Data (2025)

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transactions on Big Data (2025)

work page 2025

[11] [11]

Cognitive Science14(2), 179–211 (1990)

Elman, J.L.: Finding structure in time. Cognitive Science14(2), 179–211 (1990)

work page 1990

[12] [12]

In: International Symposium on Information and Communication Technology

Gia, B.T., Khanh, T.B.C., Thanh, T.L.T., Tran, K., Trong, H.H., Doan, T.T., Le, K., Do, T., Le, D.D., Ngo, T.D.: Addressing ambiguous queries in video retrieval with advanced temporal search. In: International Symposium on Information and Communication Technology. pp. 167–180. Springer (2024)

work page 2024

[13] [13]

com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)

Google: Faster r-cnn inception resnet v2 model.https://www.kaggle. com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)

work page 2024

[14] [14]

In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Gurrin,C.,Jónsson,B.Þ.,Nguyen,D.T.D.,Healy,G.,Lokoc,J.,Zhou,L.,Rossetto, L., Tran, M.T., Hürst, W., Bailer, W., et al.: Introduction to the sixth annual lifelog search challenge, lsc’23. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. pp. 678–679 (2023)

work page 2023

[15] [15]

(eds.) Information and Communication Technology

Le,D.N.,Nguyen,H.P.,Lam,T.D.,Dang,M.N.,Le,M.H.:Cese:Aclip-basedevent searchengineforaichallengehcmc2024.In:Buntine,W.,Fjeld,M.,Tran,T.,Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 254–267. Springer Nature Singapore, Singapore (2025)

work page 2025

[16] [16]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

work page 2023

[17] [17]

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)

Lokoč, J., Vesel` y, P., Mejzlík, F., Kovalčík, G., Souček, T., Rossetto, L., Schoeff- mann, K., Bailer, W., Gurrin, C., Sauter, L., et al.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)

work page 2020

[18] [18]

In: International Symposium on Information and Communication Technology

Phat, T.A., Minh, T.T., Hoan, D.N.T., Nguyen, K.D.: Revimm: Enhanced video retrieval with reweighting mechanism for multi-modal queries. In: International Symposium on Information and Communication Technology. pp. 18–28. Springer (2024)

work page 2024

[19] [19]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 16 Nhuan et al

work page 2021

[20] [20]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021

[21] [21]

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision (2022),https://arxiv.org/ abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

arXiv preprint arXiv:2008.04838 (2020)

Souček, T., Lokoč, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020)

work page arXiv 2008

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

work page 2024

[24] [24]

In: Proceedings of the 2021 International Conference on Management of Data

Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al.: Milvus: A purpose-built vector data management system. In: Proceedings of the 2021 International Conference on Management of Data. pp. 2614–2627 (2021)

work page 2021

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed, O.K., Singhal, S., Som, S., et al.: Image as a foreign language: Beit pre- training for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19175–19186 (2023)

work page 2023

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., Tao, D.: Deepsolo: Let trans- former decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19348– 19357 (2023)

work page 2023

[27] [27]

International Journal of Multimedia Information Retrieval12(1), 3 (2023)

Zhu, C., Jia, Q., Chen, W., Guo, Y., Liu, Y.: Deep learning for video-text re- trieval: a review. International Journal of Multimedia Information Retrieval12(1), 3 (2023)

work page 2023

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)

Zhu, W., Huang, Y., Xie, X., Liu, W., Deng, J., Zhang, D., Wang, Z., Liu, J.: Autoshot: A short video dataset and state-of-the-art shot boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)

work page 2023