U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025
Pith reviewed 2026-05-25 04:29 UTC · model grok-4.3
The pith
U-CESE merges three prior modules into one unified clip-based engine for consistent multimodal video event retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
U-CESE integrates its three modules into a single cohesive framework with the Unified Clipping Algorithm at its core, proposes DAKE as a lightweight keyframe extraction method based on JPEG file size variations, and introduces ReCap as a recurrent-inspired captioning framework, resulting in robust, consistent, and efficient performance for large-scale multimodal event retrieval.
What carries the argument
The Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline to ensure consistent processing across query types.
If this is right
- Enables consistent retrieval across diverse query types in large video datasets
- Provides an efficient, training-free way to extract keyframes using file size changes
- Generates temporally consistent and detailed captions for events
- Supports scalable performance in multimodal event search challenges
Where Pith is reading between the lines
- The unification might reduce maintenance overhead when updating individual components in future systems.
- DAKE's reliance on JPEG sizes could be tested on compressed video formats other than those in the challenge.
- ReCap's RNN inspiration suggests potential for integration with modern sequence models for even better temporal consistency.
Load-bearing premise
Merging the three CESE modules into a single cohesive framework with the Unified Clipping Algorithm will produce consistent processing and retrieval across query types without introducing new inconsistencies or performance drops.
What would settle it
An experiment that applies the unified U-CESE and the original separate CESE modules to the same set of queries and measures whether retrieval accuracy or consistency decreases in the unified version.
Figures
read the original abstract
Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents U-CESE, a Unified Clip-based Event Search Engine for multimodal event retrieval in large-scale video datasets for the AI Challenge HCMC 2025. It builds on CESE by integrating its three modules into a single framework with a Unified Clipping Algorithm, introduces DAKE for lightweight keyframe extraction using JPEG file size variations, and ReCap for temporally consistent captioning inspired by RNNs. The abstract claims that experiments demonstrate robust, consistent, and efficient performance.
Significance. If the empirical results were to hold, the unified framework and proposed components could offer practical advances in efficient processing for large-scale multimodal video retrieval tasks, particularly in competition settings where consistency across query types is valuable. The training-free nature of DAKE is a potential strength for scalability.
major comments (1)
- [Abstract] The assertion that 'Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval' lacks any supporting evidence; the manuscript provides no retrieval metrics such as mAP or recall@K, no runtime measurements, no ablation studies, no comparisons to the original CESE modules, and no challenge leaderboard results or dataset details.
minor comments (1)
- The description of ReCap as 'inspired by Recurrent Neural Network' is vague; clarify the specific architectural connection or differences from standard RNN-based captioning.
Simulated Author's Rebuttal
We thank the referee for their review. Below we provide a point-by-point response to the major comment.
read point-by-point responses
-
Referee: [Abstract] The assertion that 'Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval' lacks any supporting evidence; the manuscript provides no retrieval metrics such as mAP or recall@K, no runtime measurements, no ablation studies, no comparisons to the original CESE modules, and no challenge leaderboard results or dataset details.
Authors: The referee correctly identifies that the manuscript does not contain the supporting experimental evidence for the claim in the abstract. There are no reported metrics, measurements, studies, comparisons, or dataset details. As this is a system description paper for the AI Challenge HCMC 2025, the performance claim was based on internal testing and challenge participation, but we acknowledge it should not be stated without evidence. We will revise the abstract to remove the unsubstantiated claim about experimental performance. revision: yes
Circularity Check
No circularity: system integration paper with no derivations or fitted quantities
full rationale
The paper presents U-CESE as an engineering integration of prior modules (CESE, DAKE, ReCap) plus a Unified Clipping Algorithm, with performance asserted via unspecified experiments. No equations, parameter-fitting steps, uniqueness theorems, or ansatzes appear in the provided text. Claims reduce to component descriptions and empirical assertion rather than any self-referential loop; the derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Elasticsearch: The official distributed search & analytics engine.https://www. elastic.co/elasticsearch
-
[2]
In: European conference on computer vision
Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive se- quence models. In: European conference on computer vision. pp. 178–196. Springer (2022)
work page 2022
-
[3]
Boreczky, J., Rowe, L.: Comparison of video shot boundary detection techniques. Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675
-
[4]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
work page 2024
-
[5]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., et al.: Gemini 2.5:Pushingthefrontierwithadvancedreasoning,multimodality,longcontext,and next generation agentic capabilities (2025),https://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: Inter- national Symposium on Information and Communication Technology
Dinh-Thi, X.B., Dao, A., Trinh, Q.B., Dinh, N.T., Vu, H.N.: Transforming video search: leveraging multimodal techniques and llms for optimal retrieval. In: Inter- national Symposium on Information and Communication Technology. pp. 121–131. Springer (2024) U-CESE: Unified Clip-based Event Search Engine 15
work page 2024
-
[7]
CCIS, Springer, Nha Trang, Vietnam (2025)
Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T.,Ninh,T.V.,Le,T.K.,Ngo,T.D.,Dang-Nguyen,D.T.,Ngo,T.T.,Schöffmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi MinhCityAIChallenge2025.In:Proceedingsofthe14thInt...
work page 2025
-
[8]
In: Proceedings of the 12th International Symposium on Information and Communication Technology
Do, T.L., Nguyen, H.D., Nguyen, Q.T., Tran, M.K., Huynh, V.T., Gurrin, C., Ninh, T.V., Le, T.K., Ngo, T.D., Ngo, T.T., et al.: News event retrieval from large video collection in ho chi minh city ai challenge 2023. In: Proceedings of the 12th International Symposium on Information and Communication Technology. pp. 1011–1017 (2023)
work page 2023
-
[9]
arXiv preprint arXiv:2408.12480 (2024)
Doan, K.T., Huynh, B.G., Hoang, D.T., Pham, T.D., Pham, N.H., Nguyen, Q., Vo, B.Q., Hoang, S.N.: Vintern-1b: an efficient multimodal large language model for vietnamese. arXiv preprint arXiv:2408.12480 (2024)
-
[10]
IEEE Transactions on Big Data (2025)
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transactions on Big Data (2025)
work page 2025
-
[11]
Cognitive Science14(2), 179–211 (1990)
Elman, J.L.: Finding structure in time. Cognitive Science14(2), 179–211 (1990)
work page 1990
-
[12]
In: International Symposium on Information and Communication Technology
Gia, B.T., Khanh, T.B.C., Thanh, T.L.T., Tran, K., Trong, H.H., Doan, T.T., Le, K., Do, T., Le, D.D., Ngo, T.D.: Addressing ambiguous queries in video retrieval with advanced temporal search. In: International Symposium on Information and Communication Technology. pp. 167–180. Springer (2024)
work page 2024
-
[13]
Google: Faster r-cnn inception resnet v2 model.https://www.kaggle. com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)
work page 2024
-
[14]
In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
Gurrin,C.,Jónsson,B.Þ.,Nguyen,D.T.D.,Healy,G.,Lokoc,J.,Zhou,L.,Rossetto, L., Tran, M.T., Hürst, W., Bailer, W., et al.: Introduction to the sixth annual lifelog search challenge, lsc’23. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. pp. 678–679 (2023)
work page 2023
-
[15]
(eds.) Information and Communication Technology
Le,D.N.,Nguyen,H.P.,Lam,T.D.,Dang,M.N.,Le,M.H.:Cese:Aclip-basedevent searchengineforaichallengehcmc2024.In:Buntine,W.,Fjeld,M.,Tran,T.,Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 254–267. Springer Nature Singapore, Singapore (2025)
work page 2025
-
[16]
In: International conference on machine learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)
work page 2023
-
[17]
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)
Lokoč, J., Vesel` y, P., Mejzlík, F., Kovalčík, G., Souček, T., Rossetto, L., Schoeff- mann, K., Bailer, W., Gurrin, C., Sauter, L., et al.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)
work page 2020
-
[18]
In: International Symposium on Information and Communication Technology
Phat, T.A., Minh, T.T., Hoan, D.N.T., Nguyen, K.D.: Revimm: Enhanced video retrieval with reweighting mechanism for multi-modal queries. In: International Symposium on Information and Communication Technology. pp. 18–28. Springer (2024)
work page 2024
-
[19]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 16 Nhuan et al
work page 2021
-
[20]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[21]
Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision (2022),https://arxiv.org/ abs/2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
arXiv preprint arXiv:2008.04838 (2020)
Souček, T., Lokoč, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020)
-
[23]
Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)
work page 2024
-
[24]
In: Proceedings of the 2021 International Conference on Management of Data
Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al.: Milvus: A purpose-built vector data management system. In: Proceedings of the 2021 International Conference on Management of Data. pp. 2614–2627 (2021)
work page 2021
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed, O.K., Singhal, S., Som, S., et al.: Image as a foreign language: Beit pre- training for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19175–19186 (2023)
work page 2023
-
[26]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., Tao, D.: Deepsolo: Let trans- former decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19348– 19357 (2023)
work page 2023
-
[27]
International Journal of Multimedia Information Retrieval12(1), 3 (2023)
Zhu, C., Jia, Q., Chen, W., Guo, Y., Liu, Y.: Deep learning for video-text re- trieval: a review. International Journal of Multimedia Information Retrieval12(1), 3 (2023)
work page 2023
-
[28]
Zhu, W., Huang, Y., Xie, X., Liu, W., Deng, J., Zhang, D., Wang, Z., Liu, J.: Autoshot: A short video dataset and state-of-the-art shot boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.