pith. sign in

arxiv: 2605.23274 · v1 · pith:WP7CJY5Gnew · submitted 2026-05-22 · 💻 cs.CV

U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

Pith reviewed 2026-05-25 04:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords event retrievalvideo search enginemultimodal retrievalkeyframe extractionvideo captioningclip-based processingunified framework
0
0 comments X

The pith

U-CESE merges three prior modules into one unified clip-based engine for consistent multimodal video event retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents U-CESE as a solution for retrieving events from large video collections where temporal, spatial, and multimodal details make the task difficult. It unifies the three modules from an earlier CESE system into a single framework to ensure the same processing steps apply no matter the query type. A new Unified Clipping Algorithm combines previous clipping methods into one pipeline. DAKE extracts keyframes without training by watching JPEG file sizes for scene changes, and ReCap creates captions that stay consistent over time. If successful, this approach would let systems handle big video archives more reliably and efficiently across different kinds of searches.

Core claim

U-CESE integrates its three modules into a single cohesive framework with the Unified Clipping Algorithm at its core, proposes DAKE as a lightweight keyframe extraction method based on JPEG file size variations, and introduces ReCap as a recurrent-inspired captioning framework, resulting in robust, consistent, and efficient performance for large-scale multimodal event retrieval.

What carries the argument

The Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline to ensure consistent processing across query types.

If this is right

  • Enables consistent retrieval across diverse query types in large video datasets
  • Provides an efficient, training-free way to extract keyframes using file size changes
  • Generates temporally consistent and detailed captions for events
  • Supports scalable performance in multimodal event search challenges

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unification might reduce maintenance overhead when updating individual components in future systems.
  • DAKE's reliance on JPEG sizes could be tested on compressed video formats other than those in the challenge.
  • ReCap's RNN inspiration suggests potential for integration with modern sequence models for even better temporal consistency.

Load-bearing premise

Merging the three CESE modules into a single cohesive framework with the Unified Clipping Algorithm will produce consistent processing and retrieval across query types without introducing new inconsistencies or performance drops.

What would settle it

An experiment that applies the unified U-CESE and the original separate CESE modules to the same set of queries and measures whether retrieval accuracy or consistency decreases in the unified version.

Figures

Figures reproduced from arXiv: 2605.23274 by Duc-Nhuan Le, Hoang-Phuc Nguyen, Minh-Hoang Le, Minh-Nhut Dang, Thanh-Duy Lam.

Figure 1
Figure 1. Figure 1: Overall system architecture of U-CESE To address these challenges, we present U-CESE, a Unified Clip-based Event Search Engine for AIC [7]. Our system extends the CESE framework [15], which retrieves coherent clips matching event descriptions across multiple queries rather than single frames. However, CESE employs three separate modules, each with distinct user interfaces and re-ranking strategies, leading… view at source ↗
Figure 2
Figure 2. Figure 2: Our data preprocessing pipeline with large motion, texture, or lighting changes exhibit abrupt variations in com￾pressed file size [3], while static scenes produce stable sizes [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Recurrence Captioning (ReCap) framework. where each shot St includes keyframes and a subtitle. We utilize AutoShot [28] for this task. At time step t, the system maintains a memory string Mt capturing accumulated contextual information. We employ Gemini [5] as the reasoning and generation engine: (Ct, Mt) = fLVLM(St, Mt−1), where fLVLM denotes the LVLM’s reasoning and generation function. T… view at source ↗
Figure 4
Figure 4. Figure 4: Main screen of U-CESE’s user interface [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: U-CESE’s Interactive Window 5 Ablation Study 5.1 Comparing DAKE with AutoShot We compare keyframes detected by DAKE with those identified by AutoShot on the video K01_V001 from the organizers’ dataset. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Frame JPEG sizes across frame indices in a video sample. “True Positive” denotes exact matches, “False Positive” refers to DAKE detec￾tions not found in AutoShot results, and “False Negative” indicates keyframes detected by Au￾toShot but missed by DAKE. 0.00 0.01 0.02 0.03 0.04 Keyframe Ratio 0.0 0.2 0.4 0.6 0.8 1.0 AutoShot Detection Ratio = 0 (Exact Match) = 0.5 × fps = 1.0 × fps = 2.0 × fps [PITH_FULL_… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of Recurrent Memory on captions [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: User’s workflow for TRAKE queries. The chosen query is the 4-th TRAKE in the final round of AIC2025, which is "In a stir-fried beef cooking video, identify the first moments when each of the following ingredients makes contact with the pan: E1: cooking oil, E2: beef, E3: onion, E4: sesbania flower." By utilizing the Tab shortcut, user can quickly edit the answer. Acknowledgments This research is supported … view at source ↗
read the original abstract

Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents U-CESE, a Unified Clip-based Event Search Engine for multimodal event retrieval in large-scale video datasets for the AI Challenge HCMC 2025. It builds on CESE by integrating its three modules into a single framework with a Unified Clipping Algorithm, introduces DAKE for lightweight keyframe extraction using JPEG file size variations, and ReCap for temporally consistent captioning inspired by RNNs. The abstract claims that experiments demonstrate robust, consistent, and efficient performance.

Significance. If the empirical results were to hold, the unified framework and proposed components could offer practical advances in efficient processing for large-scale multimodal video retrieval tasks, particularly in competition settings where consistency across query types is valuable. The training-free nature of DAKE is a potential strength for scalability.

major comments (1)
  1. [Abstract] The assertion that 'Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval' lacks any supporting evidence; the manuscript provides no retrieval metrics such as mAP or recall@K, no runtime measurements, no ablation studies, no comparisons to the original CESE modules, and no challenge leaderboard results or dataset details.
minor comments (1)
  1. The description of ReCap as 'inspired by Recurrent Neural Network' is vague; clarify the specific architectural connection or differences from standard RNN-based captioning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: [Abstract] The assertion that 'Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval' lacks any supporting evidence; the manuscript provides no retrieval metrics such as mAP or recall@K, no runtime measurements, no ablation studies, no comparisons to the original CESE modules, and no challenge leaderboard results or dataset details.

    Authors: The referee correctly identifies that the manuscript does not contain the supporting experimental evidence for the claim in the abstract. There are no reported metrics, measurements, studies, comparisons, or dataset details. As this is a system description paper for the AI Challenge HCMC 2025, the performance claim was based on internal testing and challenge participation, but we acknowledge it should not be stated without evidence. We will revise the abstract to remove the unsubstantiated claim about experimental performance. revision: yes

Circularity Check

0 steps flagged

No circularity: system integration paper with no derivations or fitted quantities

full rationale

The paper presents U-CESE as an engineering integration of prior modules (CESE, DAKE, ReCap) plus a Unified Clipping Algorithm, with performance asserted via unspecified experiments. No equations, parameter-fitting steps, uniqueness theorems, or ansatzes appear in the provided text. Claims reduce to component descriptions and empirical assertion rather than any self-referential loop; the derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5711 in / 995 out tokens · 20908 ms · 2026-05-25T04:29:48.514259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    elastic.co/elasticsearch

    Elasticsearch: The official distributed search & analytics engine.https://www. elastic.co/elasticsearch

  2. [2]

    In: European conference on computer vision

    Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive se- quence models. In: European conference on computer vision. pp. 178–196. Springer (2022)

  3. [3]

    Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675

    Boreczky, J., Rowe, L.: Comparison of video shot boundary detection techniques. Proceedings of SPIE - The International Society for Optical Engineering2670(03 1996).https://doi.org/10.1117/12.238675

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  5. [5]

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., et al.: Gemini 2.5:Pushingthefrontierwithadvancedreasoning,multimodality,longcontext,and next generation agentic capabilities (2025),https://arxiv.org/abs/2507.06261

  6. [6]

    In: Inter- national Symposium on Information and Communication Technology

    Dinh-Thi, X.B., Dao, A., Trinh, Q.B., Dinh, N.T., Vu, H.N.: Transforming video search: leveraging multimodal techniques and llms for optimal retrieval. In: Inter- national Symposium on Information and Communication Technology. pp. 121–131. Springer (2024) U-CESE: Unified Clip-based Event Search Engine 15

  7. [7]

    CCIS, Springer, Nha Trang, Vietnam (2025)

    Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T.,Ninh,T.V.,Le,T.K.,Ngo,T.D.,Dang-Nguyen,D.T.,Ngo,T.T.,Schöffmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi MinhCityAIChallenge2025.In:Proceedingsofthe14thInt...

  8. [8]

    In: Proceedings of the 12th International Symposium on Information and Communication Technology

    Do, T.L., Nguyen, H.D., Nguyen, Q.T., Tran, M.K., Huynh, V.T., Gurrin, C., Ninh, T.V., Le, T.K., Ngo, T.D., Ngo, T.T., et al.: News event retrieval from large video collection in ho chi minh city ai challenge 2023. In: Proceedings of the 12th International Symposium on Information and Communication Technology. pp. 1011–1017 (2023)

  9. [9]

    arXiv preprint arXiv:2408.12480 (2024)

    Doan, K.T., Huynh, B.G., Hoang, D.T., Pham, T.D., Pham, N.H., Nguyen, Q., Vo, B.Q., Hoang, S.N.: Vintern-1b: an efficient multimodal large language model for vietnamese. arXiv preprint arXiv:2408.12480 (2024)

  10. [10]

    IEEE Transactions on Big Data (2025)

    Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transactions on Big Data (2025)

  11. [11]

    Cognitive Science14(2), 179–211 (1990)

    Elman, J.L.: Finding structure in time. Cognitive Science14(2), 179–211 (1990)

  12. [12]

    In: International Symposium on Information and Communication Technology

    Gia, B.T., Khanh, T.B.C., Thanh, T.L.T., Tran, K., Trong, H.H., Doan, T.T., Le, K., Do, T., Le, D.D., Ngo, T.D.: Addressing ambiguous queries in video retrieval with advanced temporal search. In: International Symposium on Information and Communication Technology. pp. 167–180. Springer (2024)

  13. [13]

    com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)

    Google: Faster r-cnn inception resnet v2 model.https://www.kaggle. com/models/google/faster-rcnn-inception-resnet-v2/tensorFlow1/ faster-rcnn-openimages-v4-inception-resnet-v2(2024)

  14. [14]

    In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

    Gurrin,C.,Jónsson,B.Þ.,Nguyen,D.T.D.,Healy,G.,Lokoc,J.,Zhou,L.,Rossetto, L., Tran, M.T., Hürst, W., Bailer, W., et al.: Introduction to the sixth annual lifelog search challenge, lsc’23. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. pp. 678–679 (2023)

  15. [15]

    (eds.) Information and Communication Technology

    Le,D.N.,Nguyen,H.P.,Lam,T.D.,Dang,M.N.,Le,M.H.:Cese:Aclip-basedevent searchengineforaichallengehcmc2024.In:Buntine,W.,Fjeld,M.,Tran,T.,Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 254–267. Springer Nature Singapore, Singapore (2025)

  16. [16]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  17. [17]

    ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)

    Lokoč, J., Vesel` y, P., Mejzlík, F., Kovalčík, G., Souček, T., Rossetto, L., Schoeff- mann, K., Bailer, W., Gurrin, C., Sauter, L., et al.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(3), 1–26 (2021)

  18. [18]

    In: International Symposium on Information and Communication Technology

    Phat, T.A., Minh, T.T., Hoan, D.N.T., Nguyen, K.D.: Revimm: Enhanced video retrieval with reweighting mechanism for multi-modal queries. In: International Symposium on Information and Communication Technology. pp. 18–28. Springer (2024)

  19. [19]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 16 Nhuan et al

  20. [20]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

  21. [21]

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision (2022),https://arxiv.org/ abs/2212.04356

  22. [22]

    arXiv preprint arXiv:2008.04838 (2020)

    Souček, T., Lokoč, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020)

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

    Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

  24. [24]

    In: Proceedings of the 2021 International Conference on Management of Data

    Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al.: Milvus: A purpose-built vector data management system. In: Proceedings of the 2021 International Conference on Management of Data. pp. 2614–2627 (2021)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed, O.K., Singhal, S., Som, S., et al.: Image as a foreign language: Beit pre- training for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19175–19186 (2023)

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., Tao, D.: Deepsolo: Let trans- former decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19348– 19357 (2023)

  27. [27]

    International Journal of Multimedia Information Retrieval12(1), 3 (2023)

    Zhu, C., Jia, Q., Chen, W., Guo, Y., Liu, Y.: Deep learning for video-text re- trieval: a review. International Journal of Multimedia Information Retrieval12(1), 3 (2023)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)

    Zhu, W., Huang, Y., Xie, X., Liu, W., Deng, J., Zhang, D., Wang, Z., Liu, J.: Autoshot: A short video dataset and state-of-the-art shot boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023)