MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos
Pith reviewed 2026-05-19 21:47 UTC · model grok-4.3
The pith
A framework unifies visual frames, enhanced transcripts, and summaries for retrieving events in Vietnamese news videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a unified framework retrieves events from Vietnamese news videos by integrating keyframes for visuals, transcripts enhanced to reduce noise from accents and errors, and video summaries. Visual features and textual embeddings are produced separately and used for similarity-based retrieval, with an interactive interface supporting iterative query refinement across modalities. This design produced high scores in a qualification phase and complete retrieval success for every query in the final round.
What carries the argument
The central mechanism is the fusion of visual features extracted from keyframes and textual features from enhanced transcripts and summaries, indexed for similarity search together with a cross-modality query refinement interface.
If this is right
- The method manages noise from accents, background sounds, and recognition errors in Vietnamese audio transcripts.
- Separate embeddings support efficient similarity searches across large collections of video content.
- Iterative refinement across modalities improves alignment between user intent and retrieved video segments.
- The approach shows robustness for real-world news video event search as measured in competition settings.
Where Pith is reading between the lines
- The same multimodal structure could be tested on news videos in other languages that share transcription difficulties.
- It might enable automated systems for ongoing news monitoring and archiving without heavy manual labeling.
- Adding temporal alignment or direct audio features could further localize events within longer videos.
Load-bearing premise
That combining keyframes, enhanced transcripts, and video summaries via separate visual and textual embeddings will produce meaningfully better semantic retrieval than simpler single-modality baselines for Vietnamese news content.
What would settle it
A direct comparison on the same Vietnamese news video queries where a single-modality system using only visuals or only text retrieves fewer correct events than the multimodal version.
Figures
read the original abstract
The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all results for every query in the final round.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MERVIN, a unified multimodal framework for event retrieval in Vietnamese news videos. It integrates keyframe visual features from the Perception Encoder, textual embeddings produced by a Vietnamese language model on Gemini 1.5 Flash-enhanced transcripts and video summaries, and indexes both modalities separately in Milvus for similarity-based retrieval. A React-based interface supports iterative cross-modal query refinement. Effectiveness is claimed via a score of 79/88 in the AI Challenge HCMC 2025 qualification phase and perfect retrieval of all results for every query in the final round.
Significance. The work targets a practical gap in semantic retrieval for Vietnamese-language news video content, where accent and recognition noise are common. The engineering integration of commercial transcription enhancement with open embedding models and a vector database is reproducible in principle and could inform deployed systems. However, because no ablation or baseline results are supplied, the significance of the proposed multimodal unification itself cannot yet be assessed.
major comments (1)
- [Experimental Results] Experimental Results section: the central claim that the unified multimodal framework is effective rests solely on the reported challenge scores (79/88 qualification, perfect final-round retrieval). No unimodal baselines (visual-only or text-only), no ablation on the fusion or ranking strategy, and no dataset or query statistics are provided, so it is impossible to determine whether the integration is load-bearing or whether success arises from strong individual components, query tuning, or challenge-specific data characteristics.
minor comments (2)
- [Abstract and §3] Abstract and §3: the description of retrieval states that visual and textual embeddings are produced and indexed separately, yet the framework is called 'unified'; a short paragraph clarifying whether late fusion, re-ranking, or independent retrieval with union is used would remove ambiguity.
- [Experimental Results] The manuscript would benefit from a table listing the exact number of videos, keyframes, queries, and evaluation metric definitions used in the AI Challenge HCMC 2025 evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger experimental grounding. We address the concern about the evaluation of the multimodal framework below and outline revisions to the manuscript.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: the central claim that the unified multimodal framework is effective rests solely on the reported challenge scores (79/88 qualification, perfect final-round retrieval). No unimodal baselines (visual-only or text-only), no ablation on the fusion or ranking strategy, and no dataset or query statistics are provided, so it is impossible to determine whether the integration is load-bearing or whether success arises from strong individual components, query tuning, or challenge-specific data characteristics.
Authors: We agree that the current results section relies on end-to-end challenge performance and that this limits assessment of the multimodal unification's specific contribution. The AI Challenge HCMC 2025 evaluates complete systems on real Vietnamese news videos, where our framework's integration of Perception Encoder visuals, Gemini-enhanced transcripts, and Vietnamese LM embeddings enabled perfect retrieval in the final round. We will revise the manuscript to add dataset and query statistics (e.g., number of videos, average transcript length, query types) and a discussion of Vietnamese-specific issues such as accent-induced ASR noise that motivate cross-modal retrieval. We will also include a qualitative analysis of modality contributions. However, exhaustive quantitative ablations were not performed during challenge development due to time and resource constraints focused on the integrated system. revision: partial
- Quantitative ablation studies and unimodal baselines were not conducted as part of the challenge-oriented development process.
Circularity Check
No circularity: engineering system with empirical challenge scores only
full rationale
The paper presents MERVIN as a multimodal retrieval framework combining keyframes, Gemini-enhanced transcripts, and summaries via separate embeddings indexed in Milvus, with performance reported solely as 79/88 in qualification and perfect final-round retrieval on AI Challenge HCMC 2025. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citations, uniqueness theorems, or ansatzes are invoked. The central claim reduces to an empirical system description and external challenge outcome rather than any internal reduction of outputs to inputs by construction. This matches the default expectation of a non-circular engineering report.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
integrates keyframes, transcripts, and video summaries... Perception Encoder... Vietnamese language model... Milvus... 79 out of 88 points in AI Challenge HCMC 2025
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Transcript quality is enhanced via Gemini 1.5 Flash... visual and textual embeddings
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
work page 2021
-
[2]
In: Proceedings of NAACL-HLT (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)
work page 2019
-
[3]
Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T., Ninh, T.V., Le, T.K., Ngo, T.D., Dang-Nguyen, D.T., Ngo, T.T., Sch¨ offmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi Minh City AI Challenge 2025. In: Proc...
work page 2025
-
[4]
https://github.com/mlfoundations/open_clip (2021)
Ilharco, G., Wortsman, M., Wightman, R., et al.: Openclip. https://github.com/mlfoundations/open_clip (2021)
work page 2021
-
[5]
Luo, H., Ji, L., Zhong, X., et al.: Clip4clip: An empirical study of clip for end-to-end video clip retrieval. In: NeurIPS (2022)
work page 2022
-
[6]
(2021), https://github.com/PhilipMay/stsb-multi-mt
May, P.: Machine translated multilingual sts benchmark dataset. (2021), https://github.com/PhilipMay/stsb-multi-mt
work page 2021
-
[7]
Miech, A., Zhukov, D., Alayrac, J.B., et al.: End-to-end learning of visual repre- sentations from uncurated instructional videos. In: CVPR (2020)
work page 2020
-
[8]
OpenAI: Gpt-4 technical report (2023), arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
In: Proceedings of the 38th International Con- ference on Machine Learning (2021)
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Con- ference on Machine Learning (2021)
work page 2021
-
[10]
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022), https://arxiv.org/abs/2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [11]
-
[12]
In: NeurIPS Datasets and Benchmarks Track (2022) 12
Schuhmann, C., Vencu, R., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS Datasets and Benchmarks Track (2022) 12
work page 2022
-
[13]
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position represen- tations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)
work page 2018
-
[14]
arXiv preprint arXiv:2008.04838 (2020)
Soucek, T., Lokoc, J.: Transnet V2: an effective deep network archi- tecture for fast shot transition detection. CoRRabs/2008.04838(2020), https://arxiv.org/abs/2008.04838
-
[15]
In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T
Vo, T.P., Duong, Q.T., Nguyen, Q.T., Mai, D.K., Ly, N.K.: A comprehensive video event retrieval system for vietnamese news: Integrating clip vit, task-former, transcripts, and ocr. In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 233–243. Springer Nature Singapore, Sing...
work page 2025
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.