MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

Anh-Duy Le; Anh-Tai Pham-Nguyen; Trung-Hieu Truong-Le; Tung-Duong Le-Duc

arxiv: 2605.16120 · v1 · pith:345QV4PTnew · submitted 2026-05-15 · 💻 cs.IR

MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

Anh-Tai Pham-Nguyen , Tung-Duong Le-Duc , Anh-Duy Le , Trung-Hieu Truong-Le This is my paper

Pith reviewed 2026-05-19 21:47 UTC · model grok-4.3

classification 💻 cs.IR

keywords multimodal retrievalevent retrievalVietnamese news videosvideo searchtranscript enhancementsemantic similaritykeyframes

0 comments

The pith

A framework unifies visual frames, enhanced transcripts, and summaries for retrieving events in Vietnamese news videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that combining visual keyframes, quality-improved transcripts, and video summaries in one system enables effective semantic event retrieval from Vietnamese news videos. A sympathetic reader would care because online video volume is growing rapidly and existing tools often fail on accented speech, background noise, and recognition errors common in non-English content. The method creates separate visual and textual representations for similarity comparisons and includes an interface for users to refine queries step by step across modalities. Competition results on Vietnamese news videos indicate the system retrieved every relevant result for all tested queries after strong qualification performance. This points to multimodal fusion overcoming limits of single-data-type approaches.

Core claim

The authors claim that a unified framework retrieves events from Vietnamese news videos by integrating keyframes for visuals, transcripts enhanced to reduce noise from accents and errors, and video summaries. Visual features and textual embeddings are produced separately and used for similarity-based retrieval, with an interactive interface supporting iterative query refinement across modalities. This design produced high scores in a qualification phase and complete retrieval success for every query in the final round.

What carries the argument

The central mechanism is the fusion of visual features extracted from keyframes and textual features from enhanced transcripts and summaries, indexed for similarity search together with a cross-modality query refinement interface.

If this is right

The method manages noise from accents, background sounds, and recognition errors in Vietnamese audio transcripts.
Separate embeddings support efficient similarity searches across large collections of video content.
Iterative refinement across modalities improves alignment between user intent and retrieved video segments.
The approach shows robustness for real-world news video event search as measured in competition settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multimodal structure could be tested on news videos in other languages that share transcription difficulties.
It might enable automated systems for ongoing news monitoring and archiving without heavy manual labeling.
Adding temporal alignment or direct audio features could further localize events within longer videos.

Load-bearing premise

That combining keyframes, enhanced transcripts, and video summaries via separate visual and textual embeddings will produce meaningfully better semantic retrieval than simpler single-modality baselines for Vietnamese news content.

What would settle it

A direct comparison on the same Vietnamese news video queries where a single-modality system using only visuals or only text retrieves fewer correct events than the multimodal version.

Figures

Figures reproduced from arXiv: 2605.16120 by Anh-Duy Le, Anh-Tai Pham-Nguyen, Trung-Hieu Truong-Le, Tung-Duong Le-Duc.

**Figure 2.** Figure 2: Demonstration of our user interface: (A) searching with visual and tran [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Demonstration of Submission and Verification Page [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the query pipeline of MERVIN. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of search results for tkis-query-10 and trake-01. To illustrate temporal search, we use the trake-01 query. The task is as follows. At a festival with many people in costume, identify the first moment in the video when the following costumed characters appear prominently in the frame: E1: The Thing from Fantastic Four (character with rock-like skin) E2: A character with deer antlers and pointed … view at source ↗

read the original abstract

The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all results for every query in the final round.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MERVIN is a practical engineering pipeline for Vietnamese news video retrieval that scored well in a local challenge, but the results do not show whether the multimodal fusion adds anything over simpler baselines.

read the letter

The main point to take away is that this paper describes a working system called MERVIN for retrieving events in Vietnamese news videos. It combines keyframe visual features from the Perception Encoder, text embeddings from a Vietnamese language model on Gemini-enhanced transcripts and summaries, and Milvus for similarity search, with a React front end for query tweaks. The reported outcome is 79 out of 88 points in the AI Challenge HCMC 2025 qualification phase and full retrieval success in the final round. That is a concrete applied result for a language-specific setting where accents and audio noise are common problems.

Referee Report

1 major / 2 minor

Summary. The manuscript presents MERVIN, a unified multimodal framework for event retrieval in Vietnamese news videos. It integrates keyframe visual features from the Perception Encoder, textual embeddings produced by a Vietnamese language model on Gemini 1.5 Flash-enhanced transcripts and video summaries, and indexes both modalities separately in Milvus for similarity-based retrieval. A React-based interface supports iterative cross-modal query refinement. Effectiveness is claimed via a score of 79/88 in the AI Challenge HCMC 2025 qualification phase and perfect retrieval of all results for every query in the final round.

Significance. The work targets a practical gap in semantic retrieval for Vietnamese-language news video content, where accent and recognition noise are common. The engineering integration of commercial transcription enhancement with open embedding models and a vector database is reproducible in principle and could inform deployed systems. However, because no ablation or baseline results are supplied, the significance of the proposed multimodal unification itself cannot yet be assessed.

major comments (1)

[Experimental Results] Experimental Results section: the central claim that the unified multimodal framework is effective rests solely on the reported challenge scores (79/88 qualification, perfect final-round retrieval). No unimodal baselines (visual-only or text-only), no ablation on the fusion or ranking strategy, and no dataset or query statistics are provided, so it is impossible to determine whether the integration is load-bearing or whether success arises from strong individual components, query tuning, or challenge-specific data characteristics.

minor comments (2)

[Abstract and §3] Abstract and §3: the description of retrieval states that visual and textual embeddings are produced and indexed separately, yet the framework is called 'unified'; a short paragraph clarifying whether late fusion, re-ranking, or independent retrieval with union is used would remove ambiguity.
[Experimental Results] The manuscript would benefit from a table listing the exact number of videos, keyframes, queries, and evaluation metric definitions used in the AI Challenge HCMC 2025 evaluation.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger experimental grounding. We address the concern about the evaluation of the multimodal framework below and outline revisions to the manuscript.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: the central claim that the unified multimodal framework is effective rests solely on the reported challenge scores (79/88 qualification, perfect final-round retrieval). No unimodal baselines (visual-only or text-only), no ablation on the fusion or ranking strategy, and no dataset or query statistics are provided, so it is impossible to determine whether the integration is load-bearing or whether success arises from strong individual components, query tuning, or challenge-specific data characteristics.

Authors: We agree that the current results section relies on end-to-end challenge performance and that this limits assessment of the multimodal unification's specific contribution. The AI Challenge HCMC 2025 evaluates complete systems on real Vietnamese news videos, where our framework's integration of Perception Encoder visuals, Gemini-enhanced transcripts, and Vietnamese LM embeddings enabled perfect retrieval in the final round. We will revise the manuscript to add dataset and query statistics (e.g., number of videos, average transcript length, query types) and a discussion of Vietnamese-specific issues such as accent-induced ASR noise that motivate cross-modal retrieval. We will also include a qualitative analysis of modality contributions. However, exhaustive quantitative ablations were not performed during challenge development due to time and resource constraints focused on the integrated system. revision: partial

standing simulated objections not resolved

Quantitative ablation studies and unimodal baselines were not conducted as part of the challenge-oriented development process.

Circularity Check

0 steps flagged

No circularity: engineering system with empirical challenge scores only

full rationale

The paper presents MERVIN as a multimodal retrieval framework combining keyframes, Gemini-enhanced transcripts, and summaries via separate embeddings indexed in Milvus, with performance reported solely as 79/88 in qualification and perfect final-round retrieval on AI Challenge HCMC 2025. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citations, uniqueness theorems, or ansatzes are invoked. The central claim reduces to an empirical system description and external challenge outcome rather than any internal reduction of outputs to inputs by construction. This matches the default expectation of a non-circular engineering report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; it introduces no new mathematical axioms, free parameters, or invented entities beyond standard use of existing ML components and databases.

pith-pipeline@v0.9.0 · 5682 in / 1048 out tokens · 54312 ms · 2026-05-19T21:47:33.484565+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates keyframes, transcripts, and video summaries... Perception Encoder... Vietnamese language model... Milvus... 79 out of 88 points in AI Challenge HCMC 2025
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Transcript quality is enhanced via Gemini 1.5 Flash... visual and textual embeddings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

In: ICCV (2021)

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV (2021)

work page 2021
[2]

In: Proceedings of NAACL-HLT (2019)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)

work page 2019
[3]

In: Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T., Ninh, T.V., Le, T.K., Ngo, T.D., Dang-Nguyen, D.T., Ngo, T.T., Sch¨ offmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi Minh City AI Challenge 2025. In: Proc...

work page 2025
[4]

https://github.com/mlfoundations/open_clip (2021)

Ilharco, G., Wortsman, M., Wightman, R., et al.: Openclip. https://github.com/mlfoundations/open_clip (2021)

work page 2021
[5]

In: NeurIPS (2022)

Luo, H., Ji, L., Zhong, X., et al.: Clip4clip: An empirical study of clip for end-to-end video clip retrieval. In: NeurIPS (2022)

work page 2022
[6]

(2021), https://github.com/PhilipMay/stsb-multi-mt

May, P.: Machine translated multilingual sts benchmark dataset. (2021), https://github.com/PhilipMay/stsb-multi-mt

work page 2021
[7]

In: CVPR (2020)

Miech, A., Zhukov, D., Alayrac, J.B., et al.: End-to-end learning of visual repre- sentations from uncurated instructional videos. In: CVPR (2020)

work page 2020
[8]

OpenAI: Gpt-4 technical report (2023), arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

In: Proceedings of the 38th International Con- ference on Machine Learning (2021)

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Con- ference on Machine Learning (2021)

work page 2021
[10]

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022), https://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Research, M.A.: Perception encoder: A generalist model for images and videos (2024), arXiv preprint arXiv:2403.13462

work page arXiv 2024
[12]

In: NeurIPS Datasets and Benchmarks Track (2022) 12

Schuhmann, C., Vencu, R., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS Datasets and Benchmarks Track (2022) 12

work page 2022
[13]

In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position represen- tations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

work page 2018
[14]

arXiv preprint arXiv:2008.04838 (2020)

Soucek, T., Lokoc, J.: Transnet V2: an effective deep network archi- tecture for fast shot transition detection. CoRRabs/2008.04838(2020), https://arxiv.org/abs/2008.04838

work page arXiv 2008
[15]

In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T

Vo, T.P., Duong, Q.T., Nguyen, Q.T., Mai, D.K., Ly, N.K.: A comprehensive video event retrieval system for vietnamese news: Integrating clip vit, task-former, transcripts, and ocr. In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 233–243. Springer Nature Singapore, Sing...

work page 2025
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023

[1] [1]

In: ICCV (2021)

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV (2021)

work page 2021

[2] [2]

In: Proceedings of NAACL-HLT (2019)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)

work page 2019

[3] [3]

In: Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

Do, T.L., Huynh, V.T., Nguyen, H.D., Nguyen-Quang, T., Tran, M.K., Nguyen, T.T., Ninh, T.V., Le, T.K., Ngo, T.D., Dang-Nguyen, D.T., Ngo, T.T., Sch¨ offmann, K., Gurrin, C., Tran, M.T.: Toward abstraction-level event retrieval in large video collections: Leveraging human knowledge and LLM-based reasoning in the Ho Chi Minh City AI Challenge 2025. In: Proc...

work page 2025

[4] [4]

https://github.com/mlfoundations/open_clip (2021)

Ilharco, G., Wortsman, M., Wightman, R., et al.: Openclip. https://github.com/mlfoundations/open_clip (2021)

work page 2021

[5] [5]

In: NeurIPS (2022)

Luo, H., Ji, L., Zhong, X., et al.: Clip4clip: An empirical study of clip for end-to-end video clip retrieval. In: NeurIPS (2022)

work page 2022

[6] [6]

(2021), https://github.com/PhilipMay/stsb-multi-mt

May, P.: Machine translated multilingual sts benchmark dataset. (2021), https://github.com/PhilipMay/stsb-multi-mt

work page 2021

[7] [7]

In: CVPR (2020)

Miech, A., Zhukov, D., Alayrac, J.B., et al.: End-to-end learning of visual repre- sentations from uncurated instructional videos. In: CVPR (2020)

work page 2020

[8] [8]

OpenAI: Gpt-4 technical report (2023), arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

In: Proceedings of the 38th International Con- ference on Machine Learning (2021)

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Con- ference on Machine Learning (2021)

work page 2021

[10] [10]

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022), https://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Research, M.A.: Perception encoder: A generalist model for images and videos (2024), arXiv preprint arXiv:2403.13462

work page arXiv 2024

[12] [12]

In: NeurIPS Datasets and Benchmarks Track (2022) 12

Schuhmann, C., Vencu, R., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS Datasets and Benchmarks Track (2022) 12

work page 2022

[13] [13]

In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position represen- tations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

work page 2018

[14] [14]

arXiv preprint arXiv:2008.04838 (2020)

Soucek, T., Lokoc, J.: Transnet V2: an effective deep network archi- tecture for fast shot transition detection. CoRRabs/2008.04838(2020), https://arxiv.org/abs/2008.04838

work page arXiv 2008

[15] [15]

In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T

Vo, T.P., Duong, Q.T., Nguyen, Q.T., Mai, D.K., Ly, N.K.: A comprehensive video event retrieval system for vietnamese news: Integrating clip vit, task-former, transcripts, and ocr. In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technology. pp. 233–243. Springer Nature Singapore, Sing...

work page 2025

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023