arxiv: 2605.10228 · v1 · submitted 2026-05-11 · 💻 cs.MM

Recognition: no theorem link

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

Qijie You , Hao Liang , Mingrui Chen , Bohan Zeng , Meiyi Qiang , Zhenhao Wong , Wentao Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:18 UTC · model grok-4.3

classification 💻 cs.MM

keywords audiovisual retrievallong video benchmarkuser queriesmultimodal evaluationvideo retrievalaudio language alignmentcaption versus query retrieval

0 comments

The pith

A new benchmark shows that realistic user queries change how retrieval models perform on long audiovisual videos and that audio-language alignment remains a bottleneck.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs FLARE to evaluate retrieval of long videos using short natural-language queries that draw on both visual and audio content. It argues that prior benchmarks rely on short clips, single modalities, and caption matching, which do not match how users actually search. By releasing 87,697 annotated clips from 399 long videos together with 274,933 user-style queries filtered through a hard bimodal constraint, the work compares caption-based and query-based regimes across fifteen models. A sympathetic reader would care because video search and multimodal models are advancing rapidly, yet current evaluation methods may be steering development away from practical performance.

Core claim

FLARE supplies full-modality annotations for long videos and applies a hard bimodal constraint that keeps only those cross-modal queries for which retrieval fails on either modality alone but succeeds when both are available. Experiments with fifteen representative retrievers under caption-based and query-based settings establish that user-style queries substantially alter model behavior, that strong caption-based results do not reliably transfer to query-based retrieval, and that audio-language alignment constitutes a persistent bottleneck for unified audiovisual retrieval.

What carries the argument

The hard bimodal constraint, which retains only queries that succeed solely when vision and audio evidence are combined.

If this is right

Retrieval models require separate evaluation under query-based regimes rather than relying solely on caption-based scores.
Improvements in audio-language alignment are necessary before unified audiovisual retrieval can succeed at scale.
Benchmarks for video retrieval should incorporate long-form content and multiple modalities to reflect realistic conditions.
Caption-trained models may need additional query-style fine-tuning to maintain performance on natural user inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines for retrieval systems may benefit from incorporating diverse query formats during pre-training rather than captions alone.
The identified audio-alignment gap could constrain the utility of multimodal large language models in open-ended video search tasks.
Extending the benchmark to additional languages or domains would test whether the observed bottlenecks generalize beyond the current video sources.

Load-bearing premise

The user-simulated queries together with the hard bimodal constraint accurately represent real-world user intent and retrieval difficulty.

What would settle it

A study in which actual users issue queries against the same long videos and the resulting performance shifts and modality rankings differ from those observed with the simulated queries.

Figures

Figures reproduced from arXiv: 2605.10228 by Bohan Zeng, Hao Liang, Meiyi Qiang, Mingrui Chen, Qijie You, Wentao Zhang, Zhenhao Wong.

**Figure 2.** Figure 2: Caption Construct Final human review. To guarantee that every retained clip is retrieval-friendly—short enough to carry a single coherent topic and exhibiting non-trivial audiovisual variation—clips still exceeding 2 minutes after automated segmentation are routed to human review, where annotators manually split them along the dominant modality or discard clips lacking meaningful audio-visual content (full… view at source ↗

**Figure 3.** Figure 3: Query Construct 3.3 Multimodal Query Generation In real retrieval scenarios, users often do not provide complete or exhaustive descriptions. To make FLARE closer to realistic search behavior, we construct a high-quality dataset of simulated user queries through the following steps. Candidate generation. For vision-only and audio-only queries, Qwen3-235B-A22B-Instruct rewrites the corresponding modality cap… view at source ↗

**Figure 4.** Figure 4: Dataset statistics distributions. (a) Video duration grouped by range. (b) Clip duration in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Representative error cases. Left: a model retrieves the correct clip with a detailed caption [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for audio-driven modality triage. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for transcript-based semantic splitting. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for visual clip-level caption generation. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for audio clip-level caption generation. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for unified audiovisual caption generation. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for caption degeneration checking. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for video-level caption merging. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for single-modality query generation. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for cross-modal query generation. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form content using both visual and auditory evidence. Yet existing retrieval benchmarks are still dominated by short clips, single modalities, and caption-based evaluation. We introduce FLARE, a full-modality long-video audiovisual retrieval benchmark with user-simulated queries. Built from 399 carefully screened Video-MME videos (10--60 min, 225.4 h) to ensure source quality and diversity, FLARE contains 87,697 clips annotated with vision, audio, and unified audiovisual captions, together with 274,933 user-style queries. Cross-modal queries are further filtered by a hard bimodal constraint, requiring retrieval to fail under either modality alone but succeed when both are combined. FLARE evaluates models under two regimes, caption-based and query-based retrieval, across vision, audio, and unified audiovisual settings. Experiments with 15 representative retrievers show that user-style queries substantially change model behavior, strong caption-based performance does not always transfer to query-based retrieval, and audio--language alignment remains a key bottleneck for unified audiovisual retrieval. Our code and data are released at https://flarebench.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLARE is a useful new benchmark for long-video audiovisual retrieval with user-style queries and a strict bimodal filter, but its findings depend on how realistic those queries turn out to be.

read the letter

FLARE introduces a benchmark for retrieving segments from long videos (10-60 minutes) using both visual and audio cues with queries that aim to sound like real user searches. The authors start from 399 screened videos in Video-MME, create clips with vision-only, audio-only, and combined captions, then generate a large number of user-style queries. They add a hard bimodal filter that keeps only queries where retrieval fails if you use vision or audio by itself but works when both are available. Experiments on 15 retrievers show that switching to query-based evaluation changes model rankings, caption performance does not reliably predict query performance, and audio-language alignment is the main weak point in unified retrieval. What the paper does well is the data release and the focus on full-modality long-form content, which existing benchmarks mostly skip. The construction from an established source and the scale (nearly 88k clips and 275k queries) make it practical to use. The soft spots are in the query simulation and the bimodal constraint. These are modeling decisions that define the difficulty, so the reported behavior shifts and bottlenecks could be tied to how the queries were created rather than general properties of the models. Without the full details on simulation and quality control, it's difficult to judge how much the findings will hold up outside this benchmark. This is for people working on multimodal video retrieval and MLLM evaluation. Anyone looking for a testbed beyond short clips or single modalities will get something concrete from it. The work is coherent and points to real issues, so it deserves peer review. I would recommend sending it to referees.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces FLARE, a benchmark for full-modality long-video audiovisual retrieval with user-simulated queries. Built from 399 screened Video-MME videos (10-60 min, 225.4 h total), it provides 87,697 clips annotated with vision, audio, and unified audiovisual captions plus 274,933 user-style queries. Cross-modal queries are filtered via a hard bimodal constraint (retrieval must fail on either modality alone but succeed when both are available). The benchmark supports caption-based and query-based evaluation regimes across vision, audio, and unified settings. Experiments on 15 retrievers show that user-style queries alter model behavior, caption-based performance does not reliably transfer to query-based retrieval, and audio-language alignment remains a bottleneck for unified audiovisual retrieval. Code and data are released.

Significance. If the user-simulated queries and hard bimodal filter are faithful proxies for real-world intent and difficulty, FLARE would fill an important gap by moving video retrieval evaluation beyond short clips, single modalities, and caption-only protocols toward long-form, full-modality, query-driven settings. The open release of code and data supports reproducibility, and the empirical results across 15 retrievers supply concrete evidence of current limitations that could guide MLLM and retrieval research.

major comments (2)

[Dataset construction] Dataset construction section: the generation process for the 274,933 user-simulated queries and the precise implementation of the hard bimodal constraint (failure on unimodal retrieval but success on bimodal) are described at a high level only. These choices are load-bearing for the central claims that user-style queries change model behavior and that the benchmark captures realistic difficulty; without explicit templates, LLM prompts, or validation statistics, it is difficult to assess whether the observed performance shifts are artifacts of the simulation rather than genuine user-intent differences.
[Experiments] Experiments section: while results are reported for 15 retrievers, the paper provides limited error analysis or case studies showing concrete examples of queries that fail unimodally but succeed bimodally. This weakens the claim that audio-language alignment is the key bottleneck, as the quantitative tables alone do not isolate whether failures stem from alignment, retrieval architecture, or the filtering procedure itself.

minor comments (3)

Abstract: the total number of queries (274,933) and clips (87,697) are stated, but average query length, modality distribution, or number of queries per video would help readers quickly gauge scale.
Related work: a more explicit comparison table contrasting FLARE with prior long-video or audiovisual benchmarks (e.g., on duration, modality coverage, and query style) would strengthen positioning.
Figures: performance plots comparing caption-based vs. query-based regimes should include error bars or statistical significance markers to support the claim of substantial behavioral change.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for the constructive comments on dataset construction and experimental analysis. We address each point below and will incorporate the suggested clarifications and additions in the revised manuscript.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the generation process for the 274,933 user-simulated queries and the precise implementation of the hard bimodal constraint (failure on unimodal retrieval but success on bimodal) are described at a high level only. These choices are load-bearing for the central claims that user-style queries change model behavior and that the benchmark captures realistic difficulty; without explicit templates, LLM prompts, or validation statistics, it is difficult to assess whether the observed performance shifts are artifacts of the simulation rather than genuine user-intent differences.

Authors: We agree that additional transparency on the query generation and filtering pipeline is warranted. In the revised manuscript we will expand the Dataset Construction section to include the exact system and user prompts supplied to the LLM for simulating user-style queries, the concrete templates and decision rules used to enforce the hard bimodal constraint (including the retrieval failure thresholds applied to unimodal runs), and supplementary validation statistics such as the fraction of candidate queries retained after filtering and basic agreement metrics between automated and manual checks on a held-out subset. These additions will allow readers to evaluate the simulation fidelity directly. revision: yes
Referee: [Experiments] Experiments section: while results are reported for 15 retrievers, the paper provides limited error analysis or case studies showing concrete examples of queries that fail unimodally but succeed bimodally. This weakens the claim that audio-language alignment is the key bottleneck, as the quantitative tables alone do not isolate whether failures stem from alignment, retrieval architecture, or the filtering procedure itself.

Authors: We acknowledge that illustrative examples would strengthen the interpretation of the results. In the revised version we will add a dedicated error-analysis subsection containing 4–6 representative query–clip pairs that satisfy the hard bimodal constraint. For each pair we will report the unimodal and bimodal retrieval ranks, highlight the specific audiovisual evidence that enables success only when both modalities are available, and briefly discuss whether the failure mode appears attributable to alignment, model architecture, or filtering artifacts. This qualitative support will complement the quantitative tables and clarify the audio-alignment bottleneck claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a benchmark construction and empirical evaluation work with no mathematical derivations, fitted parameters, or predictions. Central claims rest on dataset creation from external Video-MME videos, query simulation, and testing 15 independent retrievers under standard metrics. No self-citation load-bearing steps, no ansatz smuggling, and no reduction of results to inputs by construction. The work is self-contained against external benchmarks and retrievers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper creates a new evaluation resource rather than deriving results from equations or fitted models; the main premises are domain assumptions about data quality.

axioms (1)

domain assumption The 399 selected Video-MME videos provide sufficient quality and diversity for a long-video audiovisual benchmark.
Paper states videos were carefully screened to ensure source quality and diversity.

pith-pipeline@v0.9.0 · 5555 in / 1228 out tokens · 51226 ms · 2026-05-12T05:18:32.760345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 7 internal anchors

[1]

URLhttps://www.scenedetect.com

Pyscenedetect. URLhttps://www.scenedetect.com

work page
[2]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zal?n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. Musiclm: Generating music from text, 2023. URLhttps://arxiv.org/abs/2301.11325

work page internal anchor Pith review arXiv 2023
[3]

Lovr: A benchmark for long video retrieval in multimodal contexts.arXiv preprint arXiv:2505.13928, 2025

Qifeng Cai, Hao Liang, Zhaoyang Han, Hejun Dong, Meiyi Qiang, Ruichuan An, Quanqing Xu, Bin Cui, and Wentao Zhang. Lovr: A benchmark for long video retrieval in multimodal contexts.arXiv preprint arXiv:2505.13928, 2025

work page arXiv 2025
[4]

Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772, 2025

Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, and Liyun Ru. Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772, 2025

work page internal anchor Pith review arXiv 2025
[5]

Collecting highly parallel data for paraphrase evaluation

David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2011

work page 2011
[6]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

work page 2020
[7]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024
[8]

Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, 2025

work page arXiv 2025
[9]

Glap: General contrastive audio-text pretraining across domains and languages, 2025

Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, and Jian Luan. Glap: General contrastive audio-text pretraining across domains and languages, 2025

work page 2025
[10]

Vggsound: A Large-Scale Audio-Visual Dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740, 2020. doi: 10.1109/ICASSP40776.2020.9052990

work page doi:10.1109/icassp40776.2020.9052990 2020
[11]

Clap learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[12]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025
[13]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023

work page 2023
[14]

Brace: A benchmark for robust audio caption quality evaluation, 2025

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation, 2025. URL https://arxiv. org/abs/2512.10403

work page arXiv 2025
[15]

OmniCVR: A benchmark for omni-composed video retrieval with vision, audio, and text

Junyang Ji, Shengjun Zhang, Da Li, Yuxiao Luo, Yan Wang, Di Xu, Biao Yang, Wei Yuan, Fan Yang, Zhihai He, and Wenming Yang. OmniCVR: A benchmark for omni-composed video retrieval with vision, audio, and text. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=KxxR7emO5K

work page 2026
[16]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InNAACL-HLT, 2019

work page 2019
[17]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review arXiv 2026
[18]

Vaq-bench: A benchmark for videoqa answer quality evaluation

Hao Liang, Meiyi Qiang, Zimo Meng, and Wentao Zhang. Vaq-bench: A benchmark for videoqa answer quality evaluation. InProceedings of the 34th ACM International Conference on Multimedia (ACM MM), 2026. 11

work page 2026
[19]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013

work page 2004
[20]

Plumbley, Yuexian Zou, and Wenwu Wang

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, and Wenwu Wang. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1–15, 2024

work page 2024
[21]

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, and Keisuke Imoto. M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation. InInterspeech, pages 57–61, 2024. doi: 10.21437/Interspeech.2024-29

work page doi:10.21437/interspeech.2024-29 2024
[22]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/ 2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, et al. Qwen3-asr technical report.arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review arXiv 2026
[24]

W A VE: Learning unified & versatile audio-visual embeddings with multimodal LLM

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, and Chao Zhang. W A VE: Learning unified & versatile audio-visual embeddings with multimodal LLM. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=MiV3WXDYJb

work page 2026
[26]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 20...

work page 2025
[28]

Pushing the frontier of audiovisual perception with large-scale multimodal correspondence learning, 2025

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, and Wei-Ning Hsu. Pushing the frontier of audiovisual perception with large-scale multimodal correspondence learning, 2025. URL https: //arxiv.org/abs/2512.19687

work page arXiv 2025
[29]

Videoclip-xl: Advancing long description understanding for video clip models, 2024

Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin. Videoclip-xl: Advancing long description understanding for video clip models, 2024. URLhttps://arxiv.org/abs/2410.00741

work page arXiv 2024
[30]

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

work page 2019
[31]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmenta- tion

Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmenta- tion. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023

work page 2023
[32]

Scaling audio-text retrieval with multimodal large language models, 2026

Jilan Xu, Carl Thomé, Danijela Horak, Weidi Xie, and Andrew Zisserman. Scaling audio-text retrieval with multimodal large language models, 2026. URLhttps://arxiv.org/abs/2602.18010

work page arXiv 2026
[33]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[35]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment, 2023

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment, 2023. 12 A Human Annotation Protocol Human review is integrated at multiple stages of the...

work page 2023
[36]

Video type classification.The annotator first determines whether the clip is primarilyvisually driven(e.g., sports, scenic transitions) orauditorily driven(e.g., lectures, narrated content)

work page
[37]

full unified media

Segmentation decision.Based on the identified type, the annotator segments the clip into coherent sub-clips according to the dominant modality’s transition points. If neither visual nor auditory variation is sufficient to justify further splitting, the annotator flags the clip and reports it to the authors without forcing a split. After independent annota...

work page arXiv
[38]

Analyze the audio and visual content of this video

work page
[39]

video_type

Determine whether this video is strongly audio-driven. Return ONLY valid JSON in the following format: { "video_type": "Educational/Drama/Music/Dialogue/Visual/Other", "is_strong_audio_driven": true/false, "audio_importance_score": 0-10, "reasoning": "short explanation" } Figure 6: Prompt for audio-driven modality triage. Transcript Semantic Splitting Pro...

work page
[40]

The original video (visual + audio)

work page
[41]

A reference visual caption

work page
[42]

A reference audio caption. [Reference Visual Caption] [VISUAL_CAPTION] [Reference Audio Caption] [AUDIO_CAPTION] Your task: Produce a single unified multimodal caption that faithfully preserves ALL information from both references. Instructions: - Preserve the chronological order of events. - DO NOT omit, summarize, or compress any factual information fro...

work page
[43]

obvious degenerate repetition, such as the same phrase/sentence repeated many times

work page
[44]

semantic corruption, where the text becomes meaningless, self-repeating, or mechanically loops

work page
[45]

partial collapse: the caption starts normal but later devolves into repeated fragments

work page
[46]

A caption is GOOD if:

severe discourse incoherence caused by generation failure. A caption is GOOD if:

work page
[47]

it is semantically coherent overall

work page
[48]

some mild natural repetition is acceptable

work page
[49]

label":

do not mark as BAD just because a short phrase appears a few times naturally. You must output ONLY valid JSON with this schema: { "label": "bad" or "good", "confidence": a float between 0 and 1, "reason": "short explanation", "evidence": ["short evidence 1", "short evidence 2"] } User: Please inspect the following audio caption for semantic degeneration o...

work page
[50]

Preserve ALL factual details including names, actions, and objects

work page
[51]

then", "after that

Only optimize the transition using temporal words (e.g., "then", "after that")

work page
[52]

Remove obvious repetition

work page
[53]

Keep language natural

work page
[54]

# Output format: [Optimized previous paragraph] [Optimized next paragraph] Figure 12: Prompt for video-level caption merging

Output EXACTLY two paragraphs separated by a newline. # Output format: [Optimized previous paragraph] [Optimized next paragraph] Figure 12: Prompt for video-level caption merging. 21 Single-Modality Query Generation Prompt System: You are a careful data generation assistant for retrieval benchmarks. Return valid JSON only. Do not include markdown fences, ...

work page
[55]

First identify the most salient retrievable information in the caption

work page
[56]

Do NOT simply shorten or paraphrase the full caption

work page
[57]

Do NOT include too many details in one query

work page
[58]

Each query should contain only partial information, as if the user remembers only some aspects

work page
[59]

The query must stay faithful to the caption and must not introduce any new facts

work page
[60]

Prefer user-style retrieval language over exhaustive description

work page
[61]

All queries must sound like realistic human search queries

work page
[62]

Avoid overly poetic, overly formal, keyword-stuffed, or annotation-like language

work page
[63]

Avoid repetitive outputs

work page
[64]

salient_information

Do not mention the modality name in the query itself. Query design requirements: - Generate 3 to 5 human-like queries based on the caption. - All queries should sound like real users recalling the content imperfectly. - Queries may be slightly vague, but must still be grounded in the caption. - Make the queries diverse in wording and level of specificity....

work page
[65]

First identify the most salient retrievable visual information in the caption

work page
[66]

Then identify the most salient retrievable audio information in the caption

work page
[67]

Generate queries that ALWAYS combine both modalities

work page
[68]

Each query must contain a partial visual cue and a partial audio cue

work page
[69]

Do NOT simply summarize the full caption

work page
[70]

Do NOT include too many details from either modality

work page
[71]

Do NOT make the visual cue alone sufficient for retrieval

work page
[72]

Do NOT make the audio cue alone sufficient for retrieval

work page
[73]

The query must remain faithful to the caption and must not introduce any new facts

work page
[74]

Queries should sound like realistic human retrieval queries, not structured annotations

work page
[75]

Avoid repetitive outputs and overly formal or keyword-stuffed language

work page
[76]

visual_salient_information

Each query should preserve only part of the total information in the unified caption. Design principle: The best query is one where the visual cue narrows the search space somewhat, the audio cue also narrows the search space somewhat, but only the combination strongly identifies the target clip. Output requirements: - Generate 3 to 5 human-like queries b...

work page
[77]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page