Recognition: no theorem link
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
Pith reviewed 2026-05-12 05:18 UTC · model grok-4.3
The pith
A new benchmark shows that realistic user queries change how retrieval models perform on long audiovisual videos and that audio-language alignment remains a bottleneck.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FLARE supplies full-modality annotations for long videos and applies a hard bimodal constraint that keeps only those cross-modal queries for which retrieval fails on either modality alone but succeeds when both are available. Experiments with fifteen representative retrievers under caption-based and query-based settings establish that user-style queries substantially alter model behavior, that strong caption-based results do not reliably transfer to query-based retrieval, and that audio-language alignment constitutes a persistent bottleneck for unified audiovisual retrieval.
What carries the argument
The hard bimodal constraint, which retains only queries that succeed solely when vision and audio evidence are combined.
If this is right
- Retrieval models require separate evaluation under query-based regimes rather than relying solely on caption-based scores.
- Improvements in audio-language alignment are necessary before unified audiovisual retrieval can succeed at scale.
- Benchmarks for video retrieval should incorporate long-form content and multiple modalities to reflect realistic conditions.
- Caption-trained models may need additional query-style fine-tuning to maintain performance on natural user inputs.
Where Pith is reading between the lines
- Training pipelines for retrieval systems may benefit from incorporating diverse query formats during pre-training rather than captions alone.
- The identified audio-alignment gap could constrain the utility of multimodal large language models in open-ended video search tasks.
- Extending the benchmark to additional languages or domains would test whether the observed bottlenecks generalize beyond the current video sources.
Load-bearing premise
The user-simulated queries together with the hard bimodal constraint accurately represent real-world user intent and retrieval difficulty.
What would settle it
A study in which actual users issue queries against the same long videos and the resulting performance shifts and modality rankings differ from those observed with the simulated queries.
Figures
read the original abstract
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form content using both visual and auditory evidence. Yet existing retrieval benchmarks are still dominated by short clips, single modalities, and caption-based evaluation. We introduce FLARE, a full-modality long-video audiovisual retrieval benchmark with user-simulated queries. Built from 399 carefully screened Video-MME videos (10--60 min, 225.4 h) to ensure source quality and diversity, FLARE contains 87,697 clips annotated with vision, audio, and unified audiovisual captions, together with 274,933 user-style queries. Cross-modal queries are further filtered by a hard bimodal constraint, requiring retrieval to fail under either modality alone but succeed when both are combined. FLARE evaluates models under two regimes, caption-based and query-based retrieval, across vision, audio, and unified audiovisual settings. Experiments with 15 representative retrievers show that user-style queries substantially change model behavior, strong caption-based performance does not always transfer to query-based retrieval, and audio--language alignment remains a key bottleneck for unified audiovisual retrieval. Our code and data are released at https://flarebench.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FLARE, a benchmark for full-modality long-video audiovisual retrieval with user-simulated queries. Built from 399 screened Video-MME videos (10-60 min, 225.4 h total), it provides 87,697 clips annotated with vision, audio, and unified audiovisual captions plus 274,933 user-style queries. Cross-modal queries are filtered via a hard bimodal constraint (retrieval must fail on either modality alone but succeed when both are available). The benchmark supports caption-based and query-based evaluation regimes across vision, audio, and unified settings. Experiments on 15 retrievers show that user-style queries alter model behavior, caption-based performance does not reliably transfer to query-based retrieval, and audio-language alignment remains a bottleneck for unified audiovisual retrieval. Code and data are released.
Significance. If the user-simulated queries and hard bimodal filter are faithful proxies for real-world intent and difficulty, FLARE would fill an important gap by moving video retrieval evaluation beyond short clips, single modalities, and caption-only protocols toward long-form, full-modality, query-driven settings. The open release of code and data supports reproducibility, and the empirical results across 15 retrievers supply concrete evidence of current limitations that could guide MLLM and retrieval research.
major comments (2)
- [Dataset construction] Dataset construction section: the generation process for the 274,933 user-simulated queries and the precise implementation of the hard bimodal constraint (failure on unimodal retrieval but success on bimodal) are described at a high level only. These choices are load-bearing for the central claims that user-style queries change model behavior and that the benchmark captures realistic difficulty; without explicit templates, LLM prompts, or validation statistics, it is difficult to assess whether the observed performance shifts are artifacts of the simulation rather than genuine user-intent differences.
- [Experiments] Experiments section: while results are reported for 15 retrievers, the paper provides limited error analysis or case studies showing concrete examples of queries that fail unimodally but succeed bimodally. This weakens the claim that audio-language alignment is the key bottleneck, as the quantitative tables alone do not isolate whether failures stem from alignment, retrieval architecture, or the filtering procedure itself.
minor comments (3)
- Abstract: the total number of queries (274,933) and clips (87,697) are stated, but average query length, modality distribution, or number of queries per video would help readers quickly gauge scale.
- Related work: a more explicit comparison table contrasting FLARE with prior long-video or audiovisual benchmarks (e.g., on duration, modality coverage, and query style) would strengthen positioning.
- Figures: performance plots comparing caption-based vs. query-based regimes should include error bars or statistical significance markers to support the claim of substantial behavioral change.
Simulated Author's Rebuttal
We thank the referee for the positive recommendation of minor revision and for the constructive comments on dataset construction and experimental analysis. We address each point below and will incorporate the suggested clarifications and additions in the revised manuscript.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the generation process for the 274,933 user-simulated queries and the precise implementation of the hard bimodal constraint (failure on unimodal retrieval but success on bimodal) are described at a high level only. These choices are load-bearing for the central claims that user-style queries change model behavior and that the benchmark captures realistic difficulty; without explicit templates, LLM prompts, or validation statistics, it is difficult to assess whether the observed performance shifts are artifacts of the simulation rather than genuine user-intent differences.
Authors: We agree that additional transparency on the query generation and filtering pipeline is warranted. In the revised manuscript we will expand the Dataset Construction section to include the exact system and user prompts supplied to the LLM for simulating user-style queries, the concrete templates and decision rules used to enforce the hard bimodal constraint (including the retrieval failure thresholds applied to unimodal runs), and supplementary validation statistics such as the fraction of candidate queries retained after filtering and basic agreement metrics between automated and manual checks on a held-out subset. These additions will allow readers to evaluate the simulation fidelity directly. revision: yes
-
Referee: [Experiments] Experiments section: while results are reported for 15 retrievers, the paper provides limited error analysis or case studies showing concrete examples of queries that fail unimodally but succeed bimodally. This weakens the claim that audio-language alignment is the key bottleneck, as the quantitative tables alone do not isolate whether failures stem from alignment, retrieval architecture, or the filtering procedure itself.
Authors: We acknowledge that illustrative examples would strengthen the interpretation of the results. In the revised version we will add a dedicated error-analysis subsection containing 4–6 representative query–clip pairs that satisfy the hard bimodal constraint. For each pair we will report the unimodal and bimodal retrieval ranks, highlight the specific audiovisual evidence that enables success only when both modalities are available, and briefly discuss whether the failure mode appears attributable to alignment, model architecture, or filtering artifacts. This qualitative support will complement the quantitative tables and clarify the audio-alignment bottleneck claim. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a benchmark construction and empirical evaluation work with no mathematical derivations, fitted parameters, or predictions. Central claims rest on dataset creation from external Video-MME videos, query simulation, and testing 15 independent retrievers under standard metrics. No self-citation load-bearing steps, no ansatz smuggling, and no reduction of results to inputs by construction. The work is self-contained against external benchmarks and retrievers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 399 selected Video-MME videos provide sufficient quality and diversity for a long-video audiovisual benchmark.
Reference graph
Works this paper leans on
- [1]
-
[2]
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I. Denk, Zal?n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. Musiclm: Generating music from text, 2023. URLhttps://arxiv.org/abs/2301.11325
work page internal anchor Pith review arXiv 2023
-
[3]
Qifeng Cai, Hao Liang, Zhaoyang Han, Hejun Dong, Meiyi Qiang, Ruichuan An, Quanqing Xu, Bin Cui, and Wentao Zhang. Lovr: A benchmark for long video retrieval in multimodal contexts.arXiv preprint arXiv:2505.13928, 2025
-
[4]
Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, and Liyun Ru. Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
Collecting highly parallel data for paraphrase evaluation
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2011
work page 2011
-
[6]
Vggsound: A large-scale audio-visual dataset
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
work page 2020
-
[7]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024
work page 2024
-
[8]
Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, 2025
-
[9]
Glap: General contrastive audio-text pretraining across domains and languages, 2025
Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, and Jian Luan. Glap: General contrastive audio-text pretraining across domains and languages, 2025
work page 2025
-
[10]
Vggsound: A Large-Scale Audio-Visual Dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740, 2020. doi: 10.1109/ICASSP40776.2020.9052990
-
[11]
Clap learning audio concepts from natural language supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023
work page 2023
-
[12]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025
work page 2025
-
[13]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023
work page 2023
-
[14]
Brace: A benchmark for robust audio caption quality evaluation, 2025
Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation, 2025. URL https://arxiv. org/abs/2512.10403
-
[15]
OmniCVR: A benchmark for omni-composed video retrieval with vision, audio, and text
Junyang Ji, Shengjun Zhang, Da Li, Yuxiao Luo, Yan Wang, Di Xu, Biao Yang, Wei Yuan, Fan Yang, Zhihai He, and Wenming Yang. OmniCVR: A benchmark for omni-composed video retrieval with vision, audio, and text. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=KxxR7emO5K
work page 2026
-
[16]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InNAACL-HLT, 2019
work page 2019
-
[17]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026
work page internal anchor Pith review arXiv 2026
-
[18]
Vaq-bench: A benchmark for videoqa answer quality evaluation
Hao Liang, Meiyi Qiang, Zimo Meng, and Wentao Zhang. Vaq-bench: A benchmark for videoqa answer quality evaluation. InProceedings of the 34th ACM International Conference on Multimedia (ACM MM), 2026. 11
work page 2026
-
[19]
ROUGE: A package for automatic evaluation of summaries
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013
work page 2004
-
[20]
Plumbley, Yuexian Zou, and Wenwu Wang
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, and Wenwu Wang. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1–15, 2024
work page 2024
-
[21]
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, and Keisuke Imoto. M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation. InInterspeech, pages 57–61, 2024. doi: 10.21437/Interspeech.2024-29
-
[22]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/ 2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, et al. Qwen3-asr technical report.arXiv preprint arXiv:2601.21337, 2026
work page internal anchor Pith review arXiv 2026
-
[24]
W A VE: Learning unified & versatile audio-visual embeddings with multimodal LLM
Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, and Chao Zhang. W A VE: Learning unified & versatile audio-visual embeddings with multimodal LLM. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=MiV3WXDYJb
work page 2026
-
[26]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 20...
work page 2025
-
[28]
Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, and Wei-Ning Hsu. Pushing the frontier of audiovisual perception with large-scale multimodal correspondence learning, 2025. URL https: //arxiv.org/abs/2512.19687
-
[29]
Videoclip-xl: Advancing long description understanding for video clip models, 2024
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin. Videoclip-xl: Advancing long description understanding for video clip models, 2024. URLhttps://arxiv.org/abs/2410.00741
-
[30]
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
work page 2019
-
[31]
Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmenta- tion. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023
work page 2023
-
[32]
Scaling audio-text retrieval with multimodal large language models, 2026
Jilan Xu, Carl Thomé, Danijela Horak, Weidi Xie, and Andrew Zisserman. Scaling audio-text retrieval with multimodal large language models, 2026. URLhttps://arxiv.org/abs/2602.18010
-
[33]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[35]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment, 2023. 12 A Human Annotation Protocol Human review is integrated at multiple stages of the...
work page 2023
-
[36]
Video type classification.The annotator first determines whether the clip is primarilyvisually driven(e.g., sports, scenic transitions) orauditorily driven(e.g., lectures, narrated content)
-
[37]
Segmentation decision.Based on the identified type, the annotator segments the clip into coherent sub-clips according to the dominant modality’s transition points. If neither visual nor auditory variation is sufficient to justify further splitting, the annotator flags the clip and reports it to the authors without forcing a split. After independent annota...
-
[38]
Analyze the audio and visual content of this video
-
[39]
Determine whether this video is strongly audio-driven. Return ONLY valid JSON in the following format: { "video_type": "Educational/Drama/Music/Dialogue/Visual/Other", "is_strong_audio_driven": true/false, "audio_importance_score": 0-10, "reasoning": "short explanation" } Figure 6: Prompt for audio-driven modality triage. Transcript Semantic Splitting Pro...
-
[40]
The original video (visual + audio)
-
[41]
A reference visual caption
-
[42]
A reference audio caption. [Reference Visual Caption] [VISUAL_CAPTION] [Reference Audio Caption] [AUDIO_CAPTION] Your task: Produce a single unified multimodal caption that faithfully preserves ALL information from both references. Instructions: - Preserve the chronological order of events. - DO NOT omit, summarize, or compress any factual information fro...
-
[43]
obvious degenerate repetition, such as the same phrase/sentence repeated many times
-
[44]
semantic corruption, where the text becomes meaningless, self-repeating, or mechanically loops
-
[45]
partial collapse: the caption starts normal but later devolves into repeated fragments
-
[46]
severe discourse incoherence caused by generation failure. A caption is GOOD if:
-
[47]
it is semantically coherent overall
-
[48]
some mild natural repetition is acceptable
-
[49]
do not mark as BAD just because a short phrase appears a few times naturally. You must output ONLY valid JSON with this schema: { "label": "bad" or "good", "confidence": a float between 0 and 1, "reason": "short explanation", "evidence": ["short evidence 1", "short evidence 2"] } User: Please inspect the following audio caption for semantic degeneration o...
-
[50]
Preserve ALL factual details including names, actions, and objects
-
[51]
Only optimize the transition using temporal words (e.g., "then", "after that")
-
[52]
Remove obvious repetition
-
[53]
Keep language natural
-
[54]
Output EXACTLY two paragraphs separated by a newline. # Output format: [Optimized previous paragraph] [Optimized next paragraph] Figure 12: Prompt for video-level caption merging. 21 Single-Modality Query Generation Prompt System: You are a careful data generation assistant for retrieval benchmarks. Return valid JSON only. Do not include markdown fences, ...
-
[55]
First identify the most salient retrievable information in the caption
-
[56]
Do NOT simply shorten or paraphrase the full caption
-
[57]
Do NOT include too many details in one query
-
[58]
Each query should contain only partial information, as if the user remembers only some aspects
-
[59]
The query must stay faithful to the caption and must not introduce any new facts
-
[60]
Prefer user-style retrieval language over exhaustive description
-
[61]
All queries must sound like realistic human search queries
-
[62]
Avoid overly poetic, overly formal, keyword-stuffed, or annotation-like language
-
[63]
Avoid repetitive outputs
-
[64]
Do not mention the modality name in the query itself. Query design requirements: - Generate 3 to 5 human-like queries based on the caption. - All queries should sound like real users recalling the content imperfectly. - Queries may be slightly vague, but must still be grounded in the caption. - Make the queries diverse in wording and level of specificity....
-
[65]
First identify the most salient retrievable visual information in the caption
-
[66]
Then identify the most salient retrievable audio information in the caption
-
[67]
Generate queries that ALWAYS combine both modalities
-
[68]
Each query must contain a partial visual cue and a partial audio cue
-
[69]
Do NOT simply summarize the full caption
-
[70]
Do NOT include too many details from either modality
-
[71]
Do NOT make the visual cue alone sufficient for retrieval
-
[72]
Do NOT make the audio cue alone sufficient for retrieval
-
[73]
The query must remain faithful to the caption and must not introduce any new facts
-
[74]
Queries should sound like realistic human retrieval queries, not structured annotations
-
[75]
Avoid repetitive outputs and overly formal or keyword-stuffed language
-
[76]
Each query should preserve only part of the total information in the unified caption. Design principle: The best query is one where the visual cue narrows the search space somewhat, the audio cue also narrows the search space somewhat, but only the combination strongly identifies the target clip. Output requirements: - Generate 3 to 5 human-like queries b...
-
[77]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.