arxiv: 2605.14607 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.CY

Recognition: 1 theorem link

· Lean Theorem

ViMU: Benchmarking Video Metaphorical Understanding

Qi Li , Xinchao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:48 UTC · model grok-4.3

classification 💻 cs.CV cs.CY

keywords video understandingmetaphorical understandingbenchmarksubtext inferencemultimodal modelsirony detectionsocial meaning

0 comments

The pith

ViMU is the first benchmark to test whether video models can interpret metaphorical, ironic and social subtext beyond literal visuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ViMU to measure how well frontier video models grasp the implicit layers of meaning in videos. Most current systems handle only explicit content such as objects, actions and temporal order. ViMU supplies hint-free open-ended and multiple-choice questions that require models to ground inferences about emotions, attitudes and social meanings in multimodal evidence. If the benchmark works, it will expose whether models truly understand subtext or merely match surface patterns. This distinction matters because real video communication often relies on unspoken cultural and social cues.

Core claim

The authors establish ViMU as a benchmark that evaluates video understanding models on their ability to move past literal perception and infer implicit ideas, intentions, emotions, attitudes and social meanings embedded in video context, style and viewer experience, using carefully designed hint-free questions in both open-ended and multiple-choice formats.

What carries the argument

The ViMU benchmark itself, which assesses subtext understanding through curated, hint-free questions that force models to extract metaphorical, ironic and social meanings from videos.

If this is right

Video models that succeed on ViMU would demonstrate improved capacity to interpret real-world communications that rely on unspoken layers.
The benchmark supplies a standardized way to compare frontier models on their handling of context, style and social experience rather than surface visuals alone.
Development of future video systems will need explicit mechanisms for cultural and social inference to perform well on ViMU-style evaluations.
Passing ViMU would indicate models can ground interpretations in multimodal evidence instead of relying on disclosed hints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to test whether the same models handle subtext in static images or audio-only clips with comparable difficulty.
Poor performance on ViMU would motivate new training approaches that emphasize implicit reasoning over explicit visual classification.
If ViMU questions prove culturally biased, future versions might incorporate diverse viewer perspectives to strengthen the evaluation.

Load-bearing premise

That metaphorical and social subtext in videos can be reliably measured through a fixed set of curated hint-free questions that separate genuine understanding from pattern matching or guessing.

What would settle it

A finding that top models reach high accuracy on ViMU by exploiting dataset statistics or guessing without actually processing the video content or its implicit meanings would show the benchmark does not measure the intended capability.

Figures

Figures reproduced from arXiv: 2605.14607 by Qi Li, Xinchao Wang.

**Figure 2.** Figure 2: Distribution of rhetorical mechanisms (left) and social value signals (right) in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of evidence sources (left) and target subjects (right) in the dataset. The dataset [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of three-types of multiple-choice tasks in ViMU. From top to bottom: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: An example of the open-ended interpretation task in ViMU. MLLMs are asked to interpret [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Evidence grounding analysis. From left to right, we show the trade-off between evidenceselection conservatism and grounding quality, the composition of different error types across models, and the overall distortion in pairwise evidence relations relative to the gold co-occurrence structure. examines whether models can identify the multimodal evidence supporting their interpretation. The construction proc… view at source ↗

**Figure 7.** Figure 7: PCA visualization of model similarity based on error signatures in the macro-5 taxonomy tasks. Each point denotes one model; distances reflect similarity in structured error profiles rather than overall score. stability and interpretability, we further group all rhetorical mechanisms in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Taxonomy geometry analysis of EG and RM predictions. The top row compares the [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Model–option affinity bias without guid [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Category-wise distribution of guidance-induced shifts in false positive rate (∆FPR). Each violin summarizes the distribution over models for a given category, with rhetoric (green) and social value (red) shown side by side. Points denote model-level values, while markers indicate mean shifts. and PC2 denote the first two principal components, explaining 32.9% and 18.5% of the variance, respectively. No… view at source ↗

**Figure 11.** Figure 11: An illustration of the dataset curation process. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Model–option affinity bias with guidance. Positive values indicate over-prediction relative [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViMU claims to be the first benchmark for video subtext understanding but supplies no examples, validation, or results, so the central claim stays untested.

read the letter

ViMU is the first benchmark aimed at testing whether video models can pick up metaphorical, ironic, or social subtext instead of staying on literal objects and actions. The paper does a clean job spelling out the gap: most current video work stops at visible events, while real videos often carry implicit meanings that shift with culture or context. Framing the task around hint-free questions for both open-ended and multiple-choice formats is a reasonable starting point for forcing models to ground answers in actual multimodal evidence rather than surface cues. That part of the motivation lands solidly. The soft spots are straightforward and fairly large. The abstract and description assert that the questions are carefully curated and hint-free, yet the text gives no sample questions, no inter-annotator agreement numbers, no human baselines, and no model performance numbers at all. Without those pieces it is impossible to tell whether the benchmark actually measures the intended capability or whether models could succeed through language priors or visual shortcuts. The stress-test concern about missing validation holds up on the available text. This work is aimed at people building multimodal video models who want evaluation tools that go beyond standard action recognition. A reader already thinking about cultural context or implicit meaning would find the framing useful even if the current version is still a proposal. I would bring it to the reading group as a maybe, mainly to talk through what a validated version would need. I would not cite it yet because there is no data to reference. It deserves peer review because the problem is real and the high-level design is thoughtful, but any serious review would require the authors to add concrete examples, agreement stats, and at least initial model results before it could be considered ready.

Referee Report

4 major / 2 minor

Summary. The paper introduces ViMU as the first benchmark for systematically evaluating frontier video understanding models on their ability to infer metaphorical, ironic, and social subtext, using hint-free open-ended and multiple-choice questions that require grounding interpretations in multimodal evidence rather than literal content.

Significance. If the benchmark is constructed with validated questions, human baselines, and demonstrated resistance to shortcuts, it would address a clear gap in video understanding evaluation, moving beyond literal perception tasks to implicit meaning inference across cultural contexts.

major comments (4)

[Abstract] Abstract and introduction: the central claim that ViMU 'systematically evaluate[s] the subtext understanding capabilities' and that 'all questions are designed to be hint-free' is unsupported because the manuscript provides no concrete question examples, video descriptions, or annotation guidelines.
[Benchmark Construction] Benchmark design section: no inter-annotator agreement scores, human performance baselines, or analysis of potential shortcuts (e.g., language priors or visual heuristics) are reported, which are required to establish that the questions distinguish genuine multimodal inference from guessing or pattern matching.
[Experiments] Evaluation and results: the manuscript supplies no model results, comparisons to existing video benchmarks, or falsifiable predictions, leaving the claim that ViMU can assess frontier models without any empirical demonstration.
[Question Design] Question validation: the assertion that questions are 'grounded in evidence' and reliably measure subtext lacks any reported validation procedure or pilot study data, undermining the measurement instrument's validity.

minor comments (2)

[Dataset Statistics] Clarify the exact number of videos and questions in the benchmark and provide a data release plan or link.
[Introduction] Ensure consistent use of terminology such as 'subtext' versus 'implicit meaning' across sections.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments identify important areas where additional transparency and evidence are needed to support the benchmark's validity and utility. We will revise the manuscript to address each point.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the central claim that ViMU 'systematically evaluate[s] the subtext understanding capabilities' and that 'all questions are designed to be hint-free' is unsupported because the manuscript provides no concrete question examples, video descriptions, or annotation guidelines.

Authors: We agree that the abstract and introduction would be strengthened by concrete support. In the revision we will insert specific question examples, brief video descriptions, and a summary of the annotation guidelines directly into these sections. revision: yes
Referee: [Benchmark Construction] Benchmark design section: no inter-annotator agreement scores, human performance baselines, or analysis of potential shortcuts (e.g., language priors or visual heuristics) are reported, which are required to establish that the questions distinguish genuine multimodal inference from guessing or pattern matching.

Authors: We accept that these quantitative validations are necessary. The revised manuscript will report inter-annotator agreement scores, human performance baselines on the full set, and a dedicated analysis of possible shortcuts including language priors and visual heuristics. revision: yes
Referee: [Experiments] Evaluation and results: the manuscript supplies no model results, comparisons to existing video benchmarks, or falsifiable predictions, leaving the claim that ViMU can assess frontier models without any empirical demonstration.

Authors: The current draft centers on benchmark construction. To provide the requested empirical demonstration we will add, in the revision, results from multiple frontier video models, direct comparisons against existing video benchmarks, and a short discussion of falsifiable predictions. revision: yes
Referee: [Question Design] Question validation: the assertion that questions are 'grounded in evidence' and reliably measure subtext lacks any reported validation procedure or pilot study data, undermining the measurement instrument's validity.

Authors: We will expand the question-design section to describe the full validation procedure, including pilot-study results and the criteria used to confirm that questions are grounded in multimodal evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition with no derivation chain

full rationale

The paper introduces ViMU as a new benchmark for video subtext understanding without any mathematical derivations, equations, fitted parameters, predictions, or self-citations that reduce the central claim to its own inputs. The contribution consists of benchmark curation and question design, which are presented as definitional rather than derived quantities. No load-bearing steps exist that match the enumerated circularity patterns; the manuscript is self-contained as an empirical evaluation resource.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that subtextual meaning can be isolated and tested through carefully constructed hint-free questions; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Video subtext such as metaphor and irony can be systematically evaluated through hint-free questions grounded in multimodal evidence.
This premise underpins the entire benchmark design and is stated in the abstract as the motivation for ViMU.

pith-pipeline@v0.9.0 · 5574 in / 1114 out tokens · 36531 ms · 2026-05-15T04:48:43.545861+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Mythologies (book).https://en.wikipedia.org/wiki/Mythologies_(book)

work page
[2]

Openrouter: Unified api for large language models.https://openrouter.ai

work page
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in neural information processing systems, 37:92554–92580, 2024

Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chaofan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, et al. Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in neural information processing systems, 37:92554–92580, 2024

work page 2024
[6]

Mecd+: Unlocking event-level causal graph discovery for video reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Tieyuan Chen, Huabin Liu, Yi Wang, Yihang Chen, Tianyao He, Chaofan Gan, Huanyu He, and Weiyao Lin. Mecd+: Unlocking event-level causal graph discovery for video reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[7]

Looking beyond visible cues: Implicit video question answering via dual-clue reasoning.arXiv preprint arXiv:2506.07811, 2025

Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Ziran Qin, Shijie Li, Liquan Shen, Junhui Hou, Zheng Wang, et al. Looking beyond visible cues: Implicit video question answering via dual-clue reasoning.arXiv preprint arXiv:2506.07811, 2025. 10

work page arXiv 2025
[8]

Video2commonsense: Generating commonsense descriptions to enrich video captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. Video2commonsense: Generating commonsense descriptions to enrich video captioning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 840–860, 2020

work page 2020
[9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025
[10]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Encoding—decoding (1980)

Stuart Hall. Encoding—decoding (1980). InCrime and media, pages 44–55. Routledge, 2019

work page 1980
[12]

Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech

Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al. Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)...

work page 2024
[13]

Avmeme exam: A multimodal multilin- gual multicultural benchmark for llms’ contextual and cultural knowledge and thinking.arXiv preprint arXiv:2601.17645, 2026

Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu, Yinghao Ma, Minshuo Piao, Kaiyi Yang, Xiuwen Zheng, Riki Shimizu, et al. Avmeme exam: A multimodal multilin- gual multicultural benchmark for llms’ contextual and cultural knowledge and thinking.arXiv preprint arXiv:2601.17645, 2026

work page arXiv 2026
[14]

The hateful memes challenge: Detecting hate speech in multimodal memes.Advances in neural information processing systems, 33:2611–2624, 2020

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes.Advances in neural information processing systems, 33:2611–2624, 2020

work page 2020
[15]

Routledge, 2020

Gunther Kress and Theo Van Leeuwen.Reading images: The grammar of visual design. Routledge, 2020

work page 2020
[16]

Grant and Cutler, 1994

Andrew N Leak.Barthes: mythologies. Grant and Cutler, 1994

work page 1994
[17]

Are vision-language models safe in the wild? a meme-based benchmark study

DongGeon Lee, Joonwon Jang, Jihae Jeong, and Hwanjo Yu. Are vision-language models safe in the wild? a meme-based benchmark study. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30533–30576, 2025

work page 2025
[18]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[19]

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li, Cheng-Long Wang, Yinzhi Cao, and Di Wang. Cola: A choice leakage attack framework to expose privacy risks in subset training.arXiv preprint arXiv:2604.12342, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Vid-sme: Membership inference attacks against large video understanding models.Advances in Neural Information Processing Systems, 38:111572– 111596, 2026

Qi Li, Runpeng Yu, and Xinchao Wang. Vid-sme: Membership inference attacks against large video understanding models.Advances in Neural Information Processing Systems, 38:111572– 111596, 2026

work page 2026
[21]

Goat-bench: Safety insights to large multimodal models through meme-based social abuse.ACM Transactions on Intelligent Systems and Technology, 2024

Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, and Jing Ma. Goat-bench: Safety insights to large multimodal models through meme-based social abuse.ACM Transactions on Intelligent Systems and Technology, 2024

work page 2024
[22]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025
[23]

Visu- alcomet: Reasoning about the dynamic context of a still image

Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, and Yejin Choi. Visu- alcomet: Reasoning about the dynamic context of a still image. InEuropean Conference on Computer Vision, pages 508–524. Springer, 2020. 11

work page 2020
[24]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Sakshi Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark.arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

What do you meme? generating explanations for visual semantic role labelling in memes

Shivam Sharma, Siddhant Agarwal, Tharun Suresh, Preslav Nakov, Md Shad Akhtar, and Tanmoy Chakraborty. What do you meme? generating explanations for visual semantic role labelling in memes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9763–9771, 2023

work page 2023
[26]

V-hub: A visual-centric humor understanding benchmark for video llms.arXiv preprint arXiv:2509.25773, 2025

Zhengpeng Shi, Hengli Li, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, and Zilong Zheng. V-hub: A visual-centric humor understanding benchmark for video llms.arXiv preprint arXiv:2509.25773, 2025

work page arXiv 2025
[27]

Vrr-qa: Visual relational reasoning in videos beyond explicit cues, 2026

Sirnam Swetha, Rohit Gupta, Parth Parag Kulkarni, David G Shatwell, Jeffrey A Chan Santiago, Nyle Siddiqui, Joseph Fioresi, and Mubarak Shah. Vrr-qa: Visual relational reasoning in videos beyond explicit cues, 2026

work page 2026
[28]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

work page 2025
[30]

Towards lifecycle unlearning commitment management: Measuring sample-level unlearning completeness

Cheng-Long Wang, Qi Li, Zihang Xiang, Yinzhi Cao, and Di Wang. Towards lifecycle unlearning commitment management: Measuring sample-level unlearning completeness. In 34th USENIX Security Symposium (USENIX Security 25), pages 6481–6500, 2025

work page 2025
[31]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025

work page 2025
[32]

Can i trust your answer? visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204–13214, 2024

work page 2024
[33]

Funqa: Towards surprising video comprehension

Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. InEuropean Conference on Computer Vision, pages 39–57. Springer, 2024

work page 2024
[34]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759, 2025

Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759, 2025

work page arXiv 2025
[36]

Memereacon: Probing contextual meme understanding in large vision-language models

Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Yanxi Zhao, Yifan Zhang, Zezhong Wang, Huimin Wang, Yutian Zhao, Bin Liang, Yefeng Zheng, et al. Memereacon: Probing contextual meme understanding in large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3559–3582, 2025

work page 2025
[37]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025. 12 A Details of the Dataset Curation Process The construction of V...

work page 2025
[38]

- Do not assume any external posting context beyond the video itself

Transcript from audio ASR (may be empty, noisy, or partial): <transcript> Important constraints: - You must infer visible on-screen text directly from the frames when relevant. - Do not assume any external posting context beyond the video itself. - Separateliteral contentfromintended meaning. - Use only evidence supported by frames, transcript, visible te...

work page
[39]

Does the question leak the semantic field or sensitive framing?

work page
[40]

Can the question be answered correctly using only surface-level description?

work page
[41]

Does it force understanding of the intended meaning?

work page
[42]

Is the difficulty appropriate given the video’s taxonomy?

work page
[43]

Is the gold answer aligned with the intended meaning?

work page
[44]

issues"and

Is the rubric strong enough for later LLM judging? If the question is flawed, provide a better revised_question that is harder and less revealing while still evaluable. This prompt defines the structured validation step. Iterative Refinement Prompt (Augmented Generation) You are given the annotation for a video. Taxonomy JSON: <taxonomy json> Task:Create ...

work page