MetaphorVU: Towards Metaphorical Video Understanding

Boxi Cao; Fangrui Lv; Guiping Jiang; Han Li; Hongyu Lin; Jianan Wang; Le Sun; Liyan; Ruotong Pan; Ruyin Jia

arxiv: 2605.25461 · v1 · pith:3ZPJERFQnew · submitted 2026-05-25 · 💻 cs.CV

MetaphorVU: Towards Metaphorical Video Understanding

Zhuoqun Li , Boxi Cao , Guiping Jiang , Fangrui Lv , Ruotong Pan , Jianan Wang , Xiangyu Wu , Hongyu Lin

show 8 more authors

Yaojie Lu Yong Du Ruyin Jia Liyan Tingting Gao Han Li Xianpei Han Le Sun

This is my paper

Pith reviewed 2026-06-29 22:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords metaphorical video understandingmultimodal large language modelscross-domain mappingmetaphor knowledge graphinference-time enhancementvideo benchmarkhigh-order cognition

0 comments

The pith

Multimodal models lag humans on metaphorical video understanding due to defective cross-domain mapping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the first dedicated benchmark for metaphorical video understanding to test high-order cognitive skills in multimodal large language models. It shows these models perform well below human levels, attributing the shortfall mainly to defective cross-domain mapping. The authors respond by building a metaphor knowledge graph for augmentation and introducing MetaphorBoost, an inference-time framework that delivers consistent gains. A sympathetic reader cares because metaphorical videos appear often in everyday communication to express complex ideas, so closing this gap would expand what such models can reliably interpret in real scenarios.

Core claim

We introduce MetaphorVU-Bench as the first systematic benchmark for metaphorical video understanding. Experiments reveal that current MLLMs struggle with accurate performance, lagging far behind humans primarily because of defective cross-domain mapping. We construct a metaphor knowledge graph to augment mapping and propose MetaphorBoost as an inference-time enhancement framework that produces consistent performance gains across models.

What carries the argument

MetaphorBoost, an inference-time enhancement framework that uses a metaphor knowledge graph as mapping augmentation to improve cross-domain handling in video understanding.

If this is right

Inference-time augmentation with knowledge graphs can raise MLLM performance on high-order video tasks without retraining.
Cross-domain mapping defects represent a central bottleneck that targeted augmentation can partially relieve.
The benchmark supplies a concrete yardstick for tracking progress on metaphorical and abstract reasoning in multimodal models.
Real-world applicability of MLLMs expands once they handle metaphorical videos more reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping-augmentation tactic could apply to other abstract or figurative reasoning tasks involving images or text.
Widespread use of the benchmark might push model developers toward architectures that handle cross-domain relations more natively.
Hybrid neural-symbolic systems could become more common if inference-time graph lookup proves effective across domains.

Load-bearing premise

The newly constructed benchmark accurately measures high-order cognitive capabilities for metaphorical video understanding and the observed gap stems from defective cross-domain mapping rather than benchmark artifacts or other factors.

What would settle it

A model that reaches near-human accuracy on the benchmark without any cross-domain mapping augmentation, or a demonstration that benchmark scores fail to correlate with separate human judgments of metaphorical content.

Figures

Figures reproduced from arXiv: 2605.25461 by Boxi Cao, Fangrui Lv, Guiping Jiang, Han Li, Hongyu Lin, Jianan Wang, Le Sun, Liyan, Ruotong Pan, Ruyin Jia, Tingting Gao, Xiangyu Wu, Xianpei Han, Yaojie Lu, Yong Du, Zhuoqun Li.

**Figure 1.** Figure 1: Metaphorical videos are prevalent across various realworld scenarios to convey many complex ideas, and metaphorical video understanding requires high-order cognitive capabilities. 1984; Zhang, 2021; Alnajjar et al., 2022). According to multimodal metaphor theory, human understanding of metaphorical videos is a high-order cognitive process that transforms perceived signals into deeper semantics, with the… view at source ↗

**Figure 2.** Figure 2: MetaphorVU-Bench contains 8 types of video metaphor, enabling systematic evaluation of metaphorical video understanding. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for illustration. • Cultural Symbol. Video conveys implicit meanings by symbolism of cultural artifacts, such as flying China Kongming lanterns or building a Christianity c… view at source ↗

**Figure 3.** Figure 3: We construct MetaphorVU-Bench by using a real-world short-video platform as source, selecting metaphorical videos from a large-scale video pool through multi-stage filtration, and manually annotating video metaphor interpretations with rigorous quality control. MetaphorVU-Bench can effectively support systematic and comprehensive evaluation of metaphorical video understanding [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 4.** Figure 4: Benchmark covers diverse video topics, enabling accurate evaluation of real-world metaphorical video understanding. Reliable Manual Annotation. Since video metaphor interpretation is a flexible text, different annotators may produce varying linguistic styles and formats. Although these interpretations may all be substantively correct, such subjectivity and format inconsistency make it difficult to condu… view at source ↗

**Figure 5.** Figure 5: Performing worse on subsets requiring more crossdomain mapping, supports importance of mapping augmentation. ment learning optimized for recognition and description, such as VideoRFT and Vision-R1, only achieve marginal improvements over base model Qwen2.5-VL-Instruct. 3.3. Detailed Analysis Error Analysis. To investigate the core deficiencies of MLLMs in detail, we manually observe and identify 4 common… view at source ↗

**Figure 6.** Figure 6: We construct a metaphorical knowledge graph and then propose MetaphorBoost, improving MLLMs performance on metaphorical video understanding via mapping augmentation. where N h G (ki) denotes the nodes within h hops from keyword ki in metaphorical knowledge graph, deg(·, K) represents the number of edges linking a target concept to the source keywords, and R is the resulting set. Finally, with retrieved c… view at source ↗

**Figure 7.** Figure 7: Amount of three kinds of bad mapping reduces, proving MetaphorBoost can effectively enhance cross-domain mapping. tation. “w/o external augmentatio” means querying the MLLM itself for augmentation instead of using external knowledge. The performance drops compared to MetaphorBoost, indicating that external knowledge helps compensate for MLLMs deficiency in the cross-domain mapping. Knowledge graph provide… view at source ↗

**Figure 8.** Figure 8: The green, orange, and blue highlights indicate missing mapping, superficial mapping, and improper mapping respectively, these deficiencies collectively lead to poor metaphorical video interpretation. MetaphorBoost effectively mitigates the three types of deficiencies, thereby improving MLLMs performance on metaphorical video understanding. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for LLM filtration. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for MLLM filtration. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for Human filtration. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Manual annotation guideline. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for evaluation. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for LLM judge. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for extracting metaphorical concept pairs. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for identifying visual elements. {video} {title} prompt = \ """<< Instruction >> Analyze the metaphorical logic in this video, i.e., what ideas are implicitly expressed through the content presented. {title} << Requirements >> 1. Thoroughly identify all video content that contains metaphors. 2. Analyze the underlying ideas of the metaphors deeply and accurately. 3. You may refer to the external kno… view at source ↗

**Figure 17.** Figure 17: Prompt for generating video metaphor interpretation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Examples of Body Language. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Examples of Atmosphere Language. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Examples of Cultural Symbol. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Examples of Naturalistic Symbol. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Examples of Causal Montage. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Examples of Analogical Montage. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

**Figure 24.** Figure 24: Examples of Surreal Narrative. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗

**Figure 25.** Figure 25: Examples of Performative Narrative. Note that most videos simultaneously contain multiple types of metaphor, we only show the dominant one in each case for convenient illustration. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗

read the original abstract

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaphorVU-Bench is the first dedicated test for metaphorical video understanding in MLLMs, but the claim that failures stem primarily from defective cross-domain mapping lacks the controls needed to hold up.

read the letter

The main thing here is a new benchmark called MetaphorVU-Bench for testing how well MLLMs understand metaphorical videos, plus a simple inference-time fix using a metaphor knowledge graph.

It does a solid job of identifying an area where current models fall short of humans on something that matters in practice. Metaphorical videos show up in lots of places, and the high-order mapping they require is a good test of whether models are doing more than surface matching.

The soft spot is the causal story. The paper says the struggle comes primarily from defective cross-domain mapping, but without the kind of controls the stress-test mentions—like literal baselines or detailed error analysis—it's difficult to rule out other explanations such as basic multimodal alignment problems or issues with how the benchmark items were chosen. The abstract doesn't lay out the dataset construction or statistical checks either, so the strength of the evidence for that specific diagnosis is limited right now.

The MetaphorBoost approach is a direct response to the observed failures and reportedly improves results, which is worth noting even if the details are thin in the summary.

This kind of work is for people tracking progress on MLLM reasoning capabilities beyond literal description. A reader who wants to see new test sets for figurative understanding will find the benchmark useful to examine, and the method gives a baseline to build on.

I'd recommend sending it to peer review. The new benchmark and the reported gap are substantive enough to merit referee input on the validation steps, even though the current write-up leaves some questions about what exactly is being measured.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MetaphorVU-Bench as the first systematic benchmark for metaphorical video understanding, reports that current MLLMs lag substantially behind human performance primarily due to defective cross-domain mapping, and presents MetaphorBoost, an inference-time framework that augments inputs via a constructed metaphor knowledge graph to yield consistent gains.

Significance. If the benchmark isolates cross-domain mapping as claimed and the performance gap is shown to arise from that specific deficit rather than other factors, the work would supply a needed evaluation resource and a practical augmentation technique for advancing MLLM capabilities on high-order, non-literal multimodal reasoning tasks.

major comments (2)

[Abstract] Abstract: the central diagnosis that the observed gap is 'primarily due to defective cross-domain mapping' is load-bearing for the motivation of the knowledge-graph augmentation, yet the abstract (and the reader's available description) provides no controls such as matched literal-video baselines, temporal-vs-mapping ablations, or error analysis that would distinguish mapping failure from general multimodal deficits or benchmark artifacts.
[Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: the claim that MetaphorVU-Bench accurately measures high-order metaphorical capabilities requires explicit reporting of inter-annotator agreement on metaphor interpretations and evidence that items demand cross-domain mapping rather than other video or language skills; without these, the performance gap cannot be confidently attributed to the stated cause.

minor comments (1)

[Abstract] The abstract states 'consistent performance improvement' without reporting effect sizes, statistical significance, or comparison to stronger baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. We agree that the abstract and benchmark sections can be strengthened to better support the central claims, and we outline specific revisions below.

read point-by-point responses

Referee: [Abstract] Abstract: the central diagnosis that the observed gap is 'primarily due to defective cross-domain mapping' is load-bearing for the motivation of the knowledge-graph augmentation, yet the abstract (and the reader's available description) provides no controls such as matched literal-video baselines, temporal-vs-mapping ablations, or error analysis that would distinguish mapping failure from general multimodal deficits or benchmark artifacts.

Authors: We acknowledge that the abstract does not explicitly reference the supporting analyses. The full manuscript includes an error analysis (Section 4.3) that breaks down model failures and attributes the majority to cross-domain mapping issues rather than general multimodal or temporal deficits. We will revise the abstract to concisely note this error analysis and the absence of comparable gaps on literal controls where applicable. Adding a full literal-video baseline would require new data collection and is not feasible within the current scope, but we will clarify the design rationale for focusing on metaphorical items. revision: partial
Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: the claim that MetaphorVU-Bench accurately measures high-order metaphorical capabilities requires explicit reporting of inter-annotator agreement on metaphor interpretations and evidence that items demand cross-domain mapping rather than other video or language skills; without these, the performance gap cannot be confidently attributed to the stated cause.

Authors: We will add explicit inter-annotator agreement statistics for the metaphor interpretation annotations in the revised benchmark construction section. The items were curated through a multi-stage process with domain experts to ensure they require cross-domain mapping (detailed in Section 3.2); we will expand the annotation guidelines and item selection criteria to provide clearer evidence distinguishing these from general video comprehension skills. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces MetaphorVU-Bench as a new benchmark and reports empirical performance gaps for MLLMs, then motivates and evaluates MetaphorBoost (knowledge-graph augmentation at inference time) as an improvement. No equations, fitted parameters renamed as predictions, self-definitional mappings, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on experimental results rather than reducing to the inputs by construction. This is the expected non-finding for a benchmark-plus-method paper whose improvements are externally measurable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5720 in / 1097 out tokens · 36998 ms · 2026-06-29T22:41:05.173475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 13 canonical work pages · 6 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al. Llava- onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

F., Tevissen, Y ., Guetari, K., and Yacoubi, M

Brkic, M., Razzouki, A. F., Tevissen, Y ., Guetari, K., and Yacoubi, M. A. E. Frame sampling strategies matter: A benchmark for small vision language models.arXiv preprint arXiv:2509.14769,

work page arXiv
[4]

Flute: Figurative language understanding through textual explanations

Chakrabarty, T., Saakyan, A., Ghosh, D., and Muresan, S. Flute: Figurative language understanding through textual explanations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7139–7159,

2022
[5]

Looking beyond visible cues: Implicit video ques- tion answering via dual-clue reasoning.arXiv preprint arXiv:2506.07811,

Chen, T., Liu, H., Wang, Y ., Gan, C., Lyu, M., Zou, G., and Lin, W. Looking beyond visible cues: Implicit video ques- tion answering via dual-clue reasoning.arXiv preprint arXiv:2506.07811,

work page arXiv
[6]

Scivideobench: Bench- marking scientific video reasoning in large multimodal models.arXiv preprint arXiv:2510.08559,

Deng, A., Yang, T., Yu, S., Spencer, L., Bansal, M., Chen, C., Yeung-Levy, S., and Wang, X. Scivideobench: Bench- marking scientific video reasoning in large multimodal models.arXiv preprint arXiv:2510.08559,

work page arXiv
[7]

A survey on in- context learning

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., et al. A survey on in- context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 1107–1128,

2024
[8]

Seed1.5-VL Technical Report

Google. Gemini-2.5-pro system card, 2025a. URL https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-2-5-Pro-Model-Card.pdf. Google. Gemini-3-pro system card, 2025b. URL https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., ...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y ., Yue, X., Li, B., and Liu, Z. Video-mmmu: Evaluating knowledge acqui- sition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

R., Bhattacharyya, P., and Shekhar, S

Kalarani, A. R., Bhattacharyya, P., and Shekhar, S. Unveil- ing the invisible: Captioning videos with metaphors. In Findings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 6306–6320,

2024
[12]

Looking beyond the pixels: Evaluating visual metaphor under- standing in vlms

Kundu, M., Shekhar, S., and Bhattacharyya, P. Looking beyond the pixels: Evaluating visual metaphor under- standing in vlms. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pp. 23137–23158,

2025
[13]

Meta- cognitive analysis: Evaluating declarative and procedural knowledge in datasets and large language models

Li, Z., Lin, H., Lu, Y ., Xiang, H., Han, X., and Sun, L. Meta- cognitive analysis: Evaluating declarative and procedural knowledge in datasets and large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 11222–11228,

2024
[14]

Paperregister: Boosting flexible-grained paper search via hierarchical register indexing.arXiv preprint arXiv:2508.11116, 2025a

Li, Z., Chen, X., Lin, H., Lu, Y ., Han, X., and Sun, L. Paperregister: Boosting flexible-grained paper search via hierarchical register indexing.arXiv preprint arXiv:2508.11116, 2025a. Li, Z., Chen, X., Yu, H., Lin, H., Lu, Y ., Tang, Q., Huang, F., Han, X., Sun, L., and Li, Y . Structrag: Boosting knowledge intensive reasoning of llms via inference-time...

work page arXiv 2025
[15]

Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understand- ing with large models.arXiv preprint arXiv:2505.12589,

Liu, B., Qiao, P., Ma, M., Zhang, X., Tang, Y ., Xu, P., Liu, K., and Yuan, T. Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understand- ing with large models.arXiv preprint arXiv:2505.12589,

work page arXiv
[16]

W., Soldaini, L., Soboroff, I., Weller, O., Kayi, E., et al

Mayfield, J., Yang, E., Lawrie, D., MacAvaney, S., Mc- Namee, P., Oard, D. W., Soldaini, L., Soboroff, I., Weller, O., Kayi, E., et al. On the evaluation of machine- generated reports. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pp. 1904–1915,

1904
[17]

Concept drift guided layernorm tuning for efficient multimodal metaphor iden- tification

Qian, W., Hu, Z., Song, Z., and Li, J. Concept drift guided layernorm tuning for efficient multimodal metaphor iden- tification. InProceedings of the 2025 International Con- ference on Multimedia Retrieval, pp. 1100–1108,

2025
[18]

Understanding figurative meaning through explainable visual entailment

Saakyan, A., Kulkarni, S., Chakrabarty, T., and Muresan, S. Understanding figurative meaning through explainable visual entailment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1–23,

2025
[19]

P., Shatwell, D

Swetha, S., Gupta, R., Kulkarni, P. P., Shatwell, D. G., Santiago, J. A. C., Siddiqui, N., Fioresi, J., and Shah, M. Implicitqa: Going beyond frames towards implicit video reasoning.arXiv preprint arXiv:2506.21742,

work page arXiv
[20]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

URLhttps://arxiv.org/abs/2507.01006. Tian, Y ., Zhang, R., Xu, N., and Mao, W. Bridging word- pair and token-level metaphor detection with explain- able domain mining. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13311–13325,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a

Wang, P., Zhang, Y ., Fei, H., Chen, Q., Wang, Y ., Si, J., Lu, W., Li, M., and Qin, L. S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a. Wang, Q., Yu, Y ., Yuan, Y ., Mao, R., and Zhou, T. Videorft: Incentivizing video reasonin...

work page arXiv
[22]

Theoretical Basis for Video Metaphor Taxonomy To ensure reliable and principled evaluation, a systematic video metaphor taxonomy is essential for building the benchmark

13 MetaphorVU: Towards Metaphorical Video Understanding A. Theoretical Basis for Video Metaphor Taxonomy To ensure reliable and principled evaluation, a systematic video metaphor taxonomy is essential for building the benchmark. Since no prior works have explored this kind of taxonomy, we draw on multimodal metaphor theory (Forceville et al., 2009; Forcev...

2009
[23]

virtual performance

and its extensions in the video field (Bordwell, 2013b; Stam, 2017; Schechner, 2017; Chandler, 2022), designing the first systematic video metaphor taxonomy, the details are illustrated in follows: According to Film Mise-en-sc`ene Theory (Bordwell et al., 2004; Gibbs & Gibbs, 2002; Arnheim, 1957), video metaphors can be realized through visual element arr...

2017
[24]

C.2. Prompt for LLM Judge Since the output video metaphor interpretation in MetaphorVU-Bench is free-form text, rule-based metrics are difficult to provide a score aligning with actual human habits (Mayfield et al., 2024; Li et al., 2025d). To this end, we follow the metrics in previous free-form QA evaluation works (Li et al., 2025b;e; Yu et al., 2025; L...

2024
[25]

Note that a portion of the data was originally in Chinese, to ensure the universality of the metaphorical knowledge graph, we use GPT-5 to translate the original text into English. D.2. Prompt for Extracting Metaphorical Concept Pairs Since several previous works that have been widely recognized by the community have demonstrated that current LLMs possess...

2023

[1] [1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al. Llava- onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

F., Tevissen, Y ., Guetari, K., and Yacoubi, M

Brkic, M., Razzouki, A. F., Tevissen, Y ., Guetari, K., and Yacoubi, M. A. E. Frame sampling strategies matter: A benchmark for small vision language models.arXiv preprint arXiv:2509.14769,

work page arXiv

[4] [4]

Flute: Figurative language understanding through textual explanations

Chakrabarty, T., Saakyan, A., Ghosh, D., and Muresan, S. Flute: Figurative language understanding through textual explanations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7139–7159,

2022

[5] [5]

Looking beyond visible cues: Implicit video ques- tion answering via dual-clue reasoning.arXiv preprint arXiv:2506.07811,

Chen, T., Liu, H., Wang, Y ., Gan, C., Lyu, M., Zou, G., and Lin, W. Looking beyond visible cues: Implicit video ques- tion answering via dual-clue reasoning.arXiv preprint arXiv:2506.07811,

work page arXiv

[6] [6]

Scivideobench: Bench- marking scientific video reasoning in large multimodal models.arXiv preprint arXiv:2510.08559,

Deng, A., Yang, T., Yu, S., Spencer, L., Bansal, M., Chen, C., Yeung-Levy, S., and Wang, X. Scivideobench: Bench- marking scientific video reasoning in large multimodal models.arXiv preprint arXiv:2510.08559,

work page arXiv

[7] [7]

A survey on in- context learning

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., et al. A survey on in- context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 1107–1128,

2024

[8] [8]

Seed1.5-VL Technical Report

Google. Gemini-2.5-pro system card, 2025a. URL https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-2-5-Pro-Model-Card.pdf. Google. Gemini-3-pro system card, 2025b. URL https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., ...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y ., Yue, X., Li, B., and Liu, Z. Video-mmmu: Evaluating knowledge acqui- sition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

R., Bhattacharyya, P., and Shekhar, S

Kalarani, A. R., Bhattacharyya, P., and Shekhar, S. Unveil- ing the invisible: Captioning videos with metaphors. In Findings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 6306–6320,

2024

[12] [12]

Looking beyond the pixels: Evaluating visual metaphor under- standing in vlms

Kundu, M., Shekhar, S., and Bhattacharyya, P. Looking beyond the pixels: Evaluating visual metaphor under- standing in vlms. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pp. 23137–23158,

2025

[13] [13]

Meta- cognitive analysis: Evaluating declarative and procedural knowledge in datasets and large language models

Li, Z., Lin, H., Lu, Y ., Xiang, H., Han, X., and Sun, L. Meta- cognitive analysis: Evaluating declarative and procedural knowledge in datasets and large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 11222–11228,

2024

[14] [14]

Paperregister: Boosting flexible-grained paper search via hierarchical register indexing.arXiv preprint arXiv:2508.11116, 2025a

Li, Z., Chen, X., Lin, H., Lu, Y ., Han, X., and Sun, L. Paperregister: Boosting flexible-grained paper search via hierarchical register indexing.arXiv preprint arXiv:2508.11116, 2025a. Li, Z., Chen, X., Yu, H., Lin, H., Lu, Y ., Tang, Q., Huang, F., Han, X., Sun, L., and Li, Y . Structrag: Boosting knowledge intensive reasoning of llms via inference-time...

work page arXiv 2025

[15] [15]

Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understand- ing with large models.arXiv preprint arXiv:2505.12589,

Liu, B., Qiao, P., Ma, M., Zhang, X., Tang, Y ., Xu, P., Liu, K., and Yuan, T. Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understand- ing with large models.arXiv preprint arXiv:2505.12589,

work page arXiv

[16] [16]

W., Soldaini, L., Soboroff, I., Weller, O., Kayi, E., et al

Mayfield, J., Yang, E., Lawrie, D., MacAvaney, S., Mc- Namee, P., Oard, D. W., Soldaini, L., Soboroff, I., Weller, O., Kayi, E., et al. On the evaluation of machine- generated reports. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pp. 1904–1915,

1904

[17] [17]

Concept drift guided layernorm tuning for efficient multimodal metaphor iden- tification

Qian, W., Hu, Z., Song, Z., and Li, J. Concept drift guided layernorm tuning for efficient multimodal metaphor iden- tification. InProceedings of the 2025 International Con- ference on Multimedia Retrieval, pp. 1100–1108,

2025

[18] [18]

Understanding figurative meaning through explainable visual entailment

Saakyan, A., Kulkarni, S., Chakrabarty, T., and Muresan, S. Understanding figurative meaning through explainable visual entailment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1–23,

2025

[19] [19]

P., Shatwell, D

Swetha, S., Gupta, R., Kulkarni, P. P., Shatwell, D. G., Santiago, J. A. C., Siddiqui, N., Fioresi, J., and Shah, M. Implicitqa: Going beyond frames towards implicit video reasoning.arXiv preprint arXiv:2506.21742,

work page arXiv

[20] [20]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

URLhttps://arxiv.org/abs/2507.01006. Tian, Y ., Zhang, R., Xu, N., and Mao, W. Bridging word- pair and token-level metaphor detection with explain- able domain mining. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13311–13325,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a

Wang, P., Zhang, Y ., Fei, H., Chen, Q., Wang, Y ., Si, J., Lu, W., Li, M., and Qin, L. S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a. Wang, Q., Yu, Y ., Yuan, Y ., Mao, R., and Zhou, T. Videorft: Incentivizing video reasonin...

work page arXiv

[22] [22]

Theoretical Basis for Video Metaphor Taxonomy To ensure reliable and principled evaluation, a systematic video metaphor taxonomy is essential for building the benchmark

13 MetaphorVU: Towards Metaphorical Video Understanding A. Theoretical Basis for Video Metaphor Taxonomy To ensure reliable and principled evaluation, a systematic video metaphor taxonomy is essential for building the benchmark. Since no prior works have explored this kind of taxonomy, we draw on multimodal metaphor theory (Forceville et al., 2009; Forcev...

2009

[23] [23]

virtual performance

and its extensions in the video field (Bordwell, 2013b; Stam, 2017; Schechner, 2017; Chandler, 2022), designing the first systematic video metaphor taxonomy, the details are illustrated in follows: According to Film Mise-en-sc`ene Theory (Bordwell et al., 2004; Gibbs & Gibbs, 2002; Arnheim, 1957), video metaphors can be realized through visual element arr...

2017

[24] [24]

C.2. Prompt for LLM Judge Since the output video metaphor interpretation in MetaphorVU-Bench is free-form text, rule-based metrics are difficult to provide a score aligning with actual human habits (Mayfield et al., 2024; Li et al., 2025d). To this end, we follow the metrics in previous free-form QA evaluation works (Li et al., 2025b;e; Yu et al., 2025; L...

2024

[25] [25]

Note that a portion of the data was originally in Chinese, to ensure the universality of the metaphorical knowledge graph, we use GPT-5 to translate the original text into English. D.2. Prompt for Extracting Metaphorical Concept Pairs Since several previous works that have been widely recognized by the community have demonstrated that current LLMs possess...

2023