v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

Bo Zhao; Jianqun Zhou; Qinrong Cui; Songchun Zhu; Wei Bi; Yanpeng Zhao; Yuxuan Wang; Zhengpeng Shi; Zilong Zheng

arxiv: 2509.25773 · v3 · pith:5TV6C3K4new · submitted 2025-09-30 · 💻 cs.CV · cs.AI· cs.CL

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

Zhengpeng Shi , Yanpeng Zhao , Jianqun Zhou , Yuxuan Wang , Qinrong Cui , Wei Bi , Songchun Zhu , Bo Zhao

show 1 more author

Zilong Zheng

This is my paper

classification 💻 cs.CV cs.AIcs.CL

keywords humorunderstandingvideov-hubmllmsaudiobenchmarkcomprehending

0 comments

read the original abstract

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ViMU: Benchmarking Video Metaphorical Understanding
cs.CV 2026-05 unverdicted novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.