"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

· 2026 · cs.CL · arXiv 2604.05930

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

representative citing papers

Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge

cs.DC · 2026-04-08 · unverdicted · novelty 6.0

ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, improving accuracy by up to 46.46%.

Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

cs.CV · 2026-04-11 · unverdicted · novelty 5.0

DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.

citing papers explorer

Showing 2 of 2 citing papers.

Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge cs.DC · 2026-04-08 · unverdicted · none · ref 53 · internal anchor
ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, improving accuracy by up to 46.46%.
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation cs.CV · 2026-04-11 · unverdicted · none · ref 41 · internal anchor
DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.

"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

fields

years

verdicts

representative citing papers

citing papers explorer