Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Jia Li; Wenhao Qian; Zhenzhen Hu; Zijie Song

arxiv: 2505.11237 · v4 · submitted 2025-05-16 · 💻 cs.MM · cs.LG

Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Wenhao Qian , Zhenzhen Hu , Zijie Song , Jia Li This is my paper

Pith reviewed 2026-05-22 15:15 UTC · model grok-4.3

classification 💻 cs.MM cs.LG

keywords multimodal metaphor identificationconcept driftlayer norm tuningCLIP embeddingsSLERP interpolationMET-Meme benchmarkefficient multimodal learningfigurative language understanding

0 comments

The pith

CDGLT identifies multimodal metaphors by drifting CLIP concepts with SLERP interpolation and achieves state-of-the-art results on MET-Meme at lower training cost than generative baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Concept Drift Guided LayerNorm Tuning (CDGLT) as a training-efficient way to handle multimodal metaphor identification, where models must connect literal image-text features to implied figurative meanings in memes. It creates a drifted concept by applying spherical linear interpolation to cross-modal embeddings from a CLIP encoder, which narrows the gap between surface-level and abstract interpretations. The framework also adapts prompt-based feature extraction and fusion from pre-trained language models to suit this task. If the approach holds, it would deliver higher accuracy on the MET-Meme benchmark while avoiding the heavy compute demands of methods that rely on large language or text-to-image generators. Readers would care because it points toward practical systems that can parse creative, unconventional expressions in everyday online media without massive resources.

Core claim

The paper claims that Concept Drift Guided LayerNorm Tuning (CDGLT) solves the literal-to-figurative gap in multimodal metaphor identification by generating a divergent concept embedding through SLERP interpolation of CLIP cross-modal features, then guiding LayerNorm tuning and prompt construction with pre-trained language models. This produces state-of-the-art performance on the MET-Meme benchmark together with substantially lower training costs than generative alternatives. Ablation experiments confirm that both the drifted-concept step and the adapted tuning contribute to the gains.

What carries the argument

Concept Drift, the SLERP-based interpolation of CLIP cross-modal embeddings that produces a divergent concept embedding to bridge literal features and the figurative task.

If this is right

The method reaches state-of-the-art accuracy on the MET-Meme benchmark for multimodal metaphor identification.
Training costs drop significantly compared with generative approaches that depend on large language or text-to-image models.
Ablation results isolate the separate benefits of the Concept Drift step and the adapted LayerNorm tuning strategy.
The framework advances practical, lower-cost systems for interpreting unconventional multimodal expressions such as internet memes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The SLERP drift technique might be tested on related figurative tasks such as sarcasm or humor detection that also mix images and text.
Lower training demands could allow deployment of metaphor-aware tools on resource-limited platforms or in real-time moderation settings.
Similar drift mechanisms could be explored in other domains where literal and intended meanings diverge, for example in cross-cultural communication analysis.

Load-bearing premise

That interpolating CLIP embeddings with SLERP yields a drifted concept which actually helps close the gap between literal features and figurative understanding in metaphor identification.

What would settle it

If CDGLT is evaluated on the MET-Meme benchmark and either fails to exceed prior accuracy or shows no clear reduction in training cost relative to generative baselines, the central performance and efficiency claims would not hold.

Figures

Figures reproduced from arXiv: 2505.11237 by Jia Li, Wenhao Qian, Zhenzhen Hu, Zijie Song.

**Figure 1.** Figure 1: Concept Drift Phenomenon. Whether memes are metaphorical is closely related to the embedded text. (a) Before [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The architecture of CDGLT which is implemented with feature extraction, Concept Drift modeling, and LN tuning of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Trend of accuracy and weighted F1-score with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualizations of CLIP image, text and SLERP [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDGLT combines SLERP drift on CLIP embeddings with LayerNorm tuning to cut training cost on MET-Meme while claiming SOTA, but the drift step lacks direct evidence that it produces task-relevant figurative shifts.

read the letter

The main points worth knowing are that this work pairs spherical linear interpolation of cross-modal CLIP vectors to create a drifted concept embedding with an adapted prompt-based LayerNorm tuning strategy, and it reports stronger results than prior methods on the MET-Meme benchmark at lower training cost than generative baselines. The combination itself looks new relative to the cited literature, and the efficiency claim is a practical plus in a space where full LLM or diffusion pipelines are heavy. Ablations are said to show both the drift and the tuning steps add value, which is the kind of check that helps readers trust the design choices. The code release is also a clear positive for anyone who wants to inspect or extend the implementation. On the weaker side, the central assumption that SLERP interpolation systematically moves embeddings toward figurative rather than literal regions is stated but not backed by embedding-space diagnostics, nearest-neighbor examples, or a controlled ablation that isolates drift effects on metaphor-specific mistakes. Without those, it remains possible that the observed gains trace more to the prompt construction or LN adaptation than to the drifted concept itself. The evaluation stays within a single benchmark, so how well the approach travels to other multimodal figurative tasks is still open. This paper is mainly for researchers working on efficient adaptation techniques for multimodal understanding in social-media or meme settings. A reader already familiar with CLIP-based fusion and parameter-efficient tuning will find the prompt strategy and the drift mechanism easy to follow. It deserves a serious referee because the core idea is cleanly motivated, the efficiency angle is relevant, and the reported results are strong enough to warrant closer inspection of the experiments and the drift validation. Referees will likely ask for the missing embedding analysis and broader testing, but that is normal for work at this stage.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Concept Drift Guided LayerNorm Tuning (CDGLT) for multimodal metaphor identification. It proposes using SLERP to interpolate CLIP cross-modal embeddings and produce a drifted concept embedding that bridges literal and figurative interpretations, combined with an adapted prompt-based feature extraction/fusion strategy and LayerNorm tuning. The authors claim this yields state-of-the-art results on the MET-Meme benchmark, substantially lower training costs than generative baselines, and that ablations confirm the contribution of both the drift mechanism and the LN tuning.

Significance. If the performance claims and the utility of the SLERP-based drift are substantiated with proper controls, the work would offer a practical, low-compute route to multimodal figurative understanding. It could reduce reliance on expensive generative models for meme analysis and similar tasks while highlighting targeted adaptation of pre-trained encoders.

major comments (2)

[Abstract] Abstract (paragraph on Concept Drift): the claim that SLERP interpolation of CLIP cross-modal embeddings produces a 'drifted concept' that meaningfully alleviates the literal-to-figurative gap is presented without any embedding-space diagnostics, nearest-neighbor examples, or controlled ablation that isolates its effect on metaphor-specific error modes. This mechanism is central to the novelty and performance attribution.
[Experimental Results] Experimental section: the abstract asserts SOTA performance and effective ablations, yet the manuscript supplies no dataset statistics (split sizes, class balance), error bars, statistical significance tests, or full baseline re-implementation details, rendering the central empirical claims unverifiable from the reported evidence.

minor comments (1)

[Method] The exact prompt templates used for feature extraction and fusion are described only at a high level; providing the literal strings would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on Concept Drift): the claim that SLERP interpolation of CLIP cross-modal embeddings produces a 'drifted concept' that meaningfully alleviates the literal-to-figurative gap is presented without any embedding-space diagnostics, nearest-neighbor examples, or controlled ablation that isolates its effect on metaphor-specific error modes. This mechanism is central to the novelty and performance attribution.

Authors: We acknowledge that the current presentation would benefit from more direct evidence supporting the role of the SLERP-based drift. While the ablation studies already demonstrate that removing the Concept Drift component leads to a measurable drop in performance on the MET-Meme benchmark, we agree these results do not fully isolate metaphor-specific error modes or provide embedding-space intuition. In the revised manuscript we will add t-SNE visualizations comparing original and drifted embeddings, nearest-neighbor examples in the embedding space, and a targeted error analysis on literal-versus-figurative misclassifications to better substantiate the mechanism. revision: yes
Referee: [Experimental Results] Experimental section: the abstract asserts SOTA performance and effective ablations, yet the manuscript supplies no dataset statistics (split sizes, class balance), error bars, statistical significance tests, or full baseline re-implementation details, rendering the central empirical claims unverifiable from the reported evidence.

Authors: We agree that these details are necessary for full reproducibility and verifiability. The revised version will include a table with dataset statistics (train/validation/test split sizes and class balance for MET-Meme), results reported with standard deviations across multiple random seeds, paired statistical significance tests against the main baselines, and an expanded experimental appendix with complete re-implementation details, hyperparameters, and training procedures for all compared methods. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external CLIP embeddings and standard SLERP without self-referential reduction

full rationale

The paper defines Concept Drift explicitly as SLERP interpolation between cross-modal CLIP embeddings to produce a divergent vector, then applies LayerNorm tuning and prompt adaptation. These steps are constructed from independent, pre-existing components (CLIP encoder, SLERP formula, standard LN tuning) rather than fitted parameters or self-citations that are then relabeled as predictions. Ablation studies test the added drift mechanism against baselines on the external MET-Meme benchmark. No equations or central claims reduce by construction to the inputs; the performance claim remains an empirical outcome rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The approach rests on pre-trained CLIP and language models plus two new mechanisms whose effectiveness is asserted via benchmark results; no explicit free parameters or axioms are stated in the abstract.

invented entities (1)

Concept Drift via SLERP no independent evidence
purpose: Generate divergent concept embedding to bridge literal and figurative gap
Introduced as the first key innovation in the abstract.

pith-pipeline@v0.9.0 · 5826 in / 1067 out tokens · 43526 ms · 2026-05-22T15:15:30.841387+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Concept Drift utilizes Spherical Linear Interpolation (SLERP) of two cross-modal embeddings from the CLIP encoder to produce an intermediate semantic embedding... α=0.8
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fine-tuning only its Layer Normalization (LN) parameters and position embedding components

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

Arjun R Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhi- wei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T Freeman, et al. 2023. Metaclue: Towards comprehensive visual metaphors re- search. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23201–23211

work page 2023
[2]

Elisabeth Camp. 2006. Metaphor in the Mind: The Cognition of Metaphor 1. Philosophy Compass1, 2 (2006), 154–170

work page 2006
[3]

Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Mengshi Ge, Rui Mao, and Erik Cambria. 2022. Explainable metaphor identi- fication inspired by conceptual metaphor theory. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 10681–10689

work page 2022
[6]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 249–256

work page 2010
[7]

Pragglejaz Group. 2007. MIP: A method for identifying metaphorically used words in discourse.Metaphor and symbol22, 1 (2007), 1–39

work page 2007
[8]

Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. 2021. Detecting persuasive atypicality by modeling contextual compatibility. InProceedings of the IEEE/CVF International Conference on Computer Vision. 972–982

work page 2021
[9]

Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, and Nojun Kwak. 2024. MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline.arXiv preprint arXiv:2407.12508(2024)

work page arXiv 2024
[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV] https://arxiv.org/abs/ 1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Xiaoyu He, Long Yu, Shengwei Tian, Qimeng Yang, and Jun Long. 2024. SC-Net: Multimodal metaphor detection using semantic conflicts.Neurocomputing594 (2024), 127825

work page 2024
[12]

Xiaoyu He, Long Yu, Shengwei Tian, Qimeng Yang, Jun Long, and Bo Wang. 2024. VIEMF: Multimodal metaphor detection via visual information enhancement with multimodal fusion.Information Processing & Management61, 3 (2024), 103652

work page 2024
[13]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. InInternational conference on machine learning. PMLR, 2790–2799

work page 2019
[14]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

EunJeong Hwang and Vered Shwartz. 2023. Memecap: A dataset for captioning and interpreting memes.arXiv preprint arXiv:2305.13703(2023)

work page arXiv 2023
[16]

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim

work page
[17]

InEuropean Conference on Computer Vision

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval. InEuropean Conference on Computer Vision. Springer, 239–254

work page
[18]

2008.Metaphors we live by

George Lakoff and Mark Johnson. 2008.Metaphors we live by. University of Chicago press

work page 2008
[19]

Ming Li, Jike Zhong, Chenxin Li, Liuzhuozheng Li, Nie Lin, and Masashi Sugiyama

work page
[20]

Vision-language model fine-tuning via simple parameter-efficient modifi- cation.arXiv preprint arXiv:2409.16718(2024)

work page arXiv 2024
[21]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Yucheng Li, Shun Wang, Chenghua Lin, and Guerin Frank. 2023. Metaphor detection via explicit basic meanings modelling.arXiv preprint arXiv:2305.17268 (2023)

work page arXiv 2023
[23]

Ilya Loshchilov. 2017. Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts.arXiv preprint arXiv:1608.03983(2016). https://arxiv.org/abs/ 1608.03983

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Rui Mao, Kai He, Claudia Ong, Qian Liu, and Erik Cambria. 2024. MetaPro 2.0: Computational metaphor processing on the effectiveness of anomalous language modeling. InFindings of the Association for Computational Linguistics ACL 2024. 9891–9908

work page 2024
[26]

Rui Mao, Xiao Li, Mengshi Ge, and Erik Cambria. 2022. MetaPro: A computational metaphor processing model for text pre-processing.Information Fusion86 (2022), 30–43

work page 2022
[27]

Adam Paszke, Sam Gross, Francisco Massa, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.Advances in Neural Information Processing Systems32 (2019)

work page 2019
[28]

Kalarani A R, Bhattacharyya P, and Shekhar S. 2024. Unveiling the Invisible: Cap- tioning Videos with Metaphors. InFindings of the Association for Computational Linguistics: EMNLP 2024. 6306–6320

work page 2024
[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PMLR, 8748–8763

work page 2021
[30]

Alec Radford, Jeffrey Wu, Rewon Child, et al. 2019. Language models are unsu- pervised multitask learners.OpenAI Blog1, 8 (2019), 9

work page 2019
[31]

Ken Shoemake. 1985. Animating rotation with quaternion curves. InProceedings of the 12th annual conference on Computer graphics and interactive techniques. 245–254

work page 1985
[32]

Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, and Abhinav Gupta. 2024. Hrp: Human affordances for robotic pre-training.arXiv preprint arXiv:2407.18911 (2024)

work page arXiv 2024
[33]

Kevin Stowe, Tuhin Chakrabarty, Nanyun Peng, Smaranda Muresan, and Iryna Gurevych. 2021. Metaphor generation with conceptual mappings.arXiv preprint arXiv:2106.01228(2021)

work page arXiv 2021
[34]

Yuan Tian, Minzheng Wang, Nan Xu, and Wenji Mao. 2025. ImaRA: An Imagina- tive Frame Augmented Method for Low-Resource Multimodal Metaphor Detec- tion and Explanation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3953–3967

work page 2025
[35]

A Vaswani. 2017. Attention is all you need.Advances in Neural Information Processing Systems(2017)

work page 2017
[36]

Thomas W, Lysandre D, Victor S, et al . 2019. Transformers: State-of-the-art natural language processing.Google Scholar(2019). Google Scholar Cross Ref

work page 2019
[37]

Bingbing Wang, Shijue Huang, Bin Liang, Geng Tu, Min Yang, and Ruifeng Xu. 2024. What do they “meme”? A metaphor-aware multi-modal multi-task framework for fine-grained meme understanding.Knowledge-Based Systems294 (2024), 111778

work page 2024
[38]

Yorick Wilks. 1975. A preferential, pattern-seeking, semantics for natural lan- guage inference.Artificial intelligence6, 1 (1975), 53–74

work page 1975
[39]

Bo Xu, Tingting Li, Junzhe Zheng, Mehdi Naseriparsa, Zhehuan Zhao, Hongfei Lin, and Feng Xia. 2022. Met-meme: A multimodal meme dataset rich in metaphors. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 2887–2899

work page 2022
[40]

Bo Xu, Junzhe Zheng, Jiayuan He, Yuxuan Sun, Hongfei Lin, Liang Zhao, and Feng Xia. 2024. Generating Multimodal Metaphorical Features for Meme Under- standing. InProceedings of the 32nd ACM International Conference on Multimedia. 447–455

work page 2024
[41]

Yanzhi Xu, Yueying Hua, Shichen Li, and Zhongqing Wang. 2024. Exploring Chain-of-Thought for Multi-modal Metaphor Detection. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 91–101

work page 2024
[42]

Ron Yosef, Yonatan Bitton, and Dafna Shahaf. 2023. Irfl: Image recognition of figurative language.arXiv preprint arXiv:2303.15445(2023)

work page arXiv 2023
[43]

Jun Yu, Zerui Zhang, Zhihong Wei, Gongpeng Zhao, Zhongpeng Cai, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu, Qingsong Liu, et al. 2024. Aud- tgn: Advancing action unit detection with temporal convolution and gpt-2 in wild audiovisual contexts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4814–4821

work page 2024
[44]

Dongyu Zhang, Jingwei Yu, Senyuan Jin, Liang Yang, and Hongfei Lin. 2023. Mul- tiCMET: A Novel Chinese Benchmark for Understanding Multimodal Metaphor. InFindings of the Association for Computational Linguistics: EMNLP 2023. 6141– 6154

work page 2023
[45]

Dongyu Zhang, Minghao Zhang, Heting Zhang, Liang Yang, and Hongfei Lin

work page
[46]

MultiMET: A multimodal dataset for metaphor understanding. InProceed- ings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3214–3225

work page
[47]

Linhao Zhang, Li Jin, Guangluan Xu, Xiaoyu Li, Cai Xu, Kaiwen Wei, Nayu Liu, and Haonan Liu. 2024. CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 9341–9349

work page 2024
[48]

Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. 2023. Tuning LayerNorm in Attention: Towards efficient multi-modal llm finetuning.arXiv preprint arXiv:2312.11420(2023). ICMR ’25, June 30-July 3, 2025, Chicago, IL, USA Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li

work page arXiv 2023
[49]

Li Zheng, Hao Fei, Ting Dai, Zuquan Peng, Fei Li, Huisheng Ma, Chong Teng, and Donghong Ji. 2025. Multi-Granular Multimodal Clue Fusion for Meme Understanding. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26057–26065

work page 2025
[50]

Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al . 2023. One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing systems36 (2023), 43322–43355

work page 2023

[1] [1]

Arjun R Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhi- wei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T Freeman, et al. 2023. Metaclue: Towards comprehensive visual metaphors re- search. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23201–23211

work page 2023

[2] [2]

Elisabeth Camp. 2006. Metaphor in the Mind: The Cognition of Metaphor 1. Philosophy Compass1, 2 (2006), 154–170

work page 2006

[3] [3]

Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Mengshi Ge, Rui Mao, and Erik Cambria. 2022. Explainable metaphor identi- fication inspired by conceptual metaphor theory. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 10681–10689

work page 2022

[6] [6]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 249–256

work page 2010

[7] [7]

Pragglejaz Group. 2007. MIP: A method for identifying metaphorically used words in discourse.Metaphor and symbol22, 1 (2007), 1–39

work page 2007

[8] [8]

Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. 2021. Detecting persuasive atypicality by modeling contextual compatibility. InProceedings of the IEEE/CVF International Conference on Computer Vision. 972–982

work page 2021

[9] [9]

Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, and Nojun Kwak. 2024. MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline.arXiv preprint arXiv:2407.12508(2024)

work page arXiv 2024

[10] [10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV] https://arxiv.org/abs/ 1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Xiaoyu He, Long Yu, Shengwei Tian, Qimeng Yang, and Jun Long. 2024. SC-Net: Multimodal metaphor detection using semantic conflicts.Neurocomputing594 (2024), 127825

work page 2024

[12] [12]

Xiaoyu He, Long Yu, Shengwei Tian, Qimeng Yang, Jun Long, and Bo Wang. 2024. VIEMF: Multimodal metaphor detection via visual information enhancement with multimodal fusion.Information Processing & Management61, 3 (2024), 103652

work page 2024

[13] [13]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. InInternational conference on machine learning. PMLR, 2790–2799

work page 2019

[14] [14]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

EunJeong Hwang and Vered Shwartz. 2023. Memecap: A dataset for captioning and interpreting memes.arXiv preprint arXiv:2305.13703(2023)

work page arXiv 2023

[16] [16]

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim

work page

[17] [17]

InEuropean Conference on Computer Vision

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval. InEuropean Conference on Computer Vision. Springer, 239–254

work page

[18] [18]

2008.Metaphors we live by

George Lakoff and Mark Johnson. 2008.Metaphors we live by. University of Chicago press

work page 2008

[19] [19]

Ming Li, Jike Zhong, Chenxin Li, Liuzhuozheng Li, Nie Lin, and Masashi Sugiyama

work page

[20] [20]

Vision-language model fine-tuning via simple parameter-efficient modifi- cation.arXiv preprint arXiv:2409.16718(2024)

work page arXiv 2024

[21] [21]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Yucheng Li, Shun Wang, Chenghua Lin, and Guerin Frank. 2023. Metaphor detection via explicit basic meanings modelling.arXiv preprint arXiv:2305.17268 (2023)

work page arXiv 2023

[23] [23]

Ilya Loshchilov. 2017. Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts.arXiv preprint arXiv:1608.03983(2016). https://arxiv.org/abs/ 1608.03983

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

Rui Mao, Kai He, Claudia Ong, Qian Liu, and Erik Cambria. 2024. MetaPro 2.0: Computational metaphor processing on the effectiveness of anomalous language modeling. InFindings of the Association for Computational Linguistics ACL 2024. 9891–9908

work page 2024

[26] [26]

Rui Mao, Xiao Li, Mengshi Ge, and Erik Cambria. 2022. MetaPro: A computational metaphor processing model for text pre-processing.Information Fusion86 (2022), 30–43

work page 2022

[27] [27]

Adam Paszke, Sam Gross, Francisco Massa, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.Advances in Neural Information Processing Systems32 (2019)

work page 2019

[28] [28]

Kalarani A R, Bhattacharyya P, and Shekhar S. 2024. Unveiling the Invisible: Cap- tioning Videos with Metaphors. InFindings of the Association for Computational Linguistics: EMNLP 2024. 6306–6320

work page 2024

[29] [29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PMLR, 8748–8763

work page 2021

[30] [30]

Alec Radford, Jeffrey Wu, Rewon Child, et al. 2019. Language models are unsu- pervised multitask learners.OpenAI Blog1, 8 (2019), 9

work page 2019

[31] [31]

Ken Shoemake. 1985. Animating rotation with quaternion curves. InProceedings of the 12th annual conference on Computer graphics and interactive techniques. 245–254

work page 1985

[32] [32]

Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, and Abhinav Gupta. 2024. Hrp: Human affordances for robotic pre-training.arXiv preprint arXiv:2407.18911 (2024)

work page arXiv 2024

[33] [33]

Kevin Stowe, Tuhin Chakrabarty, Nanyun Peng, Smaranda Muresan, and Iryna Gurevych. 2021. Metaphor generation with conceptual mappings.arXiv preprint arXiv:2106.01228(2021)

work page arXiv 2021

[34] [34]

Yuan Tian, Minzheng Wang, Nan Xu, and Wenji Mao. 2025. ImaRA: An Imagina- tive Frame Augmented Method for Low-Resource Multimodal Metaphor Detec- tion and Explanation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3953–3967

work page 2025

[35] [35]

A Vaswani. 2017. Attention is all you need.Advances in Neural Information Processing Systems(2017)

work page 2017

[36] [36]

Thomas W, Lysandre D, Victor S, et al . 2019. Transformers: State-of-the-art natural language processing.Google Scholar(2019). Google Scholar Cross Ref

work page 2019

[37] [37]

Bingbing Wang, Shijue Huang, Bin Liang, Geng Tu, Min Yang, and Ruifeng Xu. 2024. What do they “meme”? A metaphor-aware multi-modal multi-task framework for fine-grained meme understanding.Knowledge-Based Systems294 (2024), 111778

work page 2024

[38] [38]

Yorick Wilks. 1975. A preferential, pattern-seeking, semantics for natural lan- guage inference.Artificial intelligence6, 1 (1975), 53–74

work page 1975

[39] [39]

Bo Xu, Tingting Li, Junzhe Zheng, Mehdi Naseriparsa, Zhehuan Zhao, Hongfei Lin, and Feng Xia. 2022. Met-meme: A multimodal meme dataset rich in metaphors. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 2887–2899

work page 2022

[40] [40]

Bo Xu, Junzhe Zheng, Jiayuan He, Yuxuan Sun, Hongfei Lin, Liang Zhao, and Feng Xia. 2024. Generating Multimodal Metaphorical Features for Meme Under- standing. InProceedings of the 32nd ACM International Conference on Multimedia. 447–455

work page 2024

[41] [41]

Yanzhi Xu, Yueying Hua, Shichen Li, and Zhongqing Wang. 2024. Exploring Chain-of-Thought for Multi-modal Metaphor Detection. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 91–101

work page 2024

[42] [42]

Ron Yosef, Yonatan Bitton, and Dafna Shahaf. 2023. Irfl: Image recognition of figurative language.arXiv preprint arXiv:2303.15445(2023)

work page arXiv 2023

[43] [43]

Jun Yu, Zerui Zhang, Zhihong Wei, Gongpeng Zhao, Zhongpeng Cai, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu, Qingsong Liu, et al. 2024. Aud- tgn: Advancing action unit detection with temporal convolution and gpt-2 in wild audiovisual contexts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4814–4821

work page 2024

[44] [44]

Dongyu Zhang, Jingwei Yu, Senyuan Jin, Liang Yang, and Hongfei Lin. 2023. Mul- tiCMET: A Novel Chinese Benchmark for Understanding Multimodal Metaphor. InFindings of the Association for Computational Linguistics: EMNLP 2023. 6141– 6154

work page 2023

[45] [45]

Dongyu Zhang, Minghao Zhang, Heting Zhang, Liang Yang, and Hongfei Lin

work page

[46] [46]

MultiMET: A multimodal dataset for metaphor understanding. InProceed- ings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3214–3225

work page

[47] [47]

Linhao Zhang, Li Jin, Guangluan Xu, Xiaoyu Li, Cai Xu, Kaiwen Wei, Nayu Liu, and Haonan Liu. 2024. CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 9341–9349

work page 2024

[48] [48]

Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. 2023. Tuning LayerNorm in Attention: Towards efficient multi-modal llm finetuning.arXiv preprint arXiv:2312.11420(2023). ICMR ’25, June 30-July 3, 2025, Chicago, IL, USA Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li

work page arXiv 2023

[49] [49]

Li Zheng, Hao Fei, Ting Dai, Zuquan Peng, Fei Li, Huisheng Ma, Chong Teng, and Donghong Ji. 2025. Multi-Granular Multimodal Clue Fusion for Meme Understanding. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26057–26065

work page 2025

[50] [50]

Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al . 2023. One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing systems36 (2023), 43322–43355

work page 2023