Fall into a Pit, Gain in a Wit: Cognitive-Guided Harmful Meme Detection via Misjudgment Risk Pattern Retrieval
Pith reviewed 2026-05-18 08:38 UTC · model grok-4.3
The pith
Retrieving misjudgment risk patterns guides multimodal models to detect harmful memes with fewer errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PatMD detects harmful memes by first constructing a knowledge base in which each meme is paired with a misjudgment risk pattern that explains whether it tends to produce false negatives or false positives, then retrieving the most relevant patterns for a target meme and using them to dynamically guide the reasoning steps of a multimodal large language model.
What carries the argument
Misjudgment risk pattern retrieval, which deconstructs each meme into an explicit account of why an MLLM is likely to overlook its harm or overread harm into benign content and then injects those accounts into the model's current reasoning trace.
If this is right
- Detection performance rises by roughly 8 percent in F1-score and 7.7 percent in accuracy across multiple harmful-meme benchmarks.
- The method maintains its advantage on memes that were never seen during training or that contain deliberate adversarial changes.
- Guidance from past error patterns reduces both missed harmful content and false alarms on harmless content.
- The same retrieval step can be applied to any multimodal large language model without retraining the underlying model.
Where Pith is reading between the lines
- The approach may transfer to other domains where models must interpret subtle or ironic multimodal signals, such as short videos or captioned images.
- Automating the initial construction of the misjudgment pattern base could reduce the manual effort required to scale the method.
- The technique highlights a general strategy of using recorded model failures as a form of external memory to improve future decisions.
Load-bearing premise
The patterns stored in the knowledge base must correctly identify the real reasons models err on memes, and feeding those patterns into the model must not introduce fresh mistakes or biases of its own.
What would settle it
Running PatMD on a fresh set of memes where the retrieved patterns produce no accuracy gain or even lower scores than the unaided model would demonstrate that the patterns are not the source of the reported improvement.
Figures
read the original abstract
Internet memes have emerged as a popular multimodal medium, yet they are increasingly weaponized to convey harmful opinions through subtle rhetorical devices like irony and metaphor. Existing detection approaches, including Multimodal Large Language Model (MLLM)-based techniques, struggle with these implicit expressions, leading to frequent misjudgments. This paper introduces PatMD, a novel approach that detects harmful memes by learning from and proactively mitigating these potential misjudgment risks. Our core idea is to move beyond superficial content-level matching and instead identify the underlying misjudgment risk patterns, proactively guiding the MLLMs to avoid known misjudgment pitfalls. We first construct a knowledge base where each meme is deconstructed into a misjudgment risk pattern explaining why it might be misjudged, either overlooking harmful undertones (false negative) or overinterpreting benign content (false positive). For a given target meme, PatMD retrieves relevant patterns and utilizes them to dynamically guide the MLLM's reasoning. Experiments on a benchmark of 6,626 memes across 5 harmful detection tasks show that PatMD outperforms state-of-the-art baselines, achieving an average of 8.30% improvement in F1-score and 7.71% improvement in accuracy, while exhibiting consistent robustness on unseen and adversarial memes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PatMD, a method for harmful meme detection that constructs a knowledge base by deconstructing each meme into a misjudgment risk pattern (explaining potential false-negative overlooking of harm or false-positive overinterpretation of benign content) and retrieves relevant patterns to dynamically guide MLLM reasoning. It reports that this yields an average 8.30% F1-score and 7.71% accuracy improvement over state-of-the-art baselines on a 6,626-meme benchmark spanning 5 tasks, with added robustness to unseen and adversarial memes.
Significance. If the central mechanism is shown to be causally responsible for the gains, the work offers a cognitively motivated way to mitigate implicit rhetorical devices in multimodal content, which could improve reliability of MLLM-based moderation systems. The scale of the benchmark and focus on proactive error-pattern retrieval are strengths that, if substantiated, would advance the subfield beyond standard prompting or fine-tuning approaches.
major comments (3)
- [§3] §3 (Method, knowledge-base construction): The deconstruction of memes into misjudgment risk patterns lacks any reported validation against actual observed MLLM errors on held-out data (e.g., no correlation analysis, inter-annotator agreement with model outputs, or ablation measuring error reduction attributable to specific patterns). Without this, the patterns may function as post-hoc rationalizations, and the attribution of the 8.30% F1 gain specifically to risk-pattern retrieval cannot be verified.
- [§4.2] §4.2 (Experiments, baselines and ablations): The manuscript provides no statistical significance tests, confidence intervals, or controls for confounding factors such as variations in prompt engineering or retrieval hyperparameters; this makes it impossible to isolate whether the reported improvements and robustness on adversarial memes stem from the proposed mechanism rather than other implementation choices.
- [§4.1] §4.1 (Retrieval mechanism): The description of how patterns are retrieved and injected into the MLLM prompt is insufficiently detailed to assess whether the process introduces new biases or spurious guidance, which directly bears on the weakest assumption that retrieved patterns reliably mitigate the intended misjudgments.
minor comments (2)
- [Abstract] The abstract and §4 should explicitly name the five harmful detection tasks and the exact baseline models to allow readers to assess the scope of the claimed gains.
- [§3] Notation for false-negative vs. false-positive patterns could be made more consistent across figures and text to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, outlining the revisions we plan to incorporate to address the concerns raised.
read point-by-point responses
-
Referee: [§3] §3 (Method, knowledge-base construction): The deconstruction of memes into misjudgment risk patterns lacks any reported validation against actual observed MLLM errors on held-out data (e.g., no correlation analysis, inter-annotator agreement with model outputs, or ablation measuring error reduction attributable to specific patterns). Without this, the patterns may function as post-hoc rationalizations, and the attribution of the 8.30% F1 gain specifically to risk-pattern retrieval cannot be verified.
Authors: We agree that direct validation of the misjudgment risk patterns against observed MLLM errors on held-out data would strengthen the causal link to the reported gains. The patterns in the current manuscript are constructed from cognitive principles of meme interpretation drawn from prior literature on rhetorical devices and misjudgment risks. In the revised manuscript, we will add a dedicated analysis that correlates retrieved patterns with specific error categories (false negatives and false positives) made by baseline MLLMs on a held-out portion of the benchmark. We will also include an ablation that quantifies error reduction when patterns are used versus withheld, to better demonstrate that the patterns proactively mitigate the targeted misjudgments rather than serving solely as post-hoc rationalizations. revision: yes
-
Referee: [§4.2] §4.2 (Experiments, baselines and ablations): The manuscript provides no statistical significance tests, confidence intervals, or controls for confounding factors such as variations in prompt engineering or retrieval hyperparameters; this makes it impossible to isolate whether the reported improvements and robustness on adversarial memes stem from the proposed mechanism rather than other implementation choices.
Authors: We acknowledge that the absence of statistical significance testing and explicit controls for confounding factors such as prompt variations and retrieval hyperparameters weakens the isolation of the proposed mechanism's contribution. In the revised manuscript, we will add paired statistical tests (e.g., McNemar's test) along with 95% confidence intervals for all F1 and accuracy metrics across the five tasks. We will further include additional ablation experiments that systematically vary prompt templates and retrieval hyperparameters (including top-k values and similarity thresholds) while holding other components fixed, to show that the observed gains and robustness to adversarial memes remain consistent and are primarily driven by the risk-pattern retrieval component. revision: yes
-
Referee: [§4.1] §4.1 (Retrieval mechanism): The description of how patterns are retrieved and injected into the MLLM prompt is insufficiently detailed to assess whether the process introduces new biases or spurious guidance, which directly bears on the weakest assumption that retrieved patterns reliably mitigate the intended misjudgments.
Authors: We thank the referee for highlighting the need for greater specificity in the retrieval and injection process. The original manuscript provides only a high-level description in Section 4.1. In the revised version, we will substantially expand this section with a precise algorithmic account, including the embedding model and similarity function used for retrieval, the exact prompt template for pattern injection, and concrete examples illustrating how a retrieved pattern steers the MLLM away from a targeted misjudgment. We will also add a short discussion of potential biases that could arise from retrieval and how the cognitive grounding of the patterns is intended to reduce the risk of spurious guidance. revision: yes
Circularity Check
No significant circularity; derivation relies on external benchmark evaluation
full rationale
The paper constructs a knowledge base of misjudgment risk patterns from memes and retrieves them to guide MLLM reasoning for harmful meme detection, with performance measured on an external benchmark of 6,626 memes across 5 tasks. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or description. The claimed improvements (8.30% F1, 7.71% accuracy) are presented as empirical outcomes rather than by-construction equivalences to inputs. The approach is self-contained against external benchmarks, consistent with a normal non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs can be effectively guided by retrieved misjudgment risk patterns to reduce false negatives and false positives
Reference graph
Works this paper leans on
-
[1]
Piush Aggarwal, Pranit Chawla, Mithun Das, Punyajoy Saha, Binny Mathew, Torsten Zesch, and Animesh Mukherjee. 2023. Hateproof: Are hateful meme detection systems really robust?. InProceedings of the ACM Web Conference 2023. 3734–3743
work page 2023
-
[2]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi
-
[3]
Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)
work page 2024
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
work page 2020
-
[6]
Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-modal sarcasm detection in twitter with hierarchical fusion model. InProceedings of the 57th annual meeting of the association for computational linguistics. 2506–2515
work page 2019
-
[7]
Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee, and Jing Jiang. 2023. Pro-cap: Leveraging a frozen vision-language model for hateful meme detection. InProceedings of the 31st ACM international conference on multimedia. 5244–5252
work page 2023
-
[8]
Rui Cao, Roy Ka-Wei Lee, Wen-Haw Chong, and Jing Jiang. 2022. Prompting for Multimodal Hateful Meme Classification. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 321–332
work page 2022
-
[9]
Rui Cao, Roy Ka-Wei Lee, and Jing Jiang. 2024. Modularized networks for few- shot hateful meme detection. InProceedings of the ACM Web Conference 2024. 4575–4584
work page 2024
-
[10]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. InEuropean conference on computer vision. Springer, 104–120
work page 2020
-
[11]
Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Zhen Ye, Guang Chen, Zhiyong Huang, and Jing Ma. 2025. AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 4234–4253
work page 2025
-
[12]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113
work page 2023
-
[13]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models.Journal of Machine Learning Research25, 70 (2024), 1–53
work page 2024
-
[14]
Elisabetta Fersini, Francesca Gasparini, Giulia Rizzi, Aurora Saibene, Berta Chulvi, Paolo Rosso, Alyssa Lees, and Jeffrey Sorensen. 2022. SemEval-2022 Task 5: Multi- media automatic misogyny identification. InProceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). 533–549
work page 2022
-
[15]
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. 2025. jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval.arXiv preprint arXiv:2506.18902(2025)
-
[16]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938
work page 2020
-
[17]
Ming Shan Hee, Roy Ka-Wei Lee, and Wen-Haw Chong. 2022. On explaining multimodal hateful meme detection models. InProceedings of the ACM web conference 2022. 3651–3655
work page 2022
-
[18]
Jianzhao Huang, Hongzhan Lin, Liu Ziyan, Ziyang Luo, Guang Chen, and Jing Ma. 2024. Towards Low-Resource Harmful Meme Detection with LMM Agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2269–2293
work page 2024
-
[19]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Junhui Ji, Xuanrui Lin, and Usman Naseem. 2024. Capalign: Improving cross modal alignment via informative captioning for harmful meme detection. In Proceedings of the ACM Web Conference 2024. 4585–4594
work page 2024
- [21]
-
[22]
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992
work page 2023
- [23]
-
[24]
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes chal- lenge: Detecting hate speech in multimodal memes.Advances in neural informa- tion processing systems33 (2020), 2611–2624
work page 2020
-
[25]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213
work page 2022
-
[26]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Mem- ory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
work page 2023
-
[27]
Roy Ka-Wei Lee, Rui Cao, Ziqing Fan, Jing Jiang, and Wen-Haw Chong. 2021. Disentangling hate in online memes. InProceedings of the 29th ACM international conference on multimedia. 5138–5147
work page 2021
-
[28]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
-
[29]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang
-
[30]
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557(2019)
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[31]
Hongzhan Lin, Ziyang Luo, Jing Ma, and Long Chen. 2023. Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 9114–9128. doi:...
-
[32]
Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, and Jing Ma. 2024. Goat- bench: Safety insights to large multimodal models through meme-based social abuse.ACM Transactions on Intelligent Systems and Technology(2024)
work page 2024
-
[33]
Xuanrui Lin, Chao Jia, Junhui Ji, Hui Han, and Usman Naseem. 2025. Ask, acquire, understand: A multimodal agent-based framework for social abuse detection in memes. InProceedings of the ACM on Web Conference 2025. 4734–4744
work page 2025
- [34]
-
[35]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26296–26306
work page 2024
-
[36]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, and Kaiwei Deng. 2025. MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection. InProceedings Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Wang et al. of the 63rd Annual Meeting of the Association for Computational Linguistics. 923– 947
work page 2025
-
[38]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretrain- ing task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems32 (2019)
work page 2019
-
[39]
Abdullah Mazhar, Zuhair Hasan Shaik, Aseem Srivastava, Polly Ruhnke, La- vanya Vaddavalli, Sri Keshav Katragadda, Shweta Yadav, and Md Shad Akhtar
-
[40]
InProceedings of the ACM on Web Conference 2025
Figurative-cum-Commonsense Knowledge Infusion for Multimodal Mental Health Meme Classification. InProceedings of the ACM on Web Conference 2025. 637–648
work page 2025
-
[41]
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, and Bill Byrne. 2025. Robust adaptation of large multimodal models for retrieval augmented hateful meme detection. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 23817–23839
work page 2025
- [42]
-
[43]
Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty
Shraman Pramanick, Shivam Sharma, Dimitar Dimitrov, Md. Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. 2021. MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 4439–4455. doi:10.18653/v1/2021.findings-emnlp.379
-
[44]
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics11 (2023), 1316–1331
work page 2023
- [45]
- [46]
-
[47]
Shardul Suryawanshi, Bharathi Raja Chakravarthi, Mihael Arcan, and Paul Buite- laar. 2020. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. InProceedings of the second workshop on trolling, aggression and cyberbullying. 32–41
work page 2020
-
[48]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3.5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
work page 2022
- [50]
-
[51]
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). A Analysis of Retrieved Samples (𝐾) 72 74 76 78 80 1 2 3 4 5 6 top-K 54 56 58 60 62 64 66 Ha...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.