pith. machine review for the scientific record. sign in

arxiv: 2604.18091 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.CV

Recognition: unknown

Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:28 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords culture-aware captioningmultimodal humor generationpreference alignmentcultural context adaptationimage captioningreward hacking mitigationGRPO
0
0 comments X

The pith

A staged alignment pipeline lets multimodal models generate humorous image captions adapted to specific cultural contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called culture-aware humorous captioning, in which a model must produce funny captions for an image that are appropriate to a given cultural background; captions may differ in wording across cultures yet should share the same visual grounding and humorous logic. Existing multimodal large language models lack reliable control over explicit cultural context, so they often produce captions that are either irrelevant to the image, culturally mismatched, or not funny. The authors address this by building a six-dimensional evaluation rubric and a three-stage training process that starts with abundant Western supervision, applies judge-based preference optimization with a repulsion constraint to avoid reward hacking, and then adapts the model to an Eastern cultural context using limited data. If the approach works, it would mean AI systems can generate humor that stays grounded in the image while respecting cultural differences in what counts as funny or appropriate. Readers would care because humor is highly culture-dependent, and uncontrolled generation risks producing captions that feel tone-deaf or offensive when deployed globally.

Core claim

The central claim is that initializing a multimodal model with high-resource Western cultural supervision, followed by multi-dimensional preference alignment using judge-based GRPO together with a Degradation-aware Prototype Repulsion Constraint, and finally adapting the model to Eastern cultural context with a small amount of supervision, produces stronger overall scores on the six-dimensional evaluation framework, with especially large improvements in contextual fit and a better trade-off between image relevance and humor quality under the specified cultural constraints.

What carries the argument

The staged alignment framework that performs Western initialization, then judge-based GRPO preference alignment with a Degradation-aware Prototype Repulsion Constraint to curb reward hacking, and finally low-resource adaptation to the target cultural context.

If this is right

  • Captions generated under different cultural contexts share similar visual situations or humorous rationales even when their surface wording differs.
  • The method delivers particularly large gains on the contextual-fit dimension of the evaluation framework.
  • Models reach an improved balance between remaining relevant to the image and delivering humor once cultural constraints are enforced.
  • The Degradation-aware Prototype Repulsion Constraint reduces reward hacking during open-ended preference alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged initialization-plus-alignment-plus-adaptation pattern could be tested on other subjective generation attributes such as sentiment or politeness.
  • If the six dimensions prove incomplete, future work might need to add explicit checks for cultural offensiveness or stereotype reinforcement.
  • The approach implies that cultural adaptation in creative tasks benefits more from explicit multi-dimensional preference modeling than from simple continued fine-tuning alone.

Load-bearing premise

The six-dimensional evaluation framework accurately and without bias measures cultural appropriateness and humor quality, and the judge-based GRPO with the repulsion constraint reliably prevents reward hacking without introducing new biases.

What would settle it

A blinded human study in which raters from both Western and Eastern backgrounds independently score large sets of model outputs on the six dimensions and find no statistically significant gains in contextual fit or overall score compared with strong non-staged baselines.

Figures

Figures reproduced from arXiv: 2604.18091 by Jie Xu, Lu Li, Rongzhao Zhang, Run Xu.

Figure 1
Figure 1. Figure 1: Qualitative examples of culture-aware humorous captioning under different culture contexts. The [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall three-stage framework of CuHAlign. We first perform SFT on a Western-culture dataset [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of three LLM judges on the judge validation set for assessing the reliability of au￾tomatic pairwise evaluation across languages and dif￾ficulty levels. 7.5 Judge Reliability and Agreement with Human Evaluation To assess the reliability of automatic evaluation, we further construct a dedicated judge validation set. Specifically, we sample 400 image instances from the test set and build pairwise … view at source ↗
read the original abstract

Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous rationales.To support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and creativity. We further propose a staged alignment framework that first initializes the model with high-resource supervision under the Western cultural context, then performs multi-dimensional preference alignment via judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking in open-ended generation, and finally adapts the model to the Eastern cultural context with a small amount of supervision. Experimental results show that our method achieves stronger overall performance under the proposed evaluation framework, with particularly large gains in contextual fit and a better balance between image relevance and humor under cultural constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the task of culture-aware humorous captioning, requiring multimodal models to generate humorous image captions conditioned on a specified cultural context (e.g., Western to Eastern). It proposes a six-dimensional evaluation framework (image relevance, contextual fit, semantic richness, reasonableness, humor, creativity) and a staged alignment pipeline: Western high-resource initialization, judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking, and low-resource Eastern adaptation. The central claim is stronger overall performance under this framework, with particularly large gains in contextual fit and improved balance between image relevance and humor.

Significance. If the results and framework hold under rigorous validation, the work would meaningfully advance multimodal humor generation by addressing cultural context control, a noted gap in current MLLMs. The new task definition, the staged alignment approach, and the specific repulsion constraint for open-ended creative generation represent potentially useful contributions to preference optimization in multimodal settings, provided they demonstrate independence from evaluation artifacts.

major comments (3)
  1. [Abstract] Abstract: The central claim of stronger overall performance (especially large gains in contextual fit) rests entirely on scores from the proposed six-dimensional framework and the judge-based GRPO procedure, yet the manuscript provides no details on datasets, baselines, statistical tests, error bars, inter-rater reliability, or exact implementation of the LLM judges and the Degradation-aware Prototype Repulsion Constraint. This absence makes it impossible to determine whether the data actually supports the claim.
  2. [Evaluation Framework] Evaluation framework description: The six-dimensional framework is load-bearing for all reported gains, but the paper supplies no information on scoring (human vs. LLM judges), calibration against cultural experts, or validation that the dimensions accurately and unbiasedly measure cultural appropriateness and humor quality without introducing bias. This leaves open the possibility that improvements in contextual fit are artifacts of the self-defined metric rather than genuine advances.
  3. [Method] Method section on GRPO: The Degradation-aware Prototype Repulsion Constraint is presented as mitigating reward hacking in open-ended generation, but no equations, ablations, or empirical evidence is given showing it avoids new biases. Given the use of LLM judges that may overlap with the optimized model family, this raises a concrete risk of circularity in the reward signal that directly affects the reliability of the reported balance between relevance and humor.
minor comments (1)
  1. [Method] The description of the staged alignment pipeline would benefit from a clear diagram or pseudocode to illustrate the transition between Western initialization, GRPO, and Eastern adaptation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive feedback and the suggestion for major revision. Below, we provide point-by-point responses to the major comments, outlining the revisions we plan to make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of stronger overall performance (especially large gains in contextual fit) rests entirely on scores from the proposed six-dimensional framework and the judge-based GRPO procedure, yet the manuscript provides no details on datasets, baselines, statistical tests, error bars, inter-rater reliability, or exact implementation of the LLM judges and the Degradation-aware Prototype Repulsion Constraint. This absence makes it impossible to determine whether the data actually supports the claim.

    Authors: We thank the referee for highlighting this issue. The abstract and main text in the submitted manuscript indeed omitted detailed descriptions of the experimental setup to maintain brevity. In the revised manuscript, we will include a comprehensive 'Implementation Details' section covering the datasets used, baseline models, statistical tests performed with error bars, inter-rater reliability measures, and the exact configurations for the LLM judges and the mathematical definition of the Degradation-aware Prototype Repulsion Constraint. This addition will substantiate our central claims. revision: yes

  2. Referee: [Evaluation Framework] Evaluation framework description: The six-dimensional framework is load-bearing for all reported gains, but the paper supplies no information on scoring (human vs. LLM judges), calibration against cultural experts, or validation that the dimensions accurately and unbiasedly measure cultural appropriateness and humor quality without introducing bias. This leaves open the possibility that improvements in contextual fit are artifacts of the self-defined metric rather than genuine advances.

    Authors: We agree that the evaluation framework requires more rigorous documentation to rule out metric artifacts. We will revise the 'Evaluation Framework' section to specify the scoring methodology (primarily LLM-based with human calibration), the calibration process against cultural experts, and validation experiments demonstrating the dimensions' accuracy and lack of bias. This will include quantitative metrics on agreement and expert validation results. revision: yes

  3. Referee: [Method] Method section on GRPO: The Degradation-aware Prototype Repulsion Constraint is presented as mitigating reward hacking in open-ended generation, but no equations, ablations, or empirical evidence is given showing it avoids new biases. Given the use of LLM judges that may overlap with the optimized model family, this raises a concrete risk of circularity in the reward signal that directly affects the reliability of the reported balance between relevance and humor.

    Authors: We recognize the importance of providing equations and evidence for the proposed constraint to address concerns about new biases and circularity. In the revision, we will add the full equations for the Degradation-aware Prototype Repulsion Constraint in the Method section, include ablation studies showing its impact on reward hacking, and clarify the choice of judge models to ensure they are independent from the optimized model family. Empirical results from our experiments will be presented to demonstrate the constraint's effectiveness in balancing relevance and humor. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new task (culture-aware humorous captioning), defines a six-dimensional evaluation framework, describes a staged training procedure using judge-based GRPO plus a custom repulsion constraint, and reports empirical performance gains on the new metrics. No equations, parameter fits, or self-citations are shown to reduce the central performance claim to a definitional identity or tautology. The evaluation framework and constraint are presented as independent design choices whose effectiveness is tested experimentally rather than assumed by construction. This is the normal case of a paper proposing new infrastructure and measuring outcomes on it.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unverified validity of the six-dimensional evaluation framework and the assumption that the staged alignment procedure produces genuine cultural adaptation rather than artifacts of the training process. No free parameters, axioms, or invented entities are explicitly detailed.

axioms (1)
  • domain assumption The six-dimensional evaluation framework accurately captures contextual fit, humor, and cultural appropriateness without systematic bias from LLM judges.
    Invoked when claiming stronger overall performance and large gains in contextual fit.

pith-pipeline@v0.9.0 · 5526 in / 1465 out tokens · 44370 ms · 2026-05-10T05:28:55.636245+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    Improved baselines with vi- sual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with vi- sual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  2. [2]

    Osprey: Pixel understanding with visual instruction tuning

    Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

  3. [3]

    Patch matters: Training-free 12 fine-grained image caption enhancement via lo- cal perception

    Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. Patch matters: Training-free 12 fine-grained image caption enhancement via lo- cal perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3963–3973, 2025

  4. [4]

    Benchmark- ing large vision-language models via directed scene graph for comprehensive image caption- ing

    Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, and Zheng-Jun Zha. Benchmark- ing large vision-language models via directed scene graph for comprehensive image caption- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19618– 19627, 2025

  5. [5]

    Fleur: An explainable reference-free evaluation metric for image captioning using a large mul- timodal model

    Yebin Lee, Imseong Park, and Myungjoo Kang. Fleur: An explainable reference-free evaluation metric for image captioning using a large mul- timodal model. InProceedings of the 62nd An- nual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3732–3746, 2024

  6. [6]

    Caparena: Benchmarking and analyzing de- tailed image captioning in the llm era

    Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, and Jiajun Chen. Caparena: Benchmarking and analyzing de- tailed image captioning in the llm era. InFind- ings of the Association for Computational Lin- guistics: ACL 2025, pages 14077–14094, 2025

  7. [7]

    Describe anything: Detailed localized image and video captioning

    Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21766–21777, 2025

  8. [8]

    Compcap: Improving multimodal large language models with compos- ite captions

    Xiaohui Chen, Satya Narayan Shukla, Mah- moud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, et al. Compcap: Improving multimodal large language models with compos- ite captions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23582–23592, 2025

  9. [9]

    Sc-captioner: Improving im- age captioning with self-correction by reinforce- ment learning

    Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving im- age captioning with self-correction by reinforce- ment learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23145–23155, 2025

  10. [10]

    Ode: Open- set evaluation of hallucinations in multimodal large language models

    Yahan Tu, Rui Hu, and Jitao Sang. Ode: Open- set evaluation of hallucinations in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 19836–19845, 2025

  11. [11]

    Diffusion bridge: leveraging diffu- sion model to reduce the modality gap between text and vision for zero-shot image captioning

    Jeong Ryong Lee, Yejee Shin, Geonhui Son, and Dosik Hwang. Diffusion bridge: leveraging diffu- sion model to reduce the modality gap between text and vision for zero-shot image captioning. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 4050–4059, 2025

  12. [12]

    Let’s think outside the box: Ex- ploring leap-of-thought in large language mod- els with creative humor generation

    Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. Let’s think outside the box: Ex- ploring leap-of-thought in large language mod- els with creative humor generation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13246– 13257, 2024

  13. [13]

    *** yes- but***: A high-quality annotated multimodal dataset for evaluating satire comprehension ca- pability of vision-language models

    Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, and Niloy Ganguly. *** yes- but***: A high-quality annotated multimodal dataset for evaluating satire comprehension ca- pability of vision-language models. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 168...

  14. [14]

    Under- standing figurative meaning through explainable visual entailment

    Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, and Smaranda Muresan. Under- standing figurative meaning through explainable visual entailment. InProceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Lin- guistics: Human Language Technologies (Vol- ume 1: Long Papers), pages 1–23, 2025. 13

  15. [15]

    Humor in pixels: Benchmarking large multimodal models understanding of online comics

    Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. Humor in pix- els: Benchmarking large multimodal models un- derstanding of online comics.arXiv preprint arXiv:2509.12248, 2025

  16. [16]

    Humordb: Can ai un- derstand graphical humor? InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 604–613, 2025

    Vedaant V Jain, Gabriel Kreiman, and Felipe dos Santos Alves Feitosa. Humordb: Can ai un- derstand graphical humor? InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 604–613, 2025

  17. [17]

    Bottlehumor: Self-informed humor ex- planation using the information bottleneck prin- ciple

    EunJeong Hwang, Peter West, and Vered Shwartz. Bottlehumor: Self-informed humor ex- planation using the information bottleneck prin- ciple. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 22611– 22632, 2025

  18. [18]

    Understanding the capabilities and lim- itations of large language models for cultural commonsense

    Siqi Shen, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Soujanya Poria, and Rada Mihal- cea. Understanding the capabilities and lim- itations of large language models for cultural commonsense. InProceedings of the 2024 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: L...

  19. [19]

    Towards mea- suring and modeling “culture” in llms: A survey

    Muhammad Farid Adilazuarda, Sagnik Mukher- jee, Pradhyumna Lavania, Siddhant Shivdutt Singh, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. Towards mea- suring and modeling “culture” in llms: A survey. InProceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, pages 15763–15784, 2024

  20. [20]

    Investigating cul- tural alignment of large language models

    Badr AlKhamissi, Muhammad ElNokrashy, Mai Alkhamissi, and Mona Diab. Investigating cul- tural alignment of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 12404–12422, 2024

  21. [21]

    Culturellm: Incorporating cultural differences into large lan- guage models.Advances in Neural Information Processing Systems, 37:84799–84838, 2024

    Cheng Li, Mengzhuo Chen, Jindong Wang, Sunayana Sitaram, and Xing Xie. Culturellm: Incorporating cultural differences into large lan- guage models.Advances in Neural Information Processing Systems, 37:84799–84838, 2024

  22. [22]

    Cul- turepark: Boosting cross-cultural understanding inlargelanguagemodels.Advances in Neural In- formation Processing Systems, 37:65183–65216, 2024

    Cheng Li, Damien Teney, Linyi Yang, Qing- song Wen, Xing Xie, and Jindong Wang. Cul- turepark: Boosting cross-cultural understanding inlargelanguagemodels.Advances in Neural In- formation Processing Systems, 37:65183–65216, 2024

  23. [23]

    Extrinsic evaluation of cultural competence in large lan- guage models

    Shaily Bhatt and Fernando Diaz. Extrinsic evaluation of cultural competence in large lan- guage models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16055–16074, 2024

  24. [24]

    Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art.Transactions of the Association for Compu- tational Linguistics, 13:652–689, 2025

    Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art.Transactions of the Association for Compu- tational Linguistics, 13:652–689, 2025

  25. [25]

    Dr- ishtikon: A multimodal multilingual benchmark for testing language models’ understanding on indian culture

    Arijit Maji, Raghvendra Kumar, Akash Ghosh, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nishant Mishra, Sriparna Saha, et al. Dr- ishtikon: A multimodal multilingual benchmark for testing language models’ understanding on indian culture. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 1289–1313, 2025

  26. [26]

    Break the checkbox: challenging closed- style evaluations of cultural alignment in llms

    Mohsinul Kabir, Ajwad Abrar, and Sophia Ana- niadou. Break the checkbox: challenging closed- style evaluations of cultural alignment in llms. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pages 24–51, 2025

  27. [27]

    Incorpo- rating diverse perspectives in cultural align- ment: Survey of evaluation benchmarks through a three-dimensional framework

    Meng-Chen Wu, Si-Chi Chin, Tess Wood, Ayush Goyal, and Narayanan Sadagopan. Incorpo- rating diverse perspectives in cultural align- ment: Survey of evaluation benchmarks through a three-dimensional framework. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17037– 17072, 2025. 14

  28. [28]

    Socialcc: Interactive evalua- tion for cultural competence in language agents

    Jincenzi Wu, Jianxun Lian, Dingdong Wang, and Helen Meng. Socialcc: Interactive evalua- tion for cultural competence in language agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 33242–33271, 2025

  29. [29]

    Care: Multilingual human pref- erence learning for cultural awareness

    Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, and Wei Xu. Care: Multilingual human pref- erence learning for cultural awareness. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32854–32883, 2025

  30. [30]

    Oxfordtvg-hic: Can ma- chine make humorous captions from images? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20293– 20303, 2023

    Runjia Li, Shuyang Sun, Mohamed Elhoseiny, and Philip Torr. Oxfordtvg-hic: Can ma- chine make humorous captions from images? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20293– 20303, 2023

  31. [31]

    Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning.Advances in Neural Infor- mation Processing Systems, 37:125264–125286, 2024

    Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wa- genmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, et al. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning.Advances in Neural Infor- mation Processing Systems, 37:125264–125286, 2024

  32. [32]

    Meme- cap: A dataset for captioning and interpreting memes

    EunJeong Hwang and Vered Shwartz. Meme- cap: A dataset for captioning and interpreting memes. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1433–1445, 2023

  33. [33]

    Pun- memecn: A benchmark to explore vision- language models’ understanding of chinese pun memes

    Zhijun Xu, Siyu Yuan, Yiqiao Zhang, Jingyu Sun, Tong Zheng, and Deqing Yang. Pun- memecn: A benchmark to explore vision- language models’ understanding of chinese pun memes. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 18705–18721, 2025

  34. [34]

    Bridging the creativity understanding gap: Small-scale human alignment enables expert-level humor ranking in llms

    Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, and Ji- fan Zhang. Bridging the creativity understand- ing gap: Small-scale human alignment enables expert-level humor ranking in llms.arXiv preprint arXiv:2502.20356, 2025

  35. [35]

    Xmecap: Meme caption generation with sub-image adaptabil- ity

    Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, and Yanghua Xiao. Xmecap: Meme caption generation with sub-image adaptabil- ity. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 3352– 3361, 2024

  36. [36]

    Humorchain: Theory-guided multi-stage reasoning for interpretable multimodal humor generation.arXiv preprint arXiv:2511.21732, 2025

    Jiajun Zhang, Shijia Luo, Ruikang Zhang, and Qi Su. Humorchain: Theory-guided multi-stage reasoning for interpretable multimodal humor generation.arXiv preprint arXiv:2511.21732, 2025

  37. [37]

    On the wings of imagination: Con- flicting script-based multi-role framework for humor caption generation.arXiv preprint arXiv:2602.06423, 2026

    Wenbo Shang, Yuxi Sun, Jing Ma, and Xin Huang. On the wings of imagination: Con- flicting script-based multi-role framework for humor caption generation.arXiv preprint arXiv:2602.06423, 2026

  38. [38]

    Learning combina- torial prompts for universal controllable image captioning.International Journal of Computer Vision, 133(1):129–150, 2025

    Zhen Wang, Jun Xiao, Yueting Zhuang, Fei Gao, Jian Shao, and Long Chen. Learning combina- torial prompts for universal controllable image captioning.International Journal of Computer Vision, 133(1):129–150, 2025

  39. [39]

    Controlcap: Controllable region-level caption- ing

    Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Qixiang Ye, and Fang Wan. Controlcap: Controllable region-level caption- ing. InEuropean Conference on Computer Vi- sion, pages 21–38. Springer, 2024

  40. [40]

    Mcoca: To- wards fine-grained multimodal control in image captioning.Pattern Recognition, page 112381, 2025

    Shanshan Zhao, Teng Wang, Jinrui Zhang, Xi- angchen Wang, and Feng Zheng. Mcoca: To- wards fine-grained multimodal control in image captioning.Pattern Recognition, page 112381, 2025

  41. [41]

    Con- trollable contextualized image captioning: Di- recting the visual narrative through user-defined highlights

    Shunqi Mao, Chaoyi Zhang, Hang Su, Hwanjun Song, Igor Shalyminov, and Weidong Cai. Con- trollable contextualized image captioning: Di- recting the visual narrative through user-defined highlights. InEuropean Conference on Com- puter Vision, pages 464–481. Springer, 2024. 15

  42. [42]

    Repic: Reinforced post-training for personalizing multi-modal lan- guage models.arXiv:2506.18369, 2025

    Yeongtak Oh, Dohyun Chung, Juhyeon Shin, SanghaPark, JohanBarthelemy, JisooMok, and Sungroh Yoon. Repic: Reinforced post-training for personalizing multi-modal language models. arXiv preprint arXiv:2506.18369, 2025

  43. [43]

    Visual captioning at will: Describing images and videos guided by a few stylized sentences

    Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, and Qin Jin. Visual captioning at will: Describing images and videos guided by a few stylized sentences. InProceed- ings of the 31st ACM international conference on multimedia, pages 5705–5715, 2023

  44. [44]

    Cap- tionsmiths: Flexibly controlling language pat- tern in image captioning

    Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, and Yoshitaka Ushiku. Cap- tionsmiths: Flexibly controlling language pat- tern in image captioning. InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 19872–19881, 2025

  45. [45]

    Any- cap project: A unified framework, dataset, and benchmark for controllable omni-modal caption- ing.arXiv preprint arXiv:2507.12841, 2025

    Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, et al. Any- cap project: A unified framework, dataset, and benchmark for controllable omni-modal caption- ing.arXiv preprint arXiv:2507.12841, 2025

  46. [46]

    Culturallearning-basedcultureadap- tation of language models

    Chen Cecilia Liu, Anna Korhonen, and Iryna Gurevych. Culturallearning-basedcultureadap- tation of language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3114–3134, 2025

  47. [47]

    Mmbench: Is your multi-modal model an all- around player? InEuropean conference on com- puter vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all- around player? InEuropean conference on com- puter vision, pages 216–233. Springer, 2024

  48. [48]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuy- ing Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  49. [49]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recogni- tion, pages 9556–9567, 2024

  50. [50]

    Judgelm: Fine-tuned large language models are scalable judges,

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

  51. [51]

    Llava-critic: Learning to evalu- ate multimodal models

    Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evalu- ate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 13618–13628, 2025

  52. [52]

    From generation to judgment: Opportunities and challenges of llm- as-a-judge

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimo- hammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm- as-a-judge. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025

  53. [53]

    Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qi- hui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Ma- chine Learning, 2024

  54. [54]

    Judging the judges: Evaluating alignment and vulnera- bilities in llms-as-judges

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnera- bilities in llms-as-judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), pages 404–430, 2025

  55. [55]

    Judge anything: Mllm as a judge 16 across any modality

    Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, et al. Judge anything: Mllm as a judge 16 across any modality. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining V. 2, pages 5742–5753, 2025

  56. [56]

    Crowd comparative reasoning: Un- locking comprehensive evaluations for llm-as-a- judge

    Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, et al. Crowd comparative reasoning: Un- locking comprehensive evaluations for llm-as-a- judge. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 5059–5074, 2025

  57. [57]

    On evaluating llm alignment by evaluating llms as judges.arXiv preprint arXiv:2511.20604, 2025

    Yixin Liu, Pengfei Liu, and Arman Cohan. On evaluating llm alignment by evaluating llms as judges.arXiv preprint arXiv:2511.20604, 2025

  58. [58]

    Mammoth- vl: Eliciting multimodal reasoning with instruc- tion tuning at scale

    Jiawei Guo, Tianyu Zheng, Yizhi Li, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Graham Neu- big, Wenhu Chen, and Xiang Yue. Mammoth- vl: Eliciting multimodal reasoning with instruc- tion tuning at scale. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 13869–13920, 2025

  59. [59]

    Rlhf-v: Towards trustworthy mllms via behavior align- ment from fine-grained correctional human feed- back

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior align- ment from fine-grained correctional human feed- back. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, pages 13807–13816, 2024

  60. [60]

    Multi-modal preference alignment remedies degradation of visual instruction tuning on lan- guage models

    Shengzhi Li, Rongyu Lin, and Shichao Pei. Multi-modal preference alignment remedies degradation of visual instruction tuning on lan- guage models. InProceedings of the 62nd An- nual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 14188–14200, 2024

  61. [61]

    Mm-rlhf: The next step forward in multimodal llm alignment,

    Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm- rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

  62. [62]

    Llava steering: Visual instruction tuning with 500x fewer parameters through modality lin- ear representation-steering

    Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, and Yunpu Ma. Llava steering: Visual instruction tuning with 500x fewer parameters through modality lin- ear representation-steering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 15230–15250, 2025

  63. [63]

    Commit: Coordinated multimodal instruction tuning

    Xintong Li, Junda Wu, Tong Yu, Rui Wang, Yu Wang, Xiang Chen, Jiuxiang Gu, Lina Yao, Julian McAuley, and Jingbo Shang. Commit: Coordinated multimodal instruction tuning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11533–11547, 2025

  64. [64]

    Task prefer- ence optimization: Improving multimodal large language models with vision task alignment

    Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, et al. Task prefer- ence optimization: Improving multimodal large language models with vision task alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29880– 29892, 2025

  65. [65]

    Re- align: Aligning vision language models via retrieval-augmented direct preference optimiza- tion

    Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, and Zhengzhong Tu. Re- align: Aligning vision language models via retrieval-augmented direct preference optimiza- tion. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 2379–2397, 2025

  66. [66]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  67. [67]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, 17 AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  68. [68]

    Gemini 3 flash preview.https: //ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2026

    Google. Gemini 3 flash preview.https: //ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2026. Official model documentation, accessed 2026-04-02

  69. [69]

    Claude sonnet 4.5 system card.https://www.anthropic.com/ claude-sonnet-4-5-system-card, 2025

    Anthropic. Claude sonnet 4.5 system card.https://www.anthropic.com/ claude-sonnet-4-5-system-card, 2025. Official system card, accessed 2026-04-02

  70. [70]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  71. [71]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easyvisualtasktransfer.arXiv preprint arXiv:2408.03326, 2024

  72. [72]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024

  73. [73]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with ad- vanced large language models.arXiv preprint arXiv:2304.10592, 2023

  74. [74]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Ro- jas, Guanyu Feng, Hanlin Zhao, et al. Chat- glm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  75. [75]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jian- qiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tian- bao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical repo...

  76. [76]

    org/CorpusID:276449796

    URLhttps://api.semanticscholar. org/CorpusID:276449796

  77. [77]

    Cogvlm2: Visual language models for image and video un- derstanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wen- meng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024. 18