Recognition: unknown
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Pith reviewed 2026-05-10 05:28 UTC · model grok-4.3
The pith
A staged alignment pipeline lets multimodal models generate humorous image captions adapted to specific cultural contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that initializing a multimodal model with high-resource Western cultural supervision, followed by multi-dimensional preference alignment using judge-based GRPO together with a Degradation-aware Prototype Repulsion Constraint, and finally adapting the model to Eastern cultural context with a small amount of supervision, produces stronger overall scores on the six-dimensional evaluation framework, with especially large improvements in contextual fit and a better trade-off between image relevance and humor quality under the specified cultural constraints.
What carries the argument
The staged alignment framework that performs Western initialization, then judge-based GRPO preference alignment with a Degradation-aware Prototype Repulsion Constraint to curb reward hacking, and finally low-resource adaptation to the target cultural context.
If this is right
- Captions generated under different cultural contexts share similar visual situations or humorous rationales even when their surface wording differs.
- The method delivers particularly large gains on the contextual-fit dimension of the evaluation framework.
- Models reach an improved balance between remaining relevant to the image and delivering humor once cultural constraints are enforced.
- The Degradation-aware Prototype Repulsion Constraint reduces reward hacking during open-ended preference alignment.
Where Pith is reading between the lines
- The same staged initialization-plus-alignment-plus-adaptation pattern could be tested on other subjective generation attributes such as sentiment or politeness.
- If the six dimensions prove incomplete, future work might need to add explicit checks for cultural offensiveness or stereotype reinforcement.
- The approach implies that cultural adaptation in creative tasks benefits more from explicit multi-dimensional preference modeling than from simple continued fine-tuning alone.
Load-bearing premise
The six-dimensional evaluation framework accurately and without bias measures cultural appropriateness and humor quality, and the judge-based GRPO with the repulsion constraint reliably prevents reward hacking without introducing new biases.
What would settle it
A blinded human study in which raters from both Western and Eastern backgrounds independently score large sets of model outputs on the six dimensions and find no statistically significant gains in contextual fit or overall score compared with strong non-staged baselines.
Figures
read the original abstract
Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous rationales.To support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and creativity. We further propose a staged alignment framework that first initializes the model with high-resource supervision under the Western cultural context, then performs multi-dimensional preference alignment via judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking in open-ended generation, and finally adapts the model to the Eastern cultural context with a small amount of supervision. Experimental results show that our method achieves stronger overall performance under the proposed evaluation framework, with particularly large gains in contextual fit and a better balance between image relevance and humor under cultural constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the task of culture-aware humorous captioning, requiring multimodal models to generate humorous image captions conditioned on a specified cultural context (e.g., Western to Eastern). It proposes a six-dimensional evaluation framework (image relevance, contextual fit, semantic richness, reasonableness, humor, creativity) and a staged alignment pipeline: Western high-resource initialization, judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking, and low-resource Eastern adaptation. The central claim is stronger overall performance under this framework, with particularly large gains in contextual fit and improved balance between image relevance and humor.
Significance. If the results and framework hold under rigorous validation, the work would meaningfully advance multimodal humor generation by addressing cultural context control, a noted gap in current MLLMs. The new task definition, the staged alignment approach, and the specific repulsion constraint for open-ended creative generation represent potentially useful contributions to preference optimization in multimodal settings, provided they demonstrate independence from evaluation artifacts.
major comments (3)
- [Abstract] Abstract: The central claim of stronger overall performance (especially large gains in contextual fit) rests entirely on scores from the proposed six-dimensional framework and the judge-based GRPO procedure, yet the manuscript provides no details on datasets, baselines, statistical tests, error bars, inter-rater reliability, or exact implementation of the LLM judges and the Degradation-aware Prototype Repulsion Constraint. This absence makes it impossible to determine whether the data actually supports the claim.
- [Evaluation Framework] Evaluation framework description: The six-dimensional framework is load-bearing for all reported gains, but the paper supplies no information on scoring (human vs. LLM judges), calibration against cultural experts, or validation that the dimensions accurately and unbiasedly measure cultural appropriateness and humor quality without introducing bias. This leaves open the possibility that improvements in contextual fit are artifacts of the self-defined metric rather than genuine advances.
- [Method] Method section on GRPO: The Degradation-aware Prototype Repulsion Constraint is presented as mitigating reward hacking in open-ended generation, but no equations, ablations, or empirical evidence is given showing it avoids new biases. Given the use of LLM judges that may overlap with the optimized model family, this raises a concrete risk of circularity in the reward signal that directly affects the reliability of the reported balance between relevance and humor.
minor comments (1)
- [Method] The description of the staged alignment pipeline would benefit from a clear diagram or pseudocode to illustrate the transition between Western initialization, GRPO, and Eastern adaptation.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback and the suggestion for major revision. Below, we provide point-by-point responses to the major comments, outlining the revisions we plan to make to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of stronger overall performance (especially large gains in contextual fit) rests entirely on scores from the proposed six-dimensional framework and the judge-based GRPO procedure, yet the manuscript provides no details on datasets, baselines, statistical tests, error bars, inter-rater reliability, or exact implementation of the LLM judges and the Degradation-aware Prototype Repulsion Constraint. This absence makes it impossible to determine whether the data actually supports the claim.
Authors: We thank the referee for highlighting this issue. The abstract and main text in the submitted manuscript indeed omitted detailed descriptions of the experimental setup to maintain brevity. In the revised manuscript, we will include a comprehensive 'Implementation Details' section covering the datasets used, baseline models, statistical tests performed with error bars, inter-rater reliability measures, and the exact configurations for the LLM judges and the mathematical definition of the Degradation-aware Prototype Repulsion Constraint. This addition will substantiate our central claims. revision: yes
-
Referee: [Evaluation Framework] Evaluation framework description: The six-dimensional framework is load-bearing for all reported gains, but the paper supplies no information on scoring (human vs. LLM judges), calibration against cultural experts, or validation that the dimensions accurately and unbiasedly measure cultural appropriateness and humor quality without introducing bias. This leaves open the possibility that improvements in contextual fit are artifacts of the self-defined metric rather than genuine advances.
Authors: We agree that the evaluation framework requires more rigorous documentation to rule out metric artifacts. We will revise the 'Evaluation Framework' section to specify the scoring methodology (primarily LLM-based with human calibration), the calibration process against cultural experts, and validation experiments demonstrating the dimensions' accuracy and lack of bias. This will include quantitative metrics on agreement and expert validation results. revision: yes
-
Referee: [Method] Method section on GRPO: The Degradation-aware Prototype Repulsion Constraint is presented as mitigating reward hacking in open-ended generation, but no equations, ablations, or empirical evidence is given showing it avoids new biases. Given the use of LLM judges that may overlap with the optimized model family, this raises a concrete risk of circularity in the reward signal that directly affects the reliability of the reported balance between relevance and humor.
Authors: We recognize the importance of providing equations and evidence for the proposed constraint to address concerns about new biases and circularity. In the revision, we will add the full equations for the Degradation-aware Prototype Repulsion Constraint in the Method section, include ablation studies showing its impact on reward hacking, and clarify the choice of judge models to ensure they are independent from the optimized model family. Empirical results from our experiments will be presented to demonstrate the constraint's effectiveness in balancing relevance and humor. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a new task (culture-aware humorous captioning), defines a six-dimensional evaluation framework, describes a staged training procedure using judge-based GRPO plus a custom repulsion constraint, and reports empirical performance gains on the new metrics. No equations, parameter fits, or self-citations are shown to reduce the central performance claim to a definitional identity or tautology. The evaluation framework and constraint are presented as independent design choices whose effectiveness is tested experimentally rather than assumed by construction. This is the normal case of a paper proposing new infrastructure and measuring outcomes on it.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six-dimensional evaluation framework accurately captures contextual fit, humor, and cultural appropriateness without systematic bias from LLM judges.
Reference graph
Works this paper leans on
-
[1]
Improved baselines with vi- sual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with vi- sual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
2024
-
[2]
Osprey: Pixel understanding with visual instruction tuning
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024
2024
-
[3]
Patch matters: Training-free 12 fine-grained image caption enhancement via lo- cal perception
Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. Patch matters: Training-free 12 fine-grained image caption enhancement via lo- cal perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3963–3973, 2025
2025
-
[4]
Benchmark- ing large vision-language models via directed scene graph for comprehensive image caption- ing
Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, and Zheng-Jun Zha. Benchmark- ing large vision-language models via directed scene graph for comprehensive image caption- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19618– 19627, 2025
2025
-
[5]
Fleur: An explainable reference-free evaluation metric for image captioning using a large mul- timodal model
Yebin Lee, Imseong Park, and Myungjoo Kang. Fleur: An explainable reference-free evaluation metric for image captioning using a large mul- timodal model. InProceedings of the 62nd An- nual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3732–3746, 2024
2024
-
[6]
Caparena: Benchmarking and analyzing de- tailed image captioning in the llm era
Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, and Jiajun Chen. Caparena: Benchmarking and analyzing de- tailed image captioning in the llm era. InFind- ings of the Association for Computational Lin- guistics: ACL 2025, pages 14077–14094, 2025
2025
-
[7]
Describe anything: Detailed localized image and video captioning
Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21766–21777, 2025
2025
-
[8]
Compcap: Improving multimodal large language models with compos- ite captions
Xiaohui Chen, Satya Narayan Shukla, Mah- moud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, et al. Compcap: Improving multimodal large language models with compos- ite captions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23582–23592, 2025
2025
-
[9]
Sc-captioner: Improving im- age captioning with self-correction by reinforce- ment learning
Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving im- age captioning with self-correction by reinforce- ment learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23145–23155, 2025
2025
-
[10]
Ode: Open- set evaluation of hallucinations in multimodal large language models
Yahan Tu, Rui Hu, and Jitao Sang. Ode: Open- set evaluation of hallucinations in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 19836–19845, 2025
2025
-
[11]
Diffusion bridge: leveraging diffu- sion model to reduce the modality gap between text and vision for zero-shot image captioning
Jeong Ryong Lee, Yejee Shin, Geonhui Son, and Dosik Hwang. Diffusion bridge: leveraging diffu- sion model to reduce the modality gap between text and vision for zero-shot image captioning. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 4050–4059, 2025
2025
-
[12]
Let’s think outside the box: Ex- ploring leap-of-thought in large language mod- els with creative humor generation
Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. Let’s think outside the box: Ex- ploring leap-of-thought in large language mod- els with creative humor generation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13246– 13257, 2024
2024
-
[13]
*** yes- but***: A high-quality annotated multimodal dataset for evaluating satire comprehension ca- pability of vision-language models
Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, and Niloy Ganguly. *** yes- but***: A high-quality annotated multimodal dataset for evaluating satire comprehension ca- pability of vision-language models. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 168...
2024
-
[14]
Under- standing figurative meaning through explainable visual entailment
Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, and Smaranda Muresan. Under- standing figurative meaning through explainable visual entailment. InProceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Lin- guistics: Human Language Technologies (Vol- ume 1: Long Papers), pages 1–23, 2025. 13
2025
-
[15]
Humor in pixels: Benchmarking large multimodal models understanding of online comics
Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. Humor in pix- els: Benchmarking large multimodal models un- derstanding of online comics.arXiv preprint arXiv:2509.12248, 2025
-
[16]
Humordb: Can ai un- derstand graphical humor? InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 604–613, 2025
Vedaant V Jain, Gabriel Kreiman, and Felipe dos Santos Alves Feitosa. Humordb: Can ai un- derstand graphical humor? InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 604–613, 2025
2025
-
[17]
Bottlehumor: Self-informed humor ex- planation using the information bottleneck prin- ciple
EunJeong Hwang, Peter West, and Vered Shwartz. Bottlehumor: Self-informed humor ex- planation using the information bottleneck prin- ciple. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 22611– 22632, 2025
2025
-
[18]
Understanding the capabilities and lim- itations of large language models for cultural commonsense
Siqi Shen, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Soujanya Poria, and Rada Mihal- cea. Understanding the capabilities and lim- itations of large language models for cultural commonsense. InProceedings of the 2024 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: L...
2024
-
[19]
Towards mea- suring and modeling “culture” in llms: A survey
Muhammad Farid Adilazuarda, Sagnik Mukher- jee, Pradhyumna Lavania, Siddhant Shivdutt Singh, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. Towards mea- suring and modeling “culture” in llms: A survey. InProceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, pages 15763–15784, 2024
2024
-
[20]
Investigating cul- tural alignment of large language models
Badr AlKhamissi, Muhammad ElNokrashy, Mai Alkhamissi, and Mona Diab. Investigating cul- tural alignment of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 12404–12422, 2024
2024
-
[21]
Culturellm: Incorporating cultural differences into large lan- guage models.Advances in Neural Information Processing Systems, 37:84799–84838, 2024
Cheng Li, Mengzhuo Chen, Jindong Wang, Sunayana Sitaram, and Xing Xie. Culturellm: Incorporating cultural differences into large lan- guage models.Advances in Neural Information Processing Systems, 37:84799–84838, 2024
2024
-
[22]
Cul- turepark: Boosting cross-cultural understanding inlargelanguagemodels.Advances in Neural In- formation Processing Systems, 37:65183–65216, 2024
Cheng Li, Damien Teney, Linyi Yang, Qing- song Wen, Xing Xie, and Jindong Wang. Cul- turepark: Boosting cross-cultural understanding inlargelanguagemodels.Advances in Neural In- formation Processing Systems, 37:65183–65216, 2024
2024
-
[23]
Extrinsic evaluation of cultural competence in large lan- guage models
Shaily Bhatt and Fernando Diaz. Extrinsic evaluation of cultural competence in large lan- guage models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16055–16074, 2024
2024
-
[24]
Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art.Transactions of the Association for Compu- tational Linguistics, 13:652–689, 2025
Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art.Transactions of the Association for Compu- tational Linguistics, 13:652–689, 2025
2025
-
[25]
Dr- ishtikon: A multimodal multilingual benchmark for testing language models’ understanding on indian culture
Arijit Maji, Raghvendra Kumar, Akash Ghosh, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nishant Mishra, Sriparna Saha, et al. Dr- ishtikon: A multimodal multilingual benchmark for testing language models’ understanding on indian culture. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 1289–1313, 2025
2025
-
[26]
Break the checkbox: challenging closed- style evaluations of cultural alignment in llms
Mohsinul Kabir, Ajwad Abrar, and Sophia Ana- niadou. Break the checkbox: challenging closed- style evaluations of cultural alignment in llms. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pages 24–51, 2025
2025
-
[27]
Incorpo- rating diverse perspectives in cultural align- ment: Survey of evaluation benchmarks through a three-dimensional framework
Meng-Chen Wu, Si-Chi Chin, Tess Wood, Ayush Goyal, and Narayanan Sadagopan. Incorpo- rating diverse perspectives in cultural align- ment: Survey of evaluation benchmarks through a three-dimensional framework. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17037– 17072, 2025. 14
2025
-
[28]
Socialcc: Interactive evalua- tion for cultural competence in language agents
Jincenzi Wu, Jianxun Lian, Dingdong Wang, and Helen Meng. Socialcc: Interactive evalua- tion for cultural competence in language agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 33242–33271, 2025
2025
-
[29]
Care: Multilingual human pref- erence learning for cultural awareness
Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, and Wei Xu. Care: Multilingual human pref- erence learning for cultural awareness. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32854–32883, 2025
2025
-
[30]
Oxfordtvg-hic: Can ma- chine make humorous captions from images? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20293– 20303, 2023
Runjia Li, Shuyang Sun, Mohamed Elhoseiny, and Philip Torr. Oxfordtvg-hic: Can ma- chine make humorous captions from images? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20293– 20303, 2023
2023
-
[31]
Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning.Advances in Neural Infor- mation Processing Systems, 37:125264–125286, 2024
Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wa- genmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, et al. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon captioning.Advances in Neural Infor- mation Processing Systems, 37:125264–125286, 2024
2024
-
[32]
Meme- cap: A dataset for captioning and interpreting memes
EunJeong Hwang and Vered Shwartz. Meme- cap: A dataset for captioning and interpreting memes. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1433–1445, 2023
2023
-
[33]
Pun- memecn: A benchmark to explore vision- language models’ understanding of chinese pun memes
Zhijun Xu, Siyu Yuan, Yiqiao Zhang, Jingyu Sun, Tong Zheng, and Deqing Yang. Pun- memecn: A benchmark to explore vision- language models’ understanding of chinese pun memes. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 18705–18721, 2025
2025
-
[34]
Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, and Ji- fan Zhang. Bridging the creativity understand- ing gap: Small-scale human alignment enables expert-level humor ranking in llms.arXiv preprint arXiv:2502.20356, 2025
-
[35]
Xmecap: Meme caption generation with sub-image adaptabil- ity
Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, and Yanghua Xiao. Xmecap: Meme caption generation with sub-image adaptabil- ity. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 3352– 3361, 2024
2024
-
[36]
Jiajun Zhang, Shijia Luo, Ruikang Zhang, and Qi Su. Humorchain: Theory-guided multi-stage reasoning for interpretable multimodal humor generation.arXiv preprint arXiv:2511.21732, 2025
-
[37]
Wenbo Shang, Yuxi Sun, Jing Ma, and Xin Huang. On the wings of imagination: Con- flicting script-based multi-role framework for humor caption generation.arXiv preprint arXiv:2602.06423, 2026
-
[38]
Learning combina- torial prompts for universal controllable image captioning.International Journal of Computer Vision, 133(1):129–150, 2025
Zhen Wang, Jun Xiao, Yueting Zhuang, Fei Gao, Jian Shao, and Long Chen. Learning combina- torial prompts for universal controllable image captioning.International Journal of Computer Vision, 133(1):129–150, 2025
2025
-
[39]
Controlcap: Controllable region-level caption- ing
Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Qixiang Ye, and Fang Wan. Controlcap: Controllable region-level caption- ing. InEuropean Conference on Computer Vi- sion, pages 21–38. Springer, 2024
2024
-
[40]
Mcoca: To- wards fine-grained multimodal control in image captioning.Pattern Recognition, page 112381, 2025
Shanshan Zhao, Teng Wang, Jinrui Zhang, Xi- angchen Wang, and Feng Zheng. Mcoca: To- wards fine-grained multimodal control in image captioning.Pattern Recognition, page 112381, 2025
2025
-
[41]
Con- trollable contextualized image captioning: Di- recting the visual narrative through user-defined highlights
Shunqi Mao, Chaoyi Zhang, Hang Su, Hwanjun Song, Igor Shalyminov, and Weidong Cai. Con- trollable contextualized image captioning: Di- recting the visual narrative through user-defined highlights. InEuropean Conference on Com- puter Vision, pages 464–481. Springer, 2024. 15
2024
-
[42]
Yeongtak Oh, Dohyun Chung, Juhyeon Shin, SanghaPark, JohanBarthelemy, JisooMok, and Sungroh Yoon. Repic: Reinforced post-training for personalizing multi-modal language models. arXiv preprint arXiv:2506.18369, 2025
-
[43]
Visual captioning at will: Describing images and videos guided by a few stylized sentences
Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, and Qin Jin. Visual captioning at will: Describing images and videos guided by a few stylized sentences. InProceed- ings of the 31st ACM international conference on multimedia, pages 5705–5715, 2023
2023
-
[44]
Cap- tionsmiths: Flexibly controlling language pat- tern in image captioning
Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, and Yoshitaka Ushiku. Cap- tionsmiths: Flexibly controlling language pat- tern in image captioning. InProceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 19872–19881, 2025
2025
-
[45]
Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, et al. Any- cap project: A unified framework, dataset, and benchmark for controllable omni-modal caption- ing.arXiv preprint arXiv:2507.12841, 2025
-
[46]
Culturallearning-basedcultureadap- tation of language models
Chen Cecilia Liu, Anna Korhonen, and Iryna Gurevych. Culturallearning-basedcultureadap- tation of language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3114–3134, 2025
2025
-
[47]
Mmbench: Is your multi-modal model an all- around player? InEuropean conference on com- puter vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all- around player? InEuropean conference on com- puter vision, pages 216–233. Springer, 2024
2024
-
[48]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuy- ing Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review arXiv 2023
-
[49]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recogni- tion, pages 9556–9567, 2024
2024
-
[50]
Judgelm: Fine-tuned large language models are scalable judges,
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023
-
[51]
Llava-critic: Learning to evalu- ate multimodal models
Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evalu- ate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 13618–13628, 2025
2025
-
[52]
From generation to judgment: Opportunities and challenges of llm- as-a-judge
Dawei Li, Bohan Jiang, Liangjie Huang, Alimo- hammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm- as-a-judge. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025
2025
-
[53]
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qi- hui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Ma- chine Learning, 2024
2024
-
[54]
Judging the judges: Evaluating alignment and vulnera- bilities in llms-as-judges
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnera- bilities in llms-as-judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), pages 404–430, 2025
2025
-
[55]
Judge anything: Mllm as a judge 16 across any modality
Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, et al. Judge anything: Mllm as a judge 16 across any modality. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining V. 2, pages 5742–5753, 2025
2025
-
[56]
Crowd comparative reasoning: Un- locking comprehensive evaluations for llm-as-a- judge
Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, et al. Crowd comparative reasoning: Un- locking comprehensive evaluations for llm-as-a- judge. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 5059–5074, 2025
2025
-
[57]
On evaluating llm alignment by evaluating llms as judges.arXiv preprint arXiv:2511.20604, 2025
Yixin Liu, Pengfei Liu, and Arman Cohan. On evaluating llm alignment by evaluating llms as judges.arXiv preprint arXiv:2511.20604, 2025
-
[58]
Mammoth- vl: Eliciting multimodal reasoning with instruc- tion tuning at scale
Jiawei Guo, Tianyu Zheng, Yizhi Li, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Graham Neu- big, Wenhu Chen, and Xiang Yue. Mammoth- vl: Eliciting multimodal reasoning with instruc- tion tuning at scale. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 13869–13920, 2025
2025
-
[59]
Rlhf-v: Towards trustworthy mllms via behavior align- ment from fine-grained correctional human feed- back
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior align- ment from fine-grained correctional human feed- back. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, pages 13807–13816, 2024
2024
-
[60]
Multi-modal preference alignment remedies degradation of visual instruction tuning on lan- guage models
Shengzhi Li, Rongyu Lin, and Shichao Pei. Multi-modal preference alignment remedies degradation of visual instruction tuning on lan- guage models. InProceedings of the 62nd An- nual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 14188–14200, 2024
2024
-
[61]
Mm-rlhf: The next step forward in multimodal llm alignment,
Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm- rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025
-
[62]
Llava steering: Visual instruction tuning with 500x fewer parameters through modality lin- ear representation-steering
Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, and Yunpu Ma. Llava steering: Visual instruction tuning with 500x fewer parameters through modality lin- ear representation-steering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 15230–15250, 2025
2025
-
[63]
Commit: Coordinated multimodal instruction tuning
Xintong Li, Junda Wu, Tong Yu, Rui Wang, Yu Wang, Xiang Chen, Jiuxiang Gu, Lina Yao, Julian McAuley, and Jingbo Shang. Commit: Coordinated multimodal instruction tuning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11533–11547, 2025
2025
-
[64]
Task prefer- ence optimization: Improving multimodal large language models with vision task alignment
Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, et al. Task prefer- ence optimization: Improving multimodal large language models with vision task alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29880– 29892, 2025
2025
-
[65]
Re- align: Aligning vision language models via retrieval-augmented direct preference optimiza- tion
Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, and Zhengzhong Tu. Re- align: Aligning vision language models via retrieval-augmented direct preference optimiza- tion. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 2379–2397, 2025
2025
-
[66]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, 17 AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Gemini 3 flash preview.https: //ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2026
Google. Gemini 3 flash preview.https: //ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2026. Official model documentation, accessed 2026-04-02
2026
-
[69]
Claude sonnet 4.5 system card.https://www.anthropic.com/ claude-sonnet-4-5-system-card, 2025
Anthropic. Claude sonnet 4.5 system card.https://www.anthropic.com/ claude-sonnet-4-5-system-card, 2025. Official system card, accessed 2026-04-02
2025
-
[70]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review arXiv 2025
-
[71]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easyvisualtasktransfer.arXiv preprint arXiv:2408.03326, 2024
work page Pith review arXiv 2024
-
[72]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review arXiv 2024
-
[73]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with ad- vanced large language models.arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review arXiv 2023
-
[74]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Ro- jas, Guanyu Feng, Hanlin Zhao, et al. Chat- glm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review arXiv 2024
-
[75]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jian- qiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tian- bao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical repo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
org/CorpusID:276449796
URLhttps://api.semanticscholar. org/CorpusID:276449796
-
[77]
Cogvlm2: Visual language models for image and video un- derstanding
Wenyi Hong, Weihan Wang, Ming Ding, Wen- meng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024. 18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.