MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models
Pith reviewed 2026-05-23 07:11 UTC · model grok-4.3
The pith
Large vision-language models diverge significantly from human moral judgments in multimodal scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MM-MoralBench constructs multimodal scenarios by pairing synthesized visual contexts with character dialogues to simulate dilemmas where visual and linguistic information interact. Grounded in Moral Foundations Theory, the benchmark measures LVLMs across moral judgment, classification, and response tasks on six foundations. Evaluations of more than 20 models reveal pronounced moral alignment bias that deviates from aggregated human responses, with general scaling and structural changes producing diminishing returns while thinking paradigms can induce overthinking-induced failures.
What carries the argument
MM-MoralBench benchmark of synthesized multimodal scenarios that combine visual contexts with character dialogues to assess alignment on six moral foundations via judgment, classification, and response tasks.
If this is right
- Models exhibit pronounced moral alignment bias diverging significantly from human consensus.
- General scaling or structural improvements yield diminishing returns in moral alignment.
- Thinking paradigms may trigger overthinking-induced failures in moral contexts.
- Targeted moral alignment strategies are required rather than relying on general capability gains.
Where Pith is reading between the lines
- Deployment of LVLMs in roles involving ethical decisions could produce outcomes that conflict with public values unless alignment is addressed directly.
- The benchmark might be adapted to evaluate models on sequential visual changes, such as video clips, to test dynamic moral reasoning.
- Training methods could incorporate explicit constraints on reasoning depth when handling moral queries to reduce overthinking effects.
Load-bearing premise
The synthesized multimodal scenarios created by combining visual contexts with character dialogues validly capture real-world moral dilemmas in which visual and linguistic information interact dynamically, and aggregated human responses on these scenarios constitute the appropriate target for model alignment.
What would settle it
Demonstrating that multiple LVLMs produce moral judgments matching aggregated human responses on the benchmark scenarios, or that larger models show steadily higher alignment scores without targeted training, would falsify the central findings.
Figures
read the original abstract
The rapid integration of Large Vision-Language Models (LVLMs) into critical domains necessitates comprehensive moral evaluation to ensure their alignment with human values. While extensive research has addressed moral evaluation in LLMs, text-centric assessments cannot adequately capture the complex contextual nuances and ambiguities introduced by visual modalities. To bridge this gap, we introduce MM-MoralBench, a multimodal moral evaluation benchmark grounded in Moral Foundations Theory. We construct unique multimodal scenarios by combining synthesized visual contexts with character dialogues to simulate real-world dilemmas where visual and linguistic information interact dynamically. Our benchmark assesses models across six moral foundations through moral judgment, classification, and response tasks. Extensive evaluations of over 20 LVLMs reveal that models exhibit pronounced moral alignment bias, diverging significantly from human consensus. Furthermore, our analysis indicates that general scaling or structural improvements yield diminishing returns in moral alignment, and thinking paradigm may trigger overthinking-induced failures in moral contexts, highlighting the necessity for targeted moral alignment strategies. Our benchmark is publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MM-MoralBench, a multimodal benchmark grounded in Moral Foundations Theory that constructs scenarios by pairing synthesized visual contexts with character dialogues. It evaluates over 20 LVLMs on moral judgment, classification, and response tasks across six foundations, claiming that models exhibit pronounced divergence from human consensus, that scaling and structural improvements yield diminishing returns in moral alignment, and that thinking paradigms can induce overthinking failures.
Significance. A validated multimodal moral benchmark would address a clear gap left by text-only evaluations and could usefully document current LVLMs' limitations in handling visual-linguistic moral interactions. The reported scaling and paradigm findings would be of interest if the benchmark scenarios are shown to be faithful proxies rather than artifacts of synthesis.
major comments (2)
- [Benchmark construction (§3)] Benchmark construction (abstract and §3): The paper states that scenarios are 'synthesized' to 'simulate real-world dilemmas' but supplies no external validation—such as inter-rater reliability scores with domain experts, comparison against naturalistic image-dialogue corpora, or ablation studies on synthesis artifacts. Without this, the central claims of model-human divergence and scaling limits rest on an untested assumption that the constructed items are faithful proxies.
- [Evaluation protocol (§4)] Human consensus and evaluation protocol (abstract and §4): No information is given on how human responses were collected (participant pool, exclusion criteria, statistical tests for consensus, or agreement metrics). This absence directly undermines the reported 'pronounced moral alignment bias' and the conclusion that general scaling yields diminishing returns.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction would benefit from an explicit statement of the number of scenarios per foundation and the exact task formats used for each of the three evaluation modes.
- [Results tables/figures] Figure captions and tables should include the precise number of models evaluated per category (e.g., open-source vs. closed) to allow readers to assess coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify areas where additional methodological detail will improve the manuscript. We address each major comment below and will incorporate the requested information in the revised version.
read point-by-point responses
-
Referee: Benchmark construction (§3): The paper states that scenarios are 'synthesized' to 'simulate real-world dilemmas' but supplies no external validation—such as inter-rater reliability scores with domain experts, comparison against naturalistic image-dialogue corpora, or ablation studies on synthesis artifacts. Without this, the central claims of model-human divergence and scaling limits rest on an untested assumption that the constructed items are faithful proxies.
Authors: We agree that explicit external validation of the synthesized scenarios would strengthen the benchmark. The current §3 describes the construction process grounded in Moral Foundations Theory, but does not report inter-rater checks or comparisons to naturalistic corpora. In the revision we will add (i) a description of expert review of a subset of scenarios, (ii) inter-rater reliability statistics, and (iii) a brief comparison against existing text-only moral dilemma collections to address potential synthesis artifacts. revision: yes
-
Referee: Human consensus and evaluation protocol (abstract and §4): No information is given on how human responses were collected (participant pool, exclusion criteria, statistical tests for consensus, or agreement metrics). This absence directly undermines the reported 'pronounced moral alignment bias' and the conclusion that general scaling yields diminishing returns.
Authors: We acknowledge that §4 currently omits the requested details on human data collection. The manuscript reports aggregate human consensus but does not describe recruitment, exclusion rules, or agreement metrics. In the revision we will expand §4 to include participant demographics, exclusion criteria, the statistical procedure used to establish consensus, and agreement metrics (e.g., Fleiss’ kappa). This will provide a transparent basis for the reported model-human divergence. revision: yes
Circularity Check
No circularity detected; benchmark construction and external evaluations are independent
full rationale
The paper introduces MM-MoralBench as a newly constructed benchmark grounded in external Moral Foundations Theory, with scenarios synthesized from visual contexts and dialogues, then evaluated on over 20 pre-existing LVLMs against aggregated human responses. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain consists of independent data synthesis followed by external model testing, remaining self-contained without any step reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Moral Foundations Theory provides a suitable and sufficient framework for measuring moral alignment in vision-language models
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct unique multimodal scenarios by combining synthesized visual contexts with character dialogues... grounded in Moral Foundations Theory... moral judgement, moral classification, and moral response tasks.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive evaluations of over 20 LVLMs reveal that models exhibit pronounced moral alignment bias...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A study of gener- ative large language model for medical research and health- care
Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of gener- ative large language model for medical research and health- care. NPJ digital medicine, 6(1):210, 2023. 1
work page 2023
-
[2]
Large language models in law: A survey
Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and S Yu Philip. Large language models in law: A survey. AI Open, 2024
work page 2024
-
[3]
Large language models in finance: A survey
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. InProceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023. 1
work page 2023
-
[4]
Large language model alignment: A survey
Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Wei- long Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023. 1
-
[5]
Intuitive ethics: How innately prepared intuitions generate culturally variable virtues
Jonathan Haidt and Craig Joseph. Intuitive ethics: How innately prepared intuitions generate culturally variable virtues. Daedalus, 133(4):55–66, 2004. 1, 2
work page 2004
-
[6]
Moral foun- dations theory: The pragmatic validity of moral pluralism
Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. Moral foun- dations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology , volume 47, pages 55–130. Elsevier, 2013. 1, 2
work page 2013
-
[7]
Liberals and conservatives rely on different sets of moral foundations
Jesse Graham, Jonathan Haidt, and Brian A Nosek. Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology , 96(5):1029,
-
[8]
Scott Clifford, Vijeth Iyengar, Roberto Cabeza, and Walter Sinnott-Armstrong. Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory. Behavior research methods , 47(4):1178– 1198, 2015. 1, 2, 3, 7
work page 2015
-
[9]
Unpack- ing the ethical value alignment in big models
Xiaoyuan Yi, Jing Yao, Xiting Wang, and Xing Xie. Unpack- ing the ethical value alignment in big models. arXiv preprint arXiv:2310.17551, 2023. 1
-
[10]
Evaluating the moral beliefs encoded in llms
Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36, 2024. 1, 3, 5, 6, 7
work page 2024
-
[11]
Moralbench: Moral evaluation of llms
Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. Moralbench: Moral evaluation of llms. arXiv preprint arXiv:2406.04428, 2024. 1, 3, 5
-
[12]
Aligning ai with shared human values
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. InInternational Conference on Learning Representations, 2021. 1, 3
work page 2021
-
[13]
Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes
Nicholas Lourie, Ronan Le Bras, and Yejin Choi. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. In Proceedings of the AAAI Conference on Arti- ficial Intelligence, volume 35, pages 13470–13479, 2021. 3, 5
work page 2021
-
[14]
CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models
Linhao Yu, Yongqi Leng, Yufei Huang, Shang Wu, Haixin Liu, Xinmeng Ji, Jiahui Zhao, Jinwang Song, Tingting Cui, Xiaoqing Cheng, et al. CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Online or Conference Location, 2024. Asso- ciation for Computational L...
work page 2024
-
[15]
When to make exceptions: Exploring language models as accounts of human moral judgment
Zhijing Jin, Sydney Levine, Fernando Gonzalez Adauto, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mi- halcea, Josh Tenenbaum, and Bernhard Sch ¨olkopf. When to make exceptions: Exploring language models as accounts of human moral judgment. Advances in neural information processing systems, 35:28458–28473, 2022. 3, 5
work page 2022
-
[16]
Dailydilemmas: Revealing value preferences of llms with quandaries of daily life
Yu Ying Chiu, Liwei Jiang, and Yejin Choi. Dailydilemmas: Revealing value preferences of llms with quandaries of daily life. arXiv preprint arXiv:2410.02683, 2024. 1, 3, 5
-
[17]
OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt- 4o/, 2024. 2, 3, 4, 6, 1
work page 2024
-
[18]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 2, 4
work page 2024
-
[19]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3, 6, 4
work page 2023
-
[20]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 3, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 3, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13040–13051, 2024. 3, 6, 4
work page 2024
-
[24]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 3, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 3, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Yi: Open Foundation Models by 01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 3, 6, 4 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 3, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Large language models are not robust multi- ple choice selectors
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multi- ple choice selectors. In International Conference on Learn- ing Representations, 2024. 6
work page 2024
-
[29]
Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li, Tian- hao Huang, Xiaozhi Zhu, Mo Niu, Lingyu Sun, Peng Tang, Tongqiao Xu, Kwang-Ting Cheng, and Minyi Guo. Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications. In 2023 IEEE International Symposium on Workload Characteriza- tion (IISWC), 2023. 6
work page 2023
-
[30]
Equa- tion of state calculations by fast computing machines
Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equa- tion of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953. 6 10 M³oralBench: A MultiModal Moral Benchmark for LVLMs Supplementary Material
work page 1953
-
[31]
Moral Foundations Theory 7.1. Moral Foundations We provide a more detailed explanation of the six moral foundations [5, 6] as follows: • Care/Harm: This foundation arises from the evolution- ary need to care for vulnerable offspring. It is triggered by visual and auditory signs of suffering, distress, or neediness, primarily from one’s own children but al...
-
[32]
Image Generation Details 8.1. Image Generation Prompts In image generation process, we utilize GPT-4o [17] to ex- pand the dataset by imitating similar scenarios from MFVs and converting these brief descriptions into detailed image prompts and main character dialogues. Additionally, to gen- erate two contrasting versions of the scenarios, a morally wrong ...
-
[33]
More Evaluation Examples More examples of M³oralBench evaluation are shown in Figure 10, Figure 11 and Figure 12. Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4. Average CLIP similarity scores between scenario image captions and original scenario texts for each moral foundation
-
[34]
Detailed Evaluation Results We provide the detailed evaluation results on M³oralBench across 3 moral tasks and 6 moral foundations in Table 5, Table 6 and Table 7. 2 Care Violation Instruction: How do you consider the behavior in the image from the perspective of morality? (A) Not morally wrong (B) Morally wrong Reference: (B) mPLUG-Owl2: A Phi-3-Vision: ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.