pith. sign in

arxiv: 2412.20718 · v2 · submitted 2024-12-30 · 💻 cs.CV · cs.AI

MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models

Pith reviewed 2026-05-23 07:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords MM-MoralBenchmultimodal moral evaluationlarge vision-language modelsmoral alignment biasMoral Foundations Theoryoverthinking failuresmoral judgment tasks
0
0 comments X

The pith

Large vision-language models diverge significantly from human moral judgments in multimodal scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MM-MoralBench to test moral alignment in large vision-language models using scenarios that pair synthesized images with character dialogues. It evaluates models on six moral foundations through judgment, classification, and response tasks. Testing over 20 models shows consistent divergence from human consensus on these dilemmas. The results indicate that increasing model size or changing architecture brings little improvement in alignment, and that step-by-step reasoning can lead to worse moral outputs.

Core claim

MM-MoralBench constructs multimodal scenarios by pairing synthesized visual contexts with character dialogues to simulate dilemmas where visual and linguistic information interact. Grounded in Moral Foundations Theory, the benchmark measures LVLMs across moral judgment, classification, and response tasks on six foundations. Evaluations of more than 20 models reveal pronounced moral alignment bias that deviates from aggregated human responses, with general scaling and structural changes producing diminishing returns while thinking paradigms can induce overthinking-induced failures.

What carries the argument

MM-MoralBench benchmark of synthesized multimodal scenarios that combine visual contexts with character dialogues to assess alignment on six moral foundations via judgment, classification, and response tasks.

If this is right

  • Models exhibit pronounced moral alignment bias diverging significantly from human consensus.
  • General scaling or structural improvements yield diminishing returns in moral alignment.
  • Thinking paradigms may trigger overthinking-induced failures in moral contexts.
  • Targeted moral alignment strategies are required rather than relying on general capability gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment of LVLMs in roles involving ethical decisions could produce outcomes that conflict with public values unless alignment is addressed directly.
  • The benchmark might be adapted to evaluate models on sequential visual changes, such as video clips, to test dynamic moral reasoning.
  • Training methods could incorporate explicit constraints on reasoning depth when handling moral queries to reduce overthinking effects.

Load-bearing premise

The synthesized multimodal scenarios created by combining visual contexts with character dialogues validly capture real-world moral dilemmas in which visual and linguistic information interact dynamically, and aggregated human responses on these scenarios constitute the appropriate target for model alignment.

What would settle it

Demonstrating that multiple LVLMs produce moral judgments matching aggregated human responses on the benchmark scenarios, or that larger models show steadily higher alignment scores without targeted training, would falsify the central findings.

Figures

Figures reproduced from arXiv: 2412.20718 by Bei Yan, Jie Zhang, Shiguang Shan, Xilin Chen, Zhiyuan Chen.

Figure 1
Figure 1. Figure 1: An overview of the entire pipeline for M³oralBench construction. We use GPT-4o to expand the Moral Foundations Vignettes, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of moral scenarios in MFVs violated different [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of scenario expansion from the MFVs. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of M³oralBench evaluation for different moral tasks. Moral judgement requires the model to assess whether the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the top-5 LVLM performance across 6 moral foundations on M³oralBench. A larger area indicates better [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the correlations between the moral eval [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt used to expand the scenarios in MFVs. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt used to transform the moral violation scenarios into image descriptions and main character dialogues. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of moral judgement evaluation in M³oralBench. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of moral classification evaluation in M³oralBench. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of moral response evaluation in M³oralBench. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

The rapid integration of Large Vision-Language Models (LVLMs) into critical domains necessitates comprehensive moral evaluation to ensure their alignment with human values. While extensive research has addressed moral evaluation in LLMs, text-centric assessments cannot adequately capture the complex contextual nuances and ambiguities introduced by visual modalities. To bridge this gap, we introduce MM-MoralBench, a multimodal moral evaluation benchmark grounded in Moral Foundations Theory. We construct unique multimodal scenarios by combining synthesized visual contexts with character dialogues to simulate real-world dilemmas where visual and linguistic information interact dynamically. Our benchmark assesses models across six moral foundations through moral judgment, classification, and response tasks. Extensive evaluations of over 20 LVLMs reveal that models exhibit pronounced moral alignment bias, diverging significantly from human consensus. Furthermore, our analysis indicates that general scaling or structural improvements yield diminishing returns in moral alignment, and thinking paradigm may trigger overthinking-induced failures in moral contexts, highlighting the necessity for targeted moral alignment strategies. Our benchmark is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MM-MoralBench, a multimodal benchmark grounded in Moral Foundations Theory that constructs scenarios by pairing synthesized visual contexts with character dialogues. It evaluates over 20 LVLMs on moral judgment, classification, and response tasks across six foundations, claiming that models exhibit pronounced divergence from human consensus, that scaling and structural improvements yield diminishing returns in moral alignment, and that thinking paradigms can induce overthinking failures.

Significance. A validated multimodal moral benchmark would address a clear gap left by text-only evaluations and could usefully document current LVLMs' limitations in handling visual-linguistic moral interactions. The reported scaling and paradigm findings would be of interest if the benchmark scenarios are shown to be faithful proxies rather than artifacts of synthesis.

major comments (2)
  1. [Benchmark construction (§3)] Benchmark construction (abstract and §3): The paper states that scenarios are 'synthesized' to 'simulate real-world dilemmas' but supplies no external validation—such as inter-rater reliability scores with domain experts, comparison against naturalistic image-dialogue corpora, or ablation studies on synthesis artifacts. Without this, the central claims of model-human divergence and scaling limits rest on an untested assumption that the constructed items are faithful proxies.
  2. [Evaluation protocol (§4)] Human consensus and evaluation protocol (abstract and §4): No information is given on how human responses were collected (participant pool, exclusion criteria, statistical tests for consensus, or agreement metrics). This absence directly undermines the reported 'pronounced moral alignment bias' and the conclusion that general scaling yields diminishing returns.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction would benefit from an explicit statement of the number of scenarios per foundation and the exact task formats used for each of the three evaluation modes.
  2. [Results tables/figures] Figure captions and tables should include the precise number of models evaluated per category (e.g., open-source vs. closed) to allow readers to assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify areas where additional methodological detail will improve the manuscript. We address each major comment below and will incorporate the requested information in the revised version.

read point-by-point responses
  1. Referee: Benchmark construction (§3): The paper states that scenarios are 'synthesized' to 'simulate real-world dilemmas' but supplies no external validation—such as inter-rater reliability scores with domain experts, comparison against naturalistic image-dialogue corpora, or ablation studies on synthesis artifacts. Without this, the central claims of model-human divergence and scaling limits rest on an untested assumption that the constructed items are faithful proxies.

    Authors: We agree that explicit external validation of the synthesized scenarios would strengthen the benchmark. The current §3 describes the construction process grounded in Moral Foundations Theory, but does not report inter-rater checks or comparisons to naturalistic corpora. In the revision we will add (i) a description of expert review of a subset of scenarios, (ii) inter-rater reliability statistics, and (iii) a brief comparison against existing text-only moral dilemma collections to address potential synthesis artifacts. revision: yes

  2. Referee: Human consensus and evaluation protocol (abstract and §4): No information is given on how human responses were collected (participant pool, exclusion criteria, statistical tests for consensus, or agreement metrics). This absence directly undermines the reported 'pronounced moral alignment bias' and the conclusion that general scaling yields diminishing returns.

    Authors: We acknowledge that §4 currently omits the requested details on human data collection. The manuscript reports aggregate human consensus but does not describe recruitment, exclusion rules, or agreement metrics. In the revision we will expand §4 to include participant demographics, exclusion criteria, the statistical procedure used to establish consensus, and agreement metrics (e.g., Fleiss’ kappa). This will provide a transparent basis for the reported model-human divergence. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark construction and external evaluations are independent

full rationale

The paper introduces MM-MoralBench as a newly constructed benchmark grounded in external Moral Foundations Theory, with scenarios synthesized from visual contexts and dialogues, then evaluated on over 20 pre-existing LVLMs against aggregated human responses. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain consists of independent data synthesis followed by external model testing, remaining self-contained without any step reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the untested validity of Moral Foundations Theory for multimodal AI evaluation and on the assumption that the authors' synthesized scenarios faithfully represent dynamic real-world moral contexts.

axioms (1)
  • domain assumption Moral Foundations Theory provides a suitable and sufficient framework for measuring moral alignment in vision-language models
    Benchmark construction and evaluation tasks are defined directly in terms of the six foundations without alternative frameworks or justification for this choice.

pith-pipeline@v0.9.0 · 5708 in / 1210 out tokens · 55624 ms · 2026-05-23T07:11:59.604987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

  1. [1]

    A study of gener- ative large language model for medical research and health- care

    Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of gener- ative large language model for medical research and health- care. NPJ digital medicine, 6(1):210, 2023. 1

  2. [2]

    Large language models in law: A survey

    Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and S Yu Philip. Large language models in law: A survey. AI Open, 2024

  3. [3]

    Large language models in finance: A survey

    Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. InProceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023. 1

  4. [4]

    Large language model alignment: A survey

    Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Wei- long Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023. 1

  5. [5]

    Intuitive ethics: How innately prepared intuitions generate culturally variable virtues

    Jonathan Haidt and Craig Joseph. Intuitive ethics: How innately prepared intuitions generate culturally variable virtues. Daedalus, 133(4):55–66, 2004. 1, 2

  6. [6]

    Moral foun- dations theory: The pragmatic validity of moral pluralism

    Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. Moral foun- dations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology , volume 47, pages 55–130. Elsevier, 2013. 1, 2

  7. [7]

    Liberals and conservatives rely on different sets of moral foundations

    Jesse Graham, Jonathan Haidt, and Brian A Nosek. Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology , 96(5):1029,

  8. [8]

    Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory

    Scott Clifford, Vijeth Iyengar, Roberto Cabeza, and Walter Sinnott-Armstrong. Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory. Behavior research methods , 47(4):1178– 1198, 2015. 1, 2, 3, 7

  9. [9]

    Unpack- ing the ethical value alignment in big models

    Xiaoyuan Yi, Jing Yao, Xiting Wang, and Xing Xie. Unpack- ing the ethical value alignment in big models. arXiv preprint arXiv:2310.17551, 2023. 1

  10. [10]

    Evaluating the moral beliefs encoded in llms

    Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36, 2024. 1, 3, 5, 6, 7

  11. [11]

    Moralbench: Moral evaluation of llms

    Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. Moralbench: Moral evaluation of llms. arXiv preprint arXiv:2406.04428, 2024. 1, 3, 5

  12. [12]

    Aligning ai with shared human values

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. InInternational Conference on Learning Representations, 2021. 1, 3

  13. [13]

    Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes

    Nicholas Lourie, Ronan Le Bras, and Yejin Choi. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. In Proceedings of the AAAI Conference on Arti- ficial Intelligence, volume 35, pages 13470–13479, 2021. 3, 5

  14. [14]

    CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models

    Linhao Yu, Yongqi Leng, Yufei Huang, Shang Wu, Haixin Liu, Xinmeng Ji, Jiahui Zhao, Jinwang Song, Tingting Cui, Xiaoqing Cheng, et al. CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Online or Conference Location, 2024. Asso- ciation for Computational L...

  15. [15]

    When to make exceptions: Exploring language models as accounts of human moral judgment

    Zhijing Jin, Sydney Levine, Fernando Gonzalez Adauto, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mi- halcea, Josh Tenenbaum, and Bernhard Sch ¨olkopf. When to make exceptions: Exploring language models as accounts of human moral judgment. Advances in neural information processing systems, 35:28458–28473, 2022. 3, 5

  16. [16]

    Dailydilemmas: Revealing value preferences of llms with quandaries of daily life

    Yu Ying Chiu, Liwei Jiang, and Yejin Choi. Dailydilemmas: Revealing value preferences of llms with quandaries of daily life. arXiv preprint arXiv:2410.02683, 2024. 1, 3, 5

  17. [17]

    Hello gpt-4o

    OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt- 4o/, 2024. 2, 3, 4, 6, 1

  18. [18]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 2, 4

  19. [19]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3, 6, 4

  20. [20]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3, 6, 4

  21. [21]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 3, 6, 4

  22. [22]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 3, 6, 4

  23. [23]

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13040–13051, 2024. 3, 6, 4

  24. [24]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 3, 6, 4

  25. [25]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 3, 6, 4

  26. [26]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 3, 6, 4 9

  27. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 3, 6, 4

  28. [28]

    Large language models are not robust multi- ple choice selectors

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multi- ple choice selectors. In International Conference on Learn- ing Representations, 2024. 6

  29. [29]

    Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications

    Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li, Tian- hao Huang, Xiaozhi Zhu, Mo Niu, Lingyu Sun, Peng Tang, Tongqiao Xu, Kwang-Ting Cheng, and Minyi Guo. Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications. In 2023 IEEE International Symposium on Workload Characteriza- tion (IISWC), 2023. 6

  30. [30]

    Equa- tion of state calculations by fast computing machines

    Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equa- tion of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953. 6 10 M³oralBench: A MultiModal Moral Benchmark for LVLMs Supplementary Material

  31. [31]

    Moral Foundations Theory 7.1. Moral Foundations We provide a more detailed explanation of the six moral foundations [5, 6] as follows: • Care/Harm: This foundation arises from the evolution- ary need to care for vulnerable offspring. It is triggered by visual and auditory signs of suffering, distress, or neediness, primarily from one’s own children but al...

  32. [32]

    Image Generation Details 8.1. Image Generation Prompts In image generation process, we utilize GPT-4o [17] to ex- pand the dataset by imitating similar scenarios from MFVs and converting these brief descriptions into detailed image prompts and main character dialogues. Additionally, to gen- erate two contrasting versions of the scenarios, a morally wrong ...

  33. [33]

    Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4

    More Evaluation Examples More examples of M³oralBench evaluation are shown in Figure 10, Figure 11 and Figure 12. Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4. Average CLIP similarity scores between scenario image captions and original scenario texts for each moral foundation

  34. [34]

    Detailed Evaluation Results We provide the detailed evaluation results on M³oralBench across 3 moral tasks and 6 moral foundations in Table 5, Table 6 and Table 7. 2 Care Violation Instruction: How do you consider the behavior in the image from the perspective of morality? (A) Not morally wrong (B) Morally wrong Reference: (B) mPLUG-Owl2: A Phi-3-Vision: ...