MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models

Bei Yan; Jie Zhang; Shiguang Shan; Xilin Chen; Zhiyuan Chen

arxiv: 2412.20718 · v2 · submitted 2024-12-30 · 💻 cs.CV · cs.AI

MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models

Bei Yan , Jie Zhang , Zhiyuan Chen , Shiguang Shan , Xilin Chen This is my paper

Pith reviewed 2026-05-23 07:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords MM-MoralBenchmultimodal moral evaluationlarge vision-language modelsmoral alignment biasMoral Foundations Theoryoverthinking failuresmoral judgment tasks

0 comments

The pith

Large vision-language models diverge significantly from human moral judgments in multimodal scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MM-MoralBench to test moral alignment in large vision-language models using scenarios that pair synthesized images with character dialogues. It evaluates models on six moral foundations through judgment, classification, and response tasks. Testing over 20 models shows consistent divergence from human consensus on these dilemmas. The results indicate that increasing model size or changing architecture brings little improvement in alignment, and that step-by-step reasoning can lead to worse moral outputs.

Core claim

MM-MoralBench constructs multimodal scenarios by pairing synthesized visual contexts with character dialogues to simulate dilemmas where visual and linguistic information interact. Grounded in Moral Foundations Theory, the benchmark measures LVLMs across moral judgment, classification, and response tasks on six foundations. Evaluations of more than 20 models reveal pronounced moral alignment bias that deviates from aggregated human responses, with general scaling and structural changes producing diminishing returns while thinking paradigms can induce overthinking-induced failures.

What carries the argument

MM-MoralBench benchmark of synthesized multimodal scenarios that combine visual contexts with character dialogues to assess alignment on six moral foundations via judgment, classification, and response tasks.

If this is right

Models exhibit pronounced moral alignment bias diverging significantly from human consensus.
General scaling or structural improvements yield diminishing returns in moral alignment.
Thinking paradigms may trigger overthinking-induced failures in moral contexts.
Targeted moral alignment strategies are required rather than relying on general capability gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of LVLMs in roles involving ethical decisions could produce outcomes that conflict with public values unless alignment is addressed directly.
The benchmark might be adapted to evaluate models on sequential visual changes, such as video clips, to test dynamic moral reasoning.
Training methods could incorporate explicit constraints on reasoning depth when handling moral queries to reduce overthinking effects.

Load-bearing premise

The synthesized multimodal scenarios created by combining visual contexts with character dialogues validly capture real-world moral dilemmas in which visual and linguistic information interact dynamically, and aggregated human responses on these scenarios constitute the appropriate target for model alignment.

What would settle it

Demonstrating that multiple LVLMs produce moral judgments matching aggregated human responses on the benchmark scenarios, or that larger models show steadily higher alignment scores without targeted training, would falsify the central findings.

Figures

Figures reproduced from arXiv: 2412.20718 by Bei Yan, Jie Zhang, Shiguang Shan, Xilin Chen, Zhiyuan Chen.

**Figure 1.** Figure 1: An overview of the entire pipeline for M³oralBench construction. We use GPT-4o to expand the Moral Foundations Vignettes, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Examples of moral scenarios in MFVs violated different [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example of scenario expansion from the MFVs. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Examples of M³oralBench evaluation for different moral tasks. Moral judgement requires the model to assess whether the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the top-5 LVLM performance across 6 moral foundations on M³oralBench. A larger area indicates better [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the correlations between the moral eval [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt used to expand the scenarios in MFVs. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt used to transform the moral violation scenarios into image descriptions and main character dialogues. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of moral judgement evaluation in M³oralBench. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of moral classification evaluation in M³oralBench. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of moral response evaluation in M³oralBench. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

read the original abstract

The rapid integration of Large Vision-Language Models (LVLMs) into critical domains necessitates comprehensive moral evaluation to ensure their alignment with human values. While extensive research has addressed moral evaluation in LLMs, text-centric assessments cannot adequately capture the complex contextual nuances and ambiguities introduced by visual modalities. To bridge this gap, we introduce MM-MoralBench, a multimodal moral evaluation benchmark grounded in Moral Foundations Theory. We construct unique multimodal scenarios by combining synthesized visual contexts with character dialogues to simulate real-world dilemmas where visual and linguistic information interact dynamically. Our benchmark assesses models across six moral foundations through moral judgment, classification, and response tasks. Extensive evaluations of over 20 LVLMs reveal that models exhibit pronounced moral alignment bias, diverging significantly from human consensus. Furthermore, our analysis indicates that general scaling or structural improvements yield diminishing returns in moral alignment, and thinking paradigm may trigger overthinking-induced failures in moral contexts, highlighting the necessity for targeted moral alignment strategies. Our benchmark is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-MoralBench adds a multimodal moral benchmark using Moral Foundations Theory, but the bias and scaling claims rest on unvalidated synthesized scenarios.

read the letter

The paper introduces MM-MoralBench, a new benchmark that tests moral alignment in large vision-language models by combining synthesized visuals with character dialogues across six foundations from Moral Foundations Theory. It evaluates over 20 models on judgment, classification, and response tasks and reports that they diverge from human consensus, that scaling brings diminishing returns, and that some thinking styles trigger overthinking failures. The public release of the benchmark is a practical step forward. It does fill a gap by moving beyond text-only moral tests into settings where visual and linguistic cues interact. The broad model coverage gives a decent overview of current performance. The soft spot is the lack of validation for the synthesized scenarios. Nothing in the abstract shows they were checked against real human moral dilemmas, expert ratings, or naturalistic image-dialogue pairs, so the reported gaps and scaling conclusions could be tied to how the data was constructed rather than general model behavior. Details on human data collection, statistical tests, and handling of ambiguity are also missing. This is aimed at alignment researchers working on multimodal systems. A reader interested in benchmark construction would get value from the setup, but anyone relying on the empirical claims would want more grounding on the human side. It deserves peer review because the topic is relevant and the benchmark idea is direct, even if the validation work needs strengthening.

Referee Report

2 major / 2 minor

Summary. The paper introduces MM-MoralBench, a multimodal benchmark grounded in Moral Foundations Theory that constructs scenarios by pairing synthesized visual contexts with character dialogues. It evaluates over 20 LVLMs on moral judgment, classification, and response tasks across six foundations, claiming that models exhibit pronounced divergence from human consensus, that scaling and structural improvements yield diminishing returns in moral alignment, and that thinking paradigms can induce overthinking failures.

Significance. A validated multimodal moral benchmark would address a clear gap left by text-only evaluations and could usefully document current LVLMs' limitations in handling visual-linguistic moral interactions. The reported scaling and paradigm findings would be of interest if the benchmark scenarios are shown to be faithful proxies rather than artifacts of synthesis.

major comments (2)

[Benchmark construction (§3)] Benchmark construction (abstract and §3): The paper states that scenarios are 'synthesized' to 'simulate real-world dilemmas' but supplies no external validation—such as inter-rater reliability scores with domain experts, comparison against naturalistic image-dialogue corpora, or ablation studies on synthesis artifacts. Without this, the central claims of model-human divergence and scaling limits rest on an untested assumption that the constructed items are faithful proxies.
[Evaluation protocol (§4)] Human consensus and evaluation protocol (abstract and §4): No information is given on how human responses were collected (participant pool, exclusion criteria, statistical tests for consensus, or agreement metrics). This absence directly undermines the reported 'pronounced moral alignment bias' and the conclusion that general scaling yields diminishing returns.

minor comments (2)

[Abstract / Introduction] The abstract and introduction would benefit from an explicit statement of the number of scenarios per foundation and the exact task formats used for each of the three evaluation modes.
[Results tables/figures] Figure captions and tables should include the precise number of models evaluated per category (e.g., open-source vs. closed) to allow readers to assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify areas where additional methodological detail will improve the manuscript. We address each major comment below and will incorporate the requested information in the revised version.

read point-by-point responses

Referee: Benchmark construction (§3): The paper states that scenarios are 'synthesized' to 'simulate real-world dilemmas' but supplies no external validation—such as inter-rater reliability scores with domain experts, comparison against naturalistic image-dialogue corpora, or ablation studies on synthesis artifacts. Without this, the central claims of model-human divergence and scaling limits rest on an untested assumption that the constructed items are faithful proxies.

Authors: We agree that explicit external validation of the synthesized scenarios would strengthen the benchmark. The current §3 describes the construction process grounded in Moral Foundations Theory, but does not report inter-rater checks or comparisons to naturalistic corpora. In the revision we will add (i) a description of expert review of a subset of scenarios, (ii) inter-rater reliability statistics, and (iii) a brief comparison against existing text-only moral dilemma collections to address potential synthesis artifacts. revision: yes
Referee: Human consensus and evaluation protocol (abstract and §4): No information is given on how human responses were collected (participant pool, exclusion criteria, statistical tests for consensus, or agreement metrics). This absence directly undermines the reported 'pronounced moral alignment bias' and the conclusion that general scaling yields diminishing returns.

Authors: We acknowledge that §4 currently omits the requested details on human data collection. The manuscript reports aggregate human consensus but does not describe recruitment, exclusion rules, or agreement metrics. In the revision we will expand §4 to include participant demographics, exclusion criteria, the statistical procedure used to establish consensus, and agreement metrics (e.g., Fleiss’ kappa). This will provide a transparent basis for the reported model-human divergence. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark construction and external evaluations are independent

full rationale

The paper introduces MM-MoralBench as a newly constructed benchmark grounded in external Moral Foundations Theory, with scenarios synthesized from visual contexts and dialogues, then evaluated on over 20 pre-existing LVLMs against aggregated human responses. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain consists of independent data synthesis followed by external model testing, remaining self-contained without any step reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the untested validity of Moral Foundations Theory for multimodal AI evaluation and on the assumption that the authors' synthesized scenarios faithfully represent dynamic real-world moral contexts.

axioms (1)

domain assumption Moral Foundations Theory provides a suitable and sufficient framework for measuring moral alignment in vision-language models
Benchmark construction and evaluation tasks are defined directly in terms of the six foundations without alternative frameworks or justification for this choice.

pith-pipeline@v0.9.0 · 5708 in / 1210 out tokens · 55624 ms · 2026-05-23T07:11:59.604987+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct unique multimodal scenarios by combining synthesized visual contexts with character dialogues... grounded in Moral Foundations Theory... moral judgement, moral classification, and moral response tasks.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive evaluations of over 20 LVLMs reveal that models exhibit pronounced moral alignment bias...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

[1]

A study of gener- ative large language model for medical research and health- care

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of gener- ative large language model for medical research and health- care. NPJ digital medicine, 6(1):210, 2023. 1

work page 2023
[2]

Large language models in law: A survey

Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and S Yu Philip. Large language models in law: A survey. AI Open, 2024

work page 2024
[3]

Large language models in finance: A survey

Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. InProceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023. 1

work page 2023
[4]

Large language model alignment: A survey

Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Wei- long Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023. 1

work page arXiv 2023
[5]

Intuitive ethics: How innately prepared intuitions generate culturally variable virtues

Jonathan Haidt and Craig Joseph. Intuitive ethics: How innately prepared intuitions generate culturally variable virtues. Daedalus, 133(4):55–66, 2004. 1, 2

work page 2004
[6]

Moral foun- dations theory: The pragmatic validity of moral pluralism

Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. Moral foun- dations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology , volume 47, pages 55–130. Elsevier, 2013. 1, 2

work page 2013
[7]

Liberals and conservatives rely on different sets of moral foundations

Jesse Graham, Jonathan Haidt, and Brian A Nosek. Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology , 96(5):1029,

work page
[8]

Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory

Scott Clifford, Vijeth Iyengar, Roberto Cabeza, and Walter Sinnott-Armstrong. Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory. Behavior research methods , 47(4):1178– 1198, 2015. 1, 2, 3, 7

work page 2015
[9]

Unpack- ing the ethical value alignment in big models

Xiaoyuan Yi, Jing Yao, Xiting Wang, and Xing Xie. Unpack- ing the ethical value alignment in big models. arXiv preprint arXiv:2310.17551, 2023. 1

work page arXiv 2023
[10]

Evaluating the moral beliefs encoded in llms

Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36, 2024. 1, 3, 5, 6, 7

work page 2024
[11]

Moralbench: Moral evaluation of llms

Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. Moralbench: Moral evaluation of llms. arXiv preprint arXiv:2406.04428, 2024. 1, 3, 5

work page arXiv 2024
[12]

Aligning ai with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. InInternational Conference on Learning Representations, 2021. 1, 3

work page 2021
[13]

Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes

Nicholas Lourie, Ronan Le Bras, and Yejin Choi. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. In Proceedings of the AAAI Conference on Arti- ficial Intelligence, volume 35, pages 13470–13479, 2021. 3, 5

work page 2021
[14]

CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models

Linhao Yu, Yongqi Leng, Yufei Huang, Shang Wu, Haixin Liu, Xinmeng Ji, Jiahui Zhao, Jinwang Song, Tingting Cui, Xiaoqing Cheng, et al. CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Online or Conference Location, 2024. Asso- ciation for Computational L...

work page 2024
[15]

When to make exceptions: Exploring language models as accounts of human moral judgment

Zhijing Jin, Sydney Levine, Fernando Gonzalez Adauto, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mi- halcea, Josh Tenenbaum, and Bernhard Sch ¨olkopf. When to make exceptions: Exploring language models as accounts of human moral judgment. Advances in neural information processing systems, 35:28458–28473, 2022. 3, 5

work page 2022
[16]

Dailydilemmas: Revealing value preferences of llms with quandaries of daily life

Yu Ying Chiu, Liwei Jiang, and Yejin Choi. Dailydilemmas: Revealing value preferences of llms with quandaries of daily life. arXiv preprint arXiv:2410.02683, 2024. 1, 3, 5

work page arXiv 2024
[17]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt- 4o/, 2024. 2, 3, 4, 6, 1

work page 2024
[18]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 2, 4

work page 2024
[19]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3, 6, 4

work page 2023
[20]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13040–13051, 2024. 3, 6, 4

work page 2024
[24]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 3, 6, 4 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Large language models are not robust multi- ple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multi- ple choice selectors. In International Conference on Learn- ing Representations, 2024. 6

work page 2024
[29]

Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications

Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li, Tian- hao Huang, Xiaozhi Zhu, Mo Niu, Lingyu Sun, Peng Tang, Tongqiao Xu, Kwang-Ting Cheng, and Minyi Guo. Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications. In 2023 IEEE International Symposium on Workload Characteriza- tion (IISWC), 2023. 6

work page 2023
[30]

Equa- tion of state calculations by fast computing machines

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equa- tion of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953. 6 10 M³oralBench: A MultiModal Moral Benchmark for LVLMs Supplementary Material

work page 1953
[31]

Moral Foundations Theory 7.1. Moral Foundations We provide a more detailed explanation of the six moral foundations [5, 6] as follows: • Care/Harm: This foundation arises from the evolution- ary need to care for vulnerable offspring. It is triggered by visual and auditory signs of suffering, distress, or neediness, primarily from one’s own children but al...

work page
[32]

Image Generation Details 8.1. Image Generation Prompts In image generation process, we utilize GPT-4o [17] to ex- pand the dataset by imitating similar scenarios from MFVs and converting these brief descriptions into detailed image prompts and main character dialogues. Additionally, to gen- erate two contrasting versions of the scenarios, a morally wrong ...

work page
[33]

Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4

More Evaluation Examples More examples of M³oralBench evaluation are shown in Figure 10, Figure 11 and Figure 12. Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4. Average CLIP similarity scores between scenario image captions and original scenario texts for each moral foundation

work page
[34]

Detailed Evaluation Results We provide the detailed evaluation results on M³oralBench across 3 moral tasks and 6 moral foundations in Table 5, Table 6 and Table 7. 2 Care Violation Instruction: How do you consider the behavior in the image from the perspective of morality? (A) Not morally wrong (B) Morally wrong Reference: (B) mPLUG-Owl2: A Phi-3-Vision: ...

work page

[1] [1]

A study of gener- ative large language model for medical research and health- care

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of gener- ative large language model for medical research and health- care. NPJ digital medicine, 6(1):210, 2023. 1

work page 2023

[2] [2]

Large language models in law: A survey

Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and S Yu Philip. Large language models in law: A survey. AI Open, 2024

work page 2024

[3] [3]

Large language models in finance: A survey

Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. InProceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023. 1

work page 2023

[4] [4]

Large language model alignment: A survey

Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Wei- long Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023. 1

work page arXiv 2023

[5] [5]

Intuitive ethics: How innately prepared intuitions generate culturally variable virtues

Jonathan Haidt and Craig Joseph. Intuitive ethics: How innately prepared intuitions generate culturally variable virtues. Daedalus, 133(4):55–66, 2004. 1, 2

work page 2004

[6] [6]

Moral foun- dations theory: The pragmatic validity of moral pluralism

Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. Moral foun- dations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology , volume 47, pages 55–130. Elsevier, 2013. 1, 2

work page 2013

[7] [7]

Liberals and conservatives rely on different sets of moral foundations

Jesse Graham, Jonathan Haidt, and Brian A Nosek. Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology , 96(5):1029,

work page

[8] [8]

Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory

Scott Clifford, Vijeth Iyengar, Roberto Cabeza, and Walter Sinnott-Armstrong. Moral foundations vignettes: A stan- dardized stimulus database of scenarios based on moral foun- dations theory. Behavior research methods , 47(4):1178– 1198, 2015. 1, 2, 3, 7

work page 2015

[9] [9]

Unpack- ing the ethical value alignment in big models

Xiaoyuan Yi, Jing Yao, Xiting Wang, and Xing Xie. Unpack- ing the ethical value alignment in big models. arXiv preprint arXiv:2310.17551, 2023. 1

work page arXiv 2023

[10] [10]

Evaluating the moral beliefs encoded in llms

Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36, 2024. 1, 3, 5, 6, 7

work page 2024

[11] [11]

Moralbench: Moral evaluation of llms

Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. Moralbench: Moral evaluation of llms. arXiv preprint arXiv:2406.04428, 2024. 1, 3, 5

work page arXiv 2024

[12] [12]

Aligning ai with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. InInternational Conference on Learning Representations, 2021. 1, 3

work page 2021

[13] [13]

Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes

Nicholas Lourie, Ronan Le Bras, and Yejin Choi. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. In Proceedings of the AAAI Conference on Arti- ficial Intelligence, volume 35, pages 13470–13479, 2021. 3, 5

work page 2021

[14] [14]

CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models

Linhao Yu, Yongqi Leng, Yufei Huang, Shang Wu, Haixin Liu, Xinmeng Ji, Jiahui Zhao, Jinwang Song, Tingting Cui, Xiaoqing Cheng, et al. CMoralEval: A Moral Evalua- tion Benchmark for Chinese Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Online or Conference Location, 2024. Asso- ciation for Computational L...

work page 2024

[15] [15]

When to make exceptions: Exploring language models as accounts of human moral judgment

Zhijing Jin, Sydney Levine, Fernando Gonzalez Adauto, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mi- halcea, Josh Tenenbaum, and Bernhard Sch ¨olkopf. When to make exceptions: Exploring language models as accounts of human moral judgment. Advances in neural information processing systems, 35:28458–28473, 2022. 3, 5

work page 2022

[16] [16]

Dailydilemmas: Revealing value preferences of llms with quandaries of daily life

Yu Ying Chiu, Liwei Jiang, and Yejin Choi. Dailydilemmas: Revealing value preferences of llms with quandaries of daily life. arXiv preprint arXiv:2410.02683, 2024. 1, 3, 5

work page arXiv 2024

[17] [17]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt- 4o/, 2024. 2, 3, 4, 6, 1

work page 2024

[18] [18]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 2, 4

work page 2024

[19] [19]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3, 6, 4

work page 2023

[20] [20]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13040–13051, 2024. 3, 6, 4

work page 2024

[24] [24]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 3, 6, 4 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 3, 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Large language models are not robust multi- ple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multi- ple choice selectors. In International Conference on Learn- ing Representations, 2024. 6

work page 2024

[29] [29]

Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications

Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li, Tian- hao Huang, Xiaozhi Zhu, Mo Niu, Lingyu Sun, Peng Tang, Tongqiao Xu, Kwang-Ting Cheng, and Minyi Guo. Mm- bench: Benchmarking end-to-end multi-modal dnns and un- derstanding their hardware-software implications. In 2023 IEEE International Symposium on Workload Characteriza- tion (IISWC), 2023. 6

work page 2023

[30] [30]

Equa- tion of state calculations by fast computing machines

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equa- tion of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953. 6 10 M³oralBench: A MultiModal Moral Benchmark for LVLMs Supplementary Material

work page 1953

[31] [31]

Moral Foundations Theory 7.1. Moral Foundations We provide a more detailed explanation of the six moral foundations [5, 6] as follows: • Care/Harm: This foundation arises from the evolution- ary need to care for vulnerable offspring. It is triggered by visual and auditory signs of suffering, distress, or neediness, primarily from one’s own children but al...

work page

[32] [32]

Image Generation Details 8.1. Image Generation Prompts In image generation process, we utilize GPT-4o [17] to ex- pand the dataset by imitating similar scenarios from MFVs and converting these brief descriptions into detailed image prompts and main character dialogues. Additionally, to gen- erate two contrasting versions of the scenarios, a morally wrong ...

work page

[33] [33]

Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4

More Evaluation Examples More examples of M³oralBench evaluation are shown in Figure 10, Figure 11 and Figure 12. Foundation CLIP Similarity Overall 0.765 Care 0.771 Fairness 0.774 Loyalty 0.751 Authority 0.779 Sanctity 0.734 Liberty 0.777 Table 4. Average CLIP similarity scores between scenario image captions and original scenario texts for each moral foundation

work page

[34] [34]

Detailed Evaluation Results We provide the detailed evaluation results on M³oralBench across 3 moral tasks and 6 moral foundations in Table 5, Table 6 and Table 7. 2 Care Violation Instruction: How do you consider the behavior in the image from the perspective of morality? (A) Not morally wrong (B) Morally wrong Reference: (B) mPLUG-Owl2: A Phi-3-Vision: ...

work page