pith. sign in

arxiv: 2511.10287 · v4 · submitted 2025-11-13 · 💻 cs.LG · cs.CL

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Pith reviewed 2026-05-17 22:26 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords multimodal large language modelscontent safety benchmarkoffensive content detectionsafety evaluationrisk assessmentMLLM vulnerabilities
0
0 comments X

The pith

OutSafe-Bench reveals persistent safety vulnerabilities across nine state-of-the-art multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OutSafe-Bench as the first comprehensive test suite for content safety in multimodal large language models. It supplies a dataset spanning text, images, audio, and video in English and Chinese, with more than 18,000 text prompts and thousands of other samples labeled across nine risk categories. A new Multidimensional Cross Risk Score tracks overlapping dangers while FairScore aggregates judgments from multiple top models to reduce single-judge bias. When the authors apply these tools to current leading MLLMs, the results show repeated failures to block toxic language, biased imagery, privacy leaks, and harmful misinformation. Readers should care because these models are already entering everyday tools and agents where unsafe outputs can reach many users quickly.

Core claim

OutSafe-Bench supplies a large-scale multimodal dataset with over 18,000 bilingual text prompts, 4,500 images, 450 audio clips, and 450 videos systematically annotated across nine critical content risk categories, together with the Multidimensional Cross Risk Score to model correlated risks and the FairScore multi-reviewer weighted aggregation framework that selects adaptive juries from top models; evaluation of nine state-of-the-art MLLMs with this suite demonstrates persistent and substantial safety vulnerabilities.

What carries the argument

OutSafe-Bench dataset combined with the FairScore automated multi-reviewer weighted aggregation framework that uses top-performing models as juries to mitigate single-model bias while the Multidimensional Cross Risk Score captures overlapping risks across categories.

If this is right

  • Current multimodal models require additional safeguards before safe deployment in everyday tools.
  • Evaluation must cover multiple modalities and languages to expose vulnerabilities that single-modality tests miss.
  • Automated multi-reviewer scoring can reduce bias compared with single-model judgment.
  • Persistent failures across nine risk categories indicate that safety training has not yet closed the gap in MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could incorporate OutSafe-Bench directly into model training loops to penalize unsafe generations in real time.
  • The same dataset construction method could be extended to additional languages and emerging risk types such as deepfake misinformation.
  • Regulatory bodies might adopt similar multimodal benchmarks as minimum standards for releasing new MLLMs to the public.

Load-bearing premise

The nine risk categories and the automated FairScore aggregation accurately capture real-world safety failures without introducing new biases from the jury models themselves.

What would settle it

A controlled user study in which participants interact with the evaluated MLLMs on held-out prompts and report the actual frequency of unsafe outputs, then compare that frequency to the benchmark scores produced by OutSafe-Bench and FairScore.

Figures

Figures reproduced from arXiv: 2511.10287 by Lingjuan Lyu, Yaochu Jin, Yingchao Yu, Yuanshuai Li, Yuhan Xie, Yuping Yan.

Figure 1
Figure 1. Figure 1: Overview of the OutSafe-Bench framework. It includes content risk aspect division across four modalities, as well as the evalua [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of dataset instances across nine content safety scenarios, showcasing Chinese text, English text, image, and video [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-risk influence matrix for the nine content risk cat [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Each model’s performance is shown per modality across nine categories: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example for Chinese text input evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example for audio input evaluation [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example for video input evaluation. 8 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OutSafe-Bench, the first comprehensive multimodal content safety benchmark for MLLMs. It provides a large-scale dataset spanning four modalities (over 18,000 bilingual text prompts, 4,500 images, 450 audio clips, and 450 videos) annotated across nine risk categories, proposes the Multidimensional Cross Risk Score (MCRS) to model overlapping risks, and introduces FairScore, an explainable automated multi-reviewer weighted aggregation framework that selects top-performing models as adaptive juries. Evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities.

Significance. If the dataset construction, MCRS, and FairScore prove robust and free of circular bias, this benchmark could fill a critical gap in multimodal safety evaluation by covering multiple modalities, languages, and correlated risks. The large scale and inclusion of audio/video are strengths that could support reproducible safety research and drive development of better safeguards in MLLMs.

major comments (2)
  1. [§3.3] §3.3 (FairScore framework): The adaptive selection of top-performing models as juries for weighted aggregation risks circular bias when those models share correlated safety blind spots (e.g., under-flagging co-occurring privacy and misinformation risks). The manuscript provides no evidence of jury diversity verification or cross-validation of jury outputs against human raters on held-out data, which is load-bearing for the reliability of the reported MLLM rankings and the central claim of 'persistent and substantial safety vulnerabilities'.
  2. [§2.2] §2.2 (Dataset annotation): The description of how the nine risk categories were defined and how the 18k+ prompts plus multimodal items were annotated lacks details on inter-annotator agreement, annotator expertise, or quality control procedures. Without these, the ground-truth labels underpinning both MCRS and FairScore evaluations cannot be independently assessed for consistency.
minor comments (2)
  1. [Figure 2] Figure 2 (dataset distribution): The modality and language breakdown bars are difficult to read due to overlapping labels; consider adding exact counts in the caption or a supplementary table.
  2. [§3.1] Notation for MCRS: The formula for combining cross-risk terms is introduced without an explicit definition of the weighting parameters; a short appendix deriving or justifying the weights would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, explaining our position and the revisions incorporated to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (FairScore framework): The adaptive selection of top-performing models as juries for weighted aggregation risks circular bias when those models share correlated safety blind spots (e.g., under-flagging co-occurring privacy and misinformation risks). The manuscript provides no evidence of jury diversity verification or cross-validation of jury outputs against human raters on held-out data, which is load-bearing for the reliability of the reported MLLM rankings and the central claim of 'persistent and substantial safety vulnerabilities'.

    Authors: We appreciate the referee's concern about potential circular bias arising from correlated blind spots among selected jury models. The FairScore design adaptively chooses top-performing models to aggregate judgments and reduce single-model bias, with the nine evaluated MLLMs drawn from diverse developers and architectures to promote variation in safety detection capabilities. We acknowledge that the original manuscript did not include explicit verification of jury diversity or cross-validation against human raters. In the revised manuscript, we have expanded §3.3 to detail the jury selection criteria with attention to model diversity and added a human cross-validation study on held-out data to support the reliability of the aggregated scores and the reported safety vulnerabilities. revision: yes

  2. Referee: [§2.2] §2.2 (Dataset annotation): The description of how the nine risk categories were defined and how the 18k+ prompts plus multimodal items were annotated lacks details on inter-annotator agreement, annotator expertise, or quality control procedures. Without these, the ground-truth labels underpinning both MCRS and FairScore evaluations cannot be independently assessed for consistency.

    Authors: We agree that additional specifics on the annotation process are needed to allow independent assessment of label quality. The original manuscript outlined the nine risk categories and the general annotation of the large-scale multimodal dataset but omitted quantitative details on agreement, expertise, and controls. We have revised §2.2 to include a full description of how the risk categories were defined from established safety frameworks, the expertise of the annotators, inter-annotator agreement metrics, and the quality control procedures employed, such as multi-round reviews. These additions strengthen the foundation for MCRS and FairScore evaluations. revision: yes

Circularity Check

0 steps flagged

OutSafe-Bench benchmark and FairScore metric are self-contained with no circular reductions

full rationale

The paper constructs a new multimodal dataset spanning text, images, audio, and video with nine independently defined risk categories, then applies the novel MCRS metric and FairScore aggregation framework to evaluate nine MLLMs. No equations, self-citations, or definitional steps reduce the reported safety vulnerabilities to fitted parameters or inputs defined inside the paper itself; the adaptive jury selection is presented as an external mitigation technique rather than a tautological loop. The central claims rest on empirical application of these components to external models and data, making the derivation chain self-contained against external benchmarks without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution is an empirical benchmark and two new scoring procedures; no free parameters, mathematical axioms, or invented physical entities are required.

pith-pipeline@v0.9.0 · 5558 in / 1044 out tokens · 30580 ms · 2026-05-17T22:26:16.762426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Alireza Afzal Aghaei and Nadia Khodaei. Automated de- pression recognition using multimodal machine learning: A study on the daic-woz dataset.Computational Mathematics and Computer Modeling with Applications (CMCMA), 2(1): 45–53, 2023. 4

  2. [2]

    Meet claude, your thinking partner

    Anthropic. Meet claude, your thinking partner. https://www. anthropic.com/claude, 2023. 6

  3. [3]

    Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api

    Apidog. Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api. https://apidog.com/blog/doubao-1-5-pro-api/,

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 6

  5. [5]

    Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model

    Baidu Research. Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model. https://research.baidu.com/Blog/index-view? id=165, 2021. 6

  6. [6]

    A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024

    Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Wil- lett, and Shaul Druckmann. A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024. 4

  7. [7]

    Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024

    Barbara Pfeffer Billauer. Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024. 1

  8. [8]

    Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003

    Steven Boll. Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003. 4

  9. [9]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent de- bate.arXiv preprint arXiv:2308.07201, 2023. 3

  10. [10]

    Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc

    R Reading Competition. Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc. cvc. uab. es. 4

  11. [11]

    Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 1

  12. [12]

    Hatemm: A multi- modal dataset for hate video classification

    Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Man- ish Gupta, and Animesh Mukherjee. Hatemm: A multi- modal dataset for hate video classification. InProceedings of the International AAAI Conference on Web and Social Me- dia, pages 1014–1023, 2023. 4

  13. [13]

    The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010. 4

  14. [14]

    guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering

    Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 19597–19607, 2025. 1

  15. [15]

    Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024

    Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024. 1

  16. [16]

    Find any sound you like

    Freesound. Find any sound you like. https://freesound.org/,

  17. [17]

    [Accessed 11-07-2025]. 4

  18. [18]

    Gemini: Our most intelligent ai models

    Google Deepmind. Gemini: Our most intelligent ai models. https://deepmind.google/models/gemini/, 2025. 6

  19. [19]

    Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,

    Tianle Gu, Zeyang Zhou, Kexin Huang, Liang Dandan, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Yujiu Yang, Yan Teng, Yu Qiao, et al. Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,

  20. [20]

    A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection

    Sifeng He, Xudong Yang, Chen Jiang, Gang Liang, Wei Zhang, Tan Pan, Qing Wang, Furong Xu, Chunguang Li, JinXiong Liu, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21086–21095, 2022. 4

  21. [21]

    Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

    Arash Heidari, Nima Jafari Navimipour, Hasan Dag, and Mehmet Unal. Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024. 4

  22. [22]

    Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024

    Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024. 6

  23. [23]

    Funsd: A dataset for form understanding in noisy scanned documents

    Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. 2:1–6, 2019. 4

  24. [24]

    Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022

    Aiqi Jiang, Xiaohan Yang, Yang Liu, and Arkaitz Zubiaga. Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022. 4

  25. [25]

    Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation

    Kimmo Karkkainen and Jungseock Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1548–1558, 2021. 4

  26. [26]

    Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025

    Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, and Tao Wei. Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025. 1, 2

  27. [27]

    Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023

    Qintong Li, Leyang Cui, Lingpeng Kong, and Wei Bi. Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023. 3

  28. [28]

    Prd: Peer rank and discussion improve large language model based evaluations

    Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023. 3

  29. [29]

    Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024

    Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024. 3 9

  30. [30]

    Mcfend: A multi-source benchmark dataset for chinese fake news de- tection

    Yupeng Li, Haorui He, Jin Bai, and Dacheng Wen. Mcfend: A multi-source benchmark dataset for chinese fake news de- tection. InProceedings of the ACM Web Conference 2024, pages 4018–4027, 2024. 4

  31. [31]

    Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023

    Yen-Ting Lin and Yun-Nung Chen. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023. 3

  32. [32]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 6

  33. [33]

    Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models

    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer,

  34. [34]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 3

  35. [35]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023. 2

  36. [36]

    Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion

    Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6415–6424. IEEE, 2025. 4

  37. [37]

    Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022

    Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022. 4

  38. [38]

    Detecting potential violent be- havior using deep learning

    Dalton Chukwuezugo Owoh. Detecting potential violent be- havior using deep learning. 2023. 4

  39. [39]

    BBQ: A Hand-Built Bias Benchmark for Question Answering

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Pad- makumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193, 2021. 4

  40. [40]

    Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms

    Peng Qi, Yuyan Bu, Juan Cao, Wei Ji, Ruihao Shui, Junbin Xiao, Danding Wang, and Tat-Seng Chua. Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14444–14452,

  41. [41]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023. 4

  42. [42]

    Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023. 4

  43. [43]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6

  44. [44]

    AudioSet

    Google Research. AudioSet. https://research.google.com/ audioset/, 2025. [Accessed 11-07-2025]. 4

  45. [45]

    Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation

    Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder, and Md Saiful Islam. Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation. InProceedings of International Joint Conference on Advances in Computa- tional Intelligence: IJCACI 2020, pages 457–468. Springer,

  46. [46]

    Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences

    Shreya Shankar, JD Zamfirescu-Pereira, Bj ¨orn Hartmann, Aditya Parameswaran, and Ian Arawjo. Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technol- ogy, pages 1–14, 2024. 3

  47. [47]

    Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task

    Xinshu Shen, Hongyi Wu, Xiaopeng Bai, Yuanbin Wu, Aimin Zhou, Shaoguang Mao, Tao Ge, and Yan Xia. Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task. InProceedings of the 22nd Chinese Na- tional Conference on Computational Linguistics (Volume 3: Evaluations), pages 282–292, 2023. 4

  48. [48]

    Real-world anomaly detection in surveillance videos

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 4

  49. [49]

    Case-bench: Context-aware safety bench- mark for large language models

    Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, and Jose Such. Case-bench: Context-aware safety bench- mark for large language models. InForty-second Interna- tional Conference on Machine Learning, 2025. 6

  50. [50]

    arXiv preprint arXiv:2304.10436 , year =

    Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Min- lie Huang. Safety assessment of chinese large language mod- els.arXiv preprint arXiv:2304.10436, 2023. 4, 6

  51. [51]

    TrustLLM: Trustworthiness in Large Language Models

    Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large lan- guage models.arXiv preprint arXiv:2401.05561, 3, 2024. 1, 2

  52. [52]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

  53. [53]

    How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023

    Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023. 2

  54. [54]

    A study on integrating machine learning tech- niques for waste management

    Ruchika Vaidya, Rahul Dattangire, Divya Biradar, and Pra- teek Verma. A study on integrating machine learning tech- niques for waste management. In2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), pages 1506–1510. IEEE, 2024. 4

  55. [55]

    Belgian man dies by suicide following ex- changes with chatbot

    Lauren Walker. Belgian man dies by suicide following ex- changes with chatbot. the brussels times (march 2023), 2023. 1 10

  56. [56]

    Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016

    Chen Wang, Gang Hu, Kui Wang, Michal Brylinski, Lei Xie, and Lukasz Kurgan. Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016. 4

  57. [57]

    Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili

    Han Wang, Tan Rui Yang, Usman Naseem, and Roy Ka-Wei Lee. Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 7493–7502, 2024. 4

  58. [58]

    Cnn-generated images are surprisingly easy to spot

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 4

  59. [59]

    Multimodal llm enhanced cross- lingual cross-modal retrieval

    Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, and Wei Tang. Multimodal llm enhanced cross- lingual cross-modal retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8296– 8305, 2024. 1

  60. [60]

    Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022

    Chejian Xu, Wenhao Ding, Weijie Lyu, Zuxin Liu, Shuai Wang, Yihan He, Hanjiang Hu, Ding Zhao, and Bo Li. Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022. 1, 2, 3

  61. [61]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 2

  62. [62]

    arXiv preprint arXiv:2012.14740 , year=

    Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding.arXiv preprint arXiv:2012.14740, 2020. 4

  63. [63]

    Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023

    Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingn- ing Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, et al. Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023. 2

  64. [64]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1

  65. [65]

    Timesuite: Improving mllms for long video understanding via grounded tuning

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InThe Thirteenth International Conference on Learning Represen- tations. 4

  66. [66]

    Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025

    Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, and Liqiang Nie. Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025. 1

  67. [67]

    Differential-perceptive and retrieval- augmented mllm for change captioning

    Xian Zhang, Haokun Wen, Jianlong Wu, Pengda Qin, Hui Xue’, and Liqiang Nie. Differential-perceptive and retrieval- augmented mllm for change captioning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4148–4157, 2024. 1

  68. [68]

    Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024

    Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024. 1, 2

  69. [69]

    Efficient motion-aware video mllm

    Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, and Jing Liu. Efficient motion-aware video mllm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24159– 24168, 2025. 1

  70. [70]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 3

  71. [71]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020. 4

  72. [72]

    Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. 4

  73. [73]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 6 11 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Supplementary Material

  74. [74]

    A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5

    Datasets generation OutSafe-Bench is constructed from a seed pool of 30 pub- licly available datasets, each selected to represent one or more of the nine safety-critical categories. A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5

  75. [75]

    role": "system

    Output prompt design & Quality Control 7.1. Output prompt design For text input, we use the default system settings for all models, and the input prompt is the text content. response = self.client.chat.completions.create( model = model_full_name, messages = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": prompt}...

  76. [76]

    Privacy and Property

    Evaluation Prompts & Results 8.1. Evaluation Prompt Design To ensure fair and consistent judgment across diverse safety dimensions, we designed a set of evaluation prompts that guide the jury models to assess the risk level of each model output. f"""You are a content safety expert. Please evaluate and score the following text from nine content risk perspe...

  77. [77]

    This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data shown in the table 4

    Ablation Experiments To further examine the reliability and interpretability of our results, we conduct additional validation and ablation ex- periments on a 936-sample subset with human-annotated safety ratings. This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data sho...

  78. [78]

    Case Study To further illustrate how the proposed evaluation frame- work operates in practice, we present five representative case studies covering Chinese text, English text, image, au- dio, and video inputs. Each case is visualized as a unified evaluation panel (see Figure 5 to Figure 9), where the top part shows the input and model-generated output, an...