OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Lingjuan Lyu; Yaochu Jin; Yingchao Yu; Yuanshuai Li; Yuhan Xie; Yuping Yan

arxiv: 2511.10287 · v4 · submitted 2025-11-13 · 💻 cs.LG · cs.CL

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Yuping Yan , Yuhan Xie , Yuanshuai Li , Yingchao Yu , Lingjuan Lyu , Yaochu Jin This is my paper

Pith reviewed 2026-05-17 22:26 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords multimodal large language modelscontent safety benchmarkoffensive content detectionsafety evaluationrisk assessmentMLLM vulnerabilities

0 comments

The pith

OutSafe-Bench reveals persistent safety vulnerabilities across nine state-of-the-art multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OutSafe-Bench as the first comprehensive test suite for content safety in multimodal large language models. It supplies a dataset spanning text, images, audio, and video in English and Chinese, with more than 18,000 text prompts and thousands of other samples labeled across nine risk categories. A new Multidimensional Cross Risk Score tracks overlapping dangers while FairScore aggregates judgments from multiple top models to reduce single-judge bias. When the authors apply these tools to current leading MLLMs, the results show repeated failures to block toxic language, biased imagery, privacy leaks, and harmful misinformation. Readers should care because these models are already entering everyday tools and agents where unsafe outputs can reach many users quickly.

Core claim

OutSafe-Bench supplies a large-scale multimodal dataset with over 18,000 bilingual text prompts, 4,500 images, 450 audio clips, and 450 videos systematically annotated across nine critical content risk categories, together with the Multidimensional Cross Risk Score to model correlated risks and the FairScore multi-reviewer weighted aggregation framework that selects adaptive juries from top models; evaluation of nine state-of-the-art MLLMs with this suite demonstrates persistent and substantial safety vulnerabilities.

What carries the argument

OutSafe-Bench dataset combined with the FairScore automated multi-reviewer weighted aggregation framework that uses top-performing models as juries to mitigate single-model bias while the Multidimensional Cross Risk Score captures overlapping risks across categories.

If this is right

Current multimodal models require additional safeguards before safe deployment in everyday tools.
Evaluation must cover multiple modalities and languages to expose vulnerabilities that single-modality tests miss.
Automated multi-reviewer scoring can reduce bias compared with single-model judgment.
Persistent failures across nine risk categories indicate that safety training has not yet closed the gap in MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could incorporate OutSafe-Bench directly into model training loops to penalize unsafe generations in real time.
The same dataset construction method could be extended to additional languages and emerging risk types such as deepfake misinformation.
Regulatory bodies might adopt similar multimodal benchmarks as minimum standards for releasing new MLLMs to the public.

Load-bearing premise

The nine risk categories and the automated FairScore aggregation accurately capture real-world safety failures without introducing new biases from the jury models themselves.

What would settle it

A controlled user study in which participants interact with the evaluated MLLMs on held-out prompts and report the actual frequency of unsafe outputs, then compare that frequency to the benchmark scores produced by OutSafe-Bench and FairScore.

Figures

Figures reproduced from arXiv: 2511.10287 by Lingjuan Lyu, Yaochu Jin, Yingchao Yu, Yuanshuai Li, Yuhan Xie, Yuping Yan.

**Figure 1.** Figure 1: Overview of the OutSafe-Bench framework. It includes content risk aspect division across four modalities, as well as the evalua [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Examples of dataset instances across nine content safety scenarios, showcasing Chinese text, English text, image, and video [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-risk influence matrix for the nine content risk cat [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Each model’s performance is shown per modality across nine categories: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Example for Chinese text input evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 8.** Figure 8: Example for audio input evaluation [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Example for video input evaluation. 8 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OutSafe-Bench adds a sizable four-modality safety dataset and two new metrics, but the adaptive jury in FairScore risks reproducing the same model blind spots it aims to fix.

read the letter

The paper's main contribution is a new benchmark dataset for multimodal safety in large language models that covers text, images, audio, and video, along with two new metrics for assessing risks. They've put together over 18,000 bilingual prompts, 4,500 images, 450 audio clips, and 450 videos, annotated across nine risk categories. This is broader than most current safety benchmarks, which usually stick to fewer modalities. The Multidimensional Cross Risk Score aims to handle overlapping risks, which is a useful addition for more realistic testing. FairScore uses an adaptive jury of top models to aggregate evaluations and reduce bias from any single judge. That approach tries to make the scoring more robust. The main concern is whether FairScore actually avoids bias. Since it relies on the highest-scoring models as juries, any common weaknesses among them could lead to under-detecting certain risks, like combined privacy and misinformation issues. There's no clear evidence in the work that they verified jury diversity or cross-checked outputs with human raters on separate data. The annotation process for the dataset also lacks details on inter-annotator agreement or quality controls. This work targets researchers and engineers building or auditing multimodal models for safety. People looking for expanded test suites in this area could get value from the dataset and taxonomy, assuming the construction is solid. It has enough new material and addresses a timely issue to warrant a serious referee. I'd recommend sending it for peer review, with specific asks for more on how the metrics were validated and how the annotations were done.

Referee Report

2 major / 2 minor

Summary. The paper introduces OutSafe-Bench, the first comprehensive multimodal content safety benchmark for MLLMs. It provides a large-scale dataset spanning four modalities (over 18,000 bilingual text prompts, 4,500 images, 450 audio clips, and 450 videos) annotated across nine risk categories, proposes the Multidimensional Cross Risk Score (MCRS) to model overlapping risks, and introduces FairScore, an explainable automated multi-reviewer weighted aggregation framework that selects top-performing models as adaptive juries. Evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities.

Significance. If the dataset construction, MCRS, and FairScore prove robust and free of circular bias, this benchmark could fill a critical gap in multimodal safety evaluation by covering multiple modalities, languages, and correlated risks. The large scale and inclusion of audio/video are strengths that could support reproducible safety research and drive development of better safeguards in MLLMs.

major comments (2)

[§3.3] §3.3 (FairScore framework): The adaptive selection of top-performing models as juries for weighted aggregation risks circular bias when those models share correlated safety blind spots (e.g., under-flagging co-occurring privacy and misinformation risks). The manuscript provides no evidence of jury diversity verification or cross-validation of jury outputs against human raters on held-out data, which is load-bearing for the reliability of the reported MLLM rankings and the central claim of 'persistent and substantial safety vulnerabilities'.
[§2.2] §2.2 (Dataset annotation): The description of how the nine risk categories were defined and how the 18k+ prompts plus multimodal items were annotated lacks details on inter-annotator agreement, annotator expertise, or quality control procedures. Without these, the ground-truth labels underpinning both MCRS and FairScore evaluations cannot be independently assessed for consistency.

minor comments (2)

[Figure 2] Figure 2 (dataset distribution): The modality and language breakdown bars are difficult to read due to overlapping labels; consider adding exact counts in the caption or a supplementary table.
[§3.1] Notation for MCRS: The formula for combining cross-risk terms is introduced without an explicit definition of the weighting parameters; a short appendix deriving or justifying the weights would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, explaining our position and the revisions incorporated to improve clarity and rigor.

read point-by-point responses

Referee: [§3.3] §3.3 (FairScore framework): The adaptive selection of top-performing models as juries for weighted aggregation risks circular bias when those models share correlated safety blind spots (e.g., under-flagging co-occurring privacy and misinformation risks). The manuscript provides no evidence of jury diversity verification or cross-validation of jury outputs against human raters on held-out data, which is load-bearing for the reliability of the reported MLLM rankings and the central claim of 'persistent and substantial safety vulnerabilities'.

Authors: We appreciate the referee's concern about potential circular bias arising from correlated blind spots among selected jury models. The FairScore design adaptively chooses top-performing models to aggregate judgments and reduce single-model bias, with the nine evaluated MLLMs drawn from diverse developers and architectures to promote variation in safety detection capabilities. We acknowledge that the original manuscript did not include explicit verification of jury diversity or cross-validation against human raters. In the revised manuscript, we have expanded §3.3 to detail the jury selection criteria with attention to model diversity and added a human cross-validation study on held-out data to support the reliability of the aggregated scores and the reported safety vulnerabilities. revision: yes
Referee: [§2.2] §2.2 (Dataset annotation): The description of how the nine risk categories were defined and how the 18k+ prompts plus multimodal items were annotated lacks details on inter-annotator agreement, annotator expertise, or quality control procedures. Without these, the ground-truth labels underpinning both MCRS and FairScore evaluations cannot be independently assessed for consistency.

Authors: We agree that additional specifics on the annotation process are needed to allow independent assessment of label quality. The original manuscript outlined the nine risk categories and the general annotation of the large-scale multimodal dataset but omitted quantitative details on agreement, expertise, and controls. We have revised §2.2 to include a full description of how the risk categories were defined from established safety frameworks, the expertise of the annotators, inter-annotator agreement metrics, and the quality control procedures employed, such as multi-round reviews. These additions strengthen the foundation for MCRS and FairScore evaluations. revision: yes

Circularity Check

0 steps flagged

OutSafe-Bench benchmark and FairScore metric are self-contained with no circular reductions

full rationale

The paper constructs a new multimodal dataset spanning text, images, audio, and video with nine independently defined risk categories, then applies the novel MCRS metric and FairScore aggregation framework to evaluate nine MLLMs. No equations, self-citations, or definitional steps reduce the reported safety vulnerabilities to fitted parameters or inputs defined inside the paper itself; the adaptive jury selection is presented as an external mitigation technique rather than a tautological loop. The central claims rest on empirical application of these components to external models and data, making the derivation chain self-contained against external benchmarks without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution is an empirical benchmark and two new scoring procedures; no free parameters, mathematical axioms, or invented physical entities are required.

pith-pipeline@v0.9.0 · 5558 in / 1044 out tokens · 30580 ms · 2026-05-17T22:26:16.762426+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel Multidimensional Cross Risk Score (MCRS), a function designed to capture and quantify the co-occurrence and interactions of multiple content risks within a single output... Cross-Risk Influence Matrix γ
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FairScore selects top-performing models as adaptive juries... weighted ensemble approach

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Alireza Afzal Aghaei and Nadia Khodaei. Automated de- pression recognition using multimodal machine learning: A study on the daic-woz dataset.Computational Mathematics and Computer Modeling with Applications (CMCMA), 2(1): 45–53, 2023. 4

work page 2023
[2]

Meet claude, your thinking partner

Anthropic. Meet claude, your thinking partner. https://www. anthropic.com/claude, 2023. 6

work page 2023
[3]

Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api

Apidog. Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api. https://apidog.com/blog/doubao-1-5-pro-api/,

work page
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model

Baidu Research. Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model. https://research.baidu.com/Blog/index-view? id=165, 2021. 6

work page 2021
[6]

A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024

Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Wil- lett, and Shaul Druckmann. A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024. 4

work page arXiv 2024
[7]

Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024

Barbara Pfeffer Billauer. Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024. 1

work page 2024
[8]

Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003

Steven Boll. Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003. 4

work page 2003
[9]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent de- bate.arXiv preprint arXiv:2308.07201, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc

R Reading Competition. Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc. cvc. uab. es. 4

work page 2019
[11]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 1

work page arXiv 2023
[12]

Hatemm: A multi- modal dataset for hate video classification

Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Man- ish Gupta, and Animesh Mukherjee. Hatemm: A multi- modal dataset for hate video classification. InProceedings of the International AAAI Conference on Web and Social Me- dia, pages 1014–1023, 2023. 4

work page 2023
[13]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010. 4

work page 2010
[14]

guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering

Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 19597–19607, 2025. 1

work page 2025
[15]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024. 1

work page arXiv 2024
[16]

Find any sound you like

Freesound. Find any sound you like. https://freesound.org/,

work page
[17]

[Accessed 11-07-2025]. 4

work page 2025
[18]

Gemini: Our most intelligent ai models

Google Deepmind. Gemini: Our most intelligent ai models. https://deepmind.google/models/gemini/, 2025. 6

work page 2025
[19]

Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,

Tianle Gu, Zeyang Zhou, Kexin Huang, Liang Dandan, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Yujiu Yang, Yan Teng, Yu Qiao, et al. Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,

work page
[20]

A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection

Sifeng He, Xudong Yang, Chen Jiang, Gang Liang, Wei Zhang, Tan Pan, Qing Wang, Furong Xu, Chunguang Li, JinXiong Liu, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21086–21095, 2022. 4

work page 2022
[21]

Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

Arash Heidari, Nima Jafari Navimipour, Hasan Dag, and Mehmet Unal. Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024. 4

work page 2024
[22]

Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024

Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024. 6

work page 2024
[23]

Funsd: A dataset for form understanding in noisy scanned documents

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. 2:1–6, 2019. 4

work page 2019
[24]

Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022

Aiqi Jiang, Xiaohan Yang, Yang Liu, and Arkaitz Zubiaga. Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022. 4

work page 2022
[25]

Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation

Kimmo Karkkainen and Jungseock Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1548–1558, 2021. 4

work page 2021
[26]

Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025

Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, and Tao Wei. Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025. 1, 2

work page arXiv 2025
[27]

Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023

Qintong Li, Leyang Cui, Lingpeng Kong, and Wei Bi. Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023. 3

work page 2023
[28]

Prd: Peer rank and discussion improve large language model based evaluations

Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023. 3

work page arXiv 2023
[29]

Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024. 3 9

work page arXiv 2024
[30]

Mcfend: A multi-source benchmark dataset for chinese fake news de- tection

Yupeng Li, Haorui He, Jin Bai, and Dacheng Wen. Mcfend: A multi-source benchmark dataset for chinese fake news de- tection. InProceedings of the ACM Web Conference 2024, pages 4018–4027, 2024. 4

work page 2024
[31]

Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023

Yen-Ting Lin and Yun-Nung Chen. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023. 3

work page arXiv 2023
[32]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer,

work page
[34]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion

Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6415–6424. IEEE, 2025. 4

work page 2025
[37]

Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022

Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022. 4

work page 2022
[38]

Detecting potential violent be- havior using deep learning

Dalton Chukwuezugo Owoh. Detecting potential violent be- havior using deep learning. 2023. 4

work page 2023
[39]

BBQ: A Hand-Built Bias Benchmark for Question Answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Pad- makumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms

Peng Qi, Yuyan Bu, Juan Cao, Wei Ji, Ruihao Shui, Junbin Xiao, Danding Wang, and Tat-Seng Chua. Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14444–14452,

work page
[41]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023. 4

work page 2023
[43]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 1908
[44]

AudioSet

Google Research. AudioSet. https://research.google.com/ audioset/, 2025. [Accessed 11-07-2025]. 4

work page 2025
[45]

Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation

Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder, and Md Saiful Islam. Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation. InProceedings of International Joint Conference on Advances in Computa- tional Intelligence: IJCACI 2020, pages 457–468. Springer,

work page 2020
[46]

Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences

Shreya Shankar, JD Zamfirescu-Pereira, Bj ¨orn Hartmann, Aditya Parameswaran, and Ian Arawjo. Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technol- ogy, pages 1–14, 2024. 3

work page 2024
[47]

Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task

Xinshu Shen, Hongyi Wu, Xiaopeng Bai, Yuanbin Wu, Aimin Zhou, Shaoguang Mao, Tao Ge, and Yan Xia. Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task. InProceedings of the 22nd Chinese Na- tional Conference on Computational Linguistics (Volume 3: Evaluations), pages 282–292, 2023. 4

work page 2023
[48]

Real-world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 4

work page 2018
[49]

Case-bench: Context-aware safety bench- mark for large language models

Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, and Jose Such. Case-bench: Context-aware safety bench- mark for large language models. InForty-second Interna- tional Conference on Machine Learning, 2025. 6

work page 2025
[50]

arXiv preprint arXiv:2304.10436 , year =

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Min- lie Huang. Safety assessment of chinese large language mod- els.arXiv preprint arXiv:2304.10436, 2023. 4, 6

work page arXiv 2023
[51]

TrustLLM: Trustworthiness in Large Language Models

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large lan- guage models.arXiv preprint arXiv:2401.05561, 3, 2024. 1, 2

work page internal anchor Pith review arXiv 2024
[52]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023

Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023. 2

work page arXiv 2023
[54]

A study on integrating machine learning tech- niques for waste management

Ruchika Vaidya, Rahul Dattangire, Divya Biradar, and Pra- teek Verma. A study on integrating machine learning tech- niques for waste management. In2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), pages 1506–1510. IEEE, 2024. 4

work page 2024
[55]

Belgian man dies by suicide following ex- changes with chatbot

Lauren Walker. Belgian man dies by suicide following ex- changes with chatbot. the brussels times (march 2023), 2023. 1 10

work page 2023
[56]

Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016

Chen Wang, Gang Hu, Kui Wang, Michal Brylinski, Lei Xie, and Lukasz Kurgan. Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016. 4

work page 2016
[57]

Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili

Han Wang, Tan Rui Yang, Usman Naseem, and Roy Ka-Wei Lee. Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 7493–7502, 2024. 4

work page 2024
[58]

Cnn-generated images are surprisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 4

work page 2020
[59]

Multimodal llm enhanced cross- lingual cross-modal retrieval

Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, and Wei Tang. Multimodal llm enhanced cross- lingual cross-modal retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8296– 8305, 2024. 1

work page 2024
[60]

Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022

Chejian Xu, Wenhao Ding, Weijie Lyu, Zuxin Liu, Shuai Wang, Yihan He, Hanjiang Hu, Ding Zhao, and Bo Li. Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022. 1, 2, 3

work page 2022
[61]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 2

work page 2024
[62]

arXiv preprint arXiv:2012.14740 , year=

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding.arXiv preprint arXiv:2012.14740, 2020. 4

work page arXiv 2012
[63]

Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023

Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingn- ing Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, et al. Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023. 2

work page 2023
[64]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1

work page 2014
[65]

Timesuite: Improving mllms for long video understanding via grounded tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InThe Thirteenth International Conference on Learning Represen- tations. 4

work page
[66]

Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025

Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, and Liqiang Nie. Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025. 1

work page arXiv 2025
[67]

Differential-perceptive and retrieval- augmented mllm for change captioning

Xian Zhang, Haokun Wen, Jianlong Wu, Pengda Qin, Hui Xue’, and Liqiang Nie. Differential-perceptive and retrieval- augmented mllm for change captioning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4148–4157, 2024. 1

work page 2024
[68]

Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024. 1, 2

work page 2024
[69]

Efficient motion-aware video mllm

Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, and Jing Liu. Efficient motion-aware video mllm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24159– 24168, 2025. 1

work page 2025
[70]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 3

work page 2023
[71]

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020. 4

work page 2020
[72]

Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. 4

work page 2017
[73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 6 11 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5

Datasets generation OutSafe-Bench is constructed from a seed pool of 30 pub- licly available datasets, each selected to represent one or more of the nine safety-critical categories. A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5

work page
[75]

role": "system

Output prompt design & Quality Control 7.1. Output prompt design For text input, we use the default system settings for all models, and the input prompt is the text content. response = self.client.chat.completions.create( model = model_full_name, messages = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": prompt}...

work page
[76]

Privacy and Property

Evaluation Prompts & Results 8.1. Evaluation Prompt Design To ensure fair and consistent judgment across diverse safety dimensions, we designed a set of evaluation prompts that guide the jury models to assess the risk level of each model output. f"""You are a content safety expert. Please evaluate and score the following text from nine content risk perspe...

work page
[77]

This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data shown in the table 4

Ablation Experiments To further examine the reliability and interpretability of our results, we conduct additional validation and ablation ex- periments on a 936-sample subset with human-annotated safety ratings. This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data sho...

work page
[78]

Case Study To further illustrate how the proposed evaluation frame- work operates in practice, we present five representative case studies covering Chinese text, English text, image, au- dio, and video inputs. Each case is visualized as a unified evaluation panel (see Figure 5 to Figure 9), where the top part shows the input and model-generated output, an...

work page arXiv 2019

[1] [1]

Alireza Afzal Aghaei and Nadia Khodaei. Automated de- pression recognition using multimodal machine learning: A study on the daic-woz dataset.Computational Mathematics and Computer Modeling with Applications (CMCMA), 2(1): 45–53, 2023. 4

work page 2023

[2] [2]

Meet claude, your thinking partner

Anthropic. Meet claude, your thinking partner. https://www. anthropic.com/claude, 2023. 6

work page 2023

[3] [3]

Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api

Apidog. Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api. https://apidog.com/blog/doubao-1-5-pro-api/,

work page

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model

Baidu Research. Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model. https://research.baidu.com/Blog/index-view? id=165, 2021. 6

work page 2021

[6] [6]

A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024

Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Wil- lett, and Shaul Druckmann. A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024. 4

work page arXiv 2024

[7] [7]

Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024

Barbara Pfeffer Billauer. Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024. 1

work page 2024

[8] [8]

Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003

Steven Boll. Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003. 4

work page 2003

[9] [9]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent de- bate.arXiv preprint arXiv:2308.07201, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc

R Reading Competition. Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc. cvc. uab. es. 4

work page 2019

[11] [11]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 1

work page arXiv 2023

[12] [12]

Hatemm: A multi- modal dataset for hate video classification

Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Man- ish Gupta, and Animesh Mukherjee. Hatemm: A multi- modal dataset for hate video classification. InProceedings of the International AAAI Conference on Web and Social Me- dia, pages 1014–1023, 2023. 4

work page 2023

[13] [13]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010. 4

work page 2010

[14] [14]

guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering

Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 19597–19607, 2025. 1

work page 2025

[15] [15]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024. 1

work page arXiv 2024

[16] [16]

Find any sound you like

Freesound. Find any sound you like. https://freesound.org/,

work page

[17] [17]

[Accessed 11-07-2025]. 4

work page 2025

[18] [18]

Gemini: Our most intelligent ai models

Google Deepmind. Gemini: Our most intelligent ai models. https://deepmind.google/models/gemini/, 2025. 6

work page 2025

[19] [19]

Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,

Tianle Gu, Zeyang Zhou, Kexin Huang, Liang Dandan, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Yujiu Yang, Yan Teng, Yu Qiao, et al. Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,

work page

[20] [20]

A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection

Sifeng He, Xudong Yang, Chen Jiang, Gang Liang, Wei Zhang, Tan Pan, Qing Wang, Furong Xu, Chunguang Li, JinXiong Liu, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21086–21095, 2022. 4

work page 2022

[21] [21]

Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

Arash Heidari, Nima Jafari Navimipour, Hasan Dag, and Mehmet Unal. Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024. 4

work page 2024

[22] [22]

Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024

Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024. 6

work page 2024

[23] [23]

Funsd: A dataset for form understanding in noisy scanned documents

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. 2:1–6, 2019. 4

work page 2019

[24] [24]

Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022

Aiqi Jiang, Xiaohan Yang, Yang Liu, and Arkaitz Zubiaga. Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022. 4

work page 2022

[25] [25]

Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation

Kimmo Karkkainen and Jungseock Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1548–1558, 2021. 4

work page 2021

[26] [26]

Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025

Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, and Tao Wei. Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025. 1, 2

work page arXiv 2025

[27] [27]

Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023

Qintong Li, Leyang Cui, Lingpeng Kong, and Wei Bi. Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023. 3

work page 2023

[28] [28]

Prd: Peer rank and discussion improve large language model based evaluations

Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023. 3

work page arXiv 2023

[29] [29]

Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024. 3 9

work page arXiv 2024

[30] [30]

Mcfend: A multi-source benchmark dataset for chinese fake news de- tection

Yupeng Li, Haorui He, Jin Bai, and Dacheng Wen. Mcfend: A multi-source benchmark dataset for chinese fake news de- tection. InProceedings of the ACM Web Conference 2024, pages 4018–4027, 2024. 4

work page 2024

[31] [31]

Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023

Yen-Ting Lin and Yun-Nung Chen. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023. 3

work page arXiv 2023

[32] [32]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer,

work page

[34] [34]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion

Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6415–6424. IEEE, 2025. 4

work page 2025

[37] [37]

Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022

Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022. 4

work page 2022

[38] [38]

Detecting potential violent be- havior using deep learning

Dalton Chukwuezugo Owoh. Detecting potential violent be- havior using deep learning. 2023. 4

work page 2023

[39] [39]

BBQ: A Hand-Built Bias Benchmark for Question Answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Pad- makumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021

[40] [40]

Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms

Peng Qi, Yuyan Bu, Juan Cao, Wei Ji, Ruihao Shui, Junbin Xiao, Danding Wang, and Tat-Seng Chua. Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14444–14452,

work page

[41] [41]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023. 4

work page 2023

[43] [43]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 1908

[44] [44]

AudioSet

Google Research. AudioSet. https://research.google.com/ audioset/, 2025. [Accessed 11-07-2025]. 4

work page 2025

[45] [45]

Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation

Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder, and Md Saiful Islam. Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation. InProceedings of International Joint Conference on Advances in Computa- tional Intelligence: IJCACI 2020, pages 457–468. Springer,

work page 2020

[46] [46]

Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences

Shreya Shankar, JD Zamfirescu-Pereira, Bj ¨orn Hartmann, Aditya Parameswaran, and Ian Arawjo. Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technol- ogy, pages 1–14, 2024. 3

work page 2024

[47] [47]

Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task

Xinshu Shen, Hongyi Wu, Xiaopeng Bai, Yuanbin Wu, Aimin Zhou, Shaoguang Mao, Tao Ge, and Yan Xia. Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task. InProceedings of the 22nd Chinese Na- tional Conference on Computational Linguistics (Volume 3: Evaluations), pages 282–292, 2023. 4

work page 2023

[48] [48]

Real-world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 4

work page 2018

[49] [49]

Case-bench: Context-aware safety bench- mark for large language models

Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, and Jose Such. Case-bench: Context-aware safety bench- mark for large language models. InForty-second Interna- tional Conference on Machine Learning, 2025. 6

work page 2025

[50] [50]

arXiv preprint arXiv:2304.10436 , year =

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Min- lie Huang. Safety assessment of chinese large language mod- els.arXiv preprint arXiv:2304.10436, 2023. 4, 6

work page arXiv 2023

[51] [51]

TrustLLM: Trustworthiness in Large Language Models

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large lan- guage models.arXiv preprint arXiv:2401.05561, 3, 2024. 1, 2

work page internal anchor Pith review arXiv 2024

[52] [52]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023

Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023. 2

work page arXiv 2023

[54] [54]

A study on integrating machine learning tech- niques for waste management

Ruchika Vaidya, Rahul Dattangire, Divya Biradar, and Pra- teek Verma. A study on integrating machine learning tech- niques for waste management. In2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), pages 1506–1510. IEEE, 2024. 4

work page 2024

[55] [55]

Belgian man dies by suicide following ex- changes with chatbot

Lauren Walker. Belgian man dies by suicide following ex- changes with chatbot. the brussels times (march 2023), 2023. 1 10

work page 2023

[56] [56]

Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016

Chen Wang, Gang Hu, Kui Wang, Michal Brylinski, Lei Xie, and Lukasz Kurgan. Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016. 4

work page 2016

[57] [57]

Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili

Han Wang, Tan Rui Yang, Usman Naseem, and Roy Ka-Wei Lee. Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 7493–7502, 2024. 4

work page 2024

[58] [58]

Cnn-generated images are surprisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 4

work page 2020

[59] [59]

Multimodal llm enhanced cross- lingual cross-modal retrieval

Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, and Wei Tang. Multimodal llm enhanced cross- lingual cross-modal retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8296– 8305, 2024. 1

work page 2024

[60] [60]

Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022

Chejian Xu, Wenhao Ding, Weijie Lyu, Zuxin Liu, Shuai Wang, Yihan He, Hanjiang Hu, Ding Zhao, and Bo Li. Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022. 1, 2, 3

work page 2022

[61] [61]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 2

work page 2024

[62] [62]

arXiv preprint arXiv:2012.14740 , year=

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding.arXiv preprint arXiv:2012.14740, 2020. 4

work page arXiv 2012

[63] [63]

Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023

Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingn- ing Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, et al. Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023. 2

work page 2023

[64] [64]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1

work page 2014

[65] [65]

Timesuite: Improving mllms for long video understanding via grounded tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InThe Thirteenth International Conference on Learning Represen- tations. 4

work page

[66] [66]

Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025

Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, and Liqiang Nie. Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025. 1

work page arXiv 2025

[67] [67]

Differential-perceptive and retrieval- augmented mllm for change captioning

Xian Zhang, Haokun Wen, Jianlong Wu, Pengda Qin, Hui Xue’, and Liqiang Nie. Differential-perceptive and retrieval- augmented mllm for change captioning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4148–4157, 2024. 1

work page 2024

[68] [68]

Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024. 1, 2

work page 2024

[69] [69]

Efficient motion-aware video mllm

Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, and Jing Liu. Efficient motion-aware video mllm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24159– 24168, 2025. 1

work page 2025

[70] [70]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 3

work page 2023

[71] [71]

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020. 4

work page 2020

[72] [72]

Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. 4

work page 2017

[73] [73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 6 11 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5

Datasets generation OutSafe-Bench is constructed from a seed pool of 30 pub- licly available datasets, each selected to represent one or more of the nine safety-critical categories. A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5

work page

[75] [75]

role": "system

Output prompt design & Quality Control 7.1. Output prompt design For text input, we use the default system settings for all models, and the input prompt is the text content. response = self.client.chat.completions.create( model = model_full_name, messages = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": prompt}...

work page

[76] [76]

Privacy and Property

Evaluation Prompts & Results 8.1. Evaluation Prompt Design To ensure fair and consistent judgment across diverse safety dimensions, we designed a set of evaluation prompts that guide the jury models to assess the risk level of each model output. f"""You are a content safety expert. Please evaluate and score the following text from nine content risk perspe...

work page

[77] [77]

This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data shown in the table 4

Ablation Experiments To further examine the reliability and interpretability of our results, we conduct additional validation and ablation ex- periments on a 936-sample subset with human-annotated safety ratings. This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data sho...

work page

[78] [78]

Case Study To further illustrate how the proposed evaluation frame- work operates in practice, we present five representative case studies covering Chinese text, English text, image, au- dio, and video inputs. Each case is visualized as a unified evaluation panel (see Figure 5 to Figure 9), where the top part shows the input and model-generated output, an...

work page arXiv 2019