OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Pith reviewed 2026-05-17 22:26 UTC · model grok-4.3
The pith
OutSafe-Bench reveals persistent safety vulnerabilities across nine state-of-the-art multimodal large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OutSafe-Bench supplies a large-scale multimodal dataset with over 18,000 bilingual text prompts, 4,500 images, 450 audio clips, and 450 videos systematically annotated across nine critical content risk categories, together with the Multidimensional Cross Risk Score to model correlated risks and the FairScore multi-reviewer weighted aggregation framework that selects adaptive juries from top models; evaluation of nine state-of-the-art MLLMs with this suite demonstrates persistent and substantial safety vulnerabilities.
What carries the argument
OutSafe-Bench dataset combined with the FairScore automated multi-reviewer weighted aggregation framework that uses top-performing models as juries to mitigate single-model bias while the Multidimensional Cross Risk Score captures overlapping risks across categories.
If this is right
- Current multimodal models require additional safeguards before safe deployment in everyday tools.
- Evaluation must cover multiple modalities and languages to expose vulnerabilities that single-modality tests miss.
- Automated multi-reviewer scoring can reduce bias compared with single-model judgment.
- Persistent failures across nine risk categories indicate that safety training has not yet closed the gap in MLLMs.
Where Pith is reading between the lines
- Developers could incorporate OutSafe-Bench directly into model training loops to penalize unsafe generations in real time.
- The same dataset construction method could be extended to additional languages and emerging risk types such as deepfake misinformation.
- Regulatory bodies might adopt similar multimodal benchmarks as minimum standards for releasing new MLLMs to the public.
Load-bearing premise
The nine risk categories and the automated FairScore aggregation accurately capture real-world safety failures without introducing new biases from the jury models themselves.
What would settle it
A controlled user study in which participants interact with the evaluated MLLMs on held-out prompts and report the actual frequency of unsafe outputs, then compare that frequency to the benchmark scores produced by OutSafe-Bench and FairScore.
Figures
read the original abstract
Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OutSafe-Bench, the first comprehensive multimodal content safety benchmark for MLLMs. It provides a large-scale dataset spanning four modalities (over 18,000 bilingual text prompts, 4,500 images, 450 audio clips, and 450 videos) annotated across nine risk categories, proposes the Multidimensional Cross Risk Score (MCRS) to model overlapping risks, and introduces FairScore, an explainable automated multi-reviewer weighted aggregation framework that selects top-performing models as adaptive juries. Evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities.
Significance. If the dataset construction, MCRS, and FairScore prove robust and free of circular bias, this benchmark could fill a critical gap in multimodal safety evaluation by covering multiple modalities, languages, and correlated risks. The large scale and inclusion of audio/video are strengths that could support reproducible safety research and drive development of better safeguards in MLLMs.
major comments (2)
- [§3.3] §3.3 (FairScore framework): The adaptive selection of top-performing models as juries for weighted aggregation risks circular bias when those models share correlated safety blind spots (e.g., under-flagging co-occurring privacy and misinformation risks). The manuscript provides no evidence of jury diversity verification or cross-validation of jury outputs against human raters on held-out data, which is load-bearing for the reliability of the reported MLLM rankings and the central claim of 'persistent and substantial safety vulnerabilities'.
- [§2.2] §2.2 (Dataset annotation): The description of how the nine risk categories were defined and how the 18k+ prompts plus multimodal items were annotated lacks details on inter-annotator agreement, annotator expertise, or quality control procedures. Without these, the ground-truth labels underpinning both MCRS and FairScore evaluations cannot be independently assessed for consistency.
minor comments (2)
- [Figure 2] Figure 2 (dataset distribution): The modality and language breakdown bars are difficult to read due to overlapping labels; consider adding exact counts in the caption or a supplementary table.
- [§3.1] Notation for MCRS: The formula for combining cross-risk terms is introduced without an explicit definition of the weighting parameters; a short appendix deriving or justifying the weights would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, explaining our position and the revisions incorporated to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3.3] §3.3 (FairScore framework): The adaptive selection of top-performing models as juries for weighted aggregation risks circular bias when those models share correlated safety blind spots (e.g., under-flagging co-occurring privacy and misinformation risks). The manuscript provides no evidence of jury diversity verification or cross-validation of jury outputs against human raters on held-out data, which is load-bearing for the reliability of the reported MLLM rankings and the central claim of 'persistent and substantial safety vulnerabilities'.
Authors: We appreciate the referee's concern about potential circular bias arising from correlated blind spots among selected jury models. The FairScore design adaptively chooses top-performing models to aggregate judgments and reduce single-model bias, with the nine evaluated MLLMs drawn from diverse developers and architectures to promote variation in safety detection capabilities. We acknowledge that the original manuscript did not include explicit verification of jury diversity or cross-validation against human raters. In the revised manuscript, we have expanded §3.3 to detail the jury selection criteria with attention to model diversity and added a human cross-validation study on held-out data to support the reliability of the aggregated scores and the reported safety vulnerabilities. revision: yes
-
Referee: [§2.2] §2.2 (Dataset annotation): The description of how the nine risk categories were defined and how the 18k+ prompts plus multimodal items were annotated lacks details on inter-annotator agreement, annotator expertise, or quality control procedures. Without these, the ground-truth labels underpinning both MCRS and FairScore evaluations cannot be independently assessed for consistency.
Authors: We agree that additional specifics on the annotation process are needed to allow independent assessment of label quality. The original manuscript outlined the nine risk categories and the general annotation of the large-scale multimodal dataset but omitted quantitative details on agreement, expertise, and controls. We have revised §2.2 to include a full description of how the risk categories were defined from established safety frameworks, the expertise of the annotators, inter-annotator agreement metrics, and the quality control procedures employed, such as multi-round reviews. These additions strengthen the foundation for MCRS and FairScore evaluations. revision: yes
Circularity Check
OutSafe-Bench benchmark and FairScore metric are self-contained with no circular reductions
full rationale
The paper constructs a new multimodal dataset spanning text, images, audio, and video with nine independently defined risk categories, then applies the novel MCRS metric and FairScore aggregation framework to evaluate nine MLLMs. No equations, self-citations, or definitional steps reduce the reported safety vulnerabilities to fitted parameters or inputs defined inside the paper itself; the adaptive jury selection is presented as an external mitigation technique rather than a tautological loop. The central claims rest on empirical application of these components to external models and data, making the derivation chain self-contained against external benchmarks without any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel Multidimensional Cross Risk Score (MCRS), a function designed to capture and quantify the co-occurrence and interactions of multiple content risks within a single output... Cross-Risk Influence Matrix γ
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FairScore selects top-performing models as adaptive juries... weighted ensemble approach
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Reference graph
Works this paper leans on
-
[1]
Alireza Afzal Aghaei and Nadia Khodaei. Automated de- pression recognition using multimodal machine learning: A study on the daic-woz dataset.Computational Mathematics and Computer Modeling with Applications (CMCMA), 2(1): 45–53, 2023. 4
work page 2023
-
[2]
Meet claude, your thinking partner
Anthropic. Meet claude, your thinking partner. https://www. anthropic.com/claude, 2023. 6
work page 2023
-
[3]
Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api
Apidog. Doubao 1.5 pro: Api pricing & how to use doubao- 1.5-pro api. https://apidog.com/blog/doubao-1-5-pro-api/,
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Baidu Research. Introducing pcl-baidu wenxin (ernie 3.0 ti- tan), the world’s first knowledge enhanced multi-hundred- billion model. https://research.baidu.com/Blog/index-view? id=165, 2021. 6
work page 2021
-
[6]
Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Wil- lett, and Shaul Druckmann. A cross-modal approach to silent speech with llm-enhanced recognition.arXiv preprint arXiv:2403.05583, 2024. 4
-
[7]
Barbara Pfeffer Billauer. Murder without redress-the need for new legal solutions in the age of character-ai (cai).Avail- able at SSRN 5107942, 2024. 1
work page 2024
-
[8]
Steven Boll. Suppression of acoustic noise in speech us- ing spectral subtraction.IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 2003. 4
work page 2003
-
[9]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent de- bate.arXiv preprint arXiv:2308.07201, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
R Reading Competition. Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction.Web link: https://rrc. cvc. uab. es. 4
work page 2019
-
[11]
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 1
-
[12]
Hatemm: A multi- modal dataset for hate video classification
Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Man- ish Gupta, and Animesh Mukherjee. Hatemm: A multi- modal dataset for hate video classification. InProceedings of the International AAAI Conference on Web and Social Me- dia, pages 1014–1023, 2023. 4
work page 2023
-
[13]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010. 4
work page 2010
-
[14]
guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering
Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 19597–19607, 2025. 1
work page 2025
-
[15]
Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing.arXiv preprint arXiv:2412.19806, 2024. 1
- [16]
-
[17]
[Accessed 11-07-2025]. 4
work page 2025
-
[18]
Gemini: Our most intelligent ai models
Google Deepmind. Gemini: Our most intelligent ai models. https://deepmind.google/models/gemini/, 2025. 6
work page 2025
-
[19]
Tianle Gu, Zeyang Zhou, Kexin Huang, Liang Dandan, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Yujiu Yang, Yan Teng, Yu Qiao, et al. Mllmguard: A multi-dimensional safety evalua- tion suite for multimodal large language models.Advances in Neural Information Processing Systems, 37:7256–7295,
-
[20]
Sifeng He, Xudong Yang, Chen Jiang, Gang Liang, Wei Zhang, Tan Pan, Qing Wang, Furong Xu, Chunguang Li, JinXiong Liu, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21086–21095, 2022. 4
work page 2022
-
[21]
Arash Heidari, Nima Jafari Navimipour, Hasan Dag, and Mehmet Unal. Deepfake detection using deep learning meth- ods: A systematic and comprehensive review.Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024. 4
work page 2024
-
[22]
Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024
Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm.Authorea Preprints, 2024. 6
work page 2024
-
[23]
Funsd: A dataset for form understanding in noisy scanned documents
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. 2:1–6, 2019. 4
work page 2019
-
[24]
Aiqi Jiang, Xiaohan Yang, Yang Liu, and Arkaitz Zubiaga. Swsr: A chinese dataset and lexicon for online sexism detec- tion.Online Social Networks and Media, 27:100182, 2022. 4
work page 2022
-
[25]
Kimmo Karkkainen and Jungseock Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1548–1558, 2021. 4
work page 2021
-
[26]
Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, and Tao Wei. Priv-qa: Privacy-preserving question answering for cloud large language models.arXiv preprint arXiv:2502.13564, 2025. 1, 2
-
[27]
Qintong Li, Leyang Cui, Lingpeng Kong, and Wei Bi. Col- laborative evaluation: Exploring the synergy of large lan- guage models and humans for open-ended generation eval- uation.arXiv e-prints, pages arXiv–2310, 2023. 3
work page 2023
-
[28]
Prd: Peer rank and discussion improve large language model based evaluations
Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023. 3
-
[29]
Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024
Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Rule-based data selection for large language mod- els.arXiv preprint arXiv:2410.04715, 2024. 3 9
-
[30]
Mcfend: A multi-source benchmark dataset for chinese fake news de- tection
Yupeng Li, Haorui He, Jin Bai, and Dacheng Wen. Mcfend: A multi-source benchmark dataset for chinese fake news de- tection. InProceedings of the ACM Web Conference 2024, pages 4018–4027, 2024. 4
work page 2024
-
[31]
Yen-Ting Lin and Yun-Nung Chen. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.arXiv preprint arXiv:2305.13711, 2023. 3
-
[32]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models
Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety eval- uation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer,
-
[34]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission- specific anomaly detection and natural language interpreta- tion. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6415–6424. IEEE, 2025. 4
work page 2025
-
[37]
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: a multi-label hate speech de- tection dataset.Complex & Intelligent Systems, 8(6):4663– 4678, 2022. 4
work page 2022
-
[38]
Detecting potential violent be- havior using deep learning
Dalton Chukwuezugo Owoh. Detecting potential violent be- havior using deep learning. 2023. 4
work page 2023
-
[39]
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Pad- makumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193, 2021. 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Peng Qi, Yuyan Bu, Juan Cao, Wei Ji, Ruihao Shui, Junbin Xiao, Danding Wang, and Tat-Seng Chua. Fakesv: A multi- modal benchmark with rich social context for fake news de- tection on short video platforms. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14444–14452,
-
[41]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023. 4
work page 2023
-
[43]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 1908
- [44]
-
[45]
Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation
Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder, and Md Saiful Islam. Hate speech detection in the bengali lan- guage: A dataset and its baseline evaluation. InProceedings of International Joint Conference on Advances in Computa- tional Intelligence: IJCACI 2020, pages 457–468. Springer,
work page 2020
-
[46]
Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences
Shreya Shankar, JD Zamfirescu-Pereira, Bj ¨orn Hartmann, Aditya Parameswaran, and Ian Arawjo. Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technol- ogy, pages 1–14, 2024. 3
work page 2024
-
[47]
Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task
Xinshu Shen, Hongyi Wu, Xiaopeng Bai, Yuanbin Wu, Aimin Zhou, Shaoguang Mao, Tao Ge, and Yan Xia. Overview of ccl23-eval task 8: Chinese essay fluency eval- uation (cefe) task. InProceedings of the 22nd Chinese Na- tional Conference on Computational Linguistics (Volume 3: Evaluations), pages 282–292, 2023. 4
work page 2023
-
[48]
Real-world anomaly detection in surveillance videos
Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 4
work page 2018
-
[49]
Case-bench: Context-aware safety bench- mark for large language models
Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, and Jose Such. Case-bench: Context-aware safety bench- mark for large language models. InForty-second Interna- tional Conference on Machine Learning, 2025. 6
work page 2025
-
[50]
arXiv preprint arXiv:2304.10436 , year =
Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Min- lie Huang. Safety assessment of chinese large language mod- els.arXiv preprint arXiv:2304.10436, 2023. 4, 6
-
[51]
TrustLLM: Trustworthiness in Large Language Models
Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large lan- guage models.arXiv preprint arXiv:2401.05561, 3, 2024. 1, 2
work page internal anchor Pith review arXiv 2024
-
[52]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this im- age? a safety evaluation benchmark for vision llms.arXiv preprint arXiv:2311.16101, 2023. 2
-
[54]
A study on integrating machine learning tech- niques for waste management
Ruchika Vaidya, Rahul Dattangire, Divya Biradar, and Pra- teek Verma. A study on integrating machine learning tech- niques for waste management. In2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), pages 1506–1510. IEEE, 2024. 4
work page 2024
-
[55]
Belgian man dies by suicide following ex- changes with chatbot
Lauren Walker. Belgian man dies by suicide following ex- changes with chatbot. the brussels times (march 2023), 2023. 1 10
work page 2023
-
[56]
Chen Wang, Gang Hu, Kui Wang, Michal Brylinski, Lei Xie, and Lukasz Kurgan. Pdid: database of molecular-level puta- tive protein–drug interactions in the structural human pro- teome.Bioinformatics, 32(4):579–586, 2016. 4
work page 2016
-
[57]
Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili
Han Wang, Tan Rui Yang, Usman Naseem, and Roy Ka-Wei Lee. Multihateclip: A multilingual benchmark dataset for hateful video detection on youtube and bilibili. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 7493–7502, 2024. 4
work page 2024
-
[58]
Cnn-generated images are surprisingly easy to spot
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 4
work page 2020
-
[59]
Multimodal llm enhanced cross- lingual cross-modal retrieval
Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, and Wei Tang. Multimodal llm enhanced cross- lingual cross-modal retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8296– 8305, 2024. 1
work page 2024
-
[60]
Chejian Xu, Wenhao Ding, Weijie Lyu, Zuxin Liu, Shuai Wang, Yihan He, Hanjiang Hu, Ding Zhao, and Bo Li. Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022. 1, 2, 3
work page 2022
-
[61]
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 2
work page 2024
-
[62]
arXiv preprint arXiv:2012.14740 , year=
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding.arXiv preprint arXiv:2012.14740, 2020. 4
-
[63]
Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingn- ing Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, et al. Lamm: Language-assisted multi- modal instruction-tuning dataset, framework, and bench- mark.Advances in Neural Information Processing Systems, 36:26650–26685, 2023. 2
work page 2023
-
[64]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1
work page 2014
-
[65]
Timesuite: Improving mllms for long video understanding via grounded tuning
Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InThe Thirteenth International Conference on Learning Represen- tations. 4
-
[66]
Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, and Liqiang Nie. Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding.arXiv preprint arXiv:2503.09143, 2025. 1
-
[67]
Differential-perceptive and retrieval- augmented mllm for change captioning
Xian Zhang, Haokun Wen, Jianlong Wu, Pengda Qin, Hui Xue’, and Liqiang Nie. Differential-perceptive and retrieval- augmented mllm for change captioning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4148–4157, 2024. 1
work page 2024
-
[68]
Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Multitrust: A comprehensive bench- mark towards trustworthy multimodal large language mod- els.Advances in Neural Information Processing Systems, 37:49279–49383, 2024. 1, 2
work page 2024
-
[69]
Efficient motion-aware video mllm
Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, and Jing Liu. Efficient motion-aware video mllm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24159– 24168, 2025. 1
work page 2025
-
[70]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 3
work page 2023
-
[71]
Image-based table recognition: data, model, and evaluation
Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020. 4
work page 2020
-
[72]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. 4
work page 2017
-
[73]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 6 11 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Supplementary Material
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Datasets generation OutSafe-Bench is constructed from a seed pool of 30 pub- licly available datasets, each selected to represent one or more of the nine safety-critical categories. A detailed break- down of the dataset sources and their corresponding content domains is presented in Table 5
-
[75]
Output prompt design & Quality Control 7.1. Output prompt design For text input, we use the default system settings for all models, and the input prompt is the text content. response = self.client.chat.completions.create( model = model_full_name, messages = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": prompt}...
-
[76]
Evaluation Prompts & Results 8.1. Evaluation Prompt Design To ensure fair and consistent judgment across diverse safety dimensions, we designed a set of evaluation prompts that guide the jury models to assess the risk level of each model output. f"""You are a content safety expert. Please evaluate and score the following text from nine content risk perspe...
-
[77]
Ablation Experiments To further examine the reliability and interpretability of our results, we conduct additional validation and ablation ex- periments on a 936-sample subset with human-annotated safety ratings. This subset covers balanced distributions across nine risk categories and four modalities (text, image, audio and video), with detailed data sho...
-
[78]
Case Study To further illustrate how the proposed evaluation frame- work operates in practice, we present five representative case studies covering Chinese text, English text, image, au- dio, and video inputs. Each case is visualized as a unified evaluation panel (see Figure 5 to Figure 9), where the top part shows the input and model-generated output, an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.