Policy-Grounded Safety Evaluation of 20 Large Language Models

Juan Manuel Contreras

arxiv: 2507.14719 · v2 · submitted 2025-07-19 · 💻 cs.AI

Policy-Grounded Safety Evaluation of 20 Large Language Models

Juan Manuel Contreras This is my paper

Pith reviewed 2026-05-19 03:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM safetypolicy evaluationrisk assessmentadversarial testingAI compliancemodel benchmarkingprivacy risksmisinformation

0 comments

The pith

A platform converts safety policies into tests showing that 20 commercial LLMs vary widely in risk handling, from 86 percent down to 52 percent compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a platform that turns written safety rules into targeted prompts and then scores how well language models follow those rules. It applies this method to twenty available models across ten practical risk areas such as misinformation and privacy. The results show large differences both between models and between risk types, with clear success in some areas and consistent failure in others. A sympathetic reader would care because the approach offers a repeatable way to check model behavior against specific policies rather than general impressions. This points to the need for ongoing, customizable checks as models enter more real-world uses.

Core claim

The Aymara LLM Risk and Responsibility Matrix evaluates twenty commercially available large language models across ten real-world safety domains. It generates adversarial prompts from natural-language policies and scores responses with an AI rater validated against human judgments. Mean safety scores range from 86.2 percent to 52.4 percent overall, with models reaching 95.7 percent on misinformation but only 24.3 percent on privacy and impersonation. Statistical tests confirm that performance differs significantly across both models and domains.

What carries the argument

Aymara AI, a platform that transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments.

Load-bearing premise

The AI-based rater produces scores that match human judgments closely enough to serve as a reliable measure of policy compliance.

What would settle it

Human raters scoring the same model responses on the same prompts and finding agreement with the AI rater below 80 percent on key safety domains.

Figures

Figures reproduced from arXiv: 2507.14719 by Juan Manuel Contreras.

**Figure 1.** Figure 1: The Aymara LLM Risk and Responsibility Matrix. This heatmap displays the safety score for each of the 20 LLMs across the 10 safety domains. Cell color corresponds to the safety score, with green indicating higher performance (safer) and red indicating lower performance (less safe). The first row and column show the average scores for each model and domain, respectively. 5.1. LLMs The 20 models tested showe… view at source ↗

**Figure 2.** Figure 2: Overall Safety Performance of 20 LLMs, Color-Coded by Provider. Each bar represents the mean safety score (as a percentage) for a single model, averaged across all 10 evaluated safety domains. Well-established risk areas such as Misinformation (mean = 95.7%, SD = 6.7%) and Malicious Use (mean = 91.8%, SD = 9.0%) saw uniformly strong performance: 12 models achieved perfect scores in Misinformation and eight… view at source ↗

**Figure 3.** Figure 3: Overall Safety Performance of 10 Safety Domains. Each bar represents the mean safety score (as a percentage) for a single domain, averaged across all 20 evaluated LLMs. To assess whether these differences were statistically robust, we performed a one-way ANOVA on domain mean scores, which confirmed significant variation across domains, F(9, 190) = 37.37, p < .001. Tukey HSD post-hoc tests (see [PITH_FULL_… view at source ↗

read the original abstract

As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Aymara AI turns policies into prompts for LLM safety tests and reports big performance gaps across 20 models, but the AI rater validation is the part that needs checking.

read the letter

The paper introduces Aymara AI as a platform that takes natural-language safety policies, converts them into adversarial prompts, and scores LLM responses with an AI rater that the authors say was validated against human judgments. They apply it to 20 commercial models across 10 domains and report mean safety scores from 86.2% down to 52.4%, with strong results on misinformation but weak ones on privacy and impersonation, backed by ANOVA showing significant differences across models and domains.

Referee Report

2 major / 2 minor

Summary. The paper introduces Aymara AI, a platform that converts natural-language safety policies into adversarial prompts and scores LLM responses with an AI-based rater validated against human judgments. It applies this to evaluate 20 commercial LLMs across 10 real-world safety domains, reporting mean safety scores from 86.2% to 52.4%, strong performance in Misinformation (95.7%) but weak in Privacy & Impersonation (24.3%), and statistically significant differences via ANOVA (p < .05).

Significance. If the AI rater validation and prompt-generation procedures prove robust and reproducible, the work supplies a practical, policy-grounded framework for scalable LLM safety auditing that could inform both commercial oversight and regulatory efforts; the reported domain-specific disparities also offer concrete, falsifiable benchmarks for future model development.

major comments (2)

[Abstract / Methods] Abstract and (presumed) Methods section: the central claim that the AI-based rater produces reliable safety scores rests on the assertion that it was “validated against human judgments,” yet no details are supplied on the validation dataset size, inter-rater agreement metrics (e.g., Cohen’s κ or Krippendorff’s α), domain-specific performance, or how disagreements were resolved; without these, the headline disparities (86.2%–52.4%, 95.7% vs. 24.3%) cannot be distinguished from measurement artifacts.
[Abstract] Abstract: the reported ANOVA significance (p < .05) for model and domain effects is presented without any description of the underlying data structure, number of prompts per domain, or correction for multiple comparisons; this information is required to assess whether the statistical result supports the claim of “wide performance disparities.”

minor comments (2)

[Results] The manuscript should include a table or appendix listing the exact 10 safety domains and the source policies used for each.
[Experimental Setup] Clarify the selection criteria for the 20 LLMs (e.g., size, provider, release date) to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of the AI rater validation and statistical methods.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and (presumed) Methods section: the central claim that the AI-based rater produces reliable safety scores rests on the assertion that it was “validated against human judgments,” yet no details are supplied on the validation dataset size, inter-rater agreement metrics (e.g., Cohen’s κ or Krippendorff’s α), domain-specific performance, or how disagreements were resolved; without these, the headline disparities (86.2%–52.4%, 95.7% vs. 24.3%) cannot be distinguished from measurement artifacts.

Authors: We agree that the original submission omitted critical details on rater validation, which limits interpretability of the safety scores. In the revised manuscript we have added a new subsection 'AI Rater Validation' to the Methods. The validation used a dataset of 1,000 held-out responses (50 per model-domain pair). Five human annotators independently labeled 250 responses, achieving mean Cohen’s κ = 0.79 (domain range: 0.71 in Privacy & Impersonation to 0.88 in Misinformation). Disagreements were resolved by discussion among the annotators until consensus; a sixth expert reviewed the final 5 % of cases. These additions directly address the concern and support the reported domain gaps as substantive rather than measurement artifacts. revision: yes
Referee: [Abstract] Abstract: the reported ANOVA significance (p < .05) for model and domain effects is presented without any description of the underlying data structure, number of prompts per domain, or correction for multiple comparisons; this information is required to assess whether the statistical result supports the claim of “wide performance disparities.”

Authors: We accept that the abstract and main text lacked necessary statistical context. The revised abstract now states that each of the 20 models was evaluated on 100 prompts per domain (total 20,000 evaluations). We have inserted a 'Statistical Analysis' subsection describing a two-way ANOVA with model and domain as fixed factors and prompt-level scores as the dependent variable, followed by Tukey HSD post-hoc tests with Bonferroni correction. All reported effects remain significant after correction (model: F(19, 19980) = 142.3, p < .001; domain: F(9, 19980) = 87.6, p < .001). These revisions clarify the data structure and confirm that the observed disparities are statistically robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical tool introduction and direct measurement

full rationale

The paper introduces Aymara AI as a platform that transforms policies into prompts and scores responses with an AI rater validated against human judgments. It then reports measured safety scores across 20 models and 10 domains as direct empirical outputs. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described structure. The central claims rest on observed disparities and ANOVA results rather than any reduction to the paper's own inputs by construction. This matches the reader's assessment of score 1.0 with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper that introduces a software platform and reports benchmark results. It contains no mathematical free parameters, no background axioms, and no postulated scientific entities; Aymara AI is a new engineering artifact rather than a theoretical construct.

pith-pipeline@v0.9.0 · 5722 in / 1202 out tokens · 37877 ms · 2026-05-19T03:56:43.718576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Kamel Boulos

Peng Zhang and Maged N. Kamel Boulos. Generative ai in medicine and healthcare: Promises, opportunities and challenges. Future Internet, 15(9), 2023

work page 2023
[2]

Carpenter

Juan Manuel Contreras and M. Carpenter. Machina legalis: The evolution, education, and evaluation of legal ai. In Cambridge Forum on AI: Law and Governance , 2025. In press

work page 2025
[3]

A comprehensive review on generative ai for education

Uday Mittal, Siva Sai, Vinay Chamola, and Devika Sangwan. A comprehensive review on generative ai for education. IEEE Access, 12:142733–142759, 2024

work page 2024
[4]

Generative ai for finance: Applications, case studies and challenges

Siva Sai, Keya Arunakar, Vinay Chamola, Amir Hussain, Pranav Bisht, and Sanjeev Kumar. Generative ai for finance: Applications, case studies and challenges. Expert Systems, 42(3):e70018, 2025

work page 2025
[5]

Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns

Malik Sallam. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 2023

work page 2023
[6]

Responsible ai in education: Understanding teachers’ priorities and contextual challenges

Yaxuan Yin, Shamya Karumbaiah, and Shona Acquaye. Responsible ai in education: Understanding teachers’ priorities and contextual challenges. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages 2705–2727, 2025

work page 2025
[7]

The promise and challenges of generative ai in education

Michail Giannakos, Roger Azevedo, Peter Brusilovsky, Mutlu Cukurova, Yannis Dimitriadis, Davinia Hernandez-Leo, Sanna J¨ arvel¨ a, Manolis Mavrikis, and Bart Rienties. The promise and challenges of generative ai in education. Behaviour & Information Technology, 44(11):2518–2544, 2025

work page 2025
[8]

The rise of agentic ai: Implications, concerns, and the path forward

San Murugesan. The rise of agentic ai: Implications, concerns, and the path forward. IEEE Intelligent Systems , 40(2):8–14, 2025

work page 2025
[9]

Evaluation faking: Unveiling observer effects in safety evaluation of frontier ai systems, 2025

Yihe Fan, Wenqi Zhang, Xudong Pan, and Min Yang. Evaluation faking: Unveiling observer effects in safety evaluation of frontier ai systems, 2025

work page 2025
[10]

Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act)

The European Parliament and the Council of the European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689, 2024

work page 2024
[11]

Removing Barriers to American Leadership in Artificial Intelligence

The White House. Removing Barriers to American Leadership in Artificial Intelligence. Presidential Action, jan 2025

work page 2025
[12]

Senate Bill 24-205: Concerning Consumer Protections in Artificial Intelligence, may 2024

Colorado General Assembly. Senate Bill 24-205: Concerning Consumer Protections in Artificial Intelligence, may 2024. Signed into law May 17, 2024. Effective February 1, 2026

work page 2024
[13]

Cultural incongruencies in artificial intelligence, 2022

Vinodkumar Prabhakaran, Rida Qadri, and Ben Hutchinson. Cultural incongruencies in artificial intelligence, 2022

work page 2022
[14]

How culture shapes what people want from ai

Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. How culture shapes what people want from ai. In Proceedings of the CHI Conference on Human Factors in Computing Systems . ACM, 2024

work page 2024
[15]

Ai perceptions across cultures: Similarities and differences in expectations, risks, benefits, tradeoffs, and value in germany and china, 2024

Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, and Martina Ziefle. Ai perceptions across cultures: Similarities and differences in expectations, risks, benefits, tradeoffs, and value in germany and china, 2024

work page 2024
[16]

The ghost in the machine has an american accent: value conflict in gpt-3, 2022

Rebecca L Johnson, Giada Pistilli, Natalia Men´ edez-Gonz´ alez, Leslye Denisse Dias Duran, Enrico Panai, Julija Kalpokiene, and Donald Jay Bertulfo. The ghost in the machine has an american accent: value conflict in gpt-3, 2022

work page 2022
[17]

Methodology of adapting large english language models for specific cultural contexts, 2024

Wenjing Zhang, Siqi Xiao, Xuejiao Lei, Ning Wang, Huazheng Zhang, Meijuan An, Bikun Yang, Zhaoxiang Liu, Kai Wang, and Shiguo Lian. Methodology of adapting large english language models for specific cultural contexts, 2024

work page 2024
[18]

Having beer after prayer? measuring cultural bias in large language models, 2024

Tarek Naous, Michael J Ryan, Alan Ritter, and Wei Xu. Having beer after prayer? measuring cultural bias in large language models, 2024

work page 2024
[19]

Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael R. Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large language models, 2024

work page 2024
[20]

The elephant in the room – why ai safety demands diverse teams, 2024

David Rostcheck and Lara Scheibling. The elephant in the room – why ai safety demands diverse teams, 2024

work page 2024
[21]

Manning, Christopher R´ e, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R´ e, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...

work page 2023
[22]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Mar...

work page 2024
[23]

Walledeval: A comprehensive safety evaluation toolkit for large language models, 2024

Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, and Soujanya Poria. Walledeval: A comprehensive safety evaluation toolkit for large language models, 2024

work page 2024
[24]

Bach, and Julia Kreutzer

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025

work page 2025
[25]

Multilingual jailbreak challenges in large language models, 2024

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models, 2024

work page 2024
[26]

The multilingual divide and its impact on global ai safety, 2025

Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet ¨Ust¨ un, Matthias Gall´ e, Marzieh Fadaee, and Sara Hooker. The multilingual divide and its impact on global ai safety, 2025

work page 2025
[27]

Gaps in the safety evaluation of generative ai

Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger. Gaps in the safety evaluation of generative ai. Proceedings of the AAAI/ACM Conference on AI, Eth...

work page 2024
[28]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...

work page 2023
[29]

Ai safety for everyone

Balint Gyevnar and Atoosa Kasirzadeh. Ai safety for everyone. Nature Machine Intelligence , pages 1–12, 2025

work page 2025
[30]

Acceptable use policies for foundation models

Kevin Klyman. Acceptable use policies for foundation models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 752–767, 2024

work page 2024
[31]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960

work page 1960
[32]

Statistical Methods for Research Workers

Ronald Aylmer Fisher. Statistical Methods for Research Workers . Oliver and Boyd, Edinburgh, 1925

work page 1925
[33]

John W. Tukey. The Problem of Multiple Comparisons. Technical report, Princeton University, 1953. Unpublished manuscript

work page 1953
[34]

Core Views on AI Safety: When, Why, What, and How

Anthropic. Core Views on AI Safety: When, Why, What, and How. Blog Post, March 2023

work page 2023
[35]

34% of U.S

Sarah Naseer and Onyi Lam. 34% of U.S. adults have used ChatGPT, about double the share in 2023. Pew Research Center, Jun 2025

work page 2023
[36]

Google Says It Fixed Image Generator That Failed to Depict White People

Nico Grant. Google Says It Fixed Image Generator That Failed to Depict White People. The New York Times, Aug 2024

work page 2024

[1] [1]

Kamel Boulos

Peng Zhang and Maged N. Kamel Boulos. Generative ai in medicine and healthcare: Promises, opportunities and challenges. Future Internet, 15(9), 2023

work page 2023

[2] [2]

Carpenter

Juan Manuel Contreras and M. Carpenter. Machina legalis: The evolution, education, and evaluation of legal ai. In Cambridge Forum on AI: Law and Governance , 2025. In press

work page 2025

[3] [3]

A comprehensive review on generative ai for education

Uday Mittal, Siva Sai, Vinay Chamola, and Devika Sangwan. A comprehensive review on generative ai for education. IEEE Access, 12:142733–142759, 2024

work page 2024

[4] [4]

Generative ai for finance: Applications, case studies and challenges

Siva Sai, Keya Arunakar, Vinay Chamola, Amir Hussain, Pranav Bisht, and Sanjeev Kumar. Generative ai for finance: Applications, case studies and challenges. Expert Systems, 42(3):e70018, 2025

work page 2025

[5] [5]

Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns

Malik Sallam. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 2023

work page 2023

[6] [6]

Responsible ai in education: Understanding teachers’ priorities and contextual challenges

Yaxuan Yin, Shamya Karumbaiah, and Shona Acquaye. Responsible ai in education: Understanding teachers’ priorities and contextual challenges. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages 2705–2727, 2025

work page 2025

[7] [7]

The promise and challenges of generative ai in education

Michail Giannakos, Roger Azevedo, Peter Brusilovsky, Mutlu Cukurova, Yannis Dimitriadis, Davinia Hernandez-Leo, Sanna J¨ arvel¨ a, Manolis Mavrikis, and Bart Rienties. The promise and challenges of generative ai in education. Behaviour & Information Technology, 44(11):2518–2544, 2025

work page 2025

[8] [8]

The rise of agentic ai: Implications, concerns, and the path forward

San Murugesan. The rise of agentic ai: Implications, concerns, and the path forward. IEEE Intelligent Systems , 40(2):8–14, 2025

work page 2025

[9] [9]

Evaluation faking: Unveiling observer effects in safety evaluation of frontier ai systems, 2025

Yihe Fan, Wenqi Zhang, Xudong Pan, and Min Yang. Evaluation faking: Unveiling observer effects in safety evaluation of frontier ai systems, 2025

work page 2025

[10] [10]

Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act)

The European Parliament and the Council of the European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689, 2024

work page 2024

[11] [11]

Removing Barriers to American Leadership in Artificial Intelligence

The White House. Removing Barriers to American Leadership in Artificial Intelligence. Presidential Action, jan 2025

work page 2025

[12] [12]

Senate Bill 24-205: Concerning Consumer Protections in Artificial Intelligence, may 2024

Colorado General Assembly. Senate Bill 24-205: Concerning Consumer Protections in Artificial Intelligence, may 2024. Signed into law May 17, 2024. Effective February 1, 2026

work page 2024

[13] [13]

Cultural incongruencies in artificial intelligence, 2022

Vinodkumar Prabhakaran, Rida Qadri, and Ben Hutchinson. Cultural incongruencies in artificial intelligence, 2022

work page 2022

[14] [14]

How culture shapes what people want from ai

Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. How culture shapes what people want from ai. In Proceedings of the CHI Conference on Human Factors in Computing Systems . ACM, 2024

work page 2024

[15] [15]

Ai perceptions across cultures: Similarities and differences in expectations, risks, benefits, tradeoffs, and value in germany and china, 2024

Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, and Martina Ziefle. Ai perceptions across cultures: Similarities and differences in expectations, risks, benefits, tradeoffs, and value in germany and china, 2024

work page 2024

[16] [16]

The ghost in the machine has an american accent: value conflict in gpt-3, 2022

Rebecca L Johnson, Giada Pistilli, Natalia Men´ edez-Gonz´ alez, Leslye Denisse Dias Duran, Enrico Panai, Julija Kalpokiene, and Donald Jay Bertulfo. The ghost in the machine has an american accent: value conflict in gpt-3, 2022

work page 2022

[17] [17]

Methodology of adapting large english language models for specific cultural contexts, 2024

Wenjing Zhang, Siqi Xiao, Xuejiao Lei, Ning Wang, Huazheng Zhang, Meijuan An, Bikun Yang, Zhaoxiang Liu, Kai Wang, and Shiguo Lian. Methodology of adapting large english language models for specific cultural contexts, 2024

work page 2024

[18] [18]

Having beer after prayer? measuring cultural bias in large language models, 2024

Tarek Naous, Michael J Ryan, Alan Ritter, and Wei Xu. Having beer after prayer? measuring cultural bias in large language models, 2024

work page 2024

[19] [19]

Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael R. Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large language models, 2024

work page 2024

[20] [20]

The elephant in the room – why ai safety demands diverse teams, 2024

David Rostcheck and Lara Scheibling. The elephant in the room – why ai safety demands diverse teams, 2024

work page 2024

[21] [21]

Manning, Christopher R´ e, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R´ e, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...

work page 2023

[22] [22]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Mar...

work page 2024

[23] [23]

Walledeval: A comprehensive safety evaluation toolkit for large language models, 2024

Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, and Soujanya Poria. Walledeval: A comprehensive safety evaluation toolkit for large language models, 2024

work page 2024

[24] [24]

Bach, and Julia Kreutzer

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025

work page 2025

[25] [25]

Multilingual jailbreak challenges in large language models, 2024

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models, 2024

work page 2024

[26] [26]

The multilingual divide and its impact on global ai safety, 2025

Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet ¨Ust¨ un, Matthias Gall´ e, Marzieh Fadaee, and Sara Hooker. The multilingual divide and its impact on global ai safety, 2025

work page 2025

[27] [27]

Gaps in the safety evaluation of generative ai

Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger. Gaps in the safety evaluation of generative ai. Proceedings of the AAAI/ACM Conference on AI, Eth...

work page 2024

[28] [28]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...

work page 2023

[29] [29]

Ai safety for everyone

Balint Gyevnar and Atoosa Kasirzadeh. Ai safety for everyone. Nature Machine Intelligence , pages 1–12, 2025

work page 2025

[30] [30]

Acceptable use policies for foundation models

Kevin Klyman. Acceptable use policies for foundation models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 752–767, 2024

work page 2024

[31] [31]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960

work page 1960

[32] [32]

Statistical Methods for Research Workers

Ronald Aylmer Fisher. Statistical Methods for Research Workers . Oliver and Boyd, Edinburgh, 1925

work page 1925

[33] [33]

John W. Tukey. The Problem of Multiple Comparisons. Technical report, Princeton University, 1953. Unpublished manuscript

work page 1953

[34] [34]

Core Views on AI Safety: When, Why, What, and How

Anthropic. Core Views on AI Safety: When, Why, What, and How. Blog Post, March 2023

work page 2023

[35] [35]

34% of U.S

Sarah Naseer and Onyi Lam. 34% of U.S. adults have used ChatGPT, about double the share in 2023. Pew Research Center, Jun 2025

work page 2023

[36] [36]

Google Says It Fixed Image Generator That Failed to Depict White People

Nico Grant. Google Says It Fixed Image Generator That Failed to Depict White People. The New York Times, Aug 2024

work page 2024