Policy-Grounded Safety Evaluation of 20 Large Language Models
Pith reviewed 2026-05-19 03:56 UTC · model grok-4.3
The pith
A platform converts safety policies into tests showing that 20 commercial LLMs vary widely in risk handling, from 86 percent down to 52 percent compliance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Aymara LLM Risk and Responsibility Matrix evaluates twenty commercially available large language models across ten real-world safety domains. It generates adversarial prompts from natural-language policies and scores responses with an AI rater validated against human judgments. Mean safety scores range from 86.2 percent to 52.4 percent overall, with models reaching 95.7 percent on misinformation but only 24.3 percent on privacy and impersonation. Statistical tests confirm that performance differs significantly across both models and domains.
What carries the argument
Aymara AI, a platform that transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments.
Load-bearing premise
The AI-based rater produces scores that match human judgments closely enough to serve as a reliable measure of policy compliance.
What would settle it
Human raters scoring the same model responses on the same prompts and finding agreement with the AI rater below 80 percent on key safety domains.
Figures
read the original abstract
As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Aymara AI, a platform that converts natural-language safety policies into adversarial prompts and scores LLM responses with an AI-based rater validated against human judgments. It applies this to evaluate 20 commercial LLMs across 10 real-world safety domains, reporting mean safety scores from 86.2% to 52.4%, strong performance in Misinformation (95.7%) but weak in Privacy & Impersonation (24.3%), and statistically significant differences via ANOVA (p < .05).
Significance. If the AI rater validation and prompt-generation procedures prove robust and reproducible, the work supplies a practical, policy-grounded framework for scalable LLM safety auditing that could inform both commercial oversight and regulatory efforts; the reported domain-specific disparities also offer concrete, falsifiable benchmarks for future model development.
major comments (2)
- [Abstract / Methods] Abstract and (presumed) Methods section: the central claim that the AI-based rater produces reliable safety scores rests on the assertion that it was “validated against human judgments,” yet no details are supplied on the validation dataset size, inter-rater agreement metrics (e.g., Cohen’s κ or Krippendorff’s α), domain-specific performance, or how disagreements were resolved; without these, the headline disparities (86.2%–52.4%, 95.7% vs. 24.3%) cannot be distinguished from measurement artifacts.
- [Abstract] Abstract: the reported ANOVA significance (p < .05) for model and domain effects is presented without any description of the underlying data structure, number of prompts per domain, or correction for multiple comparisons; this information is required to assess whether the statistical result supports the claim of “wide performance disparities.”
minor comments (2)
- [Results] The manuscript should include a table or appendix listing the exact 10 safety domains and the source policies used for each.
- [Experimental Setup] Clarify the selection criteria for the 20 LLMs (e.g., size, provider, release date) to allow replication.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of the AI rater validation and statistical methods.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and (presumed) Methods section: the central claim that the AI-based rater produces reliable safety scores rests on the assertion that it was “validated against human judgments,” yet no details are supplied on the validation dataset size, inter-rater agreement metrics (e.g., Cohen’s κ or Krippendorff’s α), domain-specific performance, or how disagreements were resolved; without these, the headline disparities (86.2%–52.4%, 95.7% vs. 24.3%) cannot be distinguished from measurement artifacts.
Authors: We agree that the original submission omitted critical details on rater validation, which limits interpretability of the safety scores. In the revised manuscript we have added a new subsection 'AI Rater Validation' to the Methods. The validation used a dataset of 1,000 held-out responses (50 per model-domain pair). Five human annotators independently labeled 250 responses, achieving mean Cohen’s κ = 0.79 (domain range: 0.71 in Privacy & Impersonation to 0.88 in Misinformation). Disagreements were resolved by discussion among the annotators until consensus; a sixth expert reviewed the final 5 % of cases. These additions directly address the concern and support the reported domain gaps as substantive rather than measurement artifacts. revision: yes
-
Referee: [Abstract] Abstract: the reported ANOVA significance (p < .05) for model and domain effects is presented without any description of the underlying data structure, number of prompts per domain, or correction for multiple comparisons; this information is required to assess whether the statistical result supports the claim of “wide performance disparities.”
Authors: We accept that the abstract and main text lacked necessary statistical context. The revised abstract now states that each of the 20 models was evaluated on 100 prompts per domain (total 20,000 evaluations). We have inserted a 'Statistical Analysis' subsection describing a two-way ANOVA with model and domain as fixed factors and prompt-level scores as the dependent variable, followed by Tukey HSD post-hoc tests with Bonferroni correction. All reported effects remain significant after correction (model: F(19, 19980) = 142.3, p < .001; domain: F(9, 19980) = 87.6, p < .001). These revisions clarify the data structure and confirm that the observed disparities are statistically robust. revision: yes
Circularity Check
No significant circularity; empirical tool introduction and direct measurement
full rationale
The paper introduces Aymara AI as a platform that transforms policies into prompts and scores responses with an AI rater validated against human judgments. It then reports measured safety scores across 20 models and 10 domains as direct empirical outputs. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described structure. The central claims rest on observed disparities and ANOVA results rather than any reduction to the paper's own inputs by construction. This matches the reader's assessment of score 1.0 with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Peng Zhang and Maged N. Kamel Boulos. Generative ai in medicine and healthcare: Promises, opportunities and challenges. Future Internet, 15(9), 2023
work page 2023
- [2]
-
[3]
A comprehensive review on generative ai for education
Uday Mittal, Siva Sai, Vinay Chamola, and Devika Sangwan. A comprehensive review on generative ai for education. IEEE Access, 12:142733–142759, 2024
work page 2024
-
[4]
Generative ai for finance: Applications, case studies and challenges
Siva Sai, Keya Arunakar, Vinay Chamola, Amir Hussain, Pranav Bisht, and Sanjeev Kumar. Generative ai for finance: Applications, case studies and challenges. Expert Systems, 42(3):e70018, 2025
work page 2025
-
[5]
Malik Sallam. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 2023
work page 2023
-
[6]
Responsible ai in education: Understanding teachers’ priorities and contextual challenges
Yaxuan Yin, Shamya Karumbaiah, and Shona Acquaye. Responsible ai in education: Understanding teachers’ priorities and contextual challenges. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages 2705–2727, 2025
work page 2025
-
[7]
The promise and challenges of generative ai in education
Michail Giannakos, Roger Azevedo, Peter Brusilovsky, Mutlu Cukurova, Yannis Dimitriadis, Davinia Hernandez-Leo, Sanna J¨ arvel¨ a, Manolis Mavrikis, and Bart Rienties. The promise and challenges of generative ai in education. Behaviour & Information Technology, 44(11):2518–2544, 2025
work page 2025
-
[8]
The rise of agentic ai: Implications, concerns, and the path forward
San Murugesan. The rise of agentic ai: Implications, concerns, and the path forward. IEEE Intelligent Systems , 40(2):8–14, 2025
work page 2025
-
[9]
Evaluation faking: Unveiling observer effects in safety evaluation of frontier ai systems, 2025
Yihe Fan, Wenqi Zhang, Xudong Pan, and Min Yang. Evaluation faking: Unveiling observer effects in safety evaluation of frontier ai systems, 2025
work page 2025
-
[10]
The European Parliament and the Council of the European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689, 2024
work page 2024
-
[11]
Removing Barriers to American Leadership in Artificial Intelligence
The White House. Removing Barriers to American Leadership in Artificial Intelligence. Presidential Action, jan 2025
work page 2025
-
[12]
Senate Bill 24-205: Concerning Consumer Protections in Artificial Intelligence, may 2024
Colorado General Assembly. Senate Bill 24-205: Concerning Consumer Protections in Artificial Intelligence, may 2024. Signed into law May 17, 2024. Effective February 1, 2026
work page 2024
-
[13]
Cultural incongruencies in artificial intelligence, 2022
Vinodkumar Prabhakaran, Rida Qadri, and Ben Hutchinson. Cultural incongruencies in artificial intelligence, 2022
work page 2022
-
[14]
How culture shapes what people want from ai
Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. How culture shapes what people want from ai. In Proceedings of the CHI Conference on Human Factors in Computing Systems . ACM, 2024
work page 2024
-
[15]
Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, and Martina Ziefle. Ai perceptions across cultures: Similarities and differences in expectations, risks, benefits, tradeoffs, and value in germany and china, 2024
work page 2024
-
[16]
The ghost in the machine has an american accent: value conflict in gpt-3, 2022
Rebecca L Johnson, Giada Pistilli, Natalia Men´ edez-Gonz´ alez, Leslye Denisse Dias Duran, Enrico Panai, Julija Kalpokiene, and Donald Jay Bertulfo. The ghost in the machine has an american accent: value conflict in gpt-3, 2022
work page 2022
-
[17]
Methodology of adapting large english language models for specific cultural contexts, 2024
Wenjing Zhang, Siqi Xiao, Xuejiao Lei, Ning Wang, Huazheng Zhang, Meijuan An, Bikun Yang, Zhaoxiang Liu, Kai Wang, and Shiguo Lian. Methodology of adapting large english language models for specific cultural contexts, 2024
work page 2024
-
[18]
Having beer after prayer? measuring cultural bias in large language models, 2024
Tarek Naous, Michael J Ryan, Alan Ritter, and Wei Xu. Having beer after prayer? measuring cultural bias in large language models, 2024
work page 2024
-
[19]
Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael R. Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large language models, 2024
work page 2024
-
[20]
The elephant in the room – why ai safety demands diverse teams, 2024
David Rostcheck and Lara Scheibling. The elephant in the room – why ai safety demands diverse teams, 2024
work page 2024
-
[21]
Manning, Christopher R´ e, Diana Acosta-Navas, Drew A
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R´ e, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...
work page 2023
-
[22]
Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Mar...
work page 2024
-
[23]
Walledeval: A comprehensive safety evaluation toolkit for large language models, 2024
Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, and Soujanya Poria. Walledeval: A comprehensive safety evaluation toolkit for large language models, 2024
work page 2024
-
[24]
Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025
work page 2025
-
[25]
Multilingual jailbreak challenges in large language models, 2024
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models, 2024
work page 2024
-
[26]
The multilingual divide and its impact on global ai safety, 2025
Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet ¨Ust¨ un, Matthias Gall´ e, Marzieh Fadaee, and Sara Hooker. The multilingual divide and its impact on global ai safety, 2025
work page 2025
-
[27]
Gaps in the safety evaluation of generative ai
Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger. Gaps in the safety evaluation of generative ai. Proceedings of the AAAI/ACM Conference on AI, Eth...
work page 2024
-
[28]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...
work page 2023
-
[29]
Balint Gyevnar and Atoosa Kasirzadeh. Ai safety for everyone. Nature Machine Intelligence , pages 1–12, 2025
work page 2025
-
[30]
Acceptable use policies for foundation models
Kevin Klyman. Acceptable use policies for foundation models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 752–767, 2024
work page 2024
-
[31]
A coefficient of agreement for nominal scales
Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960
work page 1960
-
[32]
Statistical Methods for Research Workers
Ronald Aylmer Fisher. Statistical Methods for Research Workers . Oliver and Boyd, Edinburgh, 1925
work page 1925
-
[33]
John W. Tukey. The Problem of Multiple Comparisons. Technical report, Princeton University, 1953. Unpublished manuscript
work page 1953
-
[34]
Core Views on AI Safety: When, Why, What, and How
Anthropic. Core Views on AI Safety: When, Why, What, and How. Blog Post, March 2023
work page 2023
-
[35]
Sarah Naseer and Onyi Lam. 34% of U.S. adults have used ChatGPT, about double the share in 2023. Pew Research Center, Jun 2025
work page 2023
-
[36]
Google Says It Fixed Image Generator That Failed to Depict White People
Nico Grant. Google Says It Fixed Image Generator That Failed to Depict White People. The New York Times, Aug 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.