Ethical Implications of Training Deceptive AI
Pith reviewed 2026-05-15 12:50 UTC · model grok-4.3
The pith
This paper proposes a Deception Research Levels framework that classifies deceptive AI research by risk across five ethical dimensions and scales safeguards accordingly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The DRL framework classifies deceptive algorithm research into one of four risk levels by scoring mechanisms on five dimensions drawn from the AI4People ethical framework and applying a highest-dimension-wins rule; levels carry cumulative safeguards from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4, with a dual-development mandate requiring simultaneous creation of detection and mitigation methods at DRL-3 and above.
What carries the argument
The Deception Research Levels (DRL) system, which scores deceptive mechanisms on five dimensions and assigns the single highest risk level indicated by any dimension to determine required safeguards.
Load-bearing premise
The five dimensions grounded in the AI4People framework, together with the highest-dimension-wins rule, supply a sufficient and non-arbitrary basis for assigning proportional safeguards.
What would settle it
A demonstration that two research projects with matching dimension scores produce materially different real-world harms, or that following the framework's safeguards still permits significant harm, would undermine the classification method.
read the original abstract
Deceptive behavior in AI systems is no longer theoretical: large language models strategically mislead without producing false statements, maintain deceptive strategies through safety training, and coordinate deception in multi-agent settings. While the European Union's AI Act prohibits deployment of deceptive AI systems, it explicitly exempts research and development, creating a necessary but unstructured space in which no established framework governs how deception research should be conducted or how risk should scale with capability. This paper proposes a Deception Research Levels (DRL) framework, a classification system for deceptive algorithm research modeled on the Biosafety Level system used in biological research. The DRL framework classifies research by risk profile rather than researcher intent, assessing deceptive mechanisms across five dimensions grounded in the AI4People ethical framework: Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification follows a ``highest dimension wins'' approach, assigning one of four risk levels with cumulative safeguards ranging from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4. A dual-development mandate at DRL-3 and above requires that detection and mitigation methods be developed alongside any deceptive capability. We apply the framework to eight case studies spanning all four levels and demonstrate that ecological validity of the deceptive mechanism emerges as a consistent, non-independent indicator of classification level. The DRL framework is intended to fill the governance gap between regulated deployment and unstructured research, supporting both beneficial applications and defensive research under conditions where safeguards are proportional to the potential for harm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Deception Research Levels (DRL) framework for classifying deceptive AI research, modeled on biosafety levels. It assesses risk across five dimensions drawn from the AI4People ethical framework (Pillar Implication, Severity, Reversibility, Scale, Vulnerability) using a highest-dimension-wins rule to assign one of four levels, each with cumulative safeguards ranging from basic documentation at DRL-1 to regulatory notification and third-party audits at DRL-4. A dual-development mandate requires concurrent work on detection and mitigation methods at DRL-3 and above. The framework is illustrated via eight case studies spanning all levels, with ecological validity of the deceptive mechanism identified as a consistent indicator of classification.
Significance. If equipped with reproducible scoring criteria, the DRL framework would provide a structured, proportional governance mechanism for the regulatory gap between prohibited deceptive AI deployment (e.g., under the EU AI Act) and unstructured research. The biosafety-level analogy and dual-mandate element offer a practical model for balancing innovation with harm prevention; the case studies supply concrete illustrations that could guide oversight bodies if the classification procedure is made operational.
major comments (2)
- [DRL framework description] DRL framework definition: The five dimensions are presented without scoring rubrics, quantitative thresholds, or decision procedures for rating Pillar Implication, Severity, Reversibility, Scale, or Vulnerability. Consequently the highest-dimension-wins rule reduces to unguided qualitative judgment; two independent assessors can defensibly assign different levels to identical research, so the claimed proportionality of safeguards cannot be guaranteed.
- [Case studies section] Case studies application: The eight case studies report final DRL assignments and note ecological validity as an indicator, but supply no explicit per-dimension scores or step-by-step justification for how the highest-dimension rule was applied. This leaves the framework's practical reproducibility untested.
minor comments (2)
- [Abstract] The abstract states the framework is 'grounded in the AI4People ethical framework' but does not specify which elements of AI4People are adopted versus adapted; a brief mapping table would improve traceability.
- [Framework overview] Notation for the four levels (DRL-1 through DRL-4) and the dual-development mandate is introduced clearly but could be summarized in a single table for quick reference.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the Deception Research Levels (DRL) framework. We agree that greater specificity in the classification criteria would improve the framework's utility and reproducibility. Below we respond to each major comment and outline the revisions we will make.
read point-by-point responses
-
Referee: DRL framework definition: The five dimensions are presented without scoring rubrics, quantitative thresholds, or decision procedures for rating Pillar Implication, Severity, Reversibility, Scale, or Vulnerability. Consequently the highest-dimension-wins rule reduces to unguided qualitative judgment; two independent assessors can defensibly assign different levels to identical research, so the claimed proportionality of safeguards cannot be guaranteed.
Authors: We recognize the validity of this observation. The framework as presented relies on expert judgment informed by the five dimensions, similar to other ethical assessment tools. However, to enhance reproducibility, we will add a new subsection detailing scoring rubrics for each dimension. These rubrics will provide qualitative criteria and examples drawn from the case studies to guide consistent application of the highest-dimension-wins rule. We will also discuss potential inter-rater reliability in the limitations section. revision: yes
-
Referee: Case studies application: The eight case studies report final DRL assignments and note ecological validity as an indicator, but supply no explicit per-dimension scores or step-by-step justification for how the highest-dimension rule was applied. This leaves the framework's practical reproducibility untested.
Authors: We agree that the case studies lack sufficient detail on the scoring process. We will revise this section to include explicit ratings for each of the five dimensions for every case study, along with a narrative justification for the final level assignment. This will demonstrate the application of the framework and address concerns about reproducibility. revision: yes
Circularity Check
No significant circularity in the DRL framework proposal
full rationale
The paper proposes the DRL classification system as an explicit new construction modeled on the external Biosafety Level framework and grounded in the AI4People ethical dimensions (Pillar Implication, Severity, Reversibility, Scale, Vulnerability). The 'highest dimension wins' rule and cumulative safeguards are presented as definitional choices for the framework itself, with no equations, fitted parameters, or self-citations that reduce the classification to its own inputs by construction. The eight case studies apply the framework illustratively but do not serve as load-bearing derivations or create self-referential loops. The central claim remains a governance proposal whose content is independent of any internal reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The AI4People ethical framework supplies an appropriate and complete basis for defining risk dimensions in deceptive AI research.
- ad hoc to paper A highest-dimension-wins rule across the five dimensions produces classifications that are proportional to actual harm potential.
invented entities (1)
-
Deception Research Levels (DRL) framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), 2024
work page 2024
-
[2]
Learning to deceive in multi-agent hidden role games
Matthew Aitchison, Lyndon Benke, and Penny Sweetser. Learning to deceive in multi-agent hidden role games. In Stefan Sarkadi, Benjamin Wright, Peta Masters, and Peter McBurney, ed- itors,Deceptive AI: First International Workshop, DeceptECAI 2020, and Second International Workshop, DeceptAI 2021, Proceedings, volume 1296 ofCommunications in Computer and I...
work page 2020
-
[3]
Chittaranjan Andrade. Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018
work page 2018
-
[4]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025
work page 2025
-
[5]
Carson.Lying and Deception: Theory and Practice
Thomas L. Carson.Lying and Deception: Theory and Practice. Oxford University Press, 2010
work page 2010
-
[6]
Pedro M. P. Curvo. The traitors: Deception and trust in multi-agent language model simulations, 2025
work page 2025
-
[7]
Bowman, Ethan Perez, and Evan Hubinger
Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Inves- tigating reward-tampering in large language models, 2024
work page 2024
-
[8]
Richard Dewey, János Botyánszki, Ciamac C. Moallemi, and Andrew T. Zheng. Outbidding and outbluffing elite humans: Mastering liar’s poker via self-play and reinforcement learning, 2025
work page 2025
-
[9]
Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B. Sai, John J. Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Language models can subtly deceive without lying: A case study on strategic phrasing in legislation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pa...
work page 2025
-
[10]
Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, Burkhard Schafer, Peggy Valcke, and Effy Vayena. Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018
work page 2018
-
[11]
Thilo Hagendorff. Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024
work page 2024
-
[12]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...
work page 2024
- [13]
-
[14]
Scenario forecast published by the AI Futures Project
-
[15]
The definition of lying and deception
James Edwin Mahon. The definition of lying and deception. In Edward N. Zalta, editor,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016. 24
work page 2016
-
[16]
Frontier models are capable of in-context scheming
Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. 2024
work page 2024
-
[17]
Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...
work page 2022
-
[18]
Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks
Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023
work page 2023
-
[19]
Stress testing deliberative alignment for anti-scheming training, 2025
Benedict Schoen, Ekaterina Nitishinskaya, Mikita Balesni, Alexander Højmark, Fabien Hofstät- ter, Jérémy Scheurer, Alexander Meinke, James Wolfe, Tom van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Andy Fan, Alexander Matveiakin, Rusheb Shah, Murray Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, and Marius Hobbhahn. Stress testing deliberative ...
work page 2025
-
[20]
William R. Shadish, Thomas D. Cook, and Donald T. Campbell.Experimental and Quasi- Experimental Designs for Generalized Causal Inference. Houghton Mifflin, 2002
work page 2002
-
[21]
Jason Starace and Terence Soule. Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026
work page 2026
-
[22]
Intentional deception as controllable capability in llm agents, 2026
Jason Starace and Terence Soule. Intentional deception as controllable capability in llm agents, 2026
work page 2026
-
[23]
Jason Starace, Jennie Tafoya, Anmol Singh, and Terence Soule. Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026
work page 2026
-
[24]
U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, and National Institutes of Health.Biosafety in Microbiological and Biomedical Laboratories, 6th edition, 2020
work page 2020
-
[25]
Honesty is the best policy: Defining and mitigating ai deception
Francis Ward, Francesca Toni, Francesco Belardinelli, and Tom Everitt. Honesty is the best policy: Defining and mitigating ai deception. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 2313–2341. Curran Associates, Inc., 2023
work page 2023
-
[26]
Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, and Yu Wang. Learning strategic language agents in the werewolf game with iterative latent space policy optimization. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 25 A Key Definitions A.1 Deception Ward et al. [24] provide a functional definition of deception grounded in the phil...
work page 2025
-
[27]
The algorithm’s outputs intentionally influence the target’s actions or acceptance of informa- tion
-
[28]
The target acceptsϕandϕis false
-
[29]
The algorithm does not operate as though ϕ is true; its behavior would differ if ϕ were actually true. This definition is functional: “acceptance” means acting as though a proposition is true, not possessing a mental belief. The definition includes algorithms that deceive through direct false statements, strategic omission, technically true but misleading...
-
[30]
The agent acts as though it observesϕis true; 26
-
[31]
The agent would act differently if it observedϕis false. If an agent does not respond differently to ϕ being true or false, its belief about ϕ is unidentifiable from its behavior. An agent accepts ϕ if it acts as though it is certain ϕ is true. This formulation avoids Theory of Mind claims about AI systems and provides a natural distinction between belief...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.