pith. sign in

arxiv: 2604.03250 · v1 · submitted 2026-03-10 · 💻 cs.CY

Ethical Implications of Training Deceptive AI

Pith reviewed 2026-05-15 12:50 UTC · model grok-4.3

classification 💻 cs.CY
keywords deceptive AIAI ethicsresearch governancerisk classificationAI safetybiosafety levelsethical frameworks
0
0 comments X

The pith

This paper proposes a Deception Research Levels framework that classifies deceptive AI research by risk across five ethical dimensions and scales safeguards accordingly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fill the regulatory gap where the EU AI Act bans deployment of deceptive AI but leaves research largely unstructured and exempt. It introduces a four-tier system modeled on biosafety levels, evaluating each research project on Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification uses a highest-dimension-wins rule, triggering cumulative protections from basic documentation at the lowest tier to regulatory notification and third-party audits at the highest. A dual-development rule at tiers three and four requires researchers to build detection and mitigation tools at the same time as any deceptive capability. Readers would care because the framework aims to permit beneficial and defensive work while keeping the potential for harm proportional to the oversight required.

Core claim

The DRL framework classifies deceptive algorithm research into one of four risk levels by scoring mechanisms on five dimensions drawn from the AI4People ethical framework and applying a highest-dimension-wins rule; levels carry cumulative safeguards from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4, with a dual-development mandate requiring simultaneous creation of detection and mitigation methods at DRL-3 and above.

What carries the argument

The Deception Research Levels (DRL) system, which scores deceptive mechanisms on five dimensions and assigns the single highest risk level indicated by any dimension to determine required safeguards.

Load-bearing premise

The five dimensions grounded in the AI4People framework, together with the highest-dimension-wins rule, supply a sufficient and non-arbitrary basis for assigning proportional safeguards.

What would settle it

A demonstration that two research projects with matching dimension scores produce materially different real-world harms, or that following the framework's safeguards still permits significant harm, would undermine the classification method.

read the original abstract

Deceptive behavior in AI systems is no longer theoretical: large language models strategically mislead without producing false statements, maintain deceptive strategies through safety training, and coordinate deception in multi-agent settings. While the European Union's AI Act prohibits deployment of deceptive AI systems, it explicitly exempts research and development, creating a necessary but unstructured space in which no established framework governs how deception research should be conducted or how risk should scale with capability. This paper proposes a Deception Research Levels (DRL) framework, a classification system for deceptive algorithm research modeled on the Biosafety Level system used in biological research. The DRL framework classifies research by risk profile rather than researcher intent, assessing deceptive mechanisms across five dimensions grounded in the AI4People ethical framework: Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification follows a ``highest dimension wins'' approach, assigning one of four risk levels with cumulative safeguards ranging from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4. A dual-development mandate at DRL-3 and above requires that detection and mitigation methods be developed alongside any deceptive capability. We apply the framework to eight case studies spanning all four levels and demonstrate that ecological validity of the deceptive mechanism emerges as a consistent, non-independent indicator of classification level. The DRL framework is intended to fill the governance gap between regulated deployment and unstructured research, supporting both beneficial applications and defensive research under conditions where safeguards are proportional to the potential for harm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Deception Research Levels (DRL) framework for classifying deceptive AI research, modeled on biosafety levels. It assesses risk across five dimensions drawn from the AI4People ethical framework (Pillar Implication, Severity, Reversibility, Scale, Vulnerability) using a highest-dimension-wins rule to assign one of four levels, each with cumulative safeguards ranging from basic documentation at DRL-1 to regulatory notification and third-party audits at DRL-4. A dual-development mandate requires concurrent work on detection and mitigation methods at DRL-3 and above. The framework is illustrated via eight case studies spanning all levels, with ecological validity of the deceptive mechanism identified as a consistent indicator of classification.

Significance. If equipped with reproducible scoring criteria, the DRL framework would provide a structured, proportional governance mechanism for the regulatory gap between prohibited deceptive AI deployment (e.g., under the EU AI Act) and unstructured research. The biosafety-level analogy and dual-mandate element offer a practical model for balancing innovation with harm prevention; the case studies supply concrete illustrations that could guide oversight bodies if the classification procedure is made operational.

major comments (2)
  1. [DRL framework description] DRL framework definition: The five dimensions are presented without scoring rubrics, quantitative thresholds, or decision procedures for rating Pillar Implication, Severity, Reversibility, Scale, or Vulnerability. Consequently the highest-dimension-wins rule reduces to unguided qualitative judgment; two independent assessors can defensibly assign different levels to identical research, so the claimed proportionality of safeguards cannot be guaranteed.
  2. [Case studies section] Case studies application: The eight case studies report final DRL assignments and note ecological validity as an indicator, but supply no explicit per-dimension scores or step-by-step justification for how the highest-dimension rule was applied. This leaves the framework's practical reproducibility untested.
minor comments (2)
  1. [Abstract] The abstract states the framework is 'grounded in the AI4People ethical framework' but does not specify which elements of AI4People are adopted versus adapted; a brief mapping table would improve traceability.
  2. [Framework overview] Notation for the four levels (DRL-1 through DRL-4) and the dual-development mandate is introduced clearly but could be summarized in a single table for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the Deception Research Levels (DRL) framework. We agree that greater specificity in the classification criteria would improve the framework's utility and reproducibility. Below we respond to each major comment and outline the revisions we will make.

read point-by-point responses
  1. Referee: DRL framework definition: The five dimensions are presented without scoring rubrics, quantitative thresholds, or decision procedures for rating Pillar Implication, Severity, Reversibility, Scale, or Vulnerability. Consequently the highest-dimension-wins rule reduces to unguided qualitative judgment; two independent assessors can defensibly assign different levels to identical research, so the claimed proportionality of safeguards cannot be guaranteed.

    Authors: We recognize the validity of this observation. The framework as presented relies on expert judgment informed by the five dimensions, similar to other ethical assessment tools. However, to enhance reproducibility, we will add a new subsection detailing scoring rubrics for each dimension. These rubrics will provide qualitative criteria and examples drawn from the case studies to guide consistent application of the highest-dimension-wins rule. We will also discuss potential inter-rater reliability in the limitations section. revision: yes

  2. Referee: Case studies application: The eight case studies report final DRL assignments and note ecological validity as an indicator, but supply no explicit per-dimension scores or step-by-step justification for how the highest-dimension rule was applied. This leaves the framework's practical reproducibility untested.

    Authors: We agree that the case studies lack sufficient detail on the scoring process. We will revise this section to include explicit ratings for each of the five dimensions for every case study, along with a narrative justification for the final level assignment. This will demonstrate the application of the framework and address concerns about reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the DRL framework proposal

full rationale

The paper proposes the DRL classification system as an explicit new construction modeled on the external Biosafety Level framework and grounded in the AI4People ethical dimensions (Pillar Implication, Severity, Reversibility, Scale, Vulnerability). The 'highest dimension wins' rule and cumulative safeguards are presented as definitional choices for the framework itself, with no equations, fitted parameters, or self-citations that reduce the classification to its own inputs by construction. The eight case studies apply the framework illustratively but do not serve as load-bearing derivations or create self-referential loops. The central claim remains a governance proposal whose content is independent of any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The proposal rests on the assumption that the AI4People ethical pillars can be operationalized into the five risk dimensions and that the highest-wins aggregation rule yields defensible classifications; no numerical parameters are fitted.

axioms (2)
  • domain assumption The AI4People ethical framework supplies an appropriate and complete basis for defining risk dimensions in deceptive AI research.
    The five dimensions are explicitly grounded in this external framework.
  • ad hoc to paper A highest-dimension-wins rule across the five dimensions produces classifications that are proportional to actual harm potential.
    This aggregation method is introduced by the paper without further justification in the abstract.
invented entities (1)
  • Deception Research Levels (DRL) framework no independent evidence
    purpose: To classify deceptive AI research by risk and assign cumulative safeguards.
    Newly proposed four-level system not present in prior work.

pith-pipeline@v0.9.0 · 5564 in / 1486 out tokens · 47984 ms · 2026-05-15T12:50:15.786160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), 2024

  2. [2]

    Learning to deceive in multi-agent hidden role games

    Matthew Aitchison, Lyndon Benke, and Penny Sweetser. Learning to deceive in multi-agent hidden role games. In Stefan Sarkadi, Benjamin Wright, Peta Masters, and Peter McBurney, ed- itors,Deceptive AI: First International Workshop, DeceptECAI 2020, and Second International Workshop, DeceptAI 2021, Proceedings, volume 1296 ofCommunications in Computer and I...

  3. [3]

    Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018

    Chittaranjan Andrade. Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018

  4. [4]

    Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

  5. [5]

    Carson.Lying and Deception: Theory and Practice

    Thomas L. Carson.Lying and Deception: Theory and Practice. Oxford University Press, 2010

  6. [6]

    Pedro M. P. Curvo. The traitors: Deception and trust in multi-agent language model simulations, 2025

  7. [7]

    Bowman, Ethan Perez, and Evan Hubinger

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Inves- tigating reward-tampering in large language models, 2024

  8. [8]

    Moallemi, and Andrew T

    Richard Dewey, János Botyánszki, Ciamac C. Moallemi, and Andrew T. Zheng. Outbidding and outbluffing elite humans: Mastering liar’s poker via self-play and reinforcement learning, 2025

  9. [9]

    Sai, John J

    Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B. Sai, John J. Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Language models can subtly deceive without lying: A case study on strategic phrasing in legislation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pa...

  10. [10]

    Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

    Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, Burkhard Schafer, Peggy Valcke, and Effy Vayena. Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

  11. [11]

    Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024

    Thilo Hagendorff. Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024

  12. [12]

    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

  13. [13]

    AI 2027,

    Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean. AI 2027,

  14. [14]

    Scenario forecast published by the AI Futures Project

  15. [15]

    The definition of lying and deception

    James Edwin Mahon. The definition of lying and deception. In Edward N. Zalta, editor,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016. 24

  16. [16]

    Frontier models are capable of in-context scheming

    Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. 2024

  17. [17]

    Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra

    Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...

  18. [18]

    Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks

    Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023

  19. [19]

    Stress testing deliberative alignment for anti-scheming training, 2025

    Benedict Schoen, Ekaterina Nitishinskaya, Mikita Balesni, Alexander Højmark, Fabien Hofstät- ter, Jérémy Scheurer, Alexander Meinke, James Wolfe, Tom van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Andy Fan, Alexander Matveiakin, Rusheb Shah, Murray Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, and Marius Hobbhahn. Stress testing deliberative ...

  20. [20]

    Shadish, Thomas D

    William R. Shadish, Thomas D. Cook, and Donald T. Campbell.Experimental and Quasi- Experimental Designs for Generalized Causal Inference. Houghton Mifflin, 2002

  21. [21]

    Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026

    Jason Starace and Terence Soule. Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026

  22. [22]

    Intentional deception as controllable capability in llm agents, 2026

    Jason Starace and Terence Soule. Intentional deception as controllable capability in llm agents, 2026

  23. [23]

    Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026

    Jason Starace, Jennie Tafoya, Anmol Singh, and Terence Soule. Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026

  24. [24]

    U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, and National Institutes of Health.Biosafety in Microbiological and Biomedical Laboratories, 6th edition, 2020

  25. [25]

    Honesty is the best policy: Defining and mitigating ai deception

    Francis Ward, Francesca Toni, Francesco Belardinelli, and Tom Everitt. Honesty is the best policy: Defining and mitigating ai deception. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 2313–2341. Curran Associates, Inc., 2023

  26. [26]

    Learning strategic language agents in the werewolf game with iterative latent space policy optimization

    Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, and Yu Wang. Learning strategic language agents in the werewolf game with iterative latent space policy optimization. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 25 A Key Definitions A.1 Deception Ward et al. [24] provide a functional definition of deception grounded in the phil...

  27. [27]

    The algorithm’s outputs intentionally influence the target’s actions or acceptance of informa- tion

  28. [28]

    The target acceptsϕandϕis false

  29. [29]

    acceptance

    The algorithm does not operate as though ϕ is true; its behavior would differ if ϕ were actually true. This definition is functional: “acceptance” means acting as though a proposition is true, not possessing a mental belief. The definition includes algorithms that deceive through direct false statements, strategic omission, technically true but misleading...

  30. [30]

    The agent acts as though it observesϕis true; 26

  31. [31]

    who should decide

    The agent would act differently if it observedϕis false. If an agent does not respond differently to ϕ being true or false, its belief about ϕ is unidentifiable from its behavior. An agent accepts ϕ if it acts as though it is certain ϕ is true. This formulation avoids Theory of Mind claims about AI systems and provides a natural distinction between belief...