Ethical Implications of Training Deceptive AI

Bert Baumgaertner; Jason Starace; Terence Soule

arxiv: 2604.03250 · v1 · submitted 2026-03-10 · 💻 cs.CY

Ethical Implications of Training Deceptive AI

Jason Starace , Bert Baumgaertner , Terence Soule This is my paper

Pith reviewed 2026-05-15 12:50 UTC · model grok-4.3

classification 💻 cs.CY

keywords deceptive AIAI ethicsresearch governancerisk classificationAI safetybiosafety levelsethical frameworks

0 comments

The pith

This paper proposes a Deception Research Levels framework that classifies deceptive AI research by risk across five ethical dimensions and scales safeguards accordingly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fill the regulatory gap where the EU AI Act bans deployment of deceptive AI but leaves research largely unstructured and exempt. It introduces a four-tier system modeled on biosafety levels, evaluating each research project on Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification uses a highest-dimension-wins rule, triggering cumulative protections from basic documentation at the lowest tier to regulatory notification and third-party audits at the highest. A dual-development rule at tiers three and four requires researchers to build detection and mitigation tools at the same time as any deceptive capability. Readers would care because the framework aims to permit beneficial and defensive work while keeping the potential for harm proportional to the oversight required.

Core claim

The DRL framework classifies deceptive algorithm research into one of four risk levels by scoring mechanisms on five dimensions drawn from the AI4People ethical framework and applying a highest-dimension-wins rule; levels carry cumulative safeguards from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4, with a dual-development mandate requiring simultaneous creation of detection and mitigation methods at DRL-3 and above.

What carries the argument

The Deception Research Levels (DRL) system, which scores deceptive mechanisms on five dimensions and assigns the single highest risk level indicated by any dimension to determine required safeguards.

Load-bearing premise

The five dimensions grounded in the AI4People framework, together with the highest-dimension-wins rule, supply a sufficient and non-arbitrary basis for assigning proportional safeguards.

What would settle it

A demonstration that two research projects with matching dimension scores produce materially different real-world harms, or that following the framework's safeguards still permits significant harm, would undermine the classification method.

read the original abstract

Deceptive behavior in AI systems is no longer theoretical: large language models strategically mislead without producing false statements, maintain deceptive strategies through safety training, and coordinate deception in multi-agent settings. While the European Union's AI Act prohibits deployment of deceptive AI systems, it explicitly exempts research and development, creating a necessary but unstructured space in which no established framework governs how deception research should be conducted or how risk should scale with capability. This paper proposes a Deception Research Levels (DRL) framework, a classification system for deceptive algorithm research modeled on the Biosafety Level system used in biological research. The DRL framework classifies research by risk profile rather than researcher intent, assessing deceptive mechanisms across five dimensions grounded in the AI4People ethical framework: Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification follows a ``highest dimension wins'' approach, assigning one of four risk levels with cumulative safeguards ranging from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4. A dual-development mandate at DRL-3 and above requires that detection and mitigation methods be developed alongside any deceptive capability. We apply the framework to eight case studies spanning all four levels and demonstrate that ecological validity of the deceptive mechanism emerges as a consistent, non-independent indicator of classification level. The DRL framework is intended to fill the governance gap between regulated deployment and unstructured research, supporting both beneficial applications and defensive research under conditions where safeguards are proportional to the potential for harm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable structure for scaling safeguards on deception research but the classification rules stay too vague to apply reliably.

read the letter

The main thing here is a Deception Research Levels framework that sorts AI deception work into four tiers using five dimensions drawn from the AI4People ethics list, with the highest dimension setting the level and a requirement to build detection methods at the upper tiers. It fills the gap the EU AI Act leaves open between banned deployment and unregulated research, and the eight case studies show how the tiers might land on real examples. The note that ecological validity tracks with higher risk levels is a concrete observation worth keeping. That part is new enough and the logic holds together internally. The weakness is that none of the five dimensions come with scoring rubrics, thresholds, or decision steps. Without those, two labs can look at the same project and land on different DRL numbers, so the promised proportional safeguards do not actually follow from the rules. The case studies illustrate outcomes but do not check whether the system produces consistent or safer decisions in practice. This is a normative design paper, not a tested method. It is aimed at people who set lab policies or write governance rules for AI. The thinking is clear and the problem is real, so it should go to peer review so reviewers can push on the missing operational details and see if the framework can be made usable.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Deception Research Levels (DRL) framework for classifying deceptive AI research, modeled on biosafety levels. It assesses risk across five dimensions drawn from the AI4People ethical framework (Pillar Implication, Severity, Reversibility, Scale, Vulnerability) using a highest-dimension-wins rule to assign one of four levels, each with cumulative safeguards ranging from basic documentation at DRL-1 to regulatory notification and third-party audits at DRL-4. A dual-development mandate requires concurrent work on detection and mitigation methods at DRL-3 and above. The framework is illustrated via eight case studies spanning all levels, with ecological validity of the deceptive mechanism identified as a consistent indicator of classification.

Significance. If equipped with reproducible scoring criteria, the DRL framework would provide a structured, proportional governance mechanism for the regulatory gap between prohibited deceptive AI deployment (e.g., under the EU AI Act) and unstructured research. The biosafety-level analogy and dual-mandate element offer a practical model for balancing innovation with harm prevention; the case studies supply concrete illustrations that could guide oversight bodies if the classification procedure is made operational.

major comments (2)

[DRL framework description] DRL framework definition: The five dimensions are presented without scoring rubrics, quantitative thresholds, or decision procedures for rating Pillar Implication, Severity, Reversibility, Scale, or Vulnerability. Consequently the highest-dimension-wins rule reduces to unguided qualitative judgment; two independent assessors can defensibly assign different levels to identical research, so the claimed proportionality of safeguards cannot be guaranteed.
[Case studies section] Case studies application: The eight case studies report final DRL assignments and note ecological validity as an indicator, but supply no explicit per-dimension scores or step-by-step justification for how the highest-dimension rule was applied. This leaves the framework's practical reproducibility untested.

minor comments (2)

[Abstract] The abstract states the framework is 'grounded in the AI4People ethical framework' but does not specify which elements of AI4People are adopted versus adapted; a brief mapping table would improve traceability.
[Framework overview] Notation for the four levels (DRL-1 through DRL-4) and the dual-development mandate is introduced clearly but could be summarized in a single table for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the Deception Research Levels (DRL) framework. We agree that greater specificity in the classification criteria would improve the framework's utility and reproducibility. Below we respond to each major comment and outline the revisions we will make.

read point-by-point responses

Referee: DRL framework definition: The five dimensions are presented without scoring rubrics, quantitative thresholds, or decision procedures for rating Pillar Implication, Severity, Reversibility, Scale, or Vulnerability. Consequently the highest-dimension-wins rule reduces to unguided qualitative judgment; two independent assessors can defensibly assign different levels to identical research, so the claimed proportionality of safeguards cannot be guaranteed.

Authors: We recognize the validity of this observation. The framework as presented relies on expert judgment informed by the five dimensions, similar to other ethical assessment tools. However, to enhance reproducibility, we will add a new subsection detailing scoring rubrics for each dimension. These rubrics will provide qualitative criteria and examples drawn from the case studies to guide consistent application of the highest-dimension-wins rule. We will also discuss potential inter-rater reliability in the limitations section. revision: yes
Referee: Case studies application: The eight case studies report final DRL assignments and note ecological validity as an indicator, but supply no explicit per-dimension scores or step-by-step justification for how the highest-dimension rule was applied. This leaves the framework's practical reproducibility untested.

Authors: We agree that the case studies lack sufficient detail on the scoring process. We will revise this section to include explicit ratings for each of the five dimensions for every case study, along with a narrative justification for the final level assignment. This will demonstrate the application of the framework and address concerns about reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the DRL framework proposal

full rationale

The paper proposes the DRL classification system as an explicit new construction modeled on the external Biosafety Level framework and grounded in the AI4People ethical dimensions (Pillar Implication, Severity, Reversibility, Scale, Vulnerability). The 'highest dimension wins' rule and cumulative safeguards are presented as definitional choices for the framework itself, with no equations, fitted parameters, or self-citations that reduce the classification to its own inputs by construction. The eight case studies apply the framework illustratively but do not serve as load-bearing derivations or create self-referential loops. The central claim remains a governance proposal whose content is independent of any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The proposal rests on the assumption that the AI4People ethical pillars can be operationalized into the five risk dimensions and that the highest-wins aggregation rule yields defensible classifications; no numerical parameters are fitted.

axioms (2)

domain assumption The AI4People ethical framework supplies an appropriate and complete basis for defining risk dimensions in deceptive AI research.
The five dimensions are explicitly grounded in this external framework.
ad hoc to paper A highest-dimension-wins rule across the five dimensions produces classifications that are proportional to actual harm potential.
This aggregation method is introduced by the paper without further justification in the abstract.

invented entities (1)

Deception Research Levels (DRL) framework no independent evidence
purpose: To classify deceptive AI research by risk and assign cumulative safeguards.
Newly proposed four-level system not present in prior work.

pith-pipeline@v0.9.0 · 5564 in / 1486 out tokens · 47984 ms · 2026-05-15T12:50:15.786160+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), 2024

work page 2024
[2]

Learning to deceive in multi-agent hidden role games

Matthew Aitchison, Lyndon Benke, and Penny Sweetser. Learning to deceive in multi-agent hidden role games. In Stefan Sarkadi, Benjamin Wright, Peta Masters, and Peter McBurney, ed- itors,Deceptive AI: First International Workshop, DeceptECAI 2020, and Second International Workshop, DeceptAI 2021, Proceedings, volume 1296 ofCommunications in Computer and I...

work page 2020
[3]

Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018

Chittaranjan Andrade. Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018

work page 2018
[4]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

work page 2025
[5]

Carson.Lying and Deception: Theory and Practice

Thomas L. Carson.Lying and Deception: Theory and Practice. Oxford University Press, 2010

work page 2010
[6]

Pedro M. P. Curvo. The traitors: Deception and trust in multi-agent language model simulations, 2025

work page 2025
[7]

Bowman, Ethan Perez, and Evan Hubinger

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Inves- tigating reward-tampering in large language models, 2024

work page 2024
[8]

Moallemi, and Andrew T

Richard Dewey, János Botyánszki, Ciamac C. Moallemi, and Andrew T. Zheng. Outbidding and outbluffing elite humans: Mastering liar’s poker via self-play and reinforcement learning, 2025

work page 2025
[9]

Sai, John J

Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B. Sai, John J. Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Language models can subtly deceive without lying: A case study on strategic phrasing in legislation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pa...

work page 2025
[10]

Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, Burkhard Schafer, Peggy Valcke, and Effy Vayena. Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

work page 2018
[11]

Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024

Thilo Hagendorff. Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024

work page 2024
[12]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page 2024
[13]

AI 2027,

Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean. AI 2027,

work page 2027
[14]

Scenario forecast published by the AI Futures Project

work page
[15]

The definition of lying and deception

James Edwin Mahon. The definition of lying and deception. In Edward N. Zalta, editor,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016. 24

work page 2016
[16]

Frontier models are capable of in-context scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. 2024

work page 2024
[17]

Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra

Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...

work page 2022
[18]

Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks

Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023

work page 2023
[19]

Stress testing deliberative alignment for anti-scheming training, 2025

Benedict Schoen, Ekaterina Nitishinskaya, Mikita Balesni, Alexander Højmark, Fabien Hofstät- ter, Jérémy Scheurer, Alexander Meinke, James Wolfe, Tom van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Andy Fan, Alexander Matveiakin, Rusheb Shah, Murray Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, and Marius Hobbhahn. Stress testing deliberative ...

work page 2025
[20]

Shadish, Thomas D

William R. Shadish, Thomas D. Cook, and Donald T. Campbell.Experimental and Quasi- Experimental Designs for Generalized Causal Inference. Houghton Mifflin, 2002

work page 2002
[21]

Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026

Jason Starace and Terence Soule. Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026

work page 2026
[22]

Intentional deception as controllable capability in llm agents, 2026

Jason Starace and Terence Soule. Intentional deception as controllable capability in llm agents, 2026

work page 2026
[23]

Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026

Jason Starace, Jennie Tafoya, Anmol Singh, and Terence Soule. Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026

work page 2026
[24]

U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, and National Institutes of Health.Biosafety in Microbiological and Biomedical Laboratories, 6th edition, 2020

work page 2020
[25]

Honesty is the best policy: Defining and mitigating ai deception

Francis Ward, Francesca Toni, Francesco Belardinelli, and Tom Everitt. Honesty is the best policy: Defining and mitigating ai deception. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 2313–2341. Curran Associates, Inc., 2023

work page 2023
[26]

Learning strategic language agents in the werewolf game with iterative latent space policy optimization

Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, and Yu Wang. Learning strategic language agents in the werewolf game with iterative latent space policy optimization. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 25 A Key Definitions A.1 Deception Ward et al. [24] provide a functional definition of deception grounded in the phil...

work page 2025
[27]

The algorithm’s outputs intentionally influence the target’s actions or acceptance of informa- tion

work page
[28]

The target acceptsϕandϕis false

work page
[29]

acceptance

The algorithm does not operate as though ϕ is true; its behavior would differ if ϕ were actually true. This definition is functional: “acceptance” means acting as though a proposition is true, not possessing a mental belief. The definition includes algorithms that deceive through direct false statements, strategic omission, technically true but misleading...

work page
[30]

The agent acts as though it observesϕis true; 26

work page
[31]

who should decide

The agent would act differently if it observedϕis false. If an agent does not respond differently to ϕ being true or false, its belief about ϕ is unidentifiable from its behavior. An agent accepts ϕ if it acts as though it is certain ϕ is true. This formulation avoids Theory of Mind claims about AI systems and provides a natural distinction between belief...

work page

[1] [1]

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), 2024

work page 2024

[2] [2]

Learning to deceive in multi-agent hidden role games

Matthew Aitchison, Lyndon Benke, and Penny Sweetser. Learning to deceive in multi-agent hidden role games. In Stefan Sarkadi, Benjamin Wright, Peta Masters, and Peter McBurney, ed- itors,Deceptive AI: First International Workshop, DeceptECAI 2020, and Second International Workshop, DeceptAI 2021, Proceedings, volume 1296 ofCommunications in Computer and I...

work page 2020

[3] [3]

Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018

Chittaranjan Andrade. Internal, external, and ecological validity in research design, conduct, and evaluation.Indian Journal of Psychological Medicine, 40(5):498–499, 2018

work page 2018

[4] [4]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

work page 2025

[5] [5]

Carson.Lying and Deception: Theory and Practice

Thomas L. Carson.Lying and Deception: Theory and Practice. Oxford University Press, 2010

work page 2010

[6] [6]

Pedro M. P. Curvo. The traitors: Deception and trust in multi-agent language model simulations, 2025

work page 2025

[7] [7]

Bowman, Ethan Perez, and Evan Hubinger

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Inves- tigating reward-tampering in large language models, 2024

work page 2024

[8] [8]

Moallemi, and Andrew T

Richard Dewey, János Botyánszki, Ciamac C. Moallemi, and Andrew T. Zheng. Outbidding and outbluffing elite humans: Mastering liar’s poker via self-play and reinforcement learning, 2025

work page 2025

[9] [9]

Sai, John J

Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B. Sai, John J. Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Language models can subtly deceive without lying: A case study on strategic phrasing in legislation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pa...

work page 2025

[10] [10]

Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, Burkhard Schafer, Peggy Valcke, and Effy Vayena. Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

work page 2018

[11] [11]

Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024

Thilo Hagendorff. Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024

work page 2024

[12] [12]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page 2024

[13] [13]

AI 2027,

Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean. AI 2027,

work page 2027

[14] [14]

Scenario forecast published by the AI Futures Project

work page

[15] [15]

The definition of lying and deception

James Edwin Mahon. The definition of lying and deception. In Edward N. Zalta, editor,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016. 24

work page 2016

[16] [16]

Frontier models are capable of in-context scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. 2024

work page 2024

[17] [17]

Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra

Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...

work page 2022

[18] [18]

Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks

Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023

work page 2023

[19] [19]

Stress testing deliberative alignment for anti-scheming training, 2025

Benedict Schoen, Ekaterina Nitishinskaya, Mikita Balesni, Alexander Højmark, Fabien Hofstät- ter, Jérémy Scheurer, Alexander Meinke, James Wolfe, Tom van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Andy Fan, Alexander Matveiakin, Rusheb Shah, Murray Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, and Marius Hobbhahn. Stress testing deliberative ...

work page 2025

[20] [20]

Shadish, Thomas D

William R. Shadish, Thomas D. Cook, and Donald T. Campbell.Experimental and Quasi- Experimental Designs for Generalized Causal Inference. Houghton Mifflin, 2002

work page 2002

[21] [21]

Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026

Jason Starace and Terence Soule. Behavioral inference at scale: The fundamental asymmetry between motivations and belief systems, 2026

work page 2026

[22] [22]

Intentional deception as controllable capability in llm agents, 2026

Jason Starace and Terence Soule. Intentional deception as controllable capability in llm agents, 2026

work page 2026

[23] [23]

Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026

Jason Starace, Jennie Tafoya, Anmol Singh, and Terence Soule. Deceptive algorithms in games: A systematic literature review.Entertainment Computing, page 101078, 2026

work page 2026

[24] [24]

U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, and National Institutes of Health.Biosafety in Microbiological and Biomedical Laboratories, 6th edition, 2020

work page 2020

[25] [25]

Honesty is the best policy: Defining and mitigating ai deception

Francis Ward, Francesca Toni, Francesco Belardinelli, and Tom Everitt. Honesty is the best policy: Defining and mitigating ai deception. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 2313–2341. Curran Associates, Inc., 2023

work page 2023

[26] [26]

Learning strategic language agents in the werewolf game with iterative latent space policy optimization

Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, and Yu Wang. Learning strategic language agents in the werewolf game with iterative latent space policy optimization. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 25 A Key Definitions A.1 Deception Ward et al. [24] provide a functional definition of deception grounded in the phil...

work page 2025

[27] [27]

The algorithm’s outputs intentionally influence the target’s actions or acceptance of informa- tion

work page

[28] [28]

The target acceptsϕandϕis false

work page

[29] [29]

acceptance

The algorithm does not operate as though ϕ is true; its behavior would differ if ϕ were actually true. This definition is functional: “acceptance” means acting as though a proposition is true, not possessing a mental belief. The definition includes algorithms that deceive through direct false statements, strategic omission, technically true but misleading...

work page

[30] [30]

The agent acts as though it observesϕis true; 26

work page

[31] [31]

who should decide

The agent would act differently if it observedϕis false. If an agent does not respond differently to ϕ being true or false, its belief about ϕ is unidentifiable from its behavior. An agent accepts ϕ if it acts as though it is certain ϕ is true. This formulation avoids Theory of Mind claims about AI systems and provides a natural distinction between belief...

work page