Modeling AGI Safety Frameworks with Causal Influence Diagrams

Ramana Kumar; Shane Legg; Tom Everitt; Victoria Krakovna

arxiv: 1906.08663 · v1 · pith:YIKYFSZWnew · submitted 2019-06-20 · 💻 cs.AI

Modeling AGI Safety Frameworks with Causal Influence Diagrams

Tom Everitt , Ramana Kumar , Victoria Krakovna , Shane Legg This is my paper

Pith reviewed 2026-05-25 19:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords AGI safety frameworkscausal influence diagramsoptimization objectivescausal assumptionsAI safetyframework comparison

0 comments

The pith

Causal influence diagrams model the optimization objectives and causal assumptions of AGI safety frameworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models several AGI safety frameworks using causal influence diagrams. These diagrams make explicit the optimization objective that the framework is pursuing and the causal assumptions about how components interact. The unified format allows direct comparison of different frameworks and their assumptions. This approach is intended to provide an accessible visual introduction to the main proposals in AGI safety research.

Core claim

By representing AGI safety frameworks as causal influence diagrams, the paper shows the optimization objective and causal assumptions of each framework in a way that permits easy comparison between them.

What carries the argument

Causal influence diagrams that capture the training and interaction of system components to display optimization objectives and causal assumptions.

If this is right

The diagrams enable straightforward visual comparison of different AGI safety frameworks.
Assumptions in each framework become explicit and comparable.
The models serve as an accessible introduction to AGI safety ideas.
Frameworks can be analyzed for their causal structure without needing full implementation details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This modeling technique could be applied to evaluate emerging AGI safety proposals as they appear.
Diagrams might reveal opportunities to combine strengths from multiple frameworks into new designs.
The method could aid in communicating complex safety ideas to non-experts in the field.

Load-bearing premise

Key elements of AGI safety frameworks can be faithfully represented in causal influence diagrams without losing important details on training dynamics or failure modes.

What would settle it

Demonstrating an AGI safety framework whose structure or risks cannot be adequately captured in a causal influence diagram.

read the original abstract

Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies causal influence diagrams to existing AGI safety frameworks to create a shared visual comparison tool, which is a modest organizing step rather than a technical advance.

read the letter

The core contribution is a set of causal influence diagrams that lay out the optimization goals and causal structure for several AGI safety proposals side by side. This lets readers see at a glance how the frameworks differ in what they treat as the objective and which variables they connect. The diagrams are built from the publicly described versions of the frameworks, so the work stays grounded in existing material rather than introducing new entities or fitted parameters. That makes the representation reproducible from the source papers. The approach is useful for communication because it turns verbal descriptions into a consistent notation that highlights assumptions about what influences what. For someone already familiar with the original frameworks, the diagrams can serve as a quick reference when checking whether two proposals rest on compatible causal pictures. The main limitation is that the diagrams are presented as illustrative. Frameworks with iterative training loops, recursive human feedback, or capability assumptions that unfold over multiple rounds may lose detail when forced into standard CID nodes. The paper does not include an explicit mapping or check against the source texts to confirm nothing structurally important was dropped, which means the comparisons rest on the authors' modeling choices. This is not a fatal issue for an organizing paper, but it caps how much weight the diagrams can carry when used to evaluate or combine frameworks. The work is aimed at the AGI safety community rather than a general AI audience. Readers who already know the main proposals will find the most value as a visual shorthand; newcomers will still need the original papers. The modeling is clear and the citation pattern is appropriate for a survey-style piece. I would send it to peer review. The contribution is narrow but the execution is careful enough on its own terms that referees could usefully comment on the fidelity of the diagrams and whether additional frameworks should be included.

Referee Report

1 major / 3 minor

Summary. The paper claims that causal influence diagrams (CIDs) provide a unified visual representation for modeling prominent AGI safety frameworks (e.g., iterated amplification, debate). The diagrams are said to display each framework's optimization objective and causal assumptions, enabling straightforward comparison across proposals; the work is positioned as an accessible introduction rather than a formal proof or empirical validation.

Significance. If the CIDs accurately encode the core objectives and structures, the contribution lies in offering a standardized, visual language for comparing AGI safety proposals. This could aid identification of differing assumptions without requiring readers to consult multiple source papers. The approach is conceptual and illustrative, with no new theorems, code, or falsifiable predictions claimed.

major comments (1)

[§4] The central claim that the diagrams 'show the optimization objective and causal assumptions' (abstract) rests on faithful representation. However, the skeptic concern about omitted training dynamics is load-bearing: for iterated amplification (modeled in §4), the CID treats the amplification step as a single decision node without explicit recursive human-feedback loops or capability escalation paths described in the source literature. This risks under-representing causal dependencies that affect safety claims, and no explicit mapping table or validation step against the original framework descriptions is provided to confirm completeness.

minor comments (3)

[§2.1] Figure 2 (debate CID) uses non-standard arrow styles for information edges; clarify the CID notation conventions in §2.1 to ensure readers can interpret all diagrams consistently.
The paper cites the original framework papers but does not include a reference list entry for the CID formalism itself (e.g., the 2019 or earlier CID literature); add this for completeness.
[Table 1] Table 1 summarizing framework assumptions is useful but omits a column for 'iterative training' or 'human oversight nodes'; expanding it would strengthen the comparison claim without altering the diagrams.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The manuscript uses causal influence diagrams as high-level abstractions to illustrate optimization objectives and causal assumptions across AGI safety frameworks. We address the specific concern about the iterated amplification model below.

read point-by-point responses

Referee: [§4] The central claim that the diagrams 'show the optimization objective and causal assumptions' (abstract) rests on faithful representation. However, the skeptic concern about omitted training dynamics is load-bearing: for iterated amplification (modeled in §4), the CID treats the amplification step as a single decision node without explicit recursive human-feedback loops or capability escalation paths described in the source literature. This risks under-representing causal dependencies that affect safety claims, and no explicit mapping table or validation step against the original framework descriptions is provided to confirm completeness.

Authors: We agree that the CID for iterated amplification is a deliberate high-level abstraction that collapses recursive human feedback and capability escalation into a single amplification decision node. This choice follows from the paper's goal of providing a unified visual language for comparing frameworks at the level of their stated objectives and core causal structure, rather than a full dynamical simulation of training. The diagram still encodes the key optimization objective (human utility over the final output) and the assumption that amplification preserves alignment. To strengthen the claim of faithful representation, we will add an explicit mapping table in §4 that links each CID node and edge to the corresponding concepts in the source literature on iterated amplification. This will clarify the abstraction level and allow readers to evaluate completeness directly. We view this as a minor but useful clarification. revision: partial

Circularity Check

0 steps flagged

No circularity; descriptive modeling from external framework descriptions

full rationale

The paper constructs causal influence diagrams to represent existing AGI safety frameworks (e.g., iterated amplification, debate) based on their publicly described components, objectives, and causal structures. No equations, parameters, or derivations are present that reduce a claimed result to fitted inputs or self-definitions. The unified representation is a modeling choice, not a prediction derived from prior results in the paper. Self-citations, if any, are not load-bearing for the core claim of faithful representation and comparison. The derivation chain is self-contained against external benchmarks (the source frameworks), with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that causal influence diagrams are an appropriate and lossless enough representation for the purposes of comparison; no free parameters or new entities are introduced.

axioms (1)

domain assumption Causal influence diagrams can accurately capture the optimization objectives and causal assumptions of AGI safety frameworks
This premise is required for the diagrams to serve as a useful comparison tool.

pith-pipeline@v0.9.0 · 5599 in / 1096 out tokens · 20067 ms · 2026-05-25T19:45:08.957790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

[1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \' e . Concrete problems in AI safety. CoRR , abs/1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Good and safe uses of AI Oracles

Stuart Armstrong. Good and safe uses of AI oracles. CoRR , abs/1711.05541, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Probabilistic evaluation of counterfactual queries

Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Association for the Advancement of Artificial Intelligence (AAAI) , pages 230--237, 1994

work page 1994
[4]

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom. Superintelligence: Paths, Dangers, Strategies . Oxford University Press, 2014

work page 2014
[5]

Why tool AI s want to be agent AI s, 2016

Gwern Branwen. Why tool AI s want to be agent AI s, 2016. https://www.gwern.net/Tool-AI

work page 2016
[6]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. CoRR , abs/1810.08575, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Act-based agents , 2015

Paul Christiano. Act-based agents , 2015. https://ai-alignment.com/act-based-agents-8ec926c79e9c

work page 2015
[8]

The Intentional Stance

Daniel Dennett. The Intentional Stance . MIT Press, 1987

work page 1987
[9]

Reframing superintelligence: Comprehensive AI services as general intelligence

K Eric Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence. Technical report, Future of Humanity Institute, University of Oxford, 2019

work page 2019
[10]

Self-modification of policy and utility function in rational agents

Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence , volume LNAI 9782, pages 1--11, 2016

work page 2016
[11]

AGI safety literature review

Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. In International Joint Conference on AI (IJCAI) , 2018

work page 2018
[12]

Understanding agent incentives using causal influence diagrams

Tom Everitt, Pedro Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams. Part I : S ingle action settings. CoRR , abs/1902.09980, 2019

work page arXiv 1902
[13]

Towards Safe Artificial General Intelligence

Tom Everitt. Towards Safe Artificial General Intelligence . PhD thesis, Australian National University, May 2018

work page 2018
[14]

Cooperative inverse reinforcement learning

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Neural Information Processing Systems (NIPS) , 2016

work page 2016
[15]

Model-based utility functions

Bill Hibbard. Model-based utility functions. Journal of Artificial General Intelligence , 3(1):1--24, 2012

work page 2012
[16]

Quantifying causal emergence shows that macro can beat micro

Erik Hoel, Larissa Albantakis, and Giulio Tononi. Quantifying causal emergence shows that macro can beat micro. In Proceedings of the National Academy of Sciences , volume 110, pages 19790--19795. National Academy of Sciences, 2013

work page 2013
[17]

Influence diagrams

Ronald A Howard and James E Matheson. Influence diagrams. Readings on the Principles and Applications of Decision Analysis , pages 721--762, 1984

work page 1984
[18]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. CoRR , abs/1805.00899, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Multi-agent influence diagrams for representing and solving games

Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior , 45(1):181--221, 2003

work page 2003
[20]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. CoRR , abs/1811.07871, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Self-modification and mortality in artificial agents

Laurent Orseau and Mark Ring. Self-modification and mortality in artificial agents. In Artificial General Intelligence , volume 6830 LNAI, pages 1--10, 2011

work page 2011
[22]

Agents and Devices: A Relative Definition of Agency

Laurent Orseau, Simon McGregor McGill, and Shane Legg. Agents and devices: A relative definition of agency. CoRR , abs/1805.12387, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Causality: Models, Reasoning, and Inference

Judea Pearl. Causality: Models, Reasoning, and Inference . Cambridge University Press, 2nd edition, 2009

work page 2009
[24]

Godel machines: Self-referential universal problem solvers making provably optimal self-improvements

J \" u rgen Schmidhuber. Godel machines: Self-referential universal problem solvers making provably optimal self-improvements. In Artificial General Intelligence . Springer, 2007

work page 2007
[25]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2nd edition, 2018

work page 2018
[26]

Learning to reinforcement learn

Jane Wang, Zeb Kurth - Nelson, Hubert Soyer, Joel Leibo, Dhruva Tirumala, R \' e mi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society (CogSci) , 2017

work page 2017

[1] [1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \' e . Concrete problems in AI safety. CoRR , abs/1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Good and safe uses of AI Oracles

Stuart Armstrong. Good and safe uses of AI oracles. CoRR , abs/1711.05541, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Probabilistic evaluation of counterfactual queries

Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Association for the Advancement of Artificial Intelligence (AAAI) , pages 230--237, 1994

work page 1994

[4] [4]

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom. Superintelligence: Paths, Dangers, Strategies . Oxford University Press, 2014

work page 2014

[5] [5]

Why tool AI s want to be agent AI s, 2016

Gwern Branwen. Why tool AI s want to be agent AI s, 2016. https://www.gwern.net/Tool-AI

work page 2016

[6] [6]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. CoRR , abs/1810.08575, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Act-based agents , 2015

Paul Christiano. Act-based agents , 2015. https://ai-alignment.com/act-based-agents-8ec926c79e9c

work page 2015

[8] [8]

The Intentional Stance

Daniel Dennett. The Intentional Stance . MIT Press, 1987

work page 1987

[9] [9]

Reframing superintelligence: Comprehensive AI services as general intelligence

K Eric Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence. Technical report, Future of Humanity Institute, University of Oxford, 2019

work page 2019

[10] [10]

Self-modification of policy and utility function in rational agents

Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence , volume LNAI 9782, pages 1--11, 2016

work page 2016

[11] [11]

AGI safety literature review

Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. In International Joint Conference on AI (IJCAI) , 2018

work page 2018

[12] [12]

Understanding agent incentives using causal influence diagrams

Tom Everitt, Pedro Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams. Part I : S ingle action settings. CoRR , abs/1902.09980, 2019

work page arXiv 1902

[13] [13]

Towards Safe Artificial General Intelligence

Tom Everitt. Towards Safe Artificial General Intelligence . PhD thesis, Australian National University, May 2018

work page 2018

[14] [14]

Cooperative inverse reinforcement learning

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Neural Information Processing Systems (NIPS) , 2016

work page 2016

[15] [15]

Model-based utility functions

Bill Hibbard. Model-based utility functions. Journal of Artificial General Intelligence , 3(1):1--24, 2012

work page 2012

[16] [16]

Quantifying causal emergence shows that macro can beat micro

Erik Hoel, Larissa Albantakis, and Giulio Tononi. Quantifying causal emergence shows that macro can beat micro. In Proceedings of the National Academy of Sciences , volume 110, pages 19790--19795. National Academy of Sciences, 2013

work page 2013

[17] [17]

Influence diagrams

Ronald A Howard and James E Matheson. Influence diagrams. Readings on the Principles and Applications of Decision Analysis , pages 721--762, 1984

work page 1984

[18] [18]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. CoRR , abs/1805.00899, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Multi-agent influence diagrams for representing and solving games

Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior , 45(1):181--221, 2003

work page 2003

[20] [20]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. CoRR , abs/1811.07871, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Self-modification and mortality in artificial agents

Laurent Orseau and Mark Ring. Self-modification and mortality in artificial agents. In Artificial General Intelligence , volume 6830 LNAI, pages 1--10, 2011

work page 2011

[22] [22]

Agents and Devices: A Relative Definition of Agency

Laurent Orseau, Simon McGregor McGill, and Shane Legg. Agents and devices: A relative definition of agency. CoRR , abs/1805.12387, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Causality: Models, Reasoning, and Inference

Judea Pearl. Causality: Models, Reasoning, and Inference . Cambridge University Press, 2nd edition, 2009

work page 2009

[24] [24]

Godel machines: Self-referential universal problem solvers making provably optimal self-improvements

J \" u rgen Schmidhuber. Godel machines: Self-referential universal problem solvers making provably optimal self-improvements. In Artificial General Intelligence . Springer, 2007

work page 2007

[25] [25]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2nd edition, 2018

work page 2018

[26] [26]

Learning to reinforcement learn

Jane Wang, Zeb Kurth - Nelson, Hubert Soyer, Joel Leibo, Dhruva Tirumala, R \' e mi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society (CogSci) , 2017

work page 2017