pith. sign in

arxiv: 1906.08663 · v1 · pith:YIKYFSZWnew · submitted 2019-06-20 · 💻 cs.AI

Modeling AGI Safety Frameworks with Causal Influence Diagrams

Pith reviewed 2026-05-25 19:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords AGI safety frameworkscausal influence diagramsoptimization objectivescausal assumptionsAI safetyframework comparison
0
0 comments X

The pith

Causal influence diagrams model the optimization objectives and causal assumptions of AGI safety frameworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models several AGI safety frameworks using causal influence diagrams. These diagrams make explicit the optimization objective that the framework is pursuing and the causal assumptions about how components interact. The unified format allows direct comparison of different frameworks and their assumptions. This approach is intended to provide an accessible visual introduction to the main proposals in AGI safety research.

Core claim

By representing AGI safety frameworks as causal influence diagrams, the paper shows the optimization objective and causal assumptions of each framework in a way that permits easy comparison between them.

What carries the argument

Causal influence diagrams that capture the training and interaction of system components to display optimization objectives and causal assumptions.

If this is right

  • The diagrams enable straightforward visual comparison of different AGI safety frameworks.
  • Assumptions in each framework become explicit and comparable.
  • The models serve as an accessible introduction to AGI safety ideas.
  • Frameworks can be analyzed for their causal structure without needing full implementation details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This modeling technique could be applied to evaluate emerging AGI safety proposals as they appear.
  • Diagrams might reveal opportunities to combine strengths from multiple frameworks into new designs.
  • The method could aid in communicating complex safety ideas to non-experts in the field.

Load-bearing premise

Key elements of AGI safety frameworks can be faithfully represented in causal influence diagrams without losing important details on training dynamics or failure modes.

What would settle it

Demonstrating an AGI safety framework whose structure or risks cannot be adequately captured in a causal influence diagram.

read the original abstract

Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that causal influence diagrams (CIDs) provide a unified visual representation for modeling prominent AGI safety frameworks (e.g., iterated amplification, debate). The diagrams are said to display each framework's optimization objective and causal assumptions, enabling straightforward comparison across proposals; the work is positioned as an accessible introduction rather than a formal proof or empirical validation.

Significance. If the CIDs accurately encode the core objectives and structures, the contribution lies in offering a standardized, visual language for comparing AGI safety proposals. This could aid identification of differing assumptions without requiring readers to consult multiple source papers. The approach is conceptual and illustrative, with no new theorems, code, or falsifiable predictions claimed.

major comments (1)
  1. [§4] The central claim that the diagrams 'show the optimization objective and causal assumptions' (abstract) rests on faithful representation. However, the skeptic concern about omitted training dynamics is load-bearing: for iterated amplification (modeled in §4), the CID treats the amplification step as a single decision node without explicit recursive human-feedback loops or capability escalation paths described in the source literature. This risks under-representing causal dependencies that affect safety claims, and no explicit mapping table or validation step against the original framework descriptions is provided to confirm completeness.
minor comments (3)
  1. [§2.1] Figure 2 (debate CID) uses non-standard arrow styles for information edges; clarify the CID notation conventions in §2.1 to ensure readers can interpret all diagrams consistently.
  2. The paper cites the original framework papers but does not include a reference list entry for the CID formalism itself (e.g., the 2019 or earlier CID literature); add this for completeness.
  3. [Table 1] Table 1 summarizing framework assumptions is useful but omits a column for 'iterative training' or 'human oversight nodes'; expanding it would strengthen the comparison claim without altering the diagrams.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The manuscript uses causal influence diagrams as high-level abstractions to illustrate optimization objectives and causal assumptions across AGI safety frameworks. We address the specific concern about the iterated amplification model below.

read point-by-point responses
  1. Referee: [§4] The central claim that the diagrams 'show the optimization objective and causal assumptions' (abstract) rests on faithful representation. However, the skeptic concern about omitted training dynamics is load-bearing: for iterated amplification (modeled in §4), the CID treats the amplification step as a single decision node without explicit recursive human-feedback loops or capability escalation paths described in the source literature. This risks under-representing causal dependencies that affect safety claims, and no explicit mapping table or validation step against the original framework descriptions is provided to confirm completeness.

    Authors: We agree that the CID for iterated amplification is a deliberate high-level abstraction that collapses recursive human feedback and capability escalation into a single amplification decision node. This choice follows from the paper's goal of providing a unified visual language for comparing frameworks at the level of their stated objectives and core causal structure, rather than a full dynamical simulation of training. The diagram still encodes the key optimization objective (human utility over the final output) and the assumption that amplification preserves alignment. To strengthen the claim of faithful representation, we will add an explicit mapping table in §4 that links each CID node and edge to the corresponding concepts in the source literature on iterated amplification. This will clarify the abstraction level and allow readers to evaluate completeness directly. We view this as a minor but useful clarification. revision: partial

Circularity Check

0 steps flagged

No circularity; descriptive modeling from external framework descriptions

full rationale

The paper constructs causal influence diagrams to represent existing AGI safety frameworks (e.g., iterated amplification, debate) based on their publicly described components, objectives, and causal structures. No equations, parameters, or derivations are present that reduce a claimed result to fitted inputs or self-definitions. The unified representation is a modeling choice, not a prediction derived from prior results in the paper. Self-citations, if any, are not load-bearing for the core claim of faithful representation and comparison. The derivation chain is self-contained against external benchmarks (the source frameworks), with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that causal influence diagrams are an appropriate and lossless enough representation for the purposes of comparison; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Causal influence diagrams can accurately capture the optimization objectives and causal assumptions of AGI safety frameworks
    This premise is required for the diagrams to serve as a useful comparison tool.

pith-pipeline@v0.9.0 · 5599 in / 1096 out tokens · 20067 ms · 2026-05-25T19:45:08.957790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \' e . Concrete problems in AI safety. CoRR , abs/1606.06565, 2016

  2. [2]

    Good and safe uses of AI Oracles

    Stuart Armstrong. Good and safe uses of AI oracles. CoRR , abs/1711.05541, 2017

  3. [3]

    Probabilistic evaluation of counterfactual queries

    Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Association for the Advancement of Artificial Intelligence (AAAI) , pages 230--237, 1994

  4. [4]

    Superintelligence: Paths, Dangers, Strategies

    Nick Bostrom. Superintelligence: Paths, Dangers, Strategies . Oxford University Press, 2014

  5. [5]

    Why tool AI s want to be agent AI s, 2016

    Gwern Branwen. Why tool AI s want to be agent AI s, 2016. https://www.gwern.net/Tool-AI

  6. [6]

    Supervising strong learners by amplifying weak experts

    Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. CoRR , abs/1810.08575, 2018

  7. [7]

    Act-based agents , 2015

    Paul Christiano. Act-based agents , 2015. https://ai-alignment.com/act-based-agents-8ec926c79e9c

  8. [8]

    The Intentional Stance

    Daniel Dennett. The Intentional Stance . MIT Press, 1987

  9. [9]

    Reframing superintelligence: Comprehensive AI services as general intelligence

    K Eric Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence. Technical report, Future of Humanity Institute, University of Oxford, 2019

  10. [10]

    Self-modification of policy and utility function in rational agents

    Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence , volume LNAI 9782, pages 1--11, 2016

  11. [11]

    AGI safety literature review

    Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. In International Joint Conference on AI (IJCAI) , 2018

  12. [12]

    Understanding agent incentives using causal influence diagrams

    Tom Everitt, Pedro Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams. Part I : S ingle action settings. CoRR , abs/1902.09980, 2019

  13. [13]

    Towards Safe Artificial General Intelligence

    Tom Everitt. Towards Safe Artificial General Intelligence . PhD thesis, Australian National University, May 2018

  14. [14]

    Cooperative inverse reinforcement learning

    Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Neural Information Processing Systems (NIPS) , 2016

  15. [15]

    Model-based utility functions

    Bill Hibbard. Model-based utility functions. Journal of Artificial General Intelligence , 3(1):1--24, 2012

  16. [16]

    Quantifying causal emergence shows that macro can beat micro

    Erik Hoel, Larissa Albantakis, and Giulio Tononi. Quantifying causal emergence shows that macro can beat micro. In Proceedings of the National Academy of Sciences , volume 110, pages 19790--19795. National Academy of Sciences, 2013

  17. [17]

    Influence diagrams

    Ronald A Howard and James E Matheson. Influence diagrams. Readings on the Principles and Applications of Decision Analysis , pages 721--762, 1984

  18. [18]

    AI safety via debate

    Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. CoRR , abs/1805.00899, 2018

  19. [19]

    Multi-agent influence diagrams for representing and solving games

    Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior , 45(1):181--221, 2003

  20. [20]

    Scalable agent alignment via reward modeling: a research direction

    Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. CoRR , abs/1811.07871, 2018

  21. [21]

    Self-modification and mortality in artificial agents

    Laurent Orseau and Mark Ring. Self-modification and mortality in artificial agents. In Artificial General Intelligence , volume 6830 LNAI, pages 1--10, 2011

  22. [22]

    Agents and Devices: A Relative Definition of Agency

    Laurent Orseau, Simon McGregor McGill, and Shane Legg. Agents and devices: A relative definition of agency. CoRR , abs/1805.12387, 2018

  23. [23]

    Causality: Models, Reasoning, and Inference

    Judea Pearl. Causality: Models, Reasoning, and Inference . Cambridge University Press, 2nd edition, 2009

  24. [24]

    Godel machines: Self-referential universal problem solvers making provably optimal self-improvements

    J \" u rgen Schmidhuber. Godel machines: Self-referential universal problem solvers making provably optimal self-improvements. In Artificial General Intelligence . Springer, 2007

  25. [25]

    Reinforcement Learning: An Introduction

    Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2nd edition, 2018

  26. [26]

    Learning to reinforcement learn

    Jane Wang, Zeb Kurth - Nelson, Hubert Soyer, Joel Leibo, Dhruva Tirumala, R \' e mi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society (CogSci) , 2017