Modeling AGI Safety Frameworks with Causal Influence Diagrams
Pith reviewed 2026-05-25 19:45 UTC · model grok-4.3
The pith
Causal influence diagrams model the optimization objectives and causal assumptions of AGI safety frameworks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing AGI safety frameworks as causal influence diagrams, the paper shows the optimization objective and causal assumptions of each framework in a way that permits easy comparison between them.
What carries the argument
Causal influence diagrams that capture the training and interaction of system components to display optimization objectives and causal assumptions.
If this is right
- The diagrams enable straightforward visual comparison of different AGI safety frameworks.
- Assumptions in each framework become explicit and comparable.
- The models serve as an accessible introduction to AGI safety ideas.
- Frameworks can be analyzed for their causal structure without needing full implementation details.
Where Pith is reading between the lines
- This modeling technique could be applied to evaluate emerging AGI safety proposals as they appear.
- Diagrams might reveal opportunities to combine strengths from multiple frameworks into new designs.
- The method could aid in communicating complex safety ideas to non-experts in the field.
Load-bearing premise
Key elements of AGI safety frameworks can be faithfully represented in causal influence diagrams without losing important details on training dynamics or failure modes.
What would settle it
Demonstrating an AGI safety framework whose structure or risks cannot be adequately captured in a causal influence diagram.
read the original abstract
Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that causal influence diagrams (CIDs) provide a unified visual representation for modeling prominent AGI safety frameworks (e.g., iterated amplification, debate). The diagrams are said to display each framework's optimization objective and causal assumptions, enabling straightforward comparison across proposals; the work is positioned as an accessible introduction rather than a formal proof or empirical validation.
Significance. If the CIDs accurately encode the core objectives and structures, the contribution lies in offering a standardized, visual language for comparing AGI safety proposals. This could aid identification of differing assumptions without requiring readers to consult multiple source papers. The approach is conceptual and illustrative, with no new theorems, code, or falsifiable predictions claimed.
major comments (1)
- [§4] The central claim that the diagrams 'show the optimization objective and causal assumptions' (abstract) rests on faithful representation. However, the skeptic concern about omitted training dynamics is load-bearing: for iterated amplification (modeled in §4), the CID treats the amplification step as a single decision node without explicit recursive human-feedback loops or capability escalation paths described in the source literature. This risks under-representing causal dependencies that affect safety claims, and no explicit mapping table or validation step against the original framework descriptions is provided to confirm completeness.
minor comments (3)
- [§2.1] Figure 2 (debate CID) uses non-standard arrow styles for information edges; clarify the CID notation conventions in §2.1 to ensure readers can interpret all diagrams consistently.
- The paper cites the original framework papers but does not include a reference list entry for the CID formalism itself (e.g., the 2019 or earlier CID literature); add this for completeness.
- [Table 1] Table 1 summarizing framework assumptions is useful but omits a column for 'iterative training' or 'human oversight nodes'; expanding it would strengthen the comparison claim without altering the diagrams.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The manuscript uses causal influence diagrams as high-level abstractions to illustrate optimization objectives and causal assumptions across AGI safety frameworks. We address the specific concern about the iterated amplification model below.
read point-by-point responses
-
Referee: [§4] The central claim that the diagrams 'show the optimization objective and causal assumptions' (abstract) rests on faithful representation. However, the skeptic concern about omitted training dynamics is load-bearing: for iterated amplification (modeled in §4), the CID treats the amplification step as a single decision node without explicit recursive human-feedback loops or capability escalation paths described in the source literature. This risks under-representing causal dependencies that affect safety claims, and no explicit mapping table or validation step against the original framework descriptions is provided to confirm completeness.
Authors: We agree that the CID for iterated amplification is a deliberate high-level abstraction that collapses recursive human feedback and capability escalation into a single amplification decision node. This choice follows from the paper's goal of providing a unified visual language for comparing frameworks at the level of their stated objectives and core causal structure, rather than a full dynamical simulation of training. The diagram still encodes the key optimization objective (human utility over the final output) and the assumption that amplification preserves alignment. To strengthen the claim of faithful representation, we will add an explicit mapping table in §4 that links each CID node and edge to the corresponding concepts in the source literature on iterated amplification. This will clarify the abstraction level and allow readers to evaluate completeness directly. We view this as a minor but useful clarification. revision: partial
Circularity Check
No circularity; descriptive modeling from external framework descriptions
full rationale
The paper constructs causal influence diagrams to represent existing AGI safety frameworks (e.g., iterated amplification, debate) based on their publicly described components, objectives, and causal structures. No equations, parameters, or derivations are present that reduce a claimed result to fitted inputs or self-definitions. The unified representation is a modeling choice, not a prediction derived from prior results in the paper. Self-citations, if any, are not load-bearing for the core claim of faithful representation and comparison. The derivation chain is self-contained against external benchmarks (the source frameworks), with no reduction by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal influence diagrams can accurately capture the optimization objectives and causal assumptions of AGI safety frameworks
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \' e . Concrete problems in AI safety. CoRR , abs/1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Good and safe uses of AI Oracles
Stuart Armstrong. Good and safe uses of AI oracles. CoRR , abs/1711.05541, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
Probabilistic evaluation of counterfactual queries
Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Association for the Advancement of Artificial Intelligence (AAAI) , pages 230--237, 1994
work page 1994
-
[4]
Superintelligence: Paths, Dangers, Strategies
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies . Oxford University Press, 2014
work page 2014
-
[5]
Why tool AI s want to be agent AI s, 2016
Gwern Branwen. Why tool AI s want to be agent AI s, 2016. https://www.gwern.net/Tool-AI
work page 2016
-
[6]
Supervising strong learners by amplifying weak experts
Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. CoRR , abs/1810.08575, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Paul Christiano. Act-based agents , 2015. https://ai-alignment.com/act-based-agents-8ec926c79e9c
work page 2015
- [8]
-
[9]
Reframing superintelligence: Comprehensive AI services as general intelligence
K Eric Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence. Technical report, Future of Humanity Institute, University of Oxford, 2019
work page 2019
-
[10]
Self-modification of policy and utility function in rational agents
Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence , volume LNAI 9782, pages 1--11, 2016
work page 2016
-
[11]
Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. In International Joint Conference on AI (IJCAI) , 2018
work page 2018
-
[12]
Understanding agent incentives using causal influence diagrams
Tom Everitt, Pedro Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams. Part I : S ingle action settings. CoRR , abs/1902.09980, 2019
-
[13]
Towards Safe Artificial General Intelligence
Tom Everitt. Towards Safe Artificial General Intelligence . PhD thesis, Australian National University, May 2018
work page 2018
-
[14]
Cooperative inverse reinforcement learning
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Neural Information Processing Systems (NIPS) , 2016
work page 2016
-
[15]
Bill Hibbard. Model-based utility functions. Journal of Artificial General Intelligence , 3(1):1--24, 2012
work page 2012
-
[16]
Quantifying causal emergence shows that macro can beat micro
Erik Hoel, Larissa Albantakis, and Giulio Tononi. Quantifying causal emergence shows that macro can beat micro. In Proceedings of the National Academy of Sciences , volume 110, pages 19790--19795. National Academy of Sciences, 2013
work page 2013
-
[17]
Ronald A Howard and James E Matheson. Influence diagrams. Readings on the Principles and Applications of Decision Analysis , pages 721--762, 1984
work page 1984
-
[18]
Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. CoRR , abs/1805.00899, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Multi-agent influence diagrams for representing and solving games
Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior , 45(1):181--221, 2003
work page 2003
-
[20]
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. CoRR , abs/1811.07871, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Self-modification and mortality in artificial agents
Laurent Orseau and Mark Ring. Self-modification and mortality in artificial agents. In Artificial General Intelligence , volume 6830 LNAI, pages 1--10, 2011
work page 2011
-
[22]
Agents and Devices: A Relative Definition of Agency
Laurent Orseau, Simon McGregor McGill, and Shane Legg. Agents and devices: A relative definition of agency. CoRR , abs/1805.12387, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Causality: Models, Reasoning, and Inference
Judea Pearl. Causality: Models, Reasoning, and Inference . Cambridge University Press, 2nd edition, 2009
work page 2009
-
[24]
Godel machines: Self-referential universal problem solvers making provably optimal self-improvements
J \" u rgen Schmidhuber. Godel machines: Self-referential universal problem solvers making provably optimal self-improvements. In Artificial General Intelligence . Springer, 2007
work page 2007
-
[25]
Reinforcement Learning: An Introduction
Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . MIT Press, 2nd edition, 2018
work page 2018
-
[26]
Learning to reinforcement learn
Jane Wang, Zeb Kurth - Nelson, Hubert Soyer, Joel Leibo, Dhruva Tirumala, R \' e mi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society (CogSci) , 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.