arxiv: 2605.12963 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: unknown

Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

James M. Mazzu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:12 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI safetycontrol theoryexternal enforcementintrinsic safetyreachability conditionself-modificationstructural requirements

0 comments

The pith

Control theory proves that no externally enforced strategy can sustain AI safety once system effects exceed bounded external counteraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies control-theoretic modeling to show that safety strategies relying on external enforcement fail structurally when an AI system's effects grow beyond what limited external controls can counteract. A sympathetic reader cares because this rules out an entire class of approaches—such as ongoing oversight or external alignment checks—once the reachability condition holds, rather than depending on flaws in any single tactic. If the result stands, viable safety must instead be intrinsic, meaning the system's own terminal objective starts safety-compatible, stays stable through self-modification, and continues to hold as capabilities increase. The work supplies formal structure for this limit without offering a full intrinsic strategy.

Core claim

Under explicit premises including a reachability condition, once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class. If at least one candidate safety-sustaining strategy remains, then all such remaining strategies must be intrinsic and must satisfy four structural requirements: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be Preserv

What carries the argument

Control-theoretic reachability condition that bounds external effects while leaving system effects unbounded, establishing class-wide impossibility for external enforcement.

Load-bearing premise

The modeling premise that external control effects stay bounded while system effects do not, once the reachability condition is met.

What would settle it

An explicit construction or simulation of an AI system in which bounded external interventions continue to counteract all growing system effects indefinitely without escape.

Figures

Figures reproduced from arXiv: 2605.12963 by James M. Mazzu.

**Figure 1.** Figure 1: Main results and their logical relationship. Red shows what is ruled out; green shows what any remaining strategy must satisfy. As [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗

read the original abstract

As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes why external control can't sustain AI safety once system effects exceed bounded oversight, under control theory premises, and spells out what intrinsic alternatives must satisfy.

read the letter

The main thing to know is that this paper applies control theory to show a class-wide impossibility: under a reachability condition and related premises, no safety strategy that relies even partly on ongoing external enforcement can hold once the AI's effects outstrip what bounded controls can counteract. It then derives a conditional necessity that any surviving strategy must be intrinsic, plus four structural requirements for viability like initial objective compatibility and stability under self-modification.

Referee Report

2 major / 2 minor

Summary. The paper applies control theory to derive two main results on AI safety: under explicit premises including a reachability condition, it proves a class-wide external impossibility result showing that no strategy relying on continued external enforcement can sustain safety once system effects exceed bounded external control; it then derives a conditional necessity result that any remaining viable strategies must be intrinsic, and states four structural requirements (no dependence on external enforcement, safety-compatible terminal objective at formation, stability under self-modification, and preservation with capability growth). The contribution is a formal structural analysis rather than a concrete strategy.

Significance. If the derivations hold under the stated premises, the work provides a rigorous control-theoretic framework that rules out entire classes of external safety strategies and identifies necessary conditions for intrinsic alternatives. This offers formal structure to concerns about the limits of external control, which is a strength for the theoretical AI safety literature; the conditional nature of the results and absence of parameter fitting or self-referential definitions further support its value as a clarifying contribution.

major comments (2)

[premises and reachability condition] Reachability condition (premises section): the central impossibility result depends on this condition strictly bounding external control effects while allowing unbounded system effects; the manuscript should provide an explicit justification or example derivation showing why system effects cannot be matched by external control under realistic AI dynamics, as this separation is load-bearing for the class-wide claim.
[§5] §5, structural requirements: the necessity that the terminal objective must be safety-compatible when first formed and remain stable under self-modification is asserted as required for viability, but the derivation does not address potential hybrid cases where limited external oversight could stabilize an otherwise intrinsic objective, which could weaken the strict necessity result.

minor comments (2)

[Abstract] Abstract: the four structural requirements are summarized but not enumerated; listing them explicitly would improve immediate clarity without altering the technical content.
[early sections / notation] Notation and definitions: control-theoretic terms such as 'reachability condition' and 'bounded external control' should include a brief formal definition or reference to standard control theory results in an early section to aid readers from the AI safety community.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. The feedback helps clarify the presentation of the reachability condition and the scope of the necessity result. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [premises and reachability condition] Reachability condition (premises section): the central impossibility result depends on this condition strictly bounding external control effects while allowing unbounded system effects; the manuscript should provide an explicit justification or example derivation showing why system effects cannot be matched by external control under realistic AI dynamics, as this separation is load-bearing for the class-wide claim.

Authors: The reachability condition encodes the modeling premise that external enforcement mechanisms operate with bounded reachability (e.g., limited sensors, actuators, or computational resources), while the AI system's effects become unbounded once capability growth allows access to resources or state-space regions outside that bound. This separation follows directly from standard control-theoretic assumptions on actuator limits versus plant dynamics. We will add a short explicit derivation and a concrete example (an AI system that can initiate self-replication or remote resource acquisition beyond external monitoring) in the premises section to make the justification self-contained. revision: yes
Referee: [§5] §5, structural requirements: the necessity that the terminal objective must be safety-compatible when first formed and remain stable under self-modification is asserted as required for viability, but the derivation does not address potential hybrid cases where limited external oversight could stabilize an otherwise intrinsic objective, which could weaken the strict necessity result.

Authors: The necessity result is conditional on the prior elimination of any strategy that depends on continued external enforcement. A hybrid that relies on limited external oversight for stabilization still depends on external enforcement and is therefore already ruled out by the class-wide impossibility result. We will revise §5 to explicitly note this point, showing that any residual external dependence reintroduces the reachability violation and therefore does not constitute a counter-example to the necessity claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives its class-wide external impossibility result and conditional intrinsic necessity result from explicit control-theoretic premises, including a stated reachability condition that separates bounded external effects from unbounded system effects. These conclusions are presented as conditional on the premises holding and do not reduce to self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. The four structural requirements follow as logical consequences of eliminating externally enforced strategies rather than being smuggled in via ansatz or prior-author uniqueness theorems. The argument remains self-contained against the stated assumptions without internal reduction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on control-theoretic modeling assumptions including a reachability condition that bounds external control effects while allowing unbounded system effects; these are treated as domain assumptions rather than derived.

axioms (1)

domain assumption Reachability condition allowing external control effects to be bounded while system effects are not
Invoked to establish the external impossibility result once system effects exceed bounded external countermeasures.

pith-pipeline@v0.9.0 · 5569 in / 1276 out tokens · 80729 ms · 2026-05-14T20:12:03.672610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety.arXiv preprint arXiv:1606.06565. Ames, A. D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., & Tabuada, P. (2019). Control barrier functions: Theory and applications. InProceedings of the 18th European Control Conference(pp. 3420–...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

AGI Safety Literature Review

(Lecture Notes in CS, Vol. 9782, pp. 1–11). Springer, Cham. Everitt, T., Lea, G., & Hutter, M. (2018). AGI safety literature review.arXiv preprint arXiv:1805.01109. Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2024). AI control: Improving safety despite intentional subversion. InProceedings of the 41st International Conference on Machine Learni...

work page internal anchor Pith review Pith/arXiv arXiv 2018