arxiv: 2604.12986 · v1 · submitted 2026-04-14 · 💻 cs.CR · cs.AI

Recognition: unknown

Parallax: Why AI Agents That Think Must Never Act

Joel Fokou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentsagent securityprompt guardrailscognitive-executive separationadversarial validationinformation flow controlreversible executionassume-compromise evaluation

0 comments

The pith

AI agents require structural separation between reasoning and execution to remain secure even when the reasoning process is fully compromised.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prompt-level guardrails cannot protect AI agents capable of real-world actions because those guardrails reside inside the same system an adversary can take over. It introduces Parallax as an architecture that keeps the reasoning component from ever directly triggering execution, instead routing every action through an independent validator, sensitivity labels, and reversible state capture. This matters because autonomous agents are moving into enterprise infrastructure where a single successful prompt injection or model takeover can cause irreversible damage. The evaluation tests the boundary by assuming the reasoning system is already controlled and shows the external mechanisms still stop the vast majority of attacks. A sympathetic reader sees this as shifting agent safety from an unreliable linguistic defense to a hard architectural one.

Core claim

When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless. The boundary is created by Cognitive-Executive Separation that prevents the reasoning module from executing actions, Adversarial Validation with Graduated Determinism that interposes an independent multi-tiered checker, Information Flow Control that propagates sensitivity labels to catch context-dependent threats, and Reversible Execution that preserves pre-action state for rollback. In the OpenParallax Go reference implementation, Assume-Compromise Evaluation across 280 adversarial cases in nine攻击

What carries the argument

Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions and requires every action to pass through an independent validator before any change occurs.

Load-bearing premise

The validator, sensitivity labels, and rollback mechanisms stay independent and uncompromised even when an adversary completely controls the reasoning component.

What would settle it

Replace the reasoning module with fully adversarial code that attempts to issue a destructive command such as deleting files or exfiltrating data; if the validator permits the action to complete, the claim that the architectural boundary survives reasoning compromise is falsified.

Figures

Figures reproduced from arXiv: 2604.12986 by Joel Fokou.

**Figure 1.** Figure 1: The Parallax architecture. The agent (untrusted, sandboxed) proposes actions via gRPC. Shield validates each action through four tiers of graduated determinism; actions that fail validation are blocked. Only validated actions reach the executor. IFC tags propagate data sensitivity into Shield for context-aware evaluation. Chronicle captures state before destructive actions. The agent cannot access or modif… view at source ↗

read the original abstract

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that prompt-level guardrails are architecturally insufficient for AI agents with real-world execution capabilities because they reside within the same system that can be compromised. It introduces Parallax, an architecture based on four principles—Cognitive-Executive Separation, Adversarial Validation with Graduated Determinism, Information Flow Control, and Reversible Execution—along with an open-source Go reference implementation (OpenParallax). Using Assume-Compromise Evaluation that bypasses the reasoning system, the work reports 98.9% attack blocking across 280 test cases in nine categories with zero false positives under default configuration (and 100% under maximum-security configuration), arguing that the architectural boundary holds even under full compromise of the reasoning component.

Significance. If the independence of the validator, label propagation, and rollback mechanisms can be rigorously established, this work offers a substantive alternative to prompt-based safety for autonomous agents by shifting to structural isolation. The open-source reference implementation and the Assume-Compromise Evaluation methodology are explicit strengths that support reproducibility and allow independent scrutiny of the claimed separation.

major comments (1)

[Evaluation / Assume-Compromise Evaluation] Evaluation / Assume-Compromise Evaluation: The central claim that Parallax's boundary holds 'regardless' of reasoning-system compromise rests on the validator, information-flow labels, and rollback logic remaining independent and uncompromised. The evaluation bypasses the reasoner by construction and reports 98.9% blocking with zero false positives, but provides no isolation proof, enumeration of entry points (shared state, IPC, file descriptors, environment variables, or database connections), or test that the validator itself cannot be reached or influenced by a compromised reasoner. This assumption is load-bearing; without it the reported figures only confirm behavior under the maintained assumption rather than validating the assumption.

minor comments (2)

[Abstract] Abstract: The abstract states the 98.9% blocking result and zero false positives but supplies no methodology details, test-case definitions, attack-category specifications, or error analysis, which limits immediate assessment of evaluation soundness.
The manuscript would benefit from explicit discussion of how the four principles interact under partial rather than total compromise scenarios.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback and for highlighting the strengths of the Assume-Compromise Evaluation and open-source implementation. We address the major comment on validating the architectural boundary below. We agree that the independence of the validator is a load-bearing assumption and will revise the manuscript to provide greater clarity and implementation details on this point.

read point-by-point responses

Referee: The central claim that Parallax's boundary holds 'regardless' of reasoning-system compromise rests on the validator, information-flow labels, and rollback logic remaining independent and uncompromised. The evaluation bypasses the reasoner by construction and reports 98.9% blocking with zero false positives, but provides no isolation proof, enumeration of entry points (shared state, IPC, file descriptors, environment variables, or database connections), or test that the validator itself cannot be reached or influenced by a compromised reasoner. This assumption is load-bearing; without it the reported figures only confirm behavior under the maintained assumption rather than validating the assumption.

Authors: We agree that the evaluation tests Parallax under the maintained assumption of separation rather than proving the separation itself. The Assume-Compromise Evaluation is explicitly constructed to bypass the reasoning system, as described in the paper, to isolate the behavior of the validator, label propagation, and rollback mechanisms. In the revised manuscript we will add a new subsection in the OpenParallax Implementation section that enumerates potential entry points—including shared state, IPC channels, file descriptors, environment variables, and database connections—and details the concrete controls used in the reference implementation (separate processes with restricted OS-level permissions, one-way action-request channels, read-only label propagation, and no writable paths from the reasoner to validator state). We will also add an explicit limitations paragraph clarifying that the reported 98.9% (default) and 100% (maximum-security) blocking rates validate the design assuming the boundary is maintained, rather than constituting a formal proof of isolation. A complete formal proof or model-checked verification of isolation against arbitrary compromise lies outside the scope of this work, which focuses on architectural principles and empirical evaluation; we will state this limitation directly. revision: partial

standing simulated objections not resolved

A formal isolation proof or model-checking verification that the validator cannot be reached or influenced by a compromised reasoner.

Circularity Check

0 steps flagged

No circularity: claims rest on explicit architectural definitions and empirical tests without reduction to inputs

full rationale

The paper defines Parallax via four architectural principles (Cognitive-Executive Separation, Adversarial Validation, Information Flow Control, Reversible Execution) and evaluates the resulting Go implementation directly with Assume-Compromise Evaluation across 280 cases. No equations, fitted parameters, self-citations, or derived predictions appear in the provided text. The 98.9% blocking rate is reported as an empirical outcome of the implementation under the stated test methodology rather than a quantity forced by redefinition or prior self-referential results. The central assertion that prompt guardrails fail under compromise while the architectural boundary holds is presented as a direct consequence of the separation design, not a circular renaming or smuggling of an ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validator layer remaining outside the compromised reasoning system and on the practical enforceability of the four principles in deployed code.

axioms (1)

domain assumption The validator, label propagation, and rollback components stay independent and uncompromised when the reasoning system is fully controlled by an attacker.
This assumption underpins the Assume-Compromise Evaluation and the claim that architectural boundaries survive reasoning-system compromise.

pith-pipeline@v0.9.0 · 5583 in / 1227 out tokens · 70374 ms · 2026-05-10T15:08:59.629229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Predicts 2026: AI Agents Transform Enterprise Applications

Gartner, Inc. Predicts 2026: AI Agents Transform Enterprise Applications. Tech- nical Report, 2025

2026
[2]

FutureScape: Worldwide AI and Automa- tion 2026 Predictions

International Data Corporation (IDC). FutureScape: Worldwide AI and Automa- tion 2026 Predictions. IDC FutureScape, Doc #US51739524, 2025

2026
[3]

State of AI 2026: How AI Is Driving Revenue, Cutting Costs and Boosting Productivity for Every Industry

NVIDIA Corporation. State of AI 2026: How AI Is Driving Revenue, Cutting Costs and Boosting Productivity for Every Industry. NVIDIA Blog, March 2026

2026
[4]

OWASP Top 10 for LLM Applications 2025

OWASP Foundation. OWASP Top 10 for LLM Applications 2025. Version 2.0, 2025

2025
[5]

OWASP Top 10 for Agentic Applications

OWASP Foundation. OWASP Top 10 for Agentic Applications. Version 1.0, De- cember 2025

2025
[6]

A Practical Guide to Building Agents: Safety

OpenAI. A Practical Guide to Building Agents: Safety. OpenAI Developer Docu- mentation, 2026

2026
[7]

AI Agent Hijacking: Strengthening Evaluations for Autonomous AI Systems

National Institute of Standards and Technology (NIST). AI Agent Hijacking: Strengthening Evaluations for Autonomous AI Systems. NIST AI 100-series, February 2026

2026
[8]

Prompt Injection Attack Trends in Enterprise AI Systems, Q4 2025

Wiz Research. Prompt Injection Attack Trends in Enterprise AI Systems, Q4 2025. Wiz Threat Research Report, 2026

2025
[9]

Prompt Injection Statistics 2026: Hidden Risks Now

SQ Magazine. Prompt Injection Statistics 2026: Hidden Risks Now. March 2026

2026
[10]

Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild

Palo Alto Networks, Unit 42. Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild. Unit 42 Threat Research, March 2026

2026
[11]

S. S. Srivastava and H. He. MemoryGraft: Implanting False Experiences in AI Agent Memory. Preprint, December 2025

2025
[12]

M. A. Ferrag, N. Tihanyi, D. Hamouda, L. Maglaras, A. Lakas, and M. Debbah. From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agent Workflows.ScienceDirect, 2025

2025
[13]

LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI

Sombra Inc. LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI. Sombra Security Blog, January 2026

2026
[14]

CVE-2026-25253: OpenClaw Critical Vul- nerability and Supply Chain Attack

NIST National Vulnerability Database. CVE-2026-25253: OpenClaw Critical Vul- nerability and Supply Chain Attack. NIST National Vulnerability Database, 2026

2026
[15]

Prompt Injection Attacks 2026: AI Security Crisis Escalates

MarkAICode. Prompt Injection Attacks 2026: AI Security Crisis Escalates. March 2026

2026
[16]

Provos, M

N. Provos, M. Friedl, and P. Honeyman. Preventing Privilege Escalation. InPro- ceedings of the 12th USENIX Security Symposium, pp. 231–242, 2003

2003
[17]

D. E. Bell and L. J. LaPadula. Secure Computer Systems: Unified Exposition and Multics Interpretation. Technical Report MTR-2997 Rev. 1, The MITRE Corpora- tion, 1976

1976
[18]

TPM 2.0 Library Specification

Trusted Computing Group. TPM 2.0 Library Specification. Family “2.0”, Level 00, Revision 01.38, 2016

2016
[19]

J. H. Saltzer and M. D. Schroeder. The Protection of Information in Computer Systems.Proceedings of the IEEE, 63(9):1278–1308, 1975

1975
[20]

Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act)

European Parliament and Council of the European Union. Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act). Official Journal of the European Union, August 2024

2024
[21]

Model AI Gover- nance Framework for Agentic AI

Infocomm Media Development Authority (IMDA), Singapore. Model AI Gover- nance Framework for Agentic AI. January 2026

2026
[22]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Rein- forcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems (NeurIPS), pp. 4299–4307, 2017

2017
[23]

Constitutional AI: Harmlessness from AI Feedback

Y. Baiet al.Constitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Willison

S. Willison. The Lethal Trifecta. https://simonwillison.net/, 2025

2025
[25]

Addressing the OWASP Top 10 Risks in Agentic AI with Microsoft Copilot Studio

Microsoft Security Blog. Addressing the OWASP Top 10 Risks in Agentic AI with Microsoft Copilot Studio. March 2026

2026
[26]

A. C. Myers and B. Liskov. A Decentralized Model for Information Flow Control. InProceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), pp. 129–142, 1997

1997
[27]

ATLAS: Adversarial Threat Landscape for Artificial-Intelli- gence Systems, v5.4.0

MITRE Corporation. ATLAS: Adversarial Threat Landscape for Artificial-Intelli- gence Systems, v5.4.0. February 2026

2026
[28]

V. S. Narajala and V. Anca. Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents.arXiv preprint arXiv:2504.19956, 2025

work page arXiv 2025
[29]

Top Agentic AI Security Threats in Late 2026

Stellar Cyber. Top Agentic AI Security Threats in Late 2026. Stellar Cyber Threat Research, March 2026

2026
[30]

A. Baby. The Privilege Escalation Kill Chain: How AI Agents Self-Grant Permis- sions and Persist Across Sessions. Technical Analysis, March 2026

2026
[31]

Securing AI Agents: The Defining Cybersecurity Challenge of 2026

Bessemer Venture Partners. Securing AI Agents: The Defining Cybersecurity Challenge of 2026. Bessemer Atlas, March 2026

2026
[32]

2026 Global Threat Report: Evasive Adversary Wields AI

CrowdStrike. 2026 Global Threat Report: Evasive Adversary Wields AI. February 2026

2026
[33]

Christodorescu, E

M. Christodorescu, E. Fernandes, A. Hooda, S. Jha, J. Rehberger, and K. Shams. Systems Security Foundations for Agentic Computing.IACR ePrint Archive, Report 2025/2173, 2025

2025
[34]

Security Considerations for Artificial Intelligence Agents

J. Maet al.Security Considerations for Artificial Intelligence Agents.arXiv preprint arXiv:2603.12230, March 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Securing the AI Agent Revolution: A Practical Guide to Model Context Protocol Security

Coalition for Secure AI (CoSAI). Securing the AI Agent Revolution: A Practical Guide to Model Context Protocol Security. January 2026

2026
[36]

Pierucci, M

F. Pierucci, M. Galisai, M. S. Bracale, M. Prandi, P. Bisconti, F. Giarrusso, O. Soroko- letova, V. Suriani, and D. Nardi. Institutional AI: A Governance Framework for Distributional AGI Safety.arXiv preprint arXiv:2601.10599, January 2026

work page arXiv 2026
[37]

AI Safety vs AI Security in LLM Applications: What Teams Must Know

Promptfoo. AI Safety vs AI Security in LLM Applications: What Teams Must Know. August 2025

2025
[38]

Chuet al.Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation

E. Chuet al.Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation. TechRxiv Preprint, 2026

2026
[39]

Agentic AI on Kubernetes and GKE: Introducing Agent Sandbox

Google Cloud. Agentic AI on Kubernetes and GKE: Introducing Agent Sandbox. Google Cloud Blog, November 2025

2025
[40]

R. F. Del Rosario, K. Krawiecka, and C. Schroeder de Witt. Architecting Re- silient LLM Agents: A Guide to Secure Plan-then-Execute Patterns.arXiv preprint arXiv:2509.08646, 2025. Joel Fokou

work page arXiv 2025
[41]

A. Nelson. The Mirage of AI Deregulation.Science, 391(6782), January 2026

2026
[42]

Zhang, J

H. Zhang, J. Huang, K. Mei,et al.Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[43]

Andriushchenko, A

M. Andriushchenko, A. Souly, M. Dziemian,et al.AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[44]

Bazinska, M

J. Bazinska, M. Mathys, F. Casucci,et al.Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents. InProceedings of the International Conference on Learning Representations (ICLR), 2026

2026
[45]

Tool Use with Claude: ToolSearch for MCP Servers

Anthropic. Tool Use with Claude: ToolSearch for MCP Servers. Anthropic Devel- oper Documentation, 2025

2025
[46]

Lumer, A

E. Lumer, A. Gulati, V. K. Subbiah, P. H. Basavaraju, and J. A. Burke. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and In- vocation in LLM Agent Multi-Turn Conversations. InProceedings of the European Conference on Information Retrieval (ECIR), 2026

2026
[47]

Prompt Injection and Jailbreak Detection Dataset

NeurAlchemy. Prompt Injection and Jailbreak Detection Dataset. HuggingFace,
[48]

https://huggingface.co/datasets/neuralchemy/Prompt-injection-dataset
[49]

Y. Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal. IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems. InProceedings of the Network and Distributed System Security Symposium (NDSS), 2025.arXiv preprint arXiv:2403.04960

work page arXiv 2025
[50]

E. Li, T. Mallick, E. Rose, W. Robertson, A. Oprea, and C. Nita-Rotaru. ACE: A Security Architecture for LLM-Integrated App Systems.arXiv preprint arXiv:2504.20984, 2025

work page internal anchor Pith review arXiv 2025
[51]

Zhang, Z

K. Zhang, Z. Su, P.-Y. Chen, E. Bertino, X. Zhang, and N. Li. LLM Agents Should Employ Security Principles.arXiv preprint arXiv:2505.24019, 2025

work page arXiv 2025
[52]

Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, Y. Gao, S. Wang, and Y. Li. Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework.arXiv preprint arXiv:2601.11893, 2026

work page arXiv 2026