Recognition: unknown
Parallax: Why AI Agents That Think Must Never Act
Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3
The pith
AI agents require structural separation between reasoning and execution to remain secure even when the reasoning process is fully compromised.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless. The boundary is created by Cognitive-Executive Separation that prevents the reasoning module from executing actions, Adversarial Validation with Graduated Determinism that interposes an independent multi-tiered checker, Information Flow Control that propagates sensitivity labels to catch context-dependent threats, and Reversible Execution that preserves pre-action state for rollback. In the OpenParallax Go reference implementation, Assume-Compromise Evaluation across 280 adversarial cases in nine攻击
What carries the argument
Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions and requires every action to pass through an independent validator before any change occurs.
Load-bearing premise
The validator, sensitivity labels, and rollback mechanisms stay independent and uncompromised even when an adversary completely controls the reasoning component.
What would settle it
Replace the reasoning module with fully adversarial code that attempts to issue a destructive command such as deleting files or exfiltrating data; if the validator permits the action to complete, the claim that the architectural boundary survives reasoning compromise is falsified.
Figures
read the original abstract
Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that prompt-level guardrails are architecturally insufficient for AI agents with real-world execution capabilities because they reside within the same system that can be compromised. It introduces Parallax, an architecture based on four principles—Cognitive-Executive Separation, Adversarial Validation with Graduated Determinism, Information Flow Control, and Reversible Execution—along with an open-source Go reference implementation (OpenParallax). Using Assume-Compromise Evaluation that bypasses the reasoning system, the work reports 98.9% attack blocking across 280 test cases in nine categories with zero false positives under default configuration (and 100% under maximum-security configuration), arguing that the architectural boundary holds even under full compromise of the reasoning component.
Significance. If the independence of the validator, label propagation, and rollback mechanisms can be rigorously established, this work offers a substantive alternative to prompt-based safety for autonomous agents by shifting to structural isolation. The open-source reference implementation and the Assume-Compromise Evaluation methodology are explicit strengths that support reproducibility and allow independent scrutiny of the claimed separation.
major comments (1)
- [Evaluation / Assume-Compromise Evaluation] Evaluation / Assume-Compromise Evaluation: The central claim that Parallax's boundary holds 'regardless' of reasoning-system compromise rests on the validator, information-flow labels, and rollback logic remaining independent and uncompromised. The evaluation bypasses the reasoner by construction and reports 98.9% blocking with zero false positives, but provides no isolation proof, enumeration of entry points (shared state, IPC, file descriptors, environment variables, or database connections), or test that the validator itself cannot be reached or influenced by a compromised reasoner. This assumption is load-bearing; without it the reported figures only confirm behavior under the maintained assumption rather than validating the assumption.
minor comments (2)
- [Abstract] Abstract: The abstract states the 98.9% blocking result and zero false positives but supplies no methodology details, test-case definitions, attack-category specifications, or error analysis, which limits immediate assessment of evaluation soundness.
- The manuscript would benefit from explicit discussion of how the four principles interact under partial rather than total compromise scenarios.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the strengths of the Assume-Compromise Evaluation and open-source implementation. We address the major comment on validating the architectural boundary below. We agree that the independence of the validator is a load-bearing assumption and will revise the manuscript to provide greater clarity and implementation details on this point.
read point-by-point responses
-
Referee: The central claim that Parallax's boundary holds 'regardless' of reasoning-system compromise rests on the validator, information-flow labels, and rollback logic remaining independent and uncompromised. The evaluation bypasses the reasoner by construction and reports 98.9% blocking with zero false positives, but provides no isolation proof, enumeration of entry points (shared state, IPC, file descriptors, environment variables, or database connections), or test that the validator itself cannot be reached or influenced by a compromised reasoner. This assumption is load-bearing; without it the reported figures only confirm behavior under the maintained assumption rather than validating the assumption.
Authors: We agree that the evaluation tests Parallax under the maintained assumption of separation rather than proving the separation itself. The Assume-Compromise Evaluation is explicitly constructed to bypass the reasoning system, as described in the paper, to isolate the behavior of the validator, label propagation, and rollback mechanisms. In the revised manuscript we will add a new subsection in the OpenParallax Implementation section that enumerates potential entry points—including shared state, IPC channels, file descriptors, environment variables, and database connections—and details the concrete controls used in the reference implementation (separate processes with restricted OS-level permissions, one-way action-request channels, read-only label propagation, and no writable paths from the reasoner to validator state). We will also add an explicit limitations paragraph clarifying that the reported 98.9% (default) and 100% (maximum-security) blocking rates validate the design assuming the boundary is maintained, rather than constituting a formal proof of isolation. A complete formal proof or model-checked verification of isolation against arbitrary compromise lies outside the scope of this work, which focuses on architectural principles and empirical evaluation; we will state this limitation directly. revision: partial
- A formal isolation proof or model-checking verification that the validator cannot be reached or influenced by a compromised reasoner.
Circularity Check
No circularity: claims rest on explicit architectural definitions and empirical tests without reduction to inputs
full rationale
The paper defines Parallax via four architectural principles (Cognitive-Executive Separation, Adversarial Validation, Information Flow Control, Reversible Execution) and evaluates the resulting Go implementation directly with Assume-Compromise Evaluation across 280 cases. No equations, fitted parameters, self-citations, or derived predictions appear in the provided text. The 98.9% blocking rate is reported as an empirical outcome of the implementation under the stated test methodology rather than a quantity forced by redefinition or prior self-referential results. The central assertion that prompt guardrails fail under compromise while the architectural boundary holds is presented as a direct consequence of the separation design, not a circular renaming or smuggling of an ansatz.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The validator, label propagation, and rollback components stay independent and uncompromised when the reasoning system is fully controlled by an attacker.
Reference graph
Works this paper leans on
-
[1]
Predicts 2026: AI Agents Transform Enterprise Applications
Gartner, Inc. Predicts 2026: AI Agents Transform Enterprise Applications. Tech- nical Report, 2025
2026
-
[2]
FutureScape: Worldwide AI and Automa- tion 2026 Predictions
International Data Corporation (IDC). FutureScape: Worldwide AI and Automa- tion 2026 Predictions. IDC FutureScape, Doc #US51739524, 2025
2026
-
[3]
State of AI 2026: How AI Is Driving Revenue, Cutting Costs and Boosting Productivity for Every Industry
NVIDIA Corporation. State of AI 2026: How AI Is Driving Revenue, Cutting Costs and Boosting Productivity for Every Industry. NVIDIA Blog, March 2026
2026
-
[4]
OWASP Top 10 for LLM Applications 2025
OWASP Foundation. OWASP Top 10 for LLM Applications 2025. Version 2.0, 2025
2025
-
[5]
OWASP Top 10 for Agentic Applications
OWASP Foundation. OWASP Top 10 for Agentic Applications. Version 1.0, De- cember 2025
2025
-
[6]
A Practical Guide to Building Agents: Safety
OpenAI. A Practical Guide to Building Agents: Safety. OpenAI Developer Docu- mentation, 2026
2026
-
[7]
AI Agent Hijacking: Strengthening Evaluations for Autonomous AI Systems
National Institute of Standards and Technology (NIST). AI Agent Hijacking: Strengthening Evaluations for Autonomous AI Systems. NIST AI 100-series, February 2026
2026
-
[8]
Prompt Injection Attack Trends in Enterprise AI Systems, Q4 2025
Wiz Research. Prompt Injection Attack Trends in Enterprise AI Systems, Q4 2025. Wiz Threat Research Report, 2026
2025
-
[9]
Prompt Injection Statistics 2026: Hidden Risks Now
SQ Magazine. Prompt Injection Statistics 2026: Hidden Risks Now. March 2026
2026
-
[10]
Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild
Palo Alto Networks, Unit 42. Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild. Unit 42 Threat Research, March 2026
2026
-
[11]
S. S. Srivastava and H. He. MemoryGraft: Implanting False Experiences in AI Agent Memory. Preprint, December 2025
2025
-
[12]
M. A. Ferrag, N. Tihanyi, D. Hamouda, L. Maglaras, A. Lakas, and M. Debbah. From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agent Workflows.ScienceDirect, 2025
2025
-
[13]
LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI
Sombra Inc. LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI. Sombra Security Blog, January 2026
2026
-
[14]
CVE-2026-25253: OpenClaw Critical Vul- nerability and Supply Chain Attack
NIST National Vulnerability Database. CVE-2026-25253: OpenClaw Critical Vul- nerability and Supply Chain Attack. NIST National Vulnerability Database, 2026
2026
-
[15]
Prompt Injection Attacks 2026: AI Security Crisis Escalates
MarkAICode. Prompt Injection Attacks 2026: AI Security Crisis Escalates. March 2026
2026
-
[16]
Provos, M
N. Provos, M. Friedl, and P. Honeyman. Preventing Privilege Escalation. InPro- ceedings of the 12th USENIX Security Symposium, pp. 231–242, 2003
2003
-
[17]
D. E. Bell and L. J. LaPadula. Secure Computer Systems: Unified Exposition and Multics Interpretation. Technical Report MTR-2997 Rev. 1, The MITRE Corpora- tion, 1976
1976
-
[18]
TPM 2.0 Library Specification
Trusted Computing Group. TPM 2.0 Library Specification. Family “2.0”, Level 00, Revision 01.38, 2016
2016
-
[19]
J. H. Saltzer and M. D. Schroeder. The Protection of Information in Computer Systems.Proceedings of the IEEE, 63(9):1278–1308, 1975
1975
-
[20]
Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act)
European Parliament and Council of the European Union. Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act). Official Journal of the European Union, August 2024
2024
-
[21]
Model AI Gover- nance Framework for Agentic AI
Infocomm Media Development Authority (IMDA), Singapore. Model AI Gover- nance Framework for Agentic AI. January 2026
2026
-
[22]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Rein- forcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems (NeurIPS), pp. 4299–4307, 2017
2017
-
[23]
Constitutional AI: Harmlessness from AI Feedback
Y. Baiet al.Constitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Willison
S. Willison. The Lethal Trifecta. https://simonwillison.net/, 2025
2025
-
[25]
Addressing the OWASP Top 10 Risks in Agentic AI with Microsoft Copilot Studio
Microsoft Security Blog. Addressing the OWASP Top 10 Risks in Agentic AI with Microsoft Copilot Studio. March 2026
2026
-
[26]
A. C. Myers and B. Liskov. A Decentralized Model for Information Flow Control. InProceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), pp. 129–142, 1997
1997
-
[27]
ATLAS: Adversarial Threat Landscape for Artificial-Intelli- gence Systems, v5.4.0
MITRE Corporation. ATLAS: Adversarial Threat Landscape for Artificial-Intelli- gence Systems, v5.4.0. February 2026
2026
- [28]
-
[29]
Top Agentic AI Security Threats in Late 2026
Stellar Cyber. Top Agentic AI Security Threats in Late 2026. Stellar Cyber Threat Research, March 2026
2026
-
[30]
A. Baby. The Privilege Escalation Kill Chain: How AI Agents Self-Grant Permis- sions and Persist Across Sessions. Technical Analysis, March 2026
2026
-
[31]
Securing AI Agents: The Defining Cybersecurity Challenge of 2026
Bessemer Venture Partners. Securing AI Agents: The Defining Cybersecurity Challenge of 2026. Bessemer Atlas, March 2026
2026
-
[32]
2026 Global Threat Report: Evasive Adversary Wields AI
CrowdStrike. 2026 Global Threat Report: Evasive Adversary Wields AI. February 2026
2026
-
[33]
Christodorescu, E
M. Christodorescu, E. Fernandes, A. Hooda, S. Jha, J. Rehberger, and K. Shams. Systems Security Foundations for Agentic Computing.IACR ePrint Archive, Report 2025/2173, 2025
2025
-
[34]
Security Considerations for Artificial Intelligence Agents
J. Maet al.Security Considerations for Artificial Intelligence Agents.arXiv preprint arXiv:2603.12230, March 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Securing the AI Agent Revolution: A Practical Guide to Model Context Protocol Security
Coalition for Secure AI (CoSAI). Securing the AI Agent Revolution: A Practical Guide to Model Context Protocol Security. January 2026
2026
-
[36]
F. Pierucci, M. Galisai, M. S. Bracale, M. Prandi, P. Bisconti, F. Giarrusso, O. Soroko- letova, V. Suriani, and D. Nardi. Institutional AI: A Governance Framework for Distributional AGI Safety.arXiv preprint arXiv:2601.10599, January 2026
-
[37]
AI Safety vs AI Security in LLM Applications: What Teams Must Know
Promptfoo. AI Safety vs AI Security in LLM Applications: What Teams Must Know. August 2025
2025
-
[38]
Chuet al.Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation
E. Chuet al.Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation. TechRxiv Preprint, 2026
2026
-
[39]
Agentic AI on Kubernetes and GKE: Introducing Agent Sandbox
Google Cloud. Agentic AI on Kubernetes and GKE: Introducing Agent Sandbox. Google Cloud Blog, November 2025
2025
- [40]
-
[41]
A. Nelson. The Mirage of AI Deregulation.Science, 391(6782), January 2026
2026
-
[42]
Zhang, J
H. Zhang, J. Huang, K. Mei,et al.Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025
2025
-
[43]
Andriushchenko, A
M. Andriushchenko, A. Souly, M. Dziemian,et al.AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025
2025
-
[44]
Bazinska, M
J. Bazinska, M. Mathys, F. Casucci,et al.Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents. InProceedings of the International Conference on Learning Representations (ICLR), 2026
2026
-
[45]
Tool Use with Claude: ToolSearch for MCP Servers
Anthropic. Tool Use with Claude: ToolSearch for MCP Servers. Anthropic Devel- oper Documentation, 2025
2025
-
[46]
Lumer, A
E. Lumer, A. Gulati, V. K. Subbiah, P. H. Basavaraju, and J. A. Burke. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and In- vocation in LLM Agent Multi-Turn Conversations. InProceedings of the European Conference on Information Retrieval (ECIR), 2026
2026
-
[47]
Prompt Injection and Jailbreak Detection Dataset
NeurAlchemy. Prompt Injection and Jailbreak Detection Dataset. HuggingFace,
-
[48]
https://huggingface.co/datasets/neuralchemy/Prompt-injection-dataset
- [49]
-
[50]
E. Li, T. Mallick, E. Rose, W. Robertson, A. Oprea, and C. Nita-Rotaru. ACE: A Security Architecture for LLM-Integrated App Systems.arXiv preprint arXiv:2504.20984, 2025
work page internal anchor Pith review arXiv 2025
- [51]
- [52]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.