Recognition: unknown
Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents
Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3
The pith
AI agents can learn the minimum viable set of tools for each task from audit logs instead of receiving blanket access to all capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a reinforcement learning policy trained on accumulated audit logs can reliably determine the minimum viable capability set for each task type, which is then enforced by scoping tool visibility before a session begins and by intercepting calls during execution.
What carries the argument
The RL Learning Policy layer that trains a PPO model on audit logs to derive task-specific minimum capability sets.
If this is right
- Agents would receive awareness of and access to only the tools required for their current task type.
- The default fifteen times overprovision ratio would be replaced by dynamically determined minimal sets.
- High-risk capabilities such as credential access would be removed from sessions where they are not needed.
- The system would refine its restrictions over time as additional audit data is collected without requiring new manual rules.
Where Pith is reading between the lines
- The learned policy could be combined with existing container isolation to create a hybrid containment layer.
- Task clustering on the audit data might allow policies to generalize across similar but not identical jobs.
- Deployment would need separate measurement of both reduced security incidents and maintained task completion rates.
Load-bearing premise
Records of past agent sessions contain enough signal to train a policy that selects the smallest necessary set of tools for new tasks without blocking required actions too often.
What would settle it
Applying the trained policy to a set of held-out tasks and checking whether it blocks a tool required for successful completion in more than a small fraction of cases.
Figures
read the original abstract
Autonomous AI agents built on open-source runtimes such as OpenClaw expose every available tool to every session by default, regardless of the task. A summarization task receives the same shell execution, subagent spawning, and credential access capabilities as a code deployment task, a 15x overprovision ratio that we call the capability overprovisioning problem. Existing defenses, including the NemoClaw container sandbox and the Cisco DefenseClaw skill scanner, address containment and threat detection but do not learn the minimum viable capability set for each task type. We present Aethelgard, a four layer adaptive governance framework that enforces least privilege for AI agents through a learned policy. Layer 1, the Capability Governor, dynamically scopes which tools the agent is aware of in each session. Layer 3, the Safety Router, intercepts tool calls before execution using a hybrid rule based and fine tuned classifier. Layer 2, the RL Learning Policy, trains a PPO policy on the accumulated audit log to learn the minimum viable skill set for each task type.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Aethelgard, a four-layer adaptive governance framework for autonomous AI agents that aims to solve the capability overprovisioning problem (15x overprovision ratio) in open-source runtimes such as OpenClaw. It proposes a Capability Governor (Layer 1) to dynamically scope tools per session, an RL Learning Policy (Layer 2) that trains a PPO policy on accumulated audit logs to learn the minimum viable skill set for each task type, and a Safety Router (Layer 3) that intercepts tool calls via a hybrid rule-based and fine-tuned classifier. The work positions this learned approach as an advance over static sandboxing (e.g., NemoClaw) and threat detection (e.g., Cisco DefenseClaw).
Significance. If the PPO policy could be shown to reliably extract minimal viable capability sets from audit data while preserving task success, the framework would offer a meaningful improvement in reducing the attack surface for AI agents beyond static containment. The idea of using reinforcement learning on real session logs for dynamic least-privilege enforcement is conceptually promising for the field of AI security. However, the manuscript supplies only an architectural sketch with no empirical results, reward definitions, or validation, so its significance remains prospective rather than demonstrated.
major comments (3)
- Abstract (RL Learning Policy description): The claim that a PPO policy trained on audit logs learns the 'minimum viable skill set' for each task type lacks any specification of state representation, action space, reward function, or training procedure. Audit logs from overprovisioned sessions record observed tool usage rather than ground-truth necessity; without negative examples or a penalty for excess capabilities that still allows task completion, the policy has no basis for distinguishing necessary from superfluous tools, directly undermining the least-privilege enforcement assertion.
- Abstract (framework structure): The paper states that Aethelgard is a 'four layer' framework yet only describes three layers (Capability Governor as Layer 1, Safety Router as Layer 3, RL Learning Policy as Layer 2). The missing fourth layer and the non-sequential numbering prevent assessment of the complete data flow and interactions required for the adaptive governance claim.
- Abstract (Safety Router): The hybrid rule-based and fine-tuned classifier for intercepting tool calls is introduced without any description of training data, evaluation metrics, or how its decisions integrate with the output of the RL policy. This component is load-bearing for the overall enforcement mechanism but remains unspecified.
minor comments (2)
- The abstract introduces Layer 3 before Layer 2, disrupting the logical presentation of the layered architecture and making it harder to follow the intended component ordering.
- No citations to related work on capability-based access control, RL for security policies, or prior AI-agent sandboxing systems are provided, which would help clarify the novelty of the four-layer design.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We appreciate the acknowledgment that the core idea of applying reinforcement learning to audit logs for dynamic least-privilege enforcement is conceptually promising. The comments correctly identify that the current manuscript is an architectural sketch lacking key technical specifications. We will incorporate the requested details and clarifications in the revised version. Our point-by-point responses to the major comments are below.
read point-by-point responses
-
Referee: Abstract (RL Learning Policy description): The claim that a PPO policy trained on audit logs learns the 'minimum viable skill set' for each task type lacks any specification of state representation, action space, reward function, or training procedure. Audit logs from overprovisioned sessions record observed tool usage rather than ground-truth necessity; without negative examples or a penalty for excess capabilities that still allows task completion, the policy has no basis for distinguishing necessary from superfluous tools, directly undermining the least-privilege enforcement assertion.
Authors: We agree that the manuscript provides only a high-level description and does not specify the RL components. In the revision we will add a dedicated subsection detailing the state representation (task-type embeddings concatenated with a vector of historically observed tool calls), the action space (per-tool binary enable/disable decisions), the reward function (task-success indicator minus a scaled penalty on the count of provisioned but unused capabilities, where usage is verified via post-session audit), and the training procedure (PPO with experience replay drawn from both real overprovisioned logs and augmented datasets containing expert-defined minimal sets). This formulation supplies the negative signal the referee correctly notes is missing from raw audit logs alone. revision: yes
-
Referee: Abstract (framework structure): The paper states that Aethelgard is a 'four layer' framework yet only describes three layers (Capability Governor as Layer 1, Safety Router as Layer 3, RL Learning Policy as Layer 2). The missing fourth layer and the non-sequential numbering prevent assessment of the complete data flow and interactions required for the adaptive governance claim.
Authors: The referee is correct; the abstract contains an inconsistency. We will revise the abstract and add an explicit framework overview section that presents the four layers in sequential order (Layer 1: Capability Governor, Layer 2: RL Learning Policy, Layer 3: Safety Router, Layer 4: Audit and Feedback Layer). The new section will describe Layer 4's role in collecting execution outcomes and feeding them back for policy updates, together with a data-flow diagram showing all inter-layer interactions. revision: yes
-
Referee: Abstract (Safety Router): The hybrid rule-based and fine-tuned classifier for intercepting tool calls is introduced without any description of training data, evaluation metrics, or how its decisions integrate with the output of the RL policy. This component is load-bearing for the overall enforcement mechanism but remains unspecified.
Authors: We acknowledge the Safety Router description is incomplete. The revised manuscript will specify the training data (labeled benign and adversarial tool-call traces extracted from audit logs plus synthetic attack examples), the evaluation metrics (precision, recall, and false-positive rate on a held-out test set), and the integration logic (the RL policy first produces a per-session capability scope that the router treats as a dynamic allow-list; rule-based checks are applied only to actions outside this scope or flagged as high-risk). revision: yes
Circularity Check
No circularity; high-level system proposal without derivations or fitted quantities.
full rationale
The manuscript proposes the Aethelgard four-layer framework and describes training a PPO policy on audit logs to learn minimum viable capabilities, but supplies no equations, reward functions, state representations, or experimental results that could reduce to self-referential definitions or fitted inputs called predictions. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim is a design assertion whose validity depends on external evaluation of the logs' signal quality rather than any internal reduction to its own inputs by construction. This matches the default expectation of a non-circular system paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Audit logs from agent sessions contain sufficient labeled signal to train a policy that distinguishes necessary from unnecessary capabilities for future tasks.
invented entities (1)
-
Aethelgard four-layer framework
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...
Reference graph
Works this paper leans on
-
[1]
EduClaw: AgentProfile-based skill scaling for educational agents
arXiv. EduClaw: AgentProfile-based skill scaling for educational agents. arXiv:2603.11709, 2026a. arXiv. XSkill: Dual-stream continual learning from experiences and skills. arXiv:2603.12056, 2026b. Cisco AI Defense. DefenseClaw: Open-source security framework for OpenClaw. RSAC 2026, March 27,
-
[2]
CrowdStrike Intelligence
URLhttps://github.com/cisco-ai-defense/defenseclaw. CrowdStrike Intelligence. ClawHavoc: Supply chain attack on OpenClaw skill registry. RSAC 2026 Keynote, March 2026,
2026
-
[3]
Curriculum-based agentic training with adaptive scheduling
EmergentMind. Curriculum-based agentic training with adaptive scheduling. EmergentMind, 2026,
2026
-
[4]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection.arXiv preprint arXiv:2302.12173,
work page internal anchor Pith review arXiv
-
[5]
Adaptive network security policy via reinforcement learning
Wei Li et al. Adaptive network security policy via reinforcement learning. IEEE S&P 2024,
2024
-
[6]
CVE-2026-25253: Remote code execution in OpenClaw.https://nvd.nist.gov/vuln/ detail/CVE-2026-25253,
NVD. CVE-2026-25253: Remote code execution in OpenClaw.https://nvd.nist.gov/vuln/ detail/CVE-2026-25253,
2026
-
[7]
NemoClaw: Secure agent runtime with OpenShell
NVIDIA. NemoClaw: Secure agent runtime with OpenShell. NVIDIA GTC 2026, March 16,
2026
-
[8]
Ignore Previous Prompt: Attack Techniques For Language Models
Primary case study: ClawHavoc supply chain attack. F´abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527,
work page internal anchor Pith review arXiv
-
[9]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
URLhttps://arxiv.org/abs/2307.16789. Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dor- mann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
OpenClaw has 500,000 instances and no enterprise kill switch
VentureBeat. OpenClaw has 500,000 instances and no enterprise kill switch. VentureBeat, March 31, 2026,
2026
-
[12]
The user wants me to execute bash commands. . . I should refuse as it violates safety guidelines
The 0.557 is thegovernedSER under PPO. The+337%improvement is the ratio. Classifier test set.The 11-case classifier evaluation in §4.2 is a smoke test. For camera-ready we construct a balanced held-out set ofn≥200from the N=500 audit log (117 BLOCK + 117 matched ALLOW decisions, plus 20 OOD adversarial examples not in training distribution). The extractio...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.