arxiv: 2604.11839 · v2 · submitted 2026-04-12 · 💻 cs.CR · cs.AI

Recognition: unknown

Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents

Bronislav Sidik , Lior Rokach

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentscapability governanceleast privilegereinforcement learningadaptive sandboxingautonomous systemsaudit logs

0 comments

The pith

AI agents can learn the minimum viable set of tools for each task from audit logs instead of receiving blanket access to all capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI agent runtimes grant every session full access to tools such as shell execution and credential access regardless of the job, resulting in a fifteen times overprovision of capabilities. The paper presents Aethelgard as a four-layer framework that trains a policy on past session records to identify and limit agents to only the tools needed for their specific task type. The system dynamically scopes tool awareness at the start of a session and intercepts calls during execution to enforce the learned minimum. If correct, this would allow autonomous agents to run with substantially lower exposure to unnecessary risks while preserving the ability to complete required work. A reader would care because the approach directly targets the default over-access problem that static sandboxes and scanners have left unaddressed.

Core claim

The paper claims that a reinforcement learning policy trained on accumulated audit logs can reliably determine the minimum viable capability set for each task type, which is then enforced by scoping tool visibility before a session begins and by intercepting calls during execution.

What carries the argument

The RL Learning Policy layer that trains a PPO model on audit logs to derive task-specific minimum capability sets.

If this is right

Agents would receive awareness of and access to only the tools required for their current task type.
The default fifteen times overprovision ratio would be replaced by dynamically determined minimal sets.
High-risk capabilities such as credential access would be removed from sessions where they are not needed.
The system would refine its restrictions over time as additional audit data is collected without requiring new manual rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The learned policy could be combined with existing container isolation to create a hybrid containment layer.
Task clustering on the audit data might allow policies to generalize across similar but not identical jobs.
Deployment would need separate measurement of both reduced security incidents and maintained task completion rates.

Load-bearing premise

Records of past agent sessions contain enough signal to train a policy that selects the smallest necessary set of tools for new tasks without blocking required actions too often.

What would settle it

Applying the trained policy to a set of held-out tasks and checking whether it blocks a tool required for successful completion in more than a small fraction of cases.

Figures

Figures reproduced from arXiv: 2604.11839 by Bronislav Sidik, Lior Rokach.

read the original abstract

Autonomous AI agents built on open-source runtimes such as OpenClaw expose every available tool to every session by default, regardless of the task. A summarization task receives the same shell execution, subagent spawning, and credential access capabilities as a code deployment task, a 15x overprovision ratio that we call the capability overprovisioning problem. Existing defenses, including the NemoClaw container sandbox and the Cisco DefenseClaw skill scanner, address containment and threat detection but do not learn the minimum viable capability set for each task type. We present Aethelgard, a four layer adaptive governance framework that enforces least privilege for AI agents through a learned policy. Layer 1, the Capability Governor, dynamically scopes which tools the agent is aware of in each session. Layer 3, the Safety Router, intercepts tool calls before execution using a hybrid rule based and fine tuned classifier. Layer 2, the RL Learning Policy, trains a PPO policy on the accumulated audit log to learn the minimum viable skill set for each task type.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Aethelgard, a four-layer adaptive governance framework for autonomous AI agents that aims to solve the capability overprovisioning problem (15x overprovision ratio) in open-source runtimes such as OpenClaw. It proposes a Capability Governor (Layer 1) to dynamically scope tools per session, an RL Learning Policy (Layer 2) that trains a PPO policy on accumulated audit logs to learn the minimum viable skill set for each task type, and a Safety Router (Layer 3) that intercepts tool calls via a hybrid rule-based and fine-tuned classifier. The work positions this learned approach as an advance over static sandboxing (e.g., NemoClaw) and threat detection (e.g., Cisco DefenseClaw).

Significance. If the PPO policy could be shown to reliably extract minimal viable capability sets from audit data while preserving task success, the framework would offer a meaningful improvement in reducing the attack surface for AI agents beyond static containment. The idea of using reinforcement learning on real session logs for dynamic least-privilege enforcement is conceptually promising for the field of AI security. However, the manuscript supplies only an architectural sketch with no empirical results, reward definitions, or validation, so its significance remains prospective rather than demonstrated.

major comments (3)

Abstract (RL Learning Policy description): The claim that a PPO policy trained on audit logs learns the 'minimum viable skill set' for each task type lacks any specification of state representation, action space, reward function, or training procedure. Audit logs from overprovisioned sessions record observed tool usage rather than ground-truth necessity; without negative examples or a penalty for excess capabilities that still allows task completion, the policy has no basis for distinguishing necessary from superfluous tools, directly undermining the least-privilege enforcement assertion.
Abstract (framework structure): The paper states that Aethelgard is a 'four layer' framework yet only describes three layers (Capability Governor as Layer 1, Safety Router as Layer 3, RL Learning Policy as Layer 2). The missing fourth layer and the non-sequential numbering prevent assessment of the complete data flow and interactions required for the adaptive governance claim.
Abstract (Safety Router): The hybrid rule-based and fine-tuned classifier for intercepting tool calls is introduced without any description of training data, evaluation metrics, or how its decisions integrate with the output of the RL policy. This component is load-bearing for the overall enforcement mechanism but remains unspecified.

minor comments (2)

The abstract introduces Layer 3 before Layer 2, disrupting the logical presentation of the layered architecture and making it harder to follow the intended component ordering.
No citations to related work on capability-based access control, RL for security policies, or prior AI-agent sandboxing systems are provided, which would help clarify the novelty of the four-layer design.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We appreciate the acknowledgment that the core idea of applying reinforcement learning to audit logs for dynamic least-privilege enforcement is conceptually promising. The comments correctly identify that the current manuscript is an architectural sketch lacking key technical specifications. We will incorporate the requested details and clarifications in the revised version. Our point-by-point responses to the major comments are below.

read point-by-point responses

Referee: Abstract (RL Learning Policy description): The claim that a PPO policy trained on audit logs learns the 'minimum viable skill set' for each task type lacks any specification of state representation, action space, reward function, or training procedure. Audit logs from overprovisioned sessions record observed tool usage rather than ground-truth necessity; without negative examples or a penalty for excess capabilities that still allows task completion, the policy has no basis for distinguishing necessary from superfluous tools, directly undermining the least-privilege enforcement assertion.

Authors: We agree that the manuscript provides only a high-level description and does not specify the RL components. In the revision we will add a dedicated subsection detailing the state representation (task-type embeddings concatenated with a vector of historically observed tool calls), the action space (per-tool binary enable/disable decisions), the reward function (task-success indicator minus a scaled penalty on the count of provisioned but unused capabilities, where usage is verified via post-session audit), and the training procedure (PPO with experience replay drawn from both real overprovisioned logs and augmented datasets containing expert-defined minimal sets). This formulation supplies the negative signal the referee correctly notes is missing from raw audit logs alone. revision: yes
Referee: Abstract (framework structure): The paper states that Aethelgard is a 'four layer' framework yet only describes three layers (Capability Governor as Layer 1, Safety Router as Layer 3, RL Learning Policy as Layer 2). The missing fourth layer and the non-sequential numbering prevent assessment of the complete data flow and interactions required for the adaptive governance claim.

Authors: The referee is correct; the abstract contains an inconsistency. We will revise the abstract and add an explicit framework overview section that presents the four layers in sequential order (Layer 1: Capability Governor, Layer 2: RL Learning Policy, Layer 3: Safety Router, Layer 4: Audit and Feedback Layer). The new section will describe Layer 4's role in collecting execution outcomes and feeding them back for policy updates, together with a data-flow diagram showing all inter-layer interactions. revision: yes
Referee: Abstract (Safety Router): The hybrid rule-based and fine-tuned classifier for intercepting tool calls is introduced without any description of training data, evaluation metrics, or how its decisions integrate with the output of the RL policy. This component is load-bearing for the overall enforcement mechanism but remains unspecified.

Authors: We acknowledge the Safety Router description is incomplete. The revised manuscript will specify the training data (labeled benign and adversarial tool-call traces extracted from audit logs plus synthetic attack examples), the evaluation metrics (precision, recall, and false-positive rate on a held-out test set), and the integration logic (the RL policy first produces a per-session capability scope that the router treats as a dynamic allow-list; rule-based checks are applied only to actions outside this scope or flagged as high-risk). revision: yes

Circularity Check

0 steps flagged

No circularity; high-level system proposal without derivations or fitted quantities.

full rationale

The manuscript proposes the Aethelgard four-layer framework and describes training a PPO policy on audit logs to learn minimum viable capabilities, but supplies no equations, reward functions, state representations, or experimental results that could reduce to self-referential definitions or fitted inputs called predictions. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim is a design assertion whose validity depends on external evaluation of the logs' signal quality rather than any internal reduction to its own inputs by construction. This matches the default expectation of a non-circular system paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the unverified premise that task-type-specific minimum capability sets exist and can be learned from logs; no free parameters, axioms, or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Audit logs from agent sessions contain sufficient labeled signal to train a policy that distinguishes necessary from unnecessary capabilities for future tasks.
Invoked in the description of Layer 2 RL Learning Policy training.

invented entities (1)

Aethelgard four-layer framework no independent evidence
purpose: Adaptive governance to enforce least privilege in AI agents.
New system name and layered architecture introduced to solve the stated overprovisioning problem.

pith-pipeline@v0.9.0 · 5477 in / 1275 out tokens · 25288 ms · 2026-05-10T15:42:15.855540+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
cs.CR 2026-04 unverdicted novelty 5.0

Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

EduClaw: AgentProfile-based skill scaling for educational agents

arXiv. EduClaw: AgentProfile-based skill scaling for educational agents. arXiv:2603.11709, 2026a. arXiv. XSkill: Dual-stream continual learning from experiences and skills. arXiv:2603.12056, 2026b. Cisco AI Defense. DefenseClaw: Open-source security framework for OpenClaw. RSAC 2026, March 27,

work page arXiv 2026
[2]

CrowdStrike Intelligence

URLhttps://github.com/cisco-ai-defense/defenseclaw. CrowdStrike Intelligence. ClawHavoc: Supply chain attack on OpenClaw skill registry. RSAC 2026 Keynote, March 2026,

2026
[3]

Curriculum-based agentic training with adaptive scheduling

EmergentMind. Curriculum-based agentic training with adaptive scheduling. EmergentMind, 2026,

2026
[4]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection.arXiv preprint arXiv:2302.12173,

work page internal anchor Pith review arXiv
[5]

Adaptive network security policy via reinforcement learning

Wei Li et al. Adaptive network security policy via reinforcement learning. IEEE S&P 2024,

2024
[6]

CVE-2026-25253: Remote code execution in OpenClaw.https://nvd.nist.gov/vuln/ detail/CVE-2026-25253,

NVD. CVE-2026-25253: Remote code execution in OpenClaw.https://nvd.nist.gov/vuln/ detail/CVE-2026-25253,

2026
[7]

NemoClaw: Secure agent runtime with OpenShell

NVIDIA. NemoClaw: Secure agent runtime with OpenShell. NVIDIA GTC 2026, March 16,

2026
[8]

Ignore Previous Prompt: Attack Techniques For Language Models

Primary case study: ClawHavoc supply chain attack. F´abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527,

work page internal anchor Pith review arXiv
[9]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

URLhttps://arxiv.org/abs/2307.16789. Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dor- mann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenClaw has 500,000 instances and no enterprise kill switch

VentureBeat. OpenClaw has 500,000 instances and no enterprise kill switch. VentureBeat, March 31, 2026,

2026
[12]

The user wants me to execute bash commands. . . I should refuse as it violates safety guidelines

The 0.557 is thegovernedSER under PPO. The+337%improvement is the ratio. Classifier test set.The 11-case classifier evaluation in §4.2 is a smoke test. For camera-ready we construct a balanced held-out set ofn≥200from the N=500 audit log (117 BLOCK + 117 matched ALLOW decisions, plus 20 OOD adversarial examples not in training distribution). The extractio...

2026