arxiv: 2604.12177 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.CL· cs.CR· cs.LG

Recognition: unknown

Policy-Invisible Violations in LLM-Based Agents

Jie Wu , Ming Gong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CRcs.LG

keywords LLM agentspolicy violationsknowledge graphcounterfactual simulationbenchmarkenforcementdata loss preventionorganizational policy

0 comments

The pith

LLM-based agents can commit policy violations that remain invisible because the necessary facts are missing from their context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents sometimes execute actions that are syntactically correct, user-approved, and contextually sensible but still breach organizational policy. This happens when compliance hinges on facts about entities, state, or history that the agent cannot see. The paper presents PhantomPolicy, a benchmark with eight violation categories where all tool outputs are clean business data without policy tags. It introduces Sentinel, which simulates each action as a change to a knowledge graph, runs the change speculatively, and checks whether the resulting structure breaks any policy invariants. On human-reviewed traces from frontier models, Sentinel reaches 93 percent accuracy compared with 68.8 percent for a content-only baseline while keeping precision high.

Core claim

The central discovery is that policy-invisible violations occur when an agent's visible context lacks the entity attributes, contextual state, or session history required for correct policy judgment. PhantomPolicy benchmark spans eight categories with balanced cases and clean data. Sentinel enforces by counterfactual graph simulation: each action is a proposed mutation, the post-action state is materialized, and graph-structural invariants determine the decision of Allow, Block, or Clarify. Human review of all 600 traces corrected 5.3 percent of labels, and Sentinel substantially outperforms content-only DLP.

What carries the argument

Sentinel's counterfactual graph simulation on an organizational knowledge graph, which materializes post-action world states and verifies structural invariants to make enforcement decisions.

If this is right

Trace-level human review is required, as it alters 5.3 percent of case-level annotations.
Content-only inspection misses violations that only appear after world-state changes are simulated.
Graph-based enforcement can deliver high accuracy and precision on most violation categories.
Some violation categories remain challenging even with full world-state access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If organizations maintain accurate knowledge graphs, this enforcement could be integrated into agent runtimes to prevent policy breaches in real time.
The approach suggests that separating the policy world model from the agent's operating context is necessary for reliable compliance.
Similar simulation methods could help with other hidden-state problems in agent systems, such as privacy leaks or safety constraints.
Policy encoding would shift from natural-language rules to explicit graph invariants that can be checked mechanically.

Load-bearing premise

All policy-relevant facts can be captured in an organizational knowledge graph whose structural invariants are sufficient to decide Allow/Block/Clarify for every violation category, and that the graph can be kept accurate enough for speculative execution.

What would settle it

A collection of policy violations where the deciding facts cannot be encoded as graph structure or invariants, causing the simulation to approve actions that should be blocked.

Figures

Figures reproduced from arXiv: 2604.12177 by Jie Wu, Ming Gong.

read the original abstract

LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines policy-invisible violations in LLM agents and shows that a graph-simulation enforcer reaches 93% accuracy on human-reviewed traces when given a complete organizational knowledge graph.

read the letter

The paper identifies cases where LLM agents take actions that are valid in their visible context but still break organizational rules because key facts about entities, state, or history are hidden. It introduces PhantomPolicy, a benchmark with eight violation categories and balanced safe and violation traces, then runs five frontier models to produce 600 traces that receive trace-level human review. Sentinel, the proposed enforcer, treats each action as a proposed change to an organizational knowledge graph, runs counterfactual simulation, and checks structural invariants to decide allow, block, or clarify. Against the human labels it reaches 93% accuracy compared with 68.8% for a content-only DLP baseline, while keeping high precision on most categories. The human review step is a clear positive; it altered 5.3% of labels and the paper reports the change transparently. The benchmark construction also keeps tool outputs free of policy metadata, which forces the distinction between agent view and enforcement view to be real. The results are scoped to favorable conditions with a complete, accurate graph whose invariants cover the violation types, and the stress test confirms no internal contradiction once that scope is granted. The main soft spot is that the gains rest on the graph being both complete and maintained over time. If the graph is partial or drifts, the reported improvement shrinks, and the paper does not test robustness to missing or noisy graph data. Sampling of the 600 traces and inter-annotator agreement details are referenced but not expanded in the abstract. This work is for researchers and engineers who deploy LLM agents inside organizations that need policy compliance. Readers focused on agent safety, enterprise automation, or data-loss prevention will find the benchmark and the direct comparison useful. It deserves a serious referee because the empirical claim is tied to human-validated labels and the method is concrete enough to replicate or extend under the stated assumptions. I would send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper defines policy-invisible violations as cases where LLM agents produce syntactically valid, user-approved actions that nonetheless breach organizational policy because required facts (entity attributes, state, or history) are absent from the agent's context. It introduces the PhantomPolicy benchmark (eight violation categories, balanced violation/safe cases, 600 traces from five frontier models with all tool responses stripped of policy metadata), reports a manual review of all traces that altered 32 labels (5.3%), and presents Sentinel, an enforcement layer that performs counterfactual simulation on an organizational knowledge graph, materializes post-action states, and checks structural invariants to output Allow/Block/Clarify decisions. On the human-reviewed labels Sentinel reaches 93.0% accuracy versus 68.8% for a content-only DLP baseline while preserving high precision.

Significance. If the empirical comparison holds, the work is significant for surfacing a concrete, previously under-studied failure mode in agentic systems and for showing that world-state grounding via an organizational knowledge graph can produce a large, measurable lift in policy compliance. The transparent reporting of the 5.3% label change after trace-level review and the direct baseline comparison are positive features that strengthen the central claim.

major comments (2)

[§5] §5 (Evaluation and Results): The central accuracy figures (93.0% for Sentinel, 68.8% for the DLP baseline) rest on human-reviewed trace labels, yet the manuscript provides no information on how the 600 traces were sampled, what instructions were given to reviewers, or whether inter-annotator agreement was measured. Because 32 labels were altered during review, these missing details are load-bearing for assessing the reliability of the reported performance gap.
[§4.2] §4.2 (Sentinel enforcement layer): The claim that graph-structural invariants suffice to decide Allow/Block/Clarify for every violation category assumes a complete, accurate, and up-to-date organizational knowledge graph whose invariants capture all policy-relevant facts. While the paper scopes the result to “favorable conditions,” it does not quantify sensitivity to missing nodes, stale attributes, or incomplete invariant definitions, which directly affects whether the 93.0% figure generalizes beyond the benchmark.

minor comments (1)

[Abstract] Abstract and §3: The five frontier models used to generate the 600 traces are not named; adding the model identifiers would improve reproducibility without altering the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the paper's significance. We address each major comment below and will revise the manuscript to improve clarity and address the raised concerns.

read point-by-point responses

Referee: [§5] §5 (Evaluation and Results): The central accuracy figures (93.0% for Sentinel, 68.8% for the DLP baseline) rest on human-reviewed trace labels, yet the manuscript provides no information on how the 600 traces were sampled, what instructions were given to reviewers, or whether inter-annotator agreement was measured. Because 32 labels were altered during review, these missing details are load-bearing for assessing the reliability of the reported performance gap.

Authors: We agree that these details are necessary to evaluate label reliability. The manuscript reports that all 600 traces were manually reviewed with 32 labels (5.3%) altered relative to initial case-level annotations, but does not describe the sampling or review protocol. In the revised manuscript, we will add a subsection in §5 detailing the stratified sampling across the eight violation categories and five models, the review guidelines (which instructed reviewers to assess policy compliance using the full hidden context), and clarification that the review was performed collaboratively by the authors with consensus on ambiguous cases. We will also note that inter-annotator agreement was not computed due to the collaborative nature of the review. These additions will allow better assessment of the performance gap. revision: yes
Referee: [§4.2] §4.2 (Sentinel enforcement layer): The claim that graph-structural invariants suffice to decide Allow/Block/Clarify for every violation category assumes a complete, accurate, and up-to-date organizational knowledge graph whose invariants capture all policy-relevant facts. While the paper scopes the result to “favorable conditions,” it does not quantify sensitivity to missing nodes, stale attributes, or incomplete invariant definitions, which directly affects whether the 93.0% figure generalizes beyond the benchmark.

Authors: The manuscript explicitly scopes the Sentinel results to favorable conditions with a complete and accurate organizational knowledge graph, as constructed in the PhantomPolicy benchmark, and does not claim broader generalization. We agree that a sensitivity analysis would strengthen the presentation. In the revision, we will add a discussion paragraph in §4.2 and/or an appendix that qualitatively addresses potential impacts of missing nodes, stale attributes, and incomplete invariants, along with suggestions for future empirical sensitivity tests. This maintains the core contribution while providing better context on limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper presents an empirical benchmark (PhantomPolicy) with 600 manually reviewed traces across eight violation categories, followed by direct accuracy comparison of Sentinel (93.0%) against a content-only DLP baseline (68.8%) on human-reviewed labels. Manual review changes only 5.3% of labels and is reported transparently. Sentinel is introduced as a counterfactual graph-simulation enforcement layer whose performance is measured against independent human annotations rather than derived from any internal fit, self-definition, or self-citation. No equations appear in the provided text, no self-citations are load-bearing for the accuracy claims, and the reported improvement is conditioned explicitly on the favorable assumption of a complete knowledge graph without reducing the measurement itself to that assumption. The derivation chain is therefore self-contained against external human benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the premise that organizational policies can be faithfully encoded as graph-structural invariants and that speculative execution on a complete knowledge graph will surface all relevant violations; no free parameters are explicitly fitted in the abstract, but the graph representation itself is an invented modeling choice.

axioms (1)

domain assumption Organizational policies can be expressed as invariants on a knowledge-graph representation of world state.
Invoked when Sentinel verifies graph-structural invariants to decide Allow/Block/Clarify.

invented entities (3)

policy-invisible violations no independent evidence
purpose: Name for the failure mode where compliance facts are absent from agent context
New term introduced to describe the benchmark cases.
PhantomPolicy no independent evidence
purpose: Benchmark dataset spanning eight violation categories
New test collection with 600 traces and human-reviewed labels.
Sentinel no independent evidence
purpose: Enforcement framework using counterfactual graph simulation
New system whose performance is the main empirical result.

pith-pipeline@v0.9.0 · 5555 in / 1479 out tokens · 68992 ms · 2026-05-10T16:14:36.861145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Mehling, R

doi: 10.1016/j. jnca.2016.01.008. 24 Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. AgentHarm: A benchmark for measuring harmfulness of LLM agents.arXiv preprint arXiv:2410.09024,

work page doi:10.1016/j 2016
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

A large annotated corpus for learning natural language inference

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642,

2015
[4]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym.arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review arXiv
[5]

Keep security! benchmarking security policy preservationinlargelanguagemodelcontextsagainstindirectattacksinquestionanswering

Hwan Chang, Yumin Kim, Yonghyun Jun, and Hwanhee Lee. Keep security! benchmarking security policy preservationinlargelanguagemodelcontextsagainstindirectattacksinquestionanswering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

2025
[6]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Schmidt, Mislav Lee, Marc Suter, Robin Staab, Nicholas Carlini, and Florian Tramer. AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352,

work page internal anchor Pith review arXiv
[7]

AgentLeak : A full-stack benchmark for privacy leakage in multi-agent LLM systems

Firdaous El Yagoubi, Ranwa Al Mallah, and Grace Badu-Marfo. AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM systems.arXiv preprint arXiv:2602.11510,

work page arXiv
[8]

Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Haeun Jang, Hwan Chang, and Hwanhee Lee. Doc-PP: Document policy preservation benchmark for large vision-language models.arXiv preprint arXiv:2601.03926,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Agent tools orchestration leaks more: Dataset, benchmark, and mitigation.arXiv preprint arXiv:2512.16310,

Yijie Qiao, Dexun Liu, Hao Yang, Wei Zhou, and Shengshan Hu. Agent tools orchestration leaks more: Dataset, benchmark, and mitigation.arXiv preprint arXiv:2512.16310,

work page arXiv
[10]

arXiv preprint arXiv:2503.09780 , year=

Mika Siva, Ruslan Salakhutdinov, and Kamalika Chaudhuri. AgentDAM: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780,

work page arXiv
[11]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864,

work page internal anchor Pith review arXiv
[12]

R-judge: Benchmarking safety risk awareness for llm agents,

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R-Judge: Benchmarking safety risk awareness for LLM agents.arXiv preprint arXiv:2401.10019,

work page arXiv