Recognition: unknown
Policy-Invisible Violations in LLM-Based Agents
Pith reviewed 2026-05-10 16:14 UTC · model grok-4.3
The pith
LLM-based agents can commit policy violations that remain invisible because the necessary facts are missing from their context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that policy-invisible violations occur when an agent's visible context lacks the entity attributes, contextual state, or session history required for correct policy judgment. PhantomPolicy benchmark spans eight categories with balanced cases and clean data. Sentinel enforces by counterfactual graph simulation: each action is a proposed mutation, the post-action state is materialized, and graph-structural invariants determine the decision of Allow, Block, or Clarify. Human review of all 600 traces corrected 5.3 percent of labels, and Sentinel substantially outperforms content-only DLP.
What carries the argument
Sentinel's counterfactual graph simulation on an organizational knowledge graph, which materializes post-action world states and verifies structural invariants to make enforcement decisions.
If this is right
- Trace-level human review is required, as it alters 5.3 percent of case-level annotations.
- Content-only inspection misses violations that only appear after world-state changes are simulated.
- Graph-based enforcement can deliver high accuracy and precision on most violation categories.
- Some violation categories remain challenging even with full world-state access.
Where Pith is reading between the lines
- If organizations maintain accurate knowledge graphs, this enforcement could be integrated into agent runtimes to prevent policy breaches in real time.
- The approach suggests that separating the policy world model from the agent's operating context is necessary for reliable compliance.
- Similar simulation methods could help with other hidden-state problems in agent systems, such as privacy leaks or safety constraints.
- Policy encoding would shift from natural-language rules to explicit graph invariants that can be checked mechanically.
Load-bearing premise
All policy-relevant facts can be captured in an organizational knowledge graph whose structural invariants are sufficient to decide Allow/Block/Clarify for every violation category, and that the graph can be kept accurate enough for speculative execution.
What would settle it
A collection of policy violations where the deciding facts cannot be encoded as graph structure or invariants, causing the simulation to approve actions that should be blocked.
Figures
read the original abstract
LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines policy-invisible violations as cases where LLM agents produce syntactically valid, user-approved actions that nonetheless breach organizational policy because required facts (entity attributes, state, or history) are absent from the agent's context. It introduces the PhantomPolicy benchmark (eight violation categories, balanced violation/safe cases, 600 traces from five frontier models with all tool responses stripped of policy metadata), reports a manual review of all traces that altered 32 labels (5.3%), and presents Sentinel, an enforcement layer that performs counterfactual simulation on an organizational knowledge graph, materializes post-action states, and checks structural invariants to output Allow/Block/Clarify decisions. On the human-reviewed labels Sentinel reaches 93.0% accuracy versus 68.8% for a content-only DLP baseline while preserving high precision.
Significance. If the empirical comparison holds, the work is significant for surfacing a concrete, previously under-studied failure mode in agentic systems and for showing that world-state grounding via an organizational knowledge graph can produce a large, measurable lift in policy compliance. The transparent reporting of the 5.3% label change after trace-level review and the direct baseline comparison are positive features that strengthen the central claim.
major comments (2)
- [§5] §5 (Evaluation and Results): The central accuracy figures (93.0% for Sentinel, 68.8% for the DLP baseline) rest on human-reviewed trace labels, yet the manuscript provides no information on how the 600 traces were sampled, what instructions were given to reviewers, or whether inter-annotator agreement was measured. Because 32 labels were altered during review, these missing details are load-bearing for assessing the reliability of the reported performance gap.
- [§4.2] §4.2 (Sentinel enforcement layer): The claim that graph-structural invariants suffice to decide Allow/Block/Clarify for every violation category assumes a complete, accurate, and up-to-date organizational knowledge graph whose invariants capture all policy-relevant facts. While the paper scopes the result to “favorable conditions,” it does not quantify sensitivity to missing nodes, stale attributes, or incomplete invariant definitions, which directly affects whether the 93.0% figure generalizes beyond the benchmark.
minor comments (1)
- [Abstract] Abstract and §3: The five frontier models used to generate the 600 traces are not named; adding the model identifiers would improve reproducibility without altering the central claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the paper's significance. We address each major comment below and will revise the manuscript to improve clarity and address the raised concerns.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation and Results): The central accuracy figures (93.0% for Sentinel, 68.8% for the DLP baseline) rest on human-reviewed trace labels, yet the manuscript provides no information on how the 600 traces were sampled, what instructions were given to reviewers, or whether inter-annotator agreement was measured. Because 32 labels were altered during review, these missing details are load-bearing for assessing the reliability of the reported performance gap.
Authors: We agree that these details are necessary to evaluate label reliability. The manuscript reports that all 600 traces were manually reviewed with 32 labels (5.3%) altered relative to initial case-level annotations, but does not describe the sampling or review protocol. In the revised manuscript, we will add a subsection in §5 detailing the stratified sampling across the eight violation categories and five models, the review guidelines (which instructed reviewers to assess policy compliance using the full hidden context), and clarification that the review was performed collaboratively by the authors with consensus on ambiguous cases. We will also note that inter-annotator agreement was not computed due to the collaborative nature of the review. These additions will allow better assessment of the performance gap. revision: yes
-
Referee: [§4.2] §4.2 (Sentinel enforcement layer): The claim that graph-structural invariants suffice to decide Allow/Block/Clarify for every violation category assumes a complete, accurate, and up-to-date organizational knowledge graph whose invariants capture all policy-relevant facts. While the paper scopes the result to “favorable conditions,” it does not quantify sensitivity to missing nodes, stale attributes, or incomplete invariant definitions, which directly affects whether the 93.0% figure generalizes beyond the benchmark.
Authors: The manuscript explicitly scopes the Sentinel results to favorable conditions with a complete and accurate organizational knowledge graph, as constructed in the PhantomPolicy benchmark, and does not claim broader generalization. We agree that a sensitivity analysis would strengthen the presentation. In the revision, we will add a discussion paragraph in §4.2 and/or an appendix that qualitatively addresses potential impacts of missing nodes, stale attributes, and incomplete invariants, along with suggestions for future empirical sensitivity tests. This maintains the core contribution while providing better context on limitations. revision: partial
Circularity Check
No significant circularity in empirical evaluation chain
full rationale
The paper presents an empirical benchmark (PhantomPolicy) with 600 manually reviewed traces across eight violation categories, followed by direct accuracy comparison of Sentinel (93.0%) against a content-only DLP baseline (68.8%) on human-reviewed labels. Manual review changes only 5.3% of labels and is reported transparently. Sentinel is introduced as a counterfactual graph-simulation enforcement layer whose performance is measured against independent human annotations rather than derived from any internal fit, self-definition, or self-citation. No equations appear in the provided text, no self-citations are load-bearing for the accuracy claims, and the reported improvement is conditioned explicitly on the favorable assumption of a complete knowledge graph without reducing the measurement itself to that assumption. The derivation chain is therefore self-contained against external human benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Organizational policies can be expressed as invariants on a knowledge-graph representation of world state.
invented entities (3)
-
policy-invisible violations
no independent evidence
-
PhantomPolicy
no independent evidence
-
Sentinel
no independent evidence
Reference graph
Works this paper leans on
-
[1]
doi: 10.1016/j. jnca.2016.01.008. 24 Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. AgentHarm: A benchmark for measuring harmfulness of LLM agents.arXiv preprint arXiv:2410.09024,
work page doi:10.1016/j 2016
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
A large annotated corpus for learning natural language inference
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642,
2015
-
[4]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym.arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review arXiv
-
[5]
Keep security! benchmarking security policy preservationinlargelanguagemodelcontextsagainstindirectattacksinquestionanswering
Hwan Chang, Yumin Kim, Yonghyun Jun, and Hwanhee Lee. Keep security! benchmarking security policy preservationinlargelanguagemodelcontextsagainstindirectattacksinquestionanswering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,
2025
-
[6]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Schmidt, Mislav Lee, Marc Suter, Robin Staab, Nicholas Carlini, and Florian Tramer. AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352,
work page internal anchor Pith review arXiv
-
[7]
AgentLeak : A full-stack benchmark for privacy leakage in multi-agent LLM systems
Firdaous El Yagoubi, Ranwa Al Mallah, and Grace Badu-Marfo. AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM systems.arXiv preprint arXiv:2602.11510,
-
[8]
Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models
Haeun Jang, Hwan Chang, and Hwanhee Lee. Doc-PP: Document policy preservation benchmark for large vision-language models.arXiv preprint arXiv:2601.03926,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Yijie Qiao, Dexun Liu, Hao Yang, Wei Zhou, and Shengshan Hu. Agent tools orchestration leaks more: Dataset, benchmark, and mitigation.arXiv preprint arXiv:2512.16310,
-
[10]
arXiv preprint arXiv:2503.09780 , year=
Mika Siva, Ruslan Salakhutdinov, and Kamalika Chaudhuri. AgentDAM: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780,
-
[11]
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864,
work page internal anchor Pith review arXiv
-
[12]
R-judge: Benchmarking safety risk awareness for llm agents,
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R-Judge: Benchmarking safety risk awareness for LLM agents.arXiv preprint arXiv:2401.10019,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.