pith. machine review for the scientific record. sign in

arxiv: 2605.05501 · v1 · submitted 2026-05-06 · 💻 cs.CR

Recognition: unknown

SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:09 UTC · model grok-4.3

classification 💻 cs.CR
keywords SOCLLM copilotpolicy complianceincident responsedeterministic verifierapproval gatesSOARcompliance checking
0
0 comments X

The pith

A deterministic verifier removes 466 non-compliant approval-gated actions from LLM-generated incident plans without lowering recall on core tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SOCpilot fixes the incident data, action catalog, policy rules, and a deterministic verifier so that compliance can be checked directly on the sequence of actions an LLM proposes for a security incident. In a case study with two LLM providers and 200 real incidents from a financial-sector SOC, the verifier strips out 466 actions that violate approval requirements while the remaining actions still cover the same baseline tasks that human analysts handled in the paired reference cases. The same fixed setup produces stable aggregate rates when the corpus is re-run three times. This matters because SOC teams are starting to rely on LLMs to draft response plans, yet those plans can still propose steps that skip mandatory gates or ordering even when the individual actions look valid in isolation. The work isolates the compliance question at the plan boundary and releases the full artifact so others can re-derive the public counts without private incident data.

Core claim

SOCpilot shows that by holding the incident package, action catalog, policy rules, verifier, and evidence surface constant, a deterministic checker can measure and enforce compliance on LLM-proposed action traces. Applied to plans from two providers across 200 incidents, the verifier removes 466 actions that breach approval gates, particularly for recovery and containment, while baseline-task recall stays level with analyst-authored SOAR references. Aggregate compliance rates remain unchanged across three reruns of the identical corpus. The system also exposes zero-cost checks for mandatory-step and ordering repairs.

What carries the argument

The deterministic verifier that inspects the full action trace against fixed policy rules encoding mandatory steps, required ordering, and explicit approval gates.

If this is right

  • Compliance becomes checkable at the plan boundary before any analyst sees the LLM output.
  • The same fixed rules and verifier can be applied to outputs from different LLM providers to compare their compliance behavior under identical policy text.
  • Focus on approval-gated actions for recovery and containment isolates the highest-risk decisions in the response workflow.
  • Releasing the runnable artifact makes the compliance counts independently reproducible without access to private incident data.
  • Zero-cost readiness checks for ordering and mandatory-step repairs can be run on any new plan without additional LLM calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-corpus design could be reused to test whether prompt changes or fine-tuning reduce the number of flagged actions before the verifier runs.
  • Similar boundary verifiers could be built for other regulated domains where LLMs generate action sequences that must respect approval or ordering constraints.
  • Integrating the repair suggestions directly into a SOAR platform would let analysts accept or override the deterministic fixes in one step.

Load-bearing premise

The policy rules and action catalog given to the verifier correctly and completely capture every mandatory compliance requirement that applies to the 200 incidents.

What would settle it

An independent audit by the original SOC analysts that finds any of the 466 removed actions were actually performed in the paired human reference plans or were required under the real operational policy.

Figures

Figures reproduced from arXiv: 2605.05501 by \'Agney Lopes Roth Ferraz, Leonardo Vaz de Meneses, Louren\c{c}o Alves Pereira J\'unior, Sidnei Barbieri.

Figure 1
Figure 1. Figure 1: SOCpilot’s declared evaluation object. Private data stops at the release boundary. The released object contains canonical packages, the action catalog view at source ↗
Figure 2
Figure 2. Figure 2: Provider-level violation rates before and after deterministic verification. view at source ↗
read the original abstract

Security operations centers (SOCs) are beginning to use large language models (LLMs) as copilots to draft incident-response plans. These plans may include actions that are valid per the catalog but still violate mandatory steps, required ordering, or approval gates before analyst review. SOCpilot makes this compliance question measurable at the plan boundary. It fixes the incident package, action catalog, policy rules, verifier, and public evidence surface. Next, it verifies the copilot's proposed action trace. We evaluate two LLM providers on 200 real incidents from an anonymized production SOC in a financial-sector case study. We compare their plans to paired analyst-authored references from the same security orchestration, automation, and response (SOAR) cases. An identical inline policy text moves the two providers in opposite directions. A deterministic verifier removes 466 non-compliant, approval-gated actions, without reducing baseline-task recall. Aggregate rates remain stable across 3 reruns of the fixed corpus. The official evidence focuses on approval-gated decisions regarding recovery and containment. Separately, the artifact exposes zero-cost readiness checks for mandatory and ordering repairs. We release the runnable artifact so independent reviewers can rederive the public results without access to private incident data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SOCpilot as a framework to make policy compliance measurable for LLM-generated incident-response plans in SOCs. It fixes the incident package, action catalog, inline policy text, and deterministic verifier, then evaluates two LLM providers on 200 real incidents from an anonymized financial-sector production SOC. Plans are compared against paired analyst-authored SOAR references; the verifier removes 466 non-compliant approval-gated actions without reducing baseline recall, with aggregate rates stable across three reruns of the fixed corpus. The artifact is released to allow independent rederivation of public results.

Significance. If the central results hold, the work supplies a practical, reproducible method for quantifying compliance gaps in LLM copilots for security operations, using real production incidents and analyst references. The release of the runnable artifact is a clear strength, enabling external verification of the reported counts and stability without access to private data.

major comments (2)
  1. [Abstract] Abstract and evaluation description: the reported removal of 466 non-compliant approval-gated actions (and the claim of no recall reduction) is produced by a deterministic verifier whose encoding of mandatory steps, ordering constraints, and approval gates is not specified or validated. The manuscript states that an identical inline policy text is used but provides no derivation, pseudocode, or cross-check against actual production SOC mandatory requirements, so any mismatch in the fixed rules directly alters the 466 count and stability claim.
  2. [Evaluation] Evaluation section: the stability of aggregate rates across three reruns is asserted for the fixed corpus, yet no description is given of how the verifier handles ordering or approval gates in the action trace, nor are error bars or per-incident breakdowns supplied. This leaves the load-bearing quantitative claim dependent on an unreviewed implementation.
minor comments (1)
  1. The abstract refers to 'zero-cost readiness checks for mandatory and ordering repairs' without defining what these checks consist of or how they are exposed in the artifact.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and the value of the released artifact. We agree that the manuscript would benefit from greater detail on the verifier and will revise accordingly to make the quantitative claims more transparent and self-contained.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation description: the reported removal of 466 non-compliant approval-gated actions (and the claim of no recall reduction) is produced by a deterministic verifier whose encoding of mandatory steps, ordering constraints, and approval gates is not specified or validated. The manuscript states that an identical inline policy text is used but provides no derivation, pseudocode, or cross-check against actual production SOC mandatory requirements, so any mismatch in the fixed rules directly alters the 466 count and stability claim.

    Authors: We acknowledge the need for more explicit specification. In the revised manuscript we will add a dedicated subsection to the Evaluation section containing: (1) pseudocode for the deterministic verifier, (2) a precise description of how mandatory steps, ordering constraints, and approval gates are encoded and enforced on action traces, and (3) the process by which the inline policy text was derived from the anonymized SOC's documented procedures. The complete verifier source, exact policy text, and all public evaluation artifacts are already released, permitting independent confirmation of the 466 count and recall preservation. We cannot release the original confidential internal SOC documents, but the inline text is a direct encoding of the mandatory requirements used in the study. revision: yes

  2. Referee: [Evaluation] Evaluation section: the stability of aggregate rates across three reruns is asserted for the fixed corpus, yet no description is given of how the verifier handles ordering or approval gates in the action trace, nor are error bars or per-incident breakdowns supplied. This leaves the load-bearing quantitative claim dependent on an unreviewed implementation.

    Authors: We will expand the Evaluation section to describe the verifier's handling of ordering and approval gates (cross-referenced to the new pseudocode subsection). Because the verifier is deterministic and the corpus of plans is fixed, aggregate rates are identical across the three reruns; we will state this explicitly. We will also add per-incident compliance breakdowns (as a table or supplementary figure) and note that error bars are inapplicable for a deterministic verifier on fixed input. These changes will make the stability claim fully reviewable from the manuscript itself while the artifact continues to support full re-execution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation uses external incidents and deterministic verification.

full rationale

The paper reports an empirical evaluation on 200 real incidents from an anonymized production SOC, comparing LLM-generated plans against paired analyst-authored references using a fixed, deterministic verifier and inline policy text. No equations, derivations, or fitted parameters are described that would reduce the reported compliance counts (e.g., removal of 466 actions) or recall metrics to the inputs by construction. The central claims rest on direct application of the verifier to external data and references, with the artifact released for independent rederivation. This matches the default expectation of no circularity for papers grounded in external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the supplied policy rules are authoritative and complete for the incidents tested; no free parameters are described, and the verifier is presented as deterministic.

axioms (1)
  • domain assumption The fixed policy rules and action catalog accurately represent all mandatory compliance constraints for the evaluated incidents.
    Invoked when counting non-compliant actions; if false, the 466 removals lose meaning.
invented entities (1)
  • SOCpilot verifier no independent evidence
    purpose: Deterministic checker for policy compliance on LLM action traces
    New system component introduced to make compliance measurable at the plan boundary.

pith-pipeline@v0.9.0 · 5533 in / 1272 out tokens · 20410 ms · 2026-05-08T16:09:18.965625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    Matched and mismatched socs: A qualitative study on security operations center issues,

    F. B. Kokulu, A. Soneji, T. Bao, Y . Shoshitaishvili, Z. Zhao, A. Doupe, and G.-J. Ahn, “Matched and mismatched socs: A qualitative study on security operations center issues,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2019, pp. 1955–1970. [Online]. Availabl...

  2. [2]

    99% false positives: A qualitative study of soc analysts’ perspectives on security alarms,

    B. A. AlAhmadi, L. Axon, and I. Martinovic, “99% false positives: A qualitative study of soc analysts’ perspectives on security alarms,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2022, pp. 2783–2800. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity22/presentation/alahmadi

  3. [3]

    In: Proceed- ings of the 2023 ACM SIGSAC Conference on Computer and Communica- tions Security

    M. Vermeer, N. Kadenko, M. van Eeten, C. Ganan, and S. Parkin, “Alert alchemy: Soc workflows and decisions in the management of nids rules,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2023, pp. 2770–2784. [Online]. Available: https://doi.org/10.1145/3576915.3616581

  4. [4]

    Lessons lost: Incident response in the age of cyber insurance and breach attorneys,

    D. W. Woods, R. B ¨ohme, J. Wolff, and D. Schwarcz, “Lessons lost: Incident response in the age of cyber insurance and breach attorneys,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2023, pp. 2259–2273. [Online]. Available: https://www.usenix.org/conference/usenixsecurity23/presentation/woods

  5. [5]

    Do you play it by the books? a study on incident response playbooks and influencing factors,

    D. Schlette, P. Empl, M. Caselli, T. Schreck, and G. Pernul, “Do you play it by the books? a study on incident response playbooks and influencing factors,” inIEEE Symposium on Security and Privacy (S&P). Los Alamitos, CA, USA: IEEE, 2024, pp. 3625–3643. [Online]. Available: https://doi.org/10.1109/SP54263.2024.00060

  6. [6]

    Pentestgpt: Evaluating and harnessing large language models for automated penetration testing,

    G. Deng, Y . Liu, V . M. Vilches, P. Liu, Y . Li, Y . Xu, M. Pinzger, S. Rass, T. Zhang, and Y . Liu, “Pentestgpt: Evaluating and harnessing large language models for automated penetration testing,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2024. [Online]. Available: https : / / www. usenix . org / conference/usenixsecurity24/pre...

  7. [7]

    Yurascanner: Leveraging llms for task-driven web app scanning,

    A. Stafeev, T. Recktenwald, G. D. Stefano, S. Khodayari, and G. Pellegrino, “Yurascanner: Leveraging llms for task-driven web app scanning,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025, pp. 1–16. [Online]. Available: https://www.ndss- symposium. org/ndss-paper/yurascanner-leve...

  8. [8]

    Prompt inversion attack against collaborative inference of large language models,

    D. Wang, G. Zhou, X. Li, Y . Bai, L. Chen, T. Qin, J. Sun, and D. Li, “The digital cybersecurity expert: How far have we come?” inIEEE Symposium on Security and Privacy (S&P). Los Alamitos, CA, USA: IEEE, 2025, pp. 3273–3290. [Online]. Available: https://doi.org/10.1109/SP61157.2025.00198

  9. [9]

    Progent: Securing AI Agents with Privilege Control

    T. Shi, J. He, Z. Wang, H. Li, L. Wu, W. Guo, and D. Song, “Pro- gent: Programmable privilege control for llm agents,”arXiv preprint arXiv:2504.11703, 2025

  10. [10]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    H. Wang, C. M. Poskitt, and J. Sun, “Agentspec: Customizable runtime enforcement for safe and reliable llm agents,” 2025. [Online]. Available: https://arxiv.org/abs/2503.18666

  11. [11]

    Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988

  12. [12]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec). New York, NY , USA: Association for Computing Machinery, 2023, pp. 79–90. [Onlin...

  13. [13]

    I know what you asked: Prompt leakage via kv- cache sharing in multi-tenant llm serving,

    G. Wu, Z. Zhang, Y . Zhang, W. Wang, J. Niu, Y . Wu, and Y . Zhang, “I know what you asked: Prompt leakage via kv- cache sharing in multi-tenant llm serving,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025. [Online]. Available: https: //www.ndss-symposium.org/wp-content/uploads/2...

  14. [14]

    Prompt inversion attack against collaborative inference of large language models,

    W. Qu, Y . Zhou, Y . Wu, T. Xiao, B. Yuan, Y . Li, and J. Zhang, “Prompt inversion attack against collaborative inference of large language models,” inIEEE Symposium on Security and Privacy (S&P). Los Alamitos, CA, USA: IEEE, 2025, pp. 1695–1712. [Online]. Available: https://doi.org/10.1109/SP61157.2025.00160

  15. [15]

    The philosopher’s stone: Trojaning plugins of large language models,

    T. Dong, M. Xue, G. Chen, R. Holland, Y . Meng, S. Li, Z. Liu, and H. Zhu, “The philosopher’s stone: Trojaning plugins of large language models,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025. [Online]. Available: https://www.ndss-symposium.org/ndss-paper/the- philosophers-stone...

  16. [16]

    Isolategpt: An execution isolation architecture for llm-based agentic systems,

    Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “Isolategpt: An execution isolation architecture for llm-based agentic systems,” in Proceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025, pp. 1–20. [Online]. Available: https://www.ndss- symposium.org/wp- content/ uploads/2025-1131-paper.pdf

  17. [17]

    Enforceable security policies,

    F. B. Schneider, “Enforceable security policies,”ACM Transactions on Information and System Security (TISSEC), vol. 3, no. 1, pp. 30–50,

  18. [18]

    Available: https://doi.org/10.1145/353323.353382

    [Online]. Available: https://doi.org/10.1145/353323.353382

  19. [19]

    Edit automata: Enforcement mechanisms for run-time security policies,

    J. Ligatti, L. Bauer, and D. Walker, “Edit automata: Enforcement mechanisms for run-time security policies,”International Journal of Information Security, vol. 4, no. 1, pp. 2–16, 2005. [Online]. Available: https://doi.org/10.1007/s10207-004-0046-8

  20. [20]

    Kinetic: Verifiable dynamic network control,

    H. Kim, J. Reich, A. Gupta, M. Shahbaz, N. Feamster, and R. Clark, “Kinetic: Verifiable dynamic network control,” inUSENIX Symposium on Networked Systems Design and Implementation (NSDI). Berkeley, CA, USA: USENIX Association, 2015, pp. 59–72. [Online]. Available: https://www.usenix.org/conference/nsdi15/technical-sessions/ presentation/kim

  21. [21]

    Psi: Precise security instrumentation for enterprise networks,

    T. Yu, S. K. Fayaz, M. Collins, V . Sekar, and S. Seshan, “Psi: Precise security instrumentation for enterprise networks,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2017, pp. 1–15. [Online]. Available: https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/psi- precise-secur...

  22. [22]

    Sok: Towards a unified approach to applied replicability for computer security,

    D. Olszewski, T. Tucker, K. R. B. Butler, and P. Traynor, “Sok: Towards a unified approach to applied replicability for computer security,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2025, pp. 469–488. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity25/presentation/olszewski

  23. [23]

    Probable inference, the law of succession, and statistical inference,

    E. B. Wilson, “Probable inference, the law of succession, and statistical inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927. [Online]. Available: https: //doi.org/10.1080/01621459.1927.10502953

  24. [24]

    Psychometrika12(2), 153–157 (1947)https://doi

    Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947. [Online]. Available: https://doi.org/10.1007/BF02295996

  25. [25]

    A simple sequentially rejective multiple test procedure.Scan- dinavian Journal of Statistics, 6(2):65–70, 1979

    S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979. [Online]. Available: https://www.jstor.org/stable/4615733

  26. [26]

    Work-from-home and covid-19: Trajectories of endpoint security management in a security operations center,

    K. R. Jones, D. A. Brucker-Hahn, B. Fidler, and A. G. Bardas, “Work-from-home and covid-19: Trajectories of endpoint security management in a security operations center,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2023, pp. 2293–2310. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity23/presentation/jones

  27. [27]

    Ruling the rules: Quantifying the evolution of rulesets, alerts and incidents in network intrusion detection,

    M. Vermeer, M. van Eeten, and C. Ga ˜n´an, “Ruling the rules: Quantifying the evolution of rulesets, alerts and incidents in network intrusion detection,” inProceedings of the ACM Asia Conference on Computer and Communications Security (AsiaCCS). New York, NY , USA: Association for Computing Machinery, 2022, pp. 799–814. [Online]. Available: https://doi.o...

  28. [28]

    True attacks, attack attempts, or benign triggers? an empirical measurement of network alerts in a security operations center,

    L. Yang, Z. Chen, C. Wang, Z. Zhang, S. Booma, P. Cao, C. Adam, A. Withers, Z. Kalbarczyk, R. K. Iyer, and G. Wang, “True attacks, attack attempts, or benign triggers? an empirical measurement of network alerts in a security operations center,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2024, pp. 1525–1542. [Online]. Available: ht...

  29. [29]

    Provg-searcher: A graph representation learning approach for efficient provenance graph search,

    E. Altinisik, F. Deniz, and H. T. Sencar, “Provg-searcher: A graph representation learning approach for efficient provenance graph search,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2023, pp. 2247–2261. [Online]. Available: https://doi.org/10.1145/3576915.3623187

  30. [30]

    Tapas: An efficient online apt detection with task-guided process provenance graph segmentation and analysis,

    B. Zhang, Y . Gao, C. Yu, B. Kuang, Z. Zhang, H. Kim, and A. Fu, “Tapas: An efficient online apt detection with task-guided process provenance graph segmentation and analysis,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2025, pp. 607–624. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity25/presentation/zhang-bo-tapas

  31. [31]

    In: Proceed- ings of the 2023 ACM SIGSAC Conference on Computer and Communica- tions Security

    F. Dong, S. Li, P. Jiang, D. Li, H. Wang, L. Huang, X. Xiao, J. Chen, X. Luo, Y . Guo, and X. Chen, “Are we there yet? an industrial viewpoint on provenance-based endpoint detection and response tools,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2023, pp. 239...

  32. [32]

    Ocr-apt: Reconstructing apt stories from audit logs using subgraph anomaly detection and llms,

    A. Aly, E. Mansour, and A. M. Youssef, “Ocr-apt: Reconstructing apt stories from audit logs using subgraph anomaly detection and llms,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2025, pp. 261–275. [Online]. Available: https://doi.org/10.1145/3719027.3765219

  33. [33]

    Raconteur: A knowledgeable, insightful, and portable llm-powered shell command explainer,

    J. Deng, X. Li, Y . Chen, Y . Bai, H. Weng, Y . Liu, T. Wei, and W. Xu, “Raconteur: A knowledgeable, insightful, and portable llm-powered shell command explainer,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025, pp. 1–18. [Online]. Available: https: //www.ndss- symposium.org/ndss...

  34. [34]

    CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

    M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapil, D. Molnar, S. Whitman, and J. Saxe, “Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.13161

  35. [35]

    Towards enforcing company policy adherence in agentic workflows,

    N. Zwerdling, D. Boaz, D. Amid, E. Rabinovich, A. Anaby-Tavor, and G. Uziel, “Towards enforcing company policy adherence in agentic workflows,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025. [Online]. Available: https://aclanthology.org/2025.emnlp-industry.41/

  36. [36]

    Formal Policy Enforcement for Real-World Agentic Systems

    N. Palumbo, S. Choudhary, J. Choi, P. Chalasani, and S. Jha, “Policy compiler for secure agentic systems,” 2026. [Online]. Available: https://arxiv.org/abs/2602.16708

  37. [37]

    Veriguard: Enhancing llm agent safety via verified code generation,

    L. Miculicich, M. Parmar, H. Palangi, K. D. Dvijotham, M. Montanari, T. Pfister, and L. T. Le, “Veriguard: Enhancing llm agent safety via verified code generation,” 2025. [Online]. Available: https : //arxiv.org/abs/2510.05156

  38. [38]

    Shieldagent: Shielding agents via verifiable safety policy reasoning

    Z. Chen, M. Kang, and B. Li, “Shieldagent: Shielding agents via verifiable safety policy reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22738

  39. [39]

    Evaluating implicit regulatory compliance in llm tool invocation via logic-guided synthesis.arXiv preprint arXiv:2601.08196,

    D. Song, Y . Huang, B. Chen, T. Cong, R. Goebel, L. Ma, and F. Khomh, “Evaluating implicit regulatory compliance in llm tool invocation via logic-guided synthesis,” 2026. [Online]. Available: https://arxiv.org/abs/2601.08196

  40. [40]

    Cacao security playbooks version 2.0,

    B. Jordan and A. Thomson, “Cacao security playbooks version 2.0,” OASIS Committee Specification 01, Nov. 2023, oASIS Open. [Online]. Available: https://docs.oasis-open.org/cacao/security-playbooks/v2.0/ cs01/security-playbooks-v2.0-cs01.html APPENDIX OPENSCIENCE The paper’s main claims are carried in the body. This appendix is supplementary: it states the...