pith. machine review for the scientific record. sign in

arxiv: 2605.02346 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.AI

Recognition: 3 theorem links

· Lean Theorem

APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:57 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords autonomous vulnerability managementbare-metal OTindustrial control systemsLLM agentspurple teamingModbus/TCPZephyr RTOSsecurity automation
0
0 comments X

The pith

An LLM framework autonomously discovers, exploits, patches, and verifies fixes for vulnerabilities in bare-metal OT devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces APIOT as a system that lets large language models handle the complete cycle of finding, using, fixing, and checking vulnerabilities on bare-metal operational technology devices. These devices are microcontrollers that run industrial protocols directly without the shells or filesystems that earlier AI security tools rely on, so the agents must work at the level of protocol fields and message formats. Experiments across hundreds of runs on Zephyr RTOS setups show a 90 percent success rate when a governance layer is present, while removing that layer leads to agents repeating actions or getting stuck. If the approach holds up, it means industrial control systems can be tested and secured more automatically than before.

Core claim

APIOT is the first large language model framework demonstrating an autonomous attack and remediation of bare-metal OT devices, achieving the full discovery to exploitation to patching to verification cycle without step-by-step human intervention, with a 90.0 percent mission success rate on Zephyr RTOS firmware across heterogeneous IIoT topologies.

What carries the argument

The runtime governance layer called the overseer, which monitors LLM agent actions to block repetition loops, missing crash checks, and reconnaissance deadlocks while agents reason over protocol fields.

If this is right

  • Attacker expertise is no longer the binding constraint on bare-metal OT exploitation.
  • Defender threat models must now assume LLM-augmented adversaries capable of executing autonomous discovery-through-remediation cycles against industrial firmware.
  • Without the overseer governance layer, agents exhibit systematic degenerate patterns including repetition loops and reconnaissance deadlocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The protocol-level reasoning design could be reused for other constrained embedded systems that lack standard operating system interfaces.
  • Security teams for industrial networks may need to add automated LLM-based red teaming to their regular assessment practices.

Load-bearing premise

The Zephyr RTOS testbed and the chosen Modbus/TCP and CoAP implementations represent the range of constraints found in actual bare-metal industrial OT deployments.

What would settle it

Running the framework on production bare-metal OT hardware with a different microcontroller or protocol and observing failure to complete the full cycle without human help would show that the reported success rate and the necessity of the overseer do not generalize.

Figures

Figures reproduced from arXiv: 2605.02346 by Adel ElZemity, Budi Arief, Calvin Brierley, George Oikonomou, Haoxiang Li, James Pope, Shujun Li, Yichao Wang, Yuxiang Huang.

Figure 1
Figure 1. Figure 1: High-level architecture of the APIOT autonomous purple-teaming framework and its interface with the IoT Virtual Lab testbed (see Section 3 for details). For the discovery stage, the most directly comparable prior work on bare￾metal MCUs is firmware fuzzing (automated generation of malformed inputs to discover crashes and memory corruptions in embedded firmware). Clements et al. [8], Scharnowski et al. [26]… view at source ↗
Figure 2
Figure 2. Figure 2: Mean turn counts to three milestones: the first exploit attempt, the red-to-blue phase transition (attack complete, defence begins), and the mission completion. The recon-to-exploit gap (turns 3–4) is consistently short, and most of the mission is spent in the exploit-and-verify loop rather than the blue-team phase. COAP T1 COAP T2 COAP T3 MODBUS T1 MODBUS T2 MODBUS T3 0% 20% 40% 60% 80% 100% Fraction of T… view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of tool calls by category (exploit, verify, recon, blue team) per protocol × topology condition. Exploit and recon together dominate in all conditions; blue-team calls are consistently the smallest category (5.0–19.0%). hibited the “zero exploits attempted” or “exploit succeeded, verify failed” failure modes, confirming that the agent’s protocol-specific exploit logic is reliable once view at source ↗
Figure 4
Figure 4. Figure 4: Mean turns to mission completion vs. network impairment level (all runs suc￾cessful at 100%). CoAP turn count is stable across all conditions (≈12 turns); Modbus increases from 23 turns at baseline to 34 turns under heavy impairment (20% loss, 200 ms latency), indicating efficiency degradation without mission failure. it engages a target. All 63/70 successful runs also completed the blue-team phase, with t… view at source ↗
Figure 5
Figure 5. Figure 5: Model sensitivity in mission success rate by model and condition (n = 10 per cell). All models achieve 100% on easy (CoAP, guided); the hard condition (MQTT, multi-step) separates models, with GPT-5.4 failing all ten runs; the blind condition (CoAP, no protocol hints) shows Claude and MiniMax M2.5 as models below 100%. alisability concern, we conducted 150 experiment runs comparing five frontier LLMs (Mini… view at source ↗
Figure 6
Figure 6. Figure 6: Overseer ON vs. OFF comparison on T1 runs. Left: mean turns to completion, where ON is 20.5% lower, reflecting the overseer intercepting runaway loops before they accumulate. Right: redundant-call rate is nearly identical across arms (37.1% vs. 38.7%), because the LLM exhibits the same repetition tendencies with or without oversight; the overseer does not change model behaviour, it bounds its consequences.… view at source ↗
read the original abstract

Bare-metal operational technology (OT) devices -- especially the microcontrollers running Modbus/TCP and CoAP at the base of industrial control systems -- have remained outside the reach of autonomous security attacks. Prior autonomous pentesting studies target Linux and web systems, whose shells and filesystems are familiar to LLM agents. Bare-metal OT has neither, so agents must reason directly over protocol fields and parser semantics. This requires new action-space designs and runtime controls, and opens new research questions about protocol-level exploit reasoning and its deployment envelope. We present APIOT (Autonomous Purple-teaming for Industrial OT), the first large language model (LLM) framework demonstrating an autonomous attack and remediation of bare-metal OT devices, achieving the full discovery -> exploitation -> patching -> verification cycle without step-by-step human intervention. We implemented and evaluated this framework on Zephyr RTOS firmware across heterogeneous industrial IoT (IIoT) topologies. Through 290 experiment runs spanning five frontier LLMs, three network topologies, two impairment levels, and guided versus unguided conditions, APIOT achieved a mission success rate of 90.0% on the full attack-remediation cycle. We found that the runtime governance layer (which we call an overseer) is a critical engineering variable: without it, agents exhibit systematic degenerate patterns, including repetition loops, missing crash verification, and reconnaissance deadlocks. Together, these findings carry two implications beyond our testbed. Attacker expertise is no longer the binding constraint on bare-metal OT exploitation, and defender threat models must now assume LLM-augmented adversaries capable of executing autonomous discovery-through-remediation cycles against industrial firmware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces APIOT, an LLM framework for autonomous end-to-end vulnerability management (discovery, exploitation, patching, verification) on bare-metal OT devices lacking shells or filesystems. It evaluates the system on Zephyr RTOS firmware implementing Modbus/TCP and CoAP across heterogeneous IIoT topologies, reporting a 90% mission success rate over 290 runs spanning five frontier LLMs, three topologies, two impairment levels, and guided/unguided conditions. The work identifies a runtime governance layer ('overseer') as essential to prevent degenerate agent behaviors such as repetition loops and reconnaissance deadlocks.

Significance. If the results hold under rigorous scrutiny, the paper would establish the first demonstrated autonomous LLM-driven attack-remediation cycle on protocol-level bare-metal OT, with direct implications for shifting defender threat models to assume LLM-augmented adversaries. The multi-LLM, multi-topology empirical design and explicit ablation of the overseer component provide concrete, falsifiable evidence rather than purely theoretical claims.

major comments (3)
  1. [Testbed description and evaluation] The central generalization that 'attacker expertise is no longer the binding constraint' rests on the Zephyr RTOS testbed sufficiently embodying the paper's premise of devices with neither shells nor filesystems and minimal firmware. Zephyr supplies an RTOS kernel, standard stacks, and timing guarantees that may provide more structure and API surfaces than typical hand-written bare-metal microcontroller firmware; this risks overstating applicability to real industrial OT deployments (abstract and evaluation sections).
  2. [Experimental methodology] The reported 90.0% mission success rate over 290 runs is load-bearing for the primary claim, yet the manuscript provides insufficient detail on data exclusion rules, crash-verification procedures, and whether overseer parameters or prompt engineering were adjusted post-hoc. Without these, reproducibility and robustness cannot be assessed (abstract and experimental results sections).
  3. [Overseer ablation and results] The necessity of the overseer is demonstrated via ablation-style comparisons, but the paper should quantify the frequency and types of degenerate patterns (repetition loops, missing crash verification, reconnaissance deadlocks) in the without-overseer condition to make the engineering variable claim precise rather than qualitative.
minor comments (2)
  1. [Abstract and evaluation metrics] Clarify the precise definition of 'mission success' (e.g., whether partial cycle completion counts) and how verification of patching is performed in the absence of a shell.
  2. [Introduction] The literature review should more explicitly delineate differences from prior LLM-based pentesting frameworks targeting Linux/web environments to strengthen the 'first' claim for bare-metal OT.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have helped us identify areas where the manuscript can be strengthened for clarity, reproducibility, and precision. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Testbed description and evaluation] The central generalization that 'attacker expertise is no longer the binding constraint' rests on the Zephyr RTOS testbed sufficiently embodying the paper's premise of devices with neither shells nor filesystems and minimal firmware. Zephyr supplies an RTOS kernel, standard stacks, and timing guarantees that may provide more structure and API surfaces than typical hand-written bare-metal microcontroller firmware; this risks overstating applicability to real industrial OT deployments (abstract and evaluation sections).

    Authors: We agree that Zephyr RTOS includes a kernel and standard protocol stacks, which differentiates it from purely hand-written bare-metal firmware lacking any OS layer. However, our implementation explicitly disables shells, filesystems, and higher-level interfaces, compelling agents to reason directly over protocol fields and parser behavior—the central technical challenge for bare-metal OT. Zephyr is widely deployed in industrial IIoT and OT microcontrollers, making it a representative platform for protocol-level interactions. We will revise the abstract and evaluation sections to more precisely characterize the testbed as 'minimal RTOS firmware on microcontrollers with no shell or filesystem access' and add an explicit limitations paragraph discussing reduced applicability to fully custom bare-metal code without any RTOS. This supports rather than overstates the claim that attacker expertise is no longer the binding constraint for protocol-level OT exploitation. revision: partial

  2. Referee: [Experimental methodology] The reported 90.0% mission success rate over 290 runs is load-bearing for the primary claim, yet the manuscript provides insufficient detail on data exclusion rules, crash-verification procedures, and whether overseer parameters or prompt engineering were adjusted post-hoc. Without these, reproducibility and robustness cannot be assessed (abstract and experimental results sections).

    Authors: We concur that additional methodological transparency is required. In the revised manuscript we will expand the experimental results section with: (1) explicit data exclusion rules (e.g., runs terminated by unrelated hardware faults or network configuration errors), (2) crash-verification procedures (independent network probes combined with firmware log inspection), and (3) a statement confirming that overseer parameters and prompt templates were fixed prior to the start of the 290 runs with no post-hoc tuning. These additions will enable independent reproduction and evaluation of the reported success rate. revision: yes

  3. Referee: [Overseer ablation and results] The necessity of the overseer is demonstrated via ablation-style comparisons, but the paper should quantify the frequency and types of degenerate patterns (repetition loops, missing crash verification, reconnaissance deadlocks) in the without-overseer condition to make the engineering variable claim precise rather than qualitative.

    Authors: We will augment the overseer ablation analysis with quantitative results drawn from the experimental logs. The revised section will report the percentage of runs in the without-overseer condition that exhibited each degenerate pattern (repetition loops, missing crash verification, reconnaissance deadlocks) together with representative trace excerpts. This will convert the current qualitative description into precise, falsifiable measurements of the overseer's engineering impact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rates derived from external testbed runs

full rationale

The paper's central result is an empirical demonstration of a 90.0% mission success rate across 290 runs on a Zephyr RTOS testbed implementing Modbus/TCP and CoAP. No equations, fitted parameters, or self-referential definitions are used to derive this rate; it is obtained by direct execution of the APIOT framework under varied conditions (five LLMs, three topologies, two impairment levels, guided/unguided). The overseer component is validated through ablation-style comparisons showing degenerate behaviors without it, rather than being presupposed. The derivation chain is therefore self-contained against the external benchmark of the implemented testbed and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical demonstration that LLM agents can reason over protocol fields when equipped with an overseer; no free parameters are fitted in the abstract, but the success metric depends on the chosen testbed and impairment levels.

axioms (1)
  • domain assumption Bare-metal OT devices expose only protocol fields and parser semantics rather than shells or filesystems.
    Stated in the opening paragraph as the reason prior LLM agents cannot be applied directly.
invented entities (1)
  • Overseer runtime governance layer no independent evidence
    purpose: Prevent degenerate agent behaviors such as repetition loops and reconnaissance deadlocks during autonomous OT exploitation.
    Introduced as a critical engineering variable whose absence produces systematic failures; no independent falsifiable prediction is given beyond the reported ablation.

pith-pipeline@v0.9.0 · 5630 in / 1419 out tokens · 33598 ms · 2026-05-08T17:57:12.626084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    In: Proceedings of the 2nd Interna- tional Conference on Future Generation Communication Technologies

    Alghamdi, T.A., Lasebae, A., Aiash, M.: Security analysis of the Constrained Ap- plication Protocol in the Internet of Things. In: Proceedings of the 2nd Interna- tional Conference on Future Generation Communication Technologies. pp. 163–

  2. [2]

    IEEE (2013).https://doi.org/10.1109/FGCT.2013.6767217

  3. [3]

    ARM Limited: Cortex-M3 Technical Reference Manual, Revision r2p0 (2010), https://developer.arm.com/documentation/ddi0337/latest/

  4. [4]

    In: Proceedings of the 2019 5th International Conference on Advanced Comput- ing & Communication Systems

    Arvind, S., Narayanan, A.: An overview of security in CoAP: Attack and analysis. In: Proceedings of the 2019 5th International Conference on Advanced Comput- ing & Communication Systems. pp. 655–660. IEEE (2019).https://doi.org/10. 1109/ICACCS.2019.8728533

  5. [5]

    OASIS Standard (2014),https:// docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html

    Banks, A., Gupta, R.: MQTT version 3.1.1. OASIS Standard (2014),https:// docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html

  6. [6]

    In: Proceedings of the 2005 USENIX Annual Technical Conference

    Bellard, F.: QEMU, a fast and portable dynamic translator. In: Proceedings of the 2005 USENIX Annual Technical Conference. USENIX Association (2005),https: //www.usenix.org/conference/2005-usenix-annual-technical-conference/ qemu-fast-and-portable-dynamic-translator Autonomous Vulnerability Management of Bare-Metal OT Networks 19

  7. [7]

    Computers & Security89, 101677:1–101677:12 (2020).https: //doi.org/10.1016/j.cose.2019.101677

    Bhamare, D., Zolanvari, M., Erbad, A., et al.: Cybersecurity for industrial control systems: A survey. Computers & Security89, 101677:1–101677:12 (2020).https: //doi.org/10.1016/j.cose.2019.101677

  8. [8]

    In: Proceedings of the 2016 Network and Distributed System Security Symposium

    Chen, D.D., Woo, M., Brumley, D., Egele, M.: Towards automated dynamic anal- ysis for Linux-based embedded firmware. In: Proceedings of the 2016 Network and Distributed System Security Symposium. Internet Society (2016).https: //doi.org/10.14722/ndss.2016.23415

  9. [9]

    In: Proceedings of the 29th USENIX Conference on Security Symposium

    Clements, A.A., Gustafson, E., Scharnowski, T., et al.: HALucinator: Firmware re- hosting through abstraction layer emulation. In: Proceedings of the 29th USENIX Conference on Security Symposium. USENIX Association (2020),https://www. usenix.org/conference/usenixsecurity20/presentation/clements

  10. [10]

    Security and Privacy Challenges of Large Language Models: A Survey

    Das, B.C., Amini, M.H., Wu, Y.: Security and Privacy Challenges of Large Lan- guage Models: A Survey. ACM Computing Surveys57(6) (Feb 2025).https: //doi.org/10.1145/3712001,https://doi.org/10.1145/3712001

  11. [11]

    Deng, G., Liu, Y., Mayoral-Vilches, V., et al.: PentestGPT: Evaluating and har- nessinglargelanguagemodelsforautomatedpenetrationtesting.In:Proceedingsof the 33rd USENIX Security Symposium. pp. 847–864. USENIX Association (2024), https://www.usenix.org/conference/usenixsecurity24/presentation/deng

  12. [12]

    Llm agents can au- tonomously exploit one-day vulnerabilities

    Fang, R., Bindu, R., Gupta, A., Kang, D.: LLM agents can autonomously ex- ploit one-day vulnerabilities. arXiv:2404.08144 [cs.CR] (2024).https://doi.org/ 10.48550/arXiv.2404.08144

  13. [13]

    In: Proceedings of the 29th USENIXSecuritySymposium.pp.1237–1254.USENIXAssociation(2020),https: //www.usenix.org/conference/usenixsecurity20/presentation/feng

    Feng, B., Mera, A., Lu, L.: P2IM: Scalable and hardware-independent firmware testing via automatic peripheral interface modeling. In: Proceedings of the 29th USENIXSecuritySymposium.pp.1237–1254.USENIXAssociation(2020),https: //www.usenix.org/conference/usenixsecurity20/presentation/feng

  14. [14]

    Proceedings of the 2020 International Conference on Emerging Trends in Smart Technologies (2020).https://doi.org/10.1109/ICETST49965.2020.9080732

    Ghazanfar, S., Hussain, F.B., ur Rehman, A., Fayyaz, U.U., Shahzad, F., Shah, G.A.: IoT-Flock: An open-source framework for IoT traffic generation. Proceedings of the 2020 International Conference on Emerging Trends in Smart Technologies (2020).https://doi.org/10.1109/ICETST49965.2020.9080732

  15. [15]

    In: Proceedings of the 2023 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

    Happe, A., Cito, J.: Getting pwn’d by AI: Penetration testing with large language models. In: Proceedings of the 2023 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 2082–

  16. [16]

    ACM (2023).https://doi.org/10.1145/3611643.3613083

  17. [17]

    Empirical Software Engineering31, 70:1–70:50 (2025).https: //doi.org/10.1007/s10664-025-10758-3

    Happe, A., Kaplan, A., Cito, J.: LLMs as hackers: Autonomous Linux privilege escalation attacks. Empirical Software Engineering31, 70:1–70:50 (2025).https: //doi.org/10.1007/s10664-025-10758-3

  18. [18]

    In: Proceedings of the 2020 Annual Computer Security Applications Conference

    Kim, A., Kim, D., Kim, E., Kim, S., Jang, Y., Kim, Y.: FirmAE: Towards large- scale emulation of IoT firmware for dynamic analysis. In: Proceedings of the 2020 Annual Computer Security Applications Conference. pp. 733–744. ACM (2020). https://doi.org/10.1145/3427228.3427294

  19. [19]

    Sensors20, 2040:1–2040:20 (2020).https://doi.org/10

    Lai, Y., Gao, H., Liu, J.: Vulnerability mining method for the Modbus TCP using an anti-sample fuzzer. Sensors20, 2040:1–2040:20 (2020).https://doi.org/10. 3390/s20072040

  20. [20]

    Journal of Information Security and Applications95, 104284 (2025).https:// doi.org/https://doi.org/10.1016/j.jisa.2025.104284

    Li, M.Q., Fung, B.C.M.: Security concerns for large language models: A survey. Journal of Information Security and Applications95, 104284 (2025).https:// doi.org/https://doi.org/10.1016/j.jisa.2025.104284

  21. [21]

    In: Proceedings of the 2024 International Conference on Learning Representations (2024)

    Liu, X., et al.: AgentBench: Evaluating LLMs as agents. In: Proceedings of the 2024 International Conference on Learning Representations (2024)

  22. [22]

    Deep learning with edge computing: A review,

    McLaughlin, S., Konstantinou, C., Wang, X., Davi, L., Sadeghi, A.R., Maniatakos, M., Karri, R.: The cybersecurity landscape in industrial control systems. Proceed- 20 A. ElZemity et al. ings of the IEEE104(5), 1039–1057 (2016).https://doi.org/10.1109/JPROC. 2015.2512235

  23. [23]

    Protocol specification (2012),https://modbus.org/docs/Modbus_Application_Protocol_ V1_1b3.pdf

    Modbus Organization: Modbus application protocol specification v1.1b3. Protocol specification (2012),https://modbus.org/docs/Modbus_Application_Protocol_ V1_1b3.pdf

  24. [24]

    In: Procedings of the 2016 Congress on Cloudification of the Internet of Things

    Mohan, N., Kangasharju, J.: Edge-Fog Cloud: A distributed cloud for internet of things computations. In: Procedings of the 2016 Congress on Cloudification of the Internet of Things. IEEE (2016).https://doi.org/10.1109/CIOT.2016.7872914

  25. [25]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Mou, Y., Xue, Z., Li, L., Liu, P., Zhang, S., Ye, W., Shao, J.: ToolSafe: Enhancing tool invocation safety of LLM-based agents via proactive step-level guardrail and feedback. arXiv:2601.10156 [cs.CL] (2026).https://doi.org/10.48550/arXiv. 2601.10156

  26. [26]

    Frontiers in Robotics and AI12(2025).https://doi.org/10.3389/frobt.2025

    Raptis, E.K., Kapoutsis, A.C., Kosmatopoulos, E.B.: Agentic LLM-based robotic systems for real-world applications: a review on their agenticness and ethics. Frontiers in Robotics and AI12(2025).https://doi.org/10.3389/frobt.2025. 1605405

  27. [27]

    Website (2023),https://www.freertos.org/, accessed: 2025-01-15

    Real Time Engineers Ltd: FreeRTOS - market leading RTOS for embedded sys- tems. Website (2023),https://www.freertos.org/, accessed: 2025-01-15

  28. [28]

    In: Proceedings of the 31st USENIX Security Sym- posium

    Scharnowski, T., Bars, N., Schloegel, M., Gustafson, E., Muench, M., Vigna, G., Kruegel, C., Holz, T., Abbasi, A.: Fuzzware: Using precise MMIO modeling for effective firmware fuzzing. In: Proceedings of the 31st USENIX Security Sym- posium. pp. 1239–1256. USENIX Association (2022),https://www.usenix.org/ conference/usenixsecurity22/presentation/scharnowski

  29. [29]

    RFC 7252, IETF (2014).https://doi.org/10.17487/RFC7252

    Shelby, Z., Hartke, K., Bormann, C.: The Constrained Application Protocol (CoAP). RFC 7252, IETF (2014).https://doi.org/10.17487/RFC7252

  30. [30]

    NIST Special Publication 800-82 Rev

    Stouffer, K., Pease, M., Tang, C., Zimmerman, T., Pillitteri, V., Lightman, S., Hahn, A., Saravia, S., Sherule, A., Thompson, M.: Guide to operational technol- ogy (OT) security. NIST Special Publication 800-82 Rev. 3, National Institute of Standards and Technology, Gaithersburg, MD (2023).https://doi.org/10.6028/ NIST.SP.800-82r3

  31. [31]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    Wang, H., Poskitt, C.M., Sun, J.: AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents. arXiv:2503.18666 [cs.AI] (2025).https://doi. org/10.48550/arXiv.2503.18666, accepted to the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)

  32. [32]

    Computers in Indus- try24(2–3), 141–158 (1994).https://doi.org/10.1016/0166-3615(94)90017-5

    Williams, T.J.: The purdue enterprise reference architecture. Computers in Indus- try24(2–3), 141–158 (1994).https://doi.org/10.1016/0166-3615(94)90017-5

  33. [33]

    and Li, Z., 2024

    Xu, J., Stokes, J.W., McDonald, G., Bai, X., Marshall, D., Wang, S., Swaminathan, A., Li, Z.: AutoAttacker: A large language model guided system to implement automatic cyber-attacks. arXiv:2403.01038 [cs.CR] (2024).https://doi.org/10. 48550/arXiv.2403.01038

  34. [34]

    In: Proceedings of the 2023 International Conference on Learning Representations (2023),https: //openreview.net/forum?id=WE_vluYUL-X

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: Re- Act: Synergizing reasoning and acting in language models. In: Proceedings of the 2023 International Conference on Learning Representations (2023),https: //openreview.net/forum?id=WE_vluYUL-X

  35. [35]

    Website (2024),https:// docs.zephyrproject.org/

    Zephyr Project: Zephyr RTOS documentation, v3.7.0. Website (2024),https:// docs.zephyrproject.org/