SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation

Aldehir Rojas; Andrew Hamara; Dwight Horne; Lawrence Wong; Nicholas Turoci; Sophie Lamothe; Timothy Kurniawan; Vishal Suresh

arxiv: 2606.05476 · v1 · pith:D5U7BMLUnew · submitted 2026-06-03 · 💻 cs.CR · cs.MA

SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation

Andrew Hamara , Dwight Horne , Aldehir Rojas , Timothy Kurniawan , Sophie Lamothe , Vishal Suresh , Nicholas Turoci , Lawrence Wong This is my paper

Pith reviewed 2026-06-28 05:06 UTC · model grok-4.3

classification 💻 cs.CR cs.MA

keywords OS hardeningmulti-agent systemsLLM remediationsecurity complianceSTIG automationiterative feedbackvulnerability remediation

0 comments

The pith

SHIELDS multi-agent system remediates up to 73% of OS security scan findings using iterative LLM feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SHIELDS as a multi-agent LLM system that approaches OS hardening as an iterative process, proposing fixes and refining them based on execution feedback and validation scans instead of relying on static corrective actions. This is relevant because manual compliance with standards like DISA STIGs is tedious and expensive, and current automation tools are limited to pre-written scripts. Evaluations on multiple virtual machine setups with LLMs from 20B to 400B parameters show remediation success up to 73%. The results indicate that effective tool use and information gathering matter more than model parameter count for success in this task.

Core claim

SHIELDS uses large language models in a multi-agent setup to treat OS hardening as an iterative, feedback-driven process. Instead of fixed remediations, it continuously proposes fixes and refines them based on target system execution and validation scans. Across evaluations, it successfully remediates up to 73% of scan findings, with success depending less on model size than on effective tool use and information gathering.

What carries the argument

The iterative multi-agent remediation loop where LLMs propose, execute, and validate fixes using feedback from the target system and scans.

If this is right

Automates compliance tasks that currently require manual effort or static tools.
Allows effective use of smaller LLMs in security-sensitive environments.
Supports local model deployment where privacy or compute limits apply.
Reduces burden of maintaining OS compliance with standards like STIGs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar iterative approaches could extend to other security domains beyond OS hardening.
If the feedback loop proves reliable, it might minimize the need for human oversight in remediation.
Testing on real-world production systems rather than VMs could reveal additional challenges.
The method might integrate with existing compliance tools to enhance their capabilities.

Load-bearing premise

That the iterative feedback from system execution and scans is sufficient for LLMs to produce correct fixes without introducing new vulnerabilities or needing human intervention.

What would settle it

An experiment showing that after SHIELDS remediation, a validation scan reports new or additional findings not present before, or that fixes cause system instability.

read the original abstract

Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHIELDS shows an iterative LLM agent loop can hit 73% remediation on OS scans with tool use mattering more than size, but the evaluation details are too thin to judge the result.

read the letter

SHIELDS treats OS hardening as a feedback loop where agents propose fixes, execute them on the target, and refine using scan results and system responses. It reports up to 73% success across six LLMs and multiple VM setups, and the data suggest tool use and information gathering drive outcomes more than raw parameter count.

The distinct element is the shift away from static corrective scripts toward repeated proposal and adjustment. That matches how compliance work actually happens when initial changes create new issues or fail validation.

The evaluation runs the system on models from 20B to 400B parameters and reports the remediation rate, which gives a basic sense of whether the approach scales across compute budgets.

The soft spot is the missing experimental controls. The abstract states the 73% figure and the tool-use observation but supplies no baselines, trial counts, variance measures, or description of how runs were selected or failures handled. Without those, the central numbers are hard to interpret or compare.

The work is empirical rather than theoretical, with no fitted parameters or circular definitions.

This paper is for researchers and engineers working on automated security configuration or LLM agents for sysadmin tasks. A reader who wants concrete examples of multi-agent remediation on real compliance standards would get practical ideas from the workflow, even if the numbers require more backing.

It deserves peer review. The iterative framing is testable and addresses a known operational cost, but any referee will need the full methods section before the 73% claim or the tool-use conclusion can be assessed.

Referee Report

2 major / 2 minor

Summary. The paper introduces SHIELDS, a multi-agent LLM system for OS hardening that treats compliance as an iterative, feedback-driven process: agents propose fixes, execute them on target VMs, and refine based on execution outcomes and validation scans against standards such as DISA STIGs. Across six LLMs (20B–400B parameters) and multiple VM configurations, the system is reported to remediate up to 73% of scan findings, with the central empirical claim that success depends more on effective tool use and information gathering than on model parameter count.

Significance. If the empirical results prove robust under controlled conditions, the work offers a practical route to reducing manual effort in security compliance, particularly for local or privacy-sensitive deployments where smaller models are preferred. The finding that tool-use effectiveness outweighs scale would be a useful contribution to the design of agentic security tools.

major comments (2)

[Evaluation / Results] The abstract and evaluation description report a 73% remediation rate but supply no information on the number of trials per configuration, statistical error bars or variance across runs, baseline comparisons against static remediation scripts or existing compliance tools, or the protocol for handling post-hoc selection of successful runs. These omissions are load-bearing for the central claim that SHIELDS achieves reliable remediation.
[Methodology / Evaluation] The evaluation protocol relies on the assumption that iterative feedback from target-system execution and validation scans is sufficient for LLMs to produce correct, non-regressive fixes without introducing new vulnerabilities. No description is given of safeguards, rollback mechanisms, or systematic failure-mode analysis that would substantiate this assumption.

minor comments (2)

[Abstract] The abstract states that success 'depends less on model size than on effective tool use' but does not quantify this comparison (e.g., via ablation on tool availability or information-gathering steps).
[System Design] Notation for agent roles, tool interfaces, and scan-result representations is introduced without a consolidated table or diagram, making it harder to follow the multi-agent workflow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation and methodology. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and safeguards.

read point-by-point responses

Referee: [Evaluation / Results] The abstract and evaluation description report a 73% remediation rate but supply no information on the number of trials per configuration, statistical error bars or variance across runs, baseline comparisons against static remediation scripts or existing compliance tools, or the protocol for handling post-hoc selection of successful runs. These omissions are load-bearing for the central claim that SHIELDS achieves reliable remediation.

Authors: We agree that these details are essential to support the central claims. In the revised manuscript we will expand the evaluation section to report the number of trials per configuration, include statistical measures such as variance and error bars across runs, add baseline comparisons to static remediation scripts and existing compliance tools, and explicitly describe the result-reporting protocol (with no post-hoc selection of runs). revision: yes
Referee: [Methodology / Evaluation] The evaluation protocol relies on the assumption that iterative feedback from target-system execution and validation scans is sufficient for LLMs to produce correct, non-regressive fixes without introducing new vulnerabilities. No description is given of safeguards, rollback mechanisms, or systematic failure-mode analysis that would substantiate this assumption.

Authors: We acknowledge the need to substantiate this assumption. The revised manuscript will add a subsection detailing the safeguards employed (including rollback via VM snapshots), post-fix validation scans to detect regressions, and a systematic failure-mode analysis of observed errors and non-regressive outcomes from the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurement only

full rationale

The paper reports an empirical evaluation of the SHIELDS multi-agent system on virtual machines across six LLMs. The central result (up to 73% remediation of scan findings) is a direct experimental measurement, not a quantity derived from equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce by construction to the paper's own inputs. The work is self-contained against external benchmarks (DISA STIG scans on VMs) with no self-citation chains or ansatzes invoked for the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the domain assumption that LLMs can reliably interpret scanner output and generate safe, effective configuration changes through iteration. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LLMs supplied with execution feedback and scanner results can iteratively produce correct OS configuration fixes
This assumption underpins both the 73% remediation figure and the claim that tool use matters more than model size.

pith-pipeline@v0.9.1-grok · 5752 in / 1389 out tokens · 25450 ms · 2026-06-28T05:06:44.620860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 9 canonical work pages · 3 internal anchors

[1]

IBM (2024)

IBM Security and Ponemon Institute: Cost of a Data Breach 9 Report 2024. IBM (2024). https://www.ibm.com/think/insights/ whats-new-2024-cost-of-a-data-breach-report

2024
[2]

https://www

Verizon Business: 2024 Data Breach Investigations Report (2024). https://www. verizon.com/business/resources/reports/dbir/

2024
[3]

https://www

Verizon Business: 2025 Data Breach Investigations Report (2025). https://www. verizon.com/business/resources/reports/dbir/

2025
[4]

https://www

SteelCloud: STIG Automation for Continuous DISA Compliance. https://www. steelcloud.com/automate-disa-stig-compliance/
[5]

ComplianceAsCode: Security Automation Content in SCAP, Bash, Ansible, and Other Formats. GitHub. https://github.com/ComplianceAsCode/content
[6]

Ansible Lockdown: Automated STIG Benchmark Compliance Remediation. GitHub. https://github.com/ansible-lockdown
[7]

Microsoft: PowerSTIG: STIG Automation. GitHub. https://github.com/ microsoft/PowerStig
[8]

https://www.redhat.com/en/blog/ center-internet-security-cis-compliance-red-hat-enterprise-linux-using-openscap

Red Hat: Center for Internet Security (CIS) Compliance in Red Hat Enterprise Linux Using OpenSCAP (2025). https://www.redhat.com/en/blog/ center-internet-security-cis-compliance-red-hat-enterprise-linux-using-openscap

2025
[9]

Malul, E., Meidan, Y., Mimran, D., Elovici, Y., Shabtai, A.: GenKubeSec: LLM-based Kubernetes misconfiguration detection, localization, reasoning, and remediation (2024) arXiv:2405.19954 [cs.CR]

work page arXiv 2024
[10]

In: Proc

Kulsum, U., Zhu, H., Xu, B., d’Amorim, M.: A case study of LLM for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. In: Proc. AIware ’24 (2024).https://arxiv.org/abs/2405.15690

work page arXiv 2024
[11]

Nong, Y., et al.: Automated software vulnerability patching using large language models (2024) arXiv:2408.13597 [cs.CR]

work page arXiv 2024
[12]

Machine Learning with Applications 17, 100570 (2024) https://doi.org/10.1016/j.mlwa.2024.100570

Cao, C., Wang, F., Lindley, L., Wang, Z.: Managing linux servers with llm-based ai agents: An empirical evaluation with gpt4. Machine Learning with Applications 17, 100570 (2024) https://doi.org/10.1016/j.mlwa.2024.100570

work page doi:10.1016/j.mlwa.2024.100570 2024
[13]

In: Proceedings of the 23rd ACM Workshop on Hot Topics in Networks, pp

Liu, X., Zhang, P., Abhashkumar, A., Chen, J., Jiang, W.: Automatic config- uration repair. In: Proceedings of the 23rd ACM Workshop on Hot Topics in Networks, pp. 213–220 (2024)

2024
[14]

Alicante, A

Wang, X., Tian, Y., Huang, K., Liang, B.: Practically implementing an llm-supported collaborative vulnerability remediation process: A team-based approach. Computers & Security148, 104113 (2025) https://doi.org/10.1016/j. cose.2024.104113

work page doi:10.1016/j 2025
[15]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Talebirad, Y., Nadiri, A.: Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

In: Arabnia, H.R., Deligiannidis, L., Amirian, S., Ghareh Mohammadi, F., Shenavarmasouleh, F

Horne, D.: The agentic ai mindset – a practitioner’s guide to architectures, pat- terns, and future directions for autonomy and automation. In: Arabnia, H.R., Deligiannidis, L., Amirian, S., Ghareh Mohammadi, F., Shenavarmasouleh, F. (eds.) AI Revolution: Research, Ethics and Society, pp. 434–455. Springer, Cham (2026) 10

2026
[17]

In: 2026 International Conference on Emerging Technologies and Future Innovations (ETFI), pp

Rokade, R.Y., Dhakulkar, B.: A survey of ai-driven stig automation techniques in modern devsecops environments. In: 2026 International Conference on Emerging Technologies and Future Innovations (ETFI), pp. 1–7 (2026). https://doi.org/10. 1109/ETFI68128.2026.11484642

work page arXiv 2026
[18]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception: Mercury: Ultra-Fast Language Models Based on Diffusion (2025). https://arxiv.org/abs/2506.17298

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Accessed: 2026-05-25

SHIELDS Capstone Project Team: timothyk31/s26 capstone l3: Shields capstone project spring 2026. Accessed: 2026-05-25

2026
[20]

https://arxiv.org/abs/2602

Arcee: Arcee Trinity Large Technical Report (2026). https://arxiv.org/abs/2602. 17004

2026
[21]

https://huggingface.co/google/ gemma-4-26b-a4b-it

Google: Gemma-4-26B-A4B-it. https://huggingface.co/google/ gemma-4-26b-a4b-it. Accessed: 2026-04-26 (2026)

2026
[22]

https://arxiv.org/abs/2604

NVIDIA: Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba- Transformer Model for Agentic Reasoning (2026). https://arxiv.org/abs/2604. 12374

2026
[23]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

(none)"} • Recommendation:{vulnerability.recommendation or

Google DeepMind: FunctionGemma-270M-IT. https://huggingface.co/google/ functiongemma-270m-it. Accessed: 2026-04-29 (2025) Appendix A Agent Prompts This section contains the prompts we use for our Remedy, Review, QA, and Triage agents in all experiments. A.1 Remedy Agent Remedy Agent System Prompt You are an adaptive remediation agent on Rocky Linux / RHEL...

2026

[1] [1]

IBM (2024)

IBM Security and Ponemon Institute: Cost of a Data Breach 9 Report 2024. IBM (2024). https://www.ibm.com/think/insights/ whats-new-2024-cost-of-a-data-breach-report

2024

[2] [2]

https://www

Verizon Business: 2024 Data Breach Investigations Report (2024). https://www. verizon.com/business/resources/reports/dbir/

2024

[3] [3]

https://www

Verizon Business: 2025 Data Breach Investigations Report (2025). https://www. verizon.com/business/resources/reports/dbir/

2025

[4] [4]

https://www

SteelCloud: STIG Automation for Continuous DISA Compliance. https://www. steelcloud.com/automate-disa-stig-compliance/

[5] [5]

ComplianceAsCode: Security Automation Content in SCAP, Bash, Ansible, and Other Formats. GitHub. https://github.com/ComplianceAsCode/content

[6] [6]

Ansible Lockdown: Automated STIG Benchmark Compliance Remediation. GitHub. https://github.com/ansible-lockdown

[7] [7]

Microsoft: PowerSTIG: STIG Automation. GitHub. https://github.com/ microsoft/PowerStig

[8] [8]

https://www.redhat.com/en/blog/ center-internet-security-cis-compliance-red-hat-enterprise-linux-using-openscap

Red Hat: Center for Internet Security (CIS) Compliance in Red Hat Enterprise Linux Using OpenSCAP (2025). https://www.redhat.com/en/blog/ center-internet-security-cis-compliance-red-hat-enterprise-linux-using-openscap

2025

[9] [9]

Malul, E., Meidan, Y., Mimran, D., Elovici, Y., Shabtai, A.: GenKubeSec: LLM-based Kubernetes misconfiguration detection, localization, reasoning, and remediation (2024) arXiv:2405.19954 [cs.CR]

work page arXiv 2024

[10] [10]

In: Proc

Kulsum, U., Zhu, H., Xu, B., d’Amorim, M.: A case study of LLM for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. In: Proc. AIware ’24 (2024).https://arxiv.org/abs/2405.15690

work page arXiv 2024

[11] [11]

Nong, Y., et al.: Automated software vulnerability patching using large language models (2024) arXiv:2408.13597 [cs.CR]

work page arXiv 2024

[12] [12]

Machine Learning with Applications 17, 100570 (2024) https://doi.org/10.1016/j.mlwa.2024.100570

Cao, C., Wang, F., Lindley, L., Wang, Z.: Managing linux servers with llm-based ai agents: An empirical evaluation with gpt4. Machine Learning with Applications 17, 100570 (2024) https://doi.org/10.1016/j.mlwa.2024.100570

work page doi:10.1016/j.mlwa.2024.100570 2024

[13] [13]

In: Proceedings of the 23rd ACM Workshop on Hot Topics in Networks, pp

Liu, X., Zhang, P., Abhashkumar, A., Chen, J., Jiang, W.: Automatic config- uration repair. In: Proceedings of the 23rd ACM Workshop on Hot Topics in Networks, pp. 213–220 (2024)

2024

[14] [14]

Alicante, A

Wang, X., Tian, Y., Huang, K., Liang, B.: Practically implementing an llm-supported collaborative vulnerability remediation process: A team-based approach. Computers & Security148, 104113 (2025) https://doi.org/10.1016/j. cose.2024.104113

work page doi:10.1016/j 2025

[15] [15]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Talebirad, Y., Nadiri, A.: Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

In: Arabnia, H.R., Deligiannidis, L., Amirian, S., Ghareh Mohammadi, F., Shenavarmasouleh, F

Horne, D.: The agentic ai mindset – a practitioner’s guide to architectures, pat- terns, and future directions for autonomy and automation. In: Arabnia, H.R., Deligiannidis, L., Amirian, S., Ghareh Mohammadi, F., Shenavarmasouleh, F. (eds.) AI Revolution: Research, Ethics and Society, pp. 434–455. Springer, Cham (2026) 10

2026

[17] [17]

In: 2026 International Conference on Emerging Technologies and Future Innovations (ETFI), pp

Rokade, R.Y., Dhakulkar, B.: A survey of ai-driven stig automation techniques in modern devsecops environments. In: 2026 International Conference on Emerging Technologies and Future Innovations (ETFI), pp. 1–7 (2026). https://doi.org/10. 1109/ETFI68128.2026.11484642

work page arXiv 2026

[18] [18]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception: Mercury: Ultra-Fast Language Models Based on Diffusion (2025). https://arxiv.org/abs/2506.17298

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Accessed: 2026-05-25

SHIELDS Capstone Project Team: timothyk31/s26 capstone l3: Shields capstone project spring 2026. Accessed: 2026-05-25

2026

[20] [20]

https://arxiv.org/abs/2602

Arcee: Arcee Trinity Large Technical Report (2026). https://arxiv.org/abs/2602. 17004

2026

[21] [21]

https://huggingface.co/google/ gemma-4-26b-a4b-it

Google: Gemma-4-26B-A4B-it. https://huggingface.co/google/ gemma-4-26b-a4b-it. Accessed: 2026-04-26 (2026)

2026

[22] [22]

https://arxiv.org/abs/2604

NVIDIA: Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba- Transformer Model for Agentic Reasoning (2026). https://arxiv.org/abs/2604. 12374

2026

[23] [23]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

(none)"} • Recommendation:{vulnerability.recommendation or

Google DeepMind: FunctionGemma-270M-IT. https://huggingface.co/google/ functiongemma-270m-it. Accessed: 2026-04-29 (2025) Appendix A Agent Prompts This section contains the prompts we use for our Remedy, Review, QA, and Triage agents in all experiments. A.1 Remedy Agent Remedy Agent System Prompt You are an adaptive remediation agent on Rocky Linux / RHEL...

2026