arxiv: 2605.09504 · v1 · submitted 2026-05-10 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Position: AI Security Policy Should Target Systems, Not Models

Michael A. Riegler , Inga Str\"umke

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords AI securityjailbreakingvulnerability discoveryLLM agentsswarm systemspolicyopen source modelsadversarial testing

0 comments

The pith

Swarm systems of small open models can replicate restricted AI attack capabilities at zero cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that coordinated groups of lightweight LLM agents can bypass safety in frontier models like GPT-4o and discover software vulnerabilities at high rates using only consumer hardware and public models. Experiments demonstrate that five 1.2 billion parameter instances achieve 45.8 percent Effective Harm Rate against GPT-4o and recover all nine planted code weaknesses in minutes, while removing the coordination scaffolds drops results to near zero. This establishes that the key enabler is the system design itself rather than model scale or access restrictions. A reader would care because it directly challenges the rationale for holding back advanced models to prevent misuse.

Core claim

Both safety bypass of frontier models and software vulnerability discovery, the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. The swarm-attack framework, with multiple lightweight LLM agents coordinating through shared memory, parallel exploration, and evolutionary optimization, compensates for the limited reasoning capacity of small individual models, as evidenced by the sharp performance drop when scaffolds are disabled.

What carries the argument

The swarm-attack framework, where multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization to perform adversarial testing.

Load-bearing premise

The reported Effective Harm Rate and 100 percent recall in the planted-CWE setup accurately reflect real-world adversarial capabilities rather than depending on the specific experimental design and hand-crafted seeds.

What would settle it

An independent run of the same swarm on GPT-4o or the C application that yields near-zero Effective Harm Rate and misses most vulnerabilities when using only the base models without the coordination components would falsify the zero-cost reproducibility claim.

read the original abstract

We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows small open models in a coordinated swarm can jailbreak GPT-4o at useful rates and recover planted vulnerabilities with scaffolds, but the zero-cost vuln claim rests on hand-crafted seeds and pre-planted CWEs.

read the letter

The one thing to take away is that this work gives concrete numbers for how system-level coordination can let tiny open models reach restricted capabilities on frontier systems, which pushes the policy argument toward targeting scaffolds instead of weights alone. The jailbreak experiment lands 45.8% effective harm rate on GPT-4o with five 1.2B models and zero on Claude Sonnet 4, while the code experiment hits 100% recall on nine planted CWEs in four minutes on a MacBook when the full pipeline is active. Both results are reported with the disabled-scaffold baseline, which is useful to see directly. The open-source release of the swarm-attack framework is also a clear positive for anyone who wants to inspect or extend the setup. What the paper does well is keep the focus on measurable outcomes rather than abstract claims and show that the same lightweight agents handle both jailbreaking and basic fuzzing tasks. The contrast between scaffolded and non-scaffolded runs makes the system-dependence point explicit without overclaiming. The main soft spot is the vulnerability experiment. The 9/9 recovery uses a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer classification on pre-planted CWEs. When those are removed, performance falls to zero by crash verification. That means the result demonstrates guided search more than autonomous discovery of unknown bugs in real codebases. The jailbreak numbers are less affected by this issue, but the paper's joint claim that both capability classes are now zero-cost reproducible leans on the vuln side generalizing, which the current setup does not fully test. Readers working on AI governance, red-teaming practices, or release policy will get the most from the numbers and the system-versus-model framing. The work is coherent on its own terms and engages the literature enough to deserve referee time, even if the methods need more detail on seed construction and statistical reporting. I would send it to peer review rather than desk reject.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a position that AI security policy should target systems rather than models. It supports this with an open-source swarm-attack framework using multiple lightweight LLM agents coordinating via shared memory, parallel exploration, and evolutionary optimization. Two experiments are reported: (1) five 1.2B-parameter model instances conducting 225 jailbreak attacks each, achieving 45.8% Effective Harm Rate (49 critical-severity breaches) against GPT-4o but 0% EHR against Claude Sonnet-4; (2) the same models performing source code analysis and binary fuzzing on a vulnerable C application with 9 planted CWEs, recovering all 9 (100% recall) in ~4 minutes on a MacBook when using a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer classification, but 0/9 by crash verification when those scaffold components are disabled. The authors conclude that the capability class motivating restricted releases like Anthropic's Mythos Preview is reproducible at effectively zero cost, with the system scaffold as the key enabler.

Significance. If the results hold, this work provides concrete quantitative evidence that small open models with system-level orchestration can replicate notable adversarial capabilities previously associated with frontier models, supporting a shift in policy focus toward deployment systems and interaction scaffolds rather than model-weight restrictions. The open-source swarm-attack framework is a clear strength, offering a reproducible tool for adversarial testing in AI security. This could inform ongoing debates on model release policies by highlighting the role of composable systems. The significance is reduced, however, by limitations in experimental generalization.

major comments (2)

[Vulnerability discovery experiment] Vulnerability discovery experiment: The 100% recall is achieved only with a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification; disabling these yields 0/9 by crash verification (and 2/9 by citation). This setup supplies the vulnerability patterns in advance, converting the task into scaffold-orchestrated search rather than open-ended autonomous discovery of unknown vulnerabilities. This is load-bearing for the central 'both' claim and the policy conclusion that the capability class motivating Mythos Preview is reproducible at zero cost with open models.
[Abstract and jailbreak experiment] Abstract and first experiment: The reported Effective Harm Rate of 45.8% (producing 49 critical-severity breaches) against GPT-4o lacks full method details, statistical analysis, or raw data. This makes it difficult to evaluate the swarm results for post-hoc choices, measurement validity of the harm rate, or robustness of the 225-attack design per model instance.

minor comments (3)

Provide a precise definition of Effective Harm Rate and the criteria for classifying 'critical-severity breaches' to improve clarity and reproducibility.
The manuscript could expand on how the evolutionary optimization and shared memory in the swarm-attack framework operate, including any pseudocode or parameter settings.
Consider adding a limitations section explicitly addressing the planted-CWE setup and the extent to which results generalize to unplanted, real-world codebases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below with clarifications and note the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Vulnerability discovery experiment: The 100% recall is achieved only with a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification; disabling these yields 0/9 by crash verification (and 2/9 by citation). This setup supplies the vulnerability patterns in advance, converting the task into scaffold-orchestrated search rather than open-ended autonomous discovery of unknown vulnerabilities. This is load-bearing for the central 'both' claim and the policy conclusion that the capability class motivating Mythos Preview is reproducible at zero cost with open models.

Authors: We agree that the experiment uses a hand-crafted exploit seed corpus along with regex pattern detection and AddressSanitizer classification, and that the ablation (0/9 by crash verification without these) is central to isolating the scaffold's contribution. The manuscript already reports this ablation explicitly. The swarm still performs code analysis, mutation via evolutionary optimization, and result classification; the seeds provide starting points rather than complete solutions. We acknowledge the referee's point that this constitutes guided recovery of planted CWEs rather than fully open-ended discovery of unknown vulnerabilities in arbitrary code. This does not invalidate the core claim that the capability class is reproducible at zero cost with open models plus scaffold, but we will revise the discussion to explicitly distinguish scaffold-assisted recovery from zero-knowledge autonomous discovery and temper related policy language accordingly. revision: partial
Referee: Abstract and first experiment: The reported Effective Harm Rate of 45.8% (producing 49 critical-severity breaches) against GPT-4o lacks full method details, statistical analysis, or raw data. This makes it difficult to evaluate the swarm results for post-hoc choices, measurement validity of the harm rate, or robustness of the 225-attack design per model instance.

Authors: The full manuscript includes a methods section describing the swarm-attack framework, agent coordination through shared memory, parallel exploration, and evolutionary optimization, as well as the definition of Effective Harm Rate via independent severity classification. We accept that the abstract is a high-level summary and that additional statistical details (e.g., confidence intervals, variance across the five model instances) and transparency on the 225-attack design would aid evaluation. We will expand the methods and results sections with these elements and add a note on data availability (aggregated results or repository link) to allow assessment of robustness and measurement validity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results stand independently

full rationale

The paper presents two direct experimental measurements using the described swarm-attack framework: jailbreak success rates (45.8% EHR on GPT-4o) and vulnerability recovery rates (9/9 with scaffolds, 0/9 without) on a test application containing planted CWEs. These are reported outcomes of concrete runs on commodity hardware, not mathematical derivations, parameter fits relabeled as predictions, or self-referential definitions. No equations appear, no uniqueness theorems are invoked, and no load-bearing self-citations reduce the central claim to prior author work. The policy position follows from the observed gap between scaffold-enabled and scaffold-disabled performance, which is a substantive empirical distinction rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; the 'Effective Harm Rate' metric and the definition of 'critical-severity breaches' may embed implicit choices, but details are unavailable.

pith-pipeline@v0.9.0 · 5589 in / 1190 out tokens · 49989 ms · 2026-05-12T04:14:15.886650+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

swarm of 1.2B parameter models... coordinated attack swarm... evolutionary optimization... assisted pipeline with hand-crafted seed corpus, regex pattern detection, and AddressSanitizer
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the important enabler is the system scaffold itself

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

LFM2 technical report.arXiv:2511.23404, 2025

Liquid AI. Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025

work page arXiv 2025
[2]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer 9 Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024
[3]

From Naptime to Big Sleep: Using large language models to catch vulnerabilities in real-world code

Big Sleep Team. From Naptime to Big Sleep: Using large language models to catch vulnerabilities in real-world code. Google Project Zero blog, November 2024. https: //projectzero.google/2024/10/from-naptime-to-big-sleep.html

work page 2024
[4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review arXiv 2023
[5]

Vulnerability detection with code language models: How far are we? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

work page 2025
[6]

Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask, 2025

Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask, 2025

work page 2025
[7]

Synthesizing multi-agent harnesses for vulnerability discovery, 2026

Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthesizing multi-agent harnesses for vulnerability discovery, 2026

work page 2026
[8]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[9]

Tree of attacks: Jailbreaking black-box LLMs automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

Small but dangerous: Evaluating and mitigating jailbreak vulnerabilities in small language models

Leonardo Piano, Claudia Battistin, Jeriek Van den Abeele, and Livio Pompianu. Small but dangerous: Evaluating and mitigating jailbreak vulnerabilities in small language models. In Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025), volume 2840 ofCommunications in Computer and Information Science, pages 500–51...

work page 2025
[11]

Our evaluation of Claude Mythos Preview’s cyber ca- pabilities

UK AI Security Institute. Our evaluation of Claude Mythos Preview’s cyber ca- pabilities. https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos- previews-cyber-capabilities, April 2026

work page 2026
[12]

Our evaluation of OpenAI’s GPT-5.5 cyber capa- bilities

UK AI Security Institute. Our evaluation of OpenAI’s GPT-5.5 cyber capa- bilities. https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5- cyber-capabilities, May 2026

work page 2026
[13]

Fuzz4All: Universal fuzzing with large language models

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Universal fuzzing with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024

work page 2024
[14]

LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks, 2025

Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo. LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks, 2025

work page 2025
[15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Extended related work: LLM-assisted vulnerability discovery Prior work on using LLMs for vulnerability discovery can be organised along two axes: the rea...

work page internal anchor Pith review Pith/arXiv arXiv 2023