Recognition: 2 theorem links
· Lean TheoremPosition: AI Security Policy Should Target Systems, Not Models
Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3
The pith
Swarm systems of small open models can replicate restricted AI attack capabilities at zero cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Both safety bypass of frontier models and software vulnerability discovery, the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. The swarm-attack framework, with multiple lightweight LLM agents coordinating through shared memory, parallel exploration, and evolutionary optimization, compensates for the limited reasoning capacity of small individual models, as evidenced by the sharp performance drop when scaffolds are disabled.
What carries the argument
The swarm-attack framework, where multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization to perform adversarial testing.
Load-bearing premise
The reported Effective Harm Rate and 100 percent recall in the planted-CWE setup accurately reflect real-world adversarial capabilities rather than depending on the specific experimental design and hand-crafted seeds.
What would settle it
An independent run of the same swarm on GPT-4o or the C application that yields near-zero Effective Harm Rate and misses most vulnerabilities when using only the base models without the coordination components would falsify the zero-cost reproducibility claim.
read the original abstract
We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a position that AI security policy should target systems rather than models. It supports this with an open-source swarm-attack framework using multiple lightweight LLM agents coordinating via shared memory, parallel exploration, and evolutionary optimization. Two experiments are reported: (1) five 1.2B-parameter model instances conducting 225 jailbreak attacks each, achieving 45.8% Effective Harm Rate (49 critical-severity breaches) against GPT-4o but 0% EHR against Claude Sonnet-4; (2) the same models performing source code analysis and binary fuzzing on a vulnerable C application with 9 planted CWEs, recovering all 9 (100% recall) in ~4 minutes on a MacBook when using a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer classification, but 0/9 by crash verification when those scaffold components are disabled. The authors conclude that the capability class motivating restricted releases like Anthropic's Mythos Preview is reproducible at effectively zero cost, with the system scaffold as the key enabler.
Significance. If the results hold, this work provides concrete quantitative evidence that small open models with system-level orchestration can replicate notable adversarial capabilities previously associated with frontier models, supporting a shift in policy focus toward deployment systems and interaction scaffolds rather than model-weight restrictions. The open-source swarm-attack framework is a clear strength, offering a reproducible tool for adversarial testing in AI security. This could inform ongoing debates on model release policies by highlighting the role of composable systems. The significance is reduced, however, by limitations in experimental generalization.
major comments (2)
- [Vulnerability discovery experiment] Vulnerability discovery experiment: The 100% recall is achieved only with a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification; disabling these yields 0/9 by crash verification (and 2/9 by citation). This setup supplies the vulnerability patterns in advance, converting the task into scaffold-orchestrated search rather than open-ended autonomous discovery of unknown vulnerabilities. This is load-bearing for the central 'both' claim and the policy conclusion that the capability class motivating Mythos Preview is reproducible at zero cost with open models.
- [Abstract and jailbreak experiment] Abstract and first experiment: The reported Effective Harm Rate of 45.8% (producing 49 critical-severity breaches) against GPT-4o lacks full method details, statistical analysis, or raw data. This makes it difficult to evaluate the swarm results for post-hoc choices, measurement validity of the harm rate, or robustness of the 225-attack design per model instance.
minor comments (3)
- Provide a precise definition of Effective Harm Rate and the criteria for classifying 'critical-severity breaches' to improve clarity and reproducibility.
- The manuscript could expand on how the evolutionary optimization and shared memory in the swarm-attack framework operate, including any pseudocode or parameter settings.
- Consider adding a limitations section explicitly addressing the planted-CWE setup and the extent to which results generalize to unplanted, real-world codebases.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment below with clarifications and note the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Vulnerability discovery experiment: The 100% recall is achieved only with a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification; disabling these yields 0/9 by crash verification (and 2/9 by citation). This setup supplies the vulnerability patterns in advance, converting the task into scaffold-orchestrated search rather than open-ended autonomous discovery of unknown vulnerabilities. This is load-bearing for the central 'both' claim and the policy conclusion that the capability class motivating Mythos Preview is reproducible at zero cost with open models.
Authors: We agree that the experiment uses a hand-crafted exploit seed corpus along with regex pattern detection and AddressSanitizer classification, and that the ablation (0/9 by crash verification without these) is central to isolating the scaffold's contribution. The manuscript already reports this ablation explicitly. The swarm still performs code analysis, mutation via evolutionary optimization, and result classification; the seeds provide starting points rather than complete solutions. We acknowledge the referee's point that this constitutes guided recovery of planted CWEs rather than fully open-ended discovery of unknown vulnerabilities in arbitrary code. This does not invalidate the core claim that the capability class is reproducible at zero cost with open models plus scaffold, but we will revise the discussion to explicitly distinguish scaffold-assisted recovery from zero-knowledge autonomous discovery and temper related policy language accordingly. revision: partial
-
Referee: Abstract and first experiment: The reported Effective Harm Rate of 45.8% (producing 49 critical-severity breaches) against GPT-4o lacks full method details, statistical analysis, or raw data. This makes it difficult to evaluate the swarm results for post-hoc choices, measurement validity of the harm rate, or robustness of the 225-attack design per model instance.
Authors: The full manuscript includes a methods section describing the swarm-attack framework, agent coordination through shared memory, parallel exploration, and evolutionary optimization, as well as the definition of Effective Harm Rate via independent severity classification. We accept that the abstract is a high-level summary and that additional statistical details (e.g., confidence intervals, variance across the five model instances) and transparency on the 225-attack design would aid evaluation. We will expand the methods and results sections with these elements and add a note on data availability (aggregated results or repository link) to allow assessment of robustness and measurement validity. revision: yes
Circularity Check
No circularity: empirical results stand independently
full rationale
The paper presents two direct experimental measurements using the described swarm-attack framework: jailbreak success rates (45.8% EHR on GPT-4o) and vulnerability recovery rates (9/9 with scaffolds, 0/9 without) on a test application containing planted CWEs. These are reported outcomes of concrete runs on commodity hardware, not mathematical derivations, parameter fits relabeled as predictions, or self-referential definitions. No equations appear, no uniqueness theorems are invoked, and no load-bearing self-citations reduce the central claim to prior author work. The policy position follows from the observed gap between scaffold-enabled and scaffold-disabled performance, which is a substantive empirical distinction rather than a tautology.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
swarm of 1.2B parameter models... coordinated attack swarm... evolutionary optimization... assisted pipeline with hand-crafted seed corpus, regex pattern detection, and AddressSanitizer
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the important enabler is the system scaffold itself
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LFM2 technical report.arXiv:2511.23404, 2025
Liquid AI. Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025
-
[2]
Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer 9 Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024
-
[3]
From Naptime to Big Sleep: Using large language models to catch vulnerabilities in real-world code
Big Sleep Team. From Naptime to Big Sleep: Using large language models to catch vulnerabilities in real-world code. Google Project Zero blog, November 2024. https: //projectzero.google/2024/10/from-naptime-to-big-sleep.html
work page 2024
-
[4]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025
work page 2025
-
[6]
Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask, 2025
Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask, 2025
work page 2025
-
[7]
Synthesizing multi-agent harnesses for vulnerability discovery, 2026
Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthesizing multi-agent harnesses for vulnerability discovery, 2026
work page 2026
-
[8]
HarmBench: A standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[9]
Tree of attacks: Jailbreaking black-box LLMs automatically
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[10]
Small but dangerous: Evaluating and mitigating jailbreak vulnerabilities in small language models
Leonardo Piano, Claudia Battistin, Jeriek Van den Abeele, and Livio Pompianu. Small but dangerous: Evaluating and mitigating jailbreak vulnerabilities in small language models. In Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025), volume 2840 ofCommunications in Computer and Information Science, pages 500–51...
work page 2025
-
[11]
Our evaluation of Claude Mythos Preview’s cyber ca- pabilities
UK AI Security Institute. Our evaluation of Claude Mythos Preview’s cyber ca- pabilities. https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos- previews-cyber-capabilities, April 2026
work page 2026
-
[12]
Our evaluation of OpenAI’s GPT-5.5 cyber capa- bilities
UK AI Security Institute. Our evaluation of OpenAI’s GPT-5.5 cyber capa- bilities. https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5- cyber-capabilities, May 2026
work page 2026
-
[13]
Fuzz4All: Universal fuzzing with large language models
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Universal fuzzing with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024
work page 2024
-
[14]
Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo. LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks, 2025
work page 2025
-
[15]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Extended related work: LLM-assisted vulnerability discovery Prior work on using LLMs for vulnerability discovery can be organised along two axes: the rea...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.