pith. machine review for the scientific record. sign in

arxiv: 2605.09504 · v1 · submitted 2026-05-10 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Position: AI Security Policy Should Target Systems, Not Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords AI securityjailbreakingvulnerability discoveryLLM agentsswarm systemspolicyopen source modelsadversarial testing
0
0 comments X

The pith

Swarm systems of small open models can replicate restricted AI attack capabilities at zero cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that coordinated groups of lightweight LLM agents can bypass safety in frontier models like GPT-4o and discover software vulnerabilities at high rates using only consumer hardware and public models. Experiments demonstrate that five 1.2 billion parameter instances achieve 45.8 percent Effective Harm Rate against GPT-4o and recover all nine planted code weaknesses in minutes, while removing the coordination scaffolds drops results to near zero. This establishes that the key enabler is the system design itself rather than model scale or access restrictions. A reader would care because it directly challenges the rationale for holding back advanced models to prevent misuse.

Core claim

Both safety bypass of frontier models and software vulnerability discovery, the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. The swarm-attack framework, with multiple lightweight LLM agents coordinating through shared memory, parallel exploration, and evolutionary optimization, compensates for the limited reasoning capacity of small individual models, as evidenced by the sharp performance drop when scaffolds are disabled.

What carries the argument

The swarm-attack framework, where multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization to perform adversarial testing.

Load-bearing premise

The reported Effective Harm Rate and 100 percent recall in the planted-CWE setup accurately reflect real-world adversarial capabilities rather than depending on the specific experimental design and hand-crafted seeds.

What would settle it

An independent run of the same swarm on GPT-4o or the C application that yields near-zero Effective Harm Rate and misses most vulnerabilities when using only the base models without the coordination components would falsify the zero-cost reproducibility claim.

read the original abstract

We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a position that AI security policy should target systems rather than models. It supports this with an open-source swarm-attack framework using multiple lightweight LLM agents coordinating via shared memory, parallel exploration, and evolutionary optimization. Two experiments are reported: (1) five 1.2B-parameter model instances conducting 225 jailbreak attacks each, achieving 45.8% Effective Harm Rate (49 critical-severity breaches) against GPT-4o but 0% EHR against Claude Sonnet-4; (2) the same models performing source code analysis and binary fuzzing on a vulnerable C application with 9 planted CWEs, recovering all 9 (100% recall) in ~4 minutes on a MacBook when using a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer classification, but 0/9 by crash verification when those scaffold components are disabled. The authors conclude that the capability class motivating restricted releases like Anthropic's Mythos Preview is reproducible at effectively zero cost, with the system scaffold as the key enabler.

Significance. If the results hold, this work provides concrete quantitative evidence that small open models with system-level orchestration can replicate notable adversarial capabilities previously associated with frontier models, supporting a shift in policy focus toward deployment systems and interaction scaffolds rather than model-weight restrictions. The open-source swarm-attack framework is a clear strength, offering a reproducible tool for adversarial testing in AI security. This could inform ongoing debates on model release policies by highlighting the role of composable systems. The significance is reduced, however, by limitations in experimental generalization.

major comments (2)
  1. [Vulnerability discovery experiment] Vulnerability discovery experiment: The 100% recall is achieved only with a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification; disabling these yields 0/9 by crash verification (and 2/9 by citation). This setup supplies the vulnerability patterns in advance, converting the task into scaffold-orchestrated search rather than open-ended autonomous discovery of unknown vulnerabilities. This is load-bearing for the central 'both' claim and the policy conclusion that the capability class motivating Mythos Preview is reproducible at zero cost with open models.
  2. [Abstract and jailbreak experiment] Abstract and first experiment: The reported Effective Harm Rate of 45.8% (producing 49 critical-severity breaches) against GPT-4o lacks full method details, statistical analysis, or raw data. This makes it difficult to evaluate the swarm results for post-hoc choices, measurement validity of the harm rate, or robustness of the 225-attack design per model instance.
minor comments (3)
  1. Provide a precise definition of Effective Harm Rate and the criteria for classifying 'critical-severity breaches' to improve clarity and reproducibility.
  2. The manuscript could expand on how the evolutionary optimization and shared memory in the swarm-attack framework operate, including any pseudocode or parameter settings.
  3. Consider adding a limitations section explicitly addressing the planted-CWE setup and the extent to which results generalize to unplanted, real-world codebases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below with clarifications and note the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Vulnerability discovery experiment: The 100% recall is achieved only with a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification; disabling these yields 0/9 by crash verification (and 2/9 by citation). This setup supplies the vulnerability patterns in advance, converting the task into scaffold-orchestrated search rather than open-ended autonomous discovery of unknown vulnerabilities. This is load-bearing for the central 'both' claim and the policy conclusion that the capability class motivating Mythos Preview is reproducible at zero cost with open models.

    Authors: We agree that the experiment uses a hand-crafted exploit seed corpus along with regex pattern detection and AddressSanitizer classification, and that the ablation (0/9 by crash verification without these) is central to isolating the scaffold's contribution. The manuscript already reports this ablation explicitly. The swarm still performs code analysis, mutation via evolutionary optimization, and result classification; the seeds provide starting points rather than complete solutions. We acknowledge the referee's point that this constitutes guided recovery of planted CWEs rather than fully open-ended discovery of unknown vulnerabilities in arbitrary code. This does not invalidate the core claim that the capability class is reproducible at zero cost with open models plus scaffold, but we will revise the discussion to explicitly distinguish scaffold-assisted recovery from zero-knowledge autonomous discovery and temper related policy language accordingly. revision: partial

  2. Referee: Abstract and first experiment: The reported Effective Harm Rate of 45.8% (producing 49 critical-severity breaches) against GPT-4o lacks full method details, statistical analysis, or raw data. This makes it difficult to evaluate the swarm results for post-hoc choices, measurement validity of the harm rate, or robustness of the 225-attack design per model instance.

    Authors: The full manuscript includes a methods section describing the swarm-attack framework, agent coordination through shared memory, parallel exploration, and evolutionary optimization, as well as the definition of Effective Harm Rate via independent severity classification. We accept that the abstract is a high-level summary and that additional statistical details (e.g., confidence intervals, variance across the five model instances) and transparency on the 225-attack design would aid evaluation. We will expand the methods and results sections with these elements and add a note on data availability (aggregated results or repository link) to allow assessment of robustness and measurement validity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results stand independently

full rationale

The paper presents two direct experimental measurements using the described swarm-attack framework: jailbreak success rates (45.8% EHR on GPT-4o) and vulnerability recovery rates (9/9 with scaffolds, 0/9 without) on a test application containing planted CWEs. These are reported outcomes of concrete runs on commodity hardware, not mathematical derivations, parameter fits relabeled as predictions, or self-referential definitions. No equations appear, no uniqueness theorems are invoked, and no load-bearing self-citations reduce the central claim to prior author work. The policy position follows from the observed gap between scaffold-enabled and scaffold-disabled performance, which is a substantive empirical distinction rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; the 'Effective Harm Rate' metric and the definition of 'critical-severity breaches' may embed implicit choices, but details are unavailable.

pith-pipeline@v0.9.0 · 5589 in / 1190 out tokens · 49989 ms · 2026-05-12T04:14:15.886650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    LFM2 technical report.arXiv:2511.23404, 2025

    Liquid AI. Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025

  2. [2]

    CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

    Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer 9 Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

  3. [3]

    From Naptime to Big Sleep: Using large language models to catch vulnerabilities in real-world code

    Big Sleep Team. From Naptime to Big Sleep: Using large language models to catch vulnerabilities in real-world code. Google Project Zero blog, November 2024. https: //projectzero.google/2024/10/from-naptime-to-big-sleep.html

  4. [4]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

  5. [5]

    Vulnerability detection with code language models: How far are we? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

  6. [6]

    Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask, 2025

    Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask, 2025

  7. [7]

    Synthesizing multi-agent harnesses for vulnerability discovery, 2026

    Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthesizing multi-agent harnesses for vulnerability discovery, 2026

  8. [8]

    HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  9. [9]

    Tree of attacks: Jailbreaking black-box LLMs automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    Small but dangerous: Evaluating and mitigating jailbreak vulnerabilities in small language models

    Leonardo Piano, Claudia Battistin, Jeriek Van den Abeele, and Livio Pompianu. Small but dangerous: Evaluating and mitigating jailbreak vulnerabilities in small language models. In Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025), volume 2840 ofCommunications in Computer and Information Science, pages 500–51...

  11. [11]

    Our evaluation of Claude Mythos Preview’s cyber ca- pabilities

    UK AI Security Institute. Our evaluation of Claude Mythos Preview’s cyber ca- pabilities. https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos- previews-cyber-capabilities, April 2026

  12. [12]

    Our evaluation of OpenAI’s GPT-5.5 cyber capa- bilities

    UK AI Security Institute. Our evaluation of OpenAI’s GPT-5.5 cyber capa- bilities. https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5- cyber-capabilities, May 2026

  13. [13]

    Fuzz4All: Universal fuzzing with large language models

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Universal fuzzing with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024

  14. [14]

    LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks, 2025

    Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo. LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks, 2025

  15. [15]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Extended related work: LLM-assisted vulnerability discovery Prior work on using LLMs for vulnerability discovery can be organised along two axes: the rea...