Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
read the original abstract
We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents, such as Claude Code and Codex, in an autoresearch loop with access to a library of 30+ prior methods and an evaluation script with a fixed compute budget. We show this pipeline to be effective in jailbreaking OpenAI's GPT-OSS-Safeguard-20B and in prompt injections against Meta-SecAlign-70B, an adversarially robust model. For GPT-OSS-Safeguard, the best agent-discovered method achieves up to 80\% attack success rate on CBRN queries, compared to <50\% for existing methods. For SecAlign, it achieves 100\% ASR, while the best prior automated methods only achieve 82\%. Notably, in our setting, attack methods are developed on unrelated surrogate models for a pure random-target token-forcing task, yet generalize directly to prompt injection on the adversarially trained model. Finally, we trace the lineage of methods developed during autoresearch, characterizing the agents' strategies and failure modes. Adversarial ML has long held that defenses must be evaluated against attacks tailored to them; autoresearch automates this principle, and we argue it should be the minimum bar for defense evaluation going forward.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents
WMAttack automates finite-budget attack search for world-model agents via SCAS and RGAR, reporting higher normalized reward drops than baselines on Atari and DMC tasks.
-
The Evaluation Game: Beyond Static LLM Benchmarking
Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits u...
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.