pith. sign in

arxiv: 2603.24511 · v2 · pith:5JZ4ZJ6Snew · submitted 2026-03-25 · 💻 cs.LG · cs.AI· cs.CR

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

classification 💻 cs.LG cs.AIcs.CR
keywords methodsautoresearchadversarialagentsattackpromptachievesadversarially
0
0 comments X
read the original abstract

We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents, such as Claude Code and Codex, in an autoresearch loop with access to a library of 30+ prior methods and an evaluation script with a fixed compute budget. We show this pipeline to be effective in jailbreaking OpenAI's GPT-OSS-Safeguard-20B and in prompt injections against Meta-SecAlign-70B, an adversarially robust model. For GPT-OSS-Safeguard, the best agent-discovered method achieves up to 80\% attack success rate on CBRN queries, compared to <50\% for existing methods. For SecAlign, it achieves 100\% ASR, while the best prior automated methods only achieve 82\%. Notably, in our setting, attack methods are developed on unrelated surrogate models for a pure random-target token-forcing task, yet generalize directly to prompt injection on the adversarially trained model. Finally, we trace the lineage of methods developed during autoresearch, characterizing the agents' strategies and failure modes. Adversarial ML has long held that defenses must be evaluated against attacks tailored to them; autoresearch automates this principle, and we argue it should be the minimum bar for defense evaluation going forward.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    WMAttack automates finite-budget attack search for world-model agents via SCAS and RGAR, reporting higher normalized reward drops than baselines on Atari and DMC tasks.

  2. The Evaluation Game: Beyond Static LLM Benchmarking

    cs.LG 2026-05 unverdicted novelty 6.0

    Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits u...

  3. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.