{"paper":{"title":"Baseline Defenses for Adversarial Attacks Against Aligned Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.","cross_cats":["cs.CL","cs.CR"],"primary_cat":"cs.LG","authors_text":"Aniruddha Saha, Avi Schwarzschild, Gowthami Somepalli, John Kirchenbauer, Jonas Geiping, Micah Goldblum, Neel Jain, Ping-yeh Chiang, Tom Goldstein, Yuxin Wen","submitted_at":"2023-09-01T17:59:44Z","abstract_excerpt":"As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision?\n  We evaluate several baseline defense strategies against leading adversarial attack"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"14840adc32c50f6fc6dae2c72b17646925396770d108c3975ce438bfb935f9c5"},"source":{"id":"2309.00614","kind":"arxiv","version":2},"verdict":{"id":"b4632dee-79ac-4fd1-889c-d3cd8c69698a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T23:20:42.902387Z","strongest_claim":"the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs","one_line_summary":"Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.","pith_extraction_headline":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models."},"references":{"count":67,"sample":[{"doi":"","year":2018,"title":"Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples","work_id":"eebd7578-ce01-480d-a386-5fbb918c2787","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","ref_index":2,"cited_arxiv_id":"2204.05862","is_internal_anchor":true},{"doi":"","year":2022,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":3,"cited_arxiv_id":"2212.08073","is_internal_anchor":true},{"doi":"","year":2018,"title":"Enhancing robustness of machine learning systems via data transformations","work_id":"1b243d8b-8004-43e8-ac31-33a73d07c18a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3128572.3140444","year":2017,"title":"Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods","work_id":"33da68d1-6d34-4d18-94b5-fb9b0cd941e2","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":67,"snapshot_sha256":"d073becfb2c21fc5bfd8dc7d188ea4fdfe543748d78887d85065b61eb339393e","internal_anchors":16},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}