{"paper":{"title":"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SmoothLLM defends large language models against jailbreaking by perturbing input prompts at the character level and aggregating multiple responses.","cross_cats":["cs.AI","stat.ML"],"primary_cat":"cs.LG","authors_text":"Alexander Robey, Eric Wong, George J. Pappas, Hamed Hassani","submitted_at":"2023-10-05T17:01:53Z","abstract_excerpt":"Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversar"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Adversarially-generated prompts are brittle to character-level changes, which is the core empirical finding used to justify random perturbation and aggregation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SmoothLLM defends large language models against jailbreaking by perturbing input prompts at the character level and aggregating multiple responses.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"793c580c3c59968310459d493ce95a6d2cf89bb537d5efd5929a73d0aea503aa"},"source":{"id":"2310.03684","kind":"arxiv","version":4},"verdict":{"id":"36bd2177-17e8-4757-b4a4-86798714b5be","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T17:05:31.423908Z","strongest_claim":"Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks.","one_line_summary":"SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Adversarially-generated prompts are brittle to character-level changes, which is the core empirical finding used to justify random perturbation and aggregation.","pith_extraction_headline":"SmoothLLM defends large language models against jailbreaking by perturbing input prompts at the character level and aggregating multiple responses."},"references":{"count":91,"sample":[{"doi":"","year":2009,"title":"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models","work_id":"6a137b3a-68fe-4f2e-aad1-ca042346408f","ref_index":1,"cited_arxiv_id":"2009.11462","is_internal_anchor":true},{"doi":"","year":2016,"title":"The ai alignment problem: why it is hard, and where to start","work_id":"afbc50a2-46bb-4a39-9aca-47eeb613457a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Artificial intelligence, values, and alignment","work_id":"d9c231dd-dfd0-4b73-bda4-8fc8c7ad2a5f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The alignment problem: Machine learning and human values","work_id":"13d2d97d-4163-4819-9541-d3968ab50a98","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Regulating chatgpt and other large generative ai models","work_id":"2dcc9e2c-5741-43a9-8f9f-c90551ceb9aa","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":91,"snapshot_sha256":"99b01a8e219fb2b6416486402c9771452dae43eea4271e1604b827848230119b","internal_anchors":23},"formal_canon":{"evidence_count":2,"snapshot_sha256":"498243ea3a4e56c45c2fc2e8d519270374d343ca0189021052f3c8335a926eae"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}