{"paper":{"title":"Detecting Language Model Attacks with Perplexity","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Adversarial jailbreak suffixes produce high perplexity under GPT-2, allowing a classifier on perplexity and length to catch most attacks.","cross_cats":["cs.AI","cs.CR","cs.LG"],"primary_cat":"cs.CL","authors_text":"Gabriel Alon, Michael Kamfonas","submitted_at":"2023-08-27T15:20:06Z","abstract_excerpt":"A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that fa"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. [...] A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the distribution of regular (non-adversarial) prompts used to measure false positives is representative of real-world usage and that future attackers will not adapt suffixes to also produce low perplexity under GPT-2.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Adversarial jailbreak suffixes produce high perplexity under GPT-2, allowing a classifier on perplexity and length to catch most attacks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e5a6f0fd51229d0f1a136796718c5ecc2ee6b912899a58e60e629baa18324bd4"},"source":{"id":"2308.14132","kind":"arxiv","version":3},"verdict":{"id":"e55be181-2a61-4cb2-8954-7bca598aa37f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T13:55:22.051094Z","strongest_claim":"By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. [...] A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.","one_line_summary":"Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the distribution of regular (non-adversarial) prompts used to measure false positives is representative of real-world usage and that future attackers will not adapt suffixes to also produce low perplexity under GPT-2.","pith_extraction_headline":"Adversarial jailbreak suffixes produce high perplexity under GPT-2, allowing a classifier on perplexity and length to catch most attacks."},"references":{"count":83,"sample":[{"doi":"","year":2022,"title":"Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022","work_id":"6b9cf002-ab59-4c61-ae20-2d2f7b0eecaf","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Boolq: Exploring the surprising difficulty of natural yes/no questions","work_id":"95712603-7f1e-44dd-825a-da30fd36d3aa","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Certified adversarial robustness via randomized smoothing","work_id":"d07eec87-9b6a-4a0e-b8f8-aee82051d662","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2012,"title":"Monitor alarm fatigue: an integrative review","work_id":"95443dd5-754f-4923-a4fa-453decbf764d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Improving alignment of dialogue agents via targeted human judgments, 2022","work_id":"279da37f-eca7-468c-9da9-7d42921a9b93","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":83,"snapshot_sha256":"09a310974125e1a6db0726b75ef3260df968cc1eb1c26603f3445e630ce39dbf","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c87afabb31051e2af7cd34f5765435442bab6e4a22119dc4d57870ea2efe33f8"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}