{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:BPZWJ3U2IYPSLIDGVUX5K5S7AM","short_pith_number":"pith:BPZWJ3U2","schema_version":"1.0","canonical_sha256":"0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258","source":{"kind":"arxiv","id":"2309.00614","version":2},"attestation_state":"computed","paper":{"title":"Baseline Defenses for Adversarial Attacks Against Aligned Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.","cross_cats":["cs.CL","cs.CR"],"primary_cat":"cs.LG","authors_text":"Aniruddha Saha, Avi Schwarzschild, Gowthami Somepalli, John Kirchenbauer, Jonas Geiping, Micah Goldblum, Neel Jain, Ping-yeh Chiang, Tom Goldstein, Yuxin Wen","submitted_at":"2023-09-01T17:59:44Z","abstract_excerpt":"As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision?\n  We evaluate several baseline defense strategies against leading adversarial attack"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2309.00614","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2023-09-01T17:59:44Z","cross_cats_sorted":["cs.CL","cs.CR"],"title_canon_sha256":"30c0897e6adbc00f6ac72b025b734528e1a6a01a369e621269727a5984e6ae75","abstract_canon_sha256":"712e24533eaa0f724504ea6f532c35cf63171f1c848d58739133e131d492cffa"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:45:00.709974Z","signature_b64":"aGUzp0GwJde3Sl4EvkrX1ws6dq9WHEThWyCnJYqyJ1gpZ/+zd//A+tHuSh/jKdHpcc766cShYS9SWAF87NEwDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258","last_reissued_at":"2026-05-18T03:45:00.709211Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:45:00.709211Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Baseline Defenses for Adversarial Attacks Against Aligned Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.","cross_cats":["cs.CL","cs.CR"],"primary_cat":"cs.LG","authors_text":"Aniruddha Saha, Avi Schwarzschild, Gowthami Somepalli, John Kirchenbauer, Jonas Geiping, Micah Goldblum, Neel Jain, Ping-yeh Chiang, Tom Goldstein, Yuxin Wen","submitted_at":"2023-09-01T17:59:44Z","abstract_excerpt":"As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision?\n  We evaluate several baseline defense strategies against leading adversarial attack"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"14840adc32c50f6fc6dae2c72b17646925396770d108c3975ce438bfb935f9c5"},"source":{"id":"2309.00614","kind":"arxiv","version":2},"verdict":{"id":"b4632dee-79ac-4fd1-889c-d3cd8c69698a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T23:20:42.902387Z","strongest_claim":"the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs","one_line_summary":"Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.","pith_extraction_headline":"Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models."},"references":{"count":67,"sample":[{"doi":"","year":2018,"title":"Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples","work_id":"eebd7578-ce01-480d-a386-5fbb918c2787","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","ref_index":2,"cited_arxiv_id":"2204.05862","is_internal_anchor":true},{"doi":"","year":2022,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":3,"cited_arxiv_id":"2212.08073","is_internal_anchor":true},{"doi":"","year":2018,"title":"Enhancing robustness of machine learning systems via data transformations","work_id":"1b243d8b-8004-43e8-ac31-33a73d07c18a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3128572.3140444","year":2017,"title":"Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods","work_id":"33da68d1-6d34-4d18-94b5-fb9b0cd941e2","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":67,"snapshot_sha256":"d073becfb2c21fc5bfd8dc7d188ea4fdfe543748d78887d85065b61eb339393e","internal_anchors":16},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2309.00614","created_at":"2026-05-18T03:45:00.709333+00:00"},{"alias_kind":"arxiv_version","alias_value":"2309.00614v2","created_at":"2026-05-18T03:45:00.709333+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2309.00614","created_at":"2026-05-18T03:45:00.709333+00:00"},{"alias_kind":"pith_short_12","alias_value":"BPZWJ3U2IYPS","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"BPZWJ3U2IYPSLIDG","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"BPZWJ3U2","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":31,"internal_anchor_count":31,"sample":[{"citing_arxiv_id":"2508.04204","citing_title":"ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2508.20325","citing_title":"GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2510.20129","citing_title":"SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2510.23883","citing_title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2504.19793","citing_title":"Prompt Injection Attack to Tool Selection in LLM Agents","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02280","citing_title":"RACC: Representation-Aware Coverage Criteria for LLM Safety Testing","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17368","citing_title":"Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2404.01318","citing_title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2407.04295","citing_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2310.03684","citing_title":"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.01473","citing_title":"SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11996","citing_title":"BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03095","citing_title":"Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2310.08419","citing_title":"Jailbreaking Black Box Large Language Models in Twenty Queries","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08427","citing_title":"The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10611","citing_title":"Re-Triggering Safeguards within LLMs for Jailbreak Detection","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08277","citing_title":"Mitigating Many-shot Jailbreak Attacks with One Single Demonstration","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03378","citing_title":"ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection","ref_index":129,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01899","citing_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01078","citing_title":"A Sentence Relation-Based Approach to Sanitizing Malicious Instructions","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00741","citing_title":"Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00236","citing_title":"Attention Is Where You Attack","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21700","citing_title":"Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19657","citing_title":"An AI Agent Execution Environment to Safeguard User Data","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18874","citing_title":"How Adversarial Environments Mislead Agentic AI?","ref_index":39,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM","json":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM.json","graph_json":"https://pith.science/api/pith-number/BPZWJ3U2IYPSLIDGVUX5K5S7AM/graph.json","events_json":"https://pith.science/api/pith-number/BPZWJ3U2IYPSLIDGVUX5K5S7AM/events.json","paper":"https://pith.science/paper/BPZWJ3U2"},"agent_actions":{"view_html":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM","download_json":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM.json","view_paper":"https://pith.science/paper/BPZWJ3U2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2309.00614&json=true","fetch_graph":"https://pith.science/api/pith-number/BPZWJ3U2IYPSLIDGVUX5K5S7AM/graph.json","fetch_events":"https://pith.science/api/pith-number/BPZWJ3U2IYPSLIDGVUX5K5S7AM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM/action/storage_attestation","attest_author":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM/action/author_attestation","sign_citation":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM/action/citation_signature","submit_replication":"https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM/action/replication_record"}},"created_at":"2026-05-18T03:45:00.709333+00:00","updated_at":"2026-05-18T03:45:00.709333+00:00"}