{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:PQSCZURRQSYBR5I56YTFM325O6","short_pith_number":"pith:PQSCZURR","schema_version":"1.0","canonical_sha256":"7c242cd23184b018f51df626566f5d77a643f2c1653587b310e29909b38bfe48","source":{"kind":"arxiv","id":"2404.01318","version":5},"attestation_state":"computed","paper":{"title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models.","cross_cats":["cs.LG"],"primary_cat":"cs.CR","authors_text":"Alexander Robey, Edgar Dobriban, Edoardo Debenedetti, Eric Wong, Florian Tramer, Francesco Croce, George J. Pappas, Hamed Hassani, Maksym Andriushchenko, Nicolas Flammarion, Patrick Chao, Vikash Sehwag","submitted_at":"2024-03-28T02:44:02Z","abstract_excerpt":"Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address thes"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2404.01318","kind":"arxiv","version":5},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CR","submitted_at":"2024-03-28T02:44:02Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"567337dfe199e91f48ff0e9b7da157d811d7b1d3e7f9d6d2aef3b5f19080f0e0","abstract_canon_sha256":"705445e4a0882b24468b88e0d56f75d406be1010542225a2e711fbe7e30a8ec4"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.303636Z","signature_b64":"wqeugG9DiNN331Et7SFx5dMG2s0j87DmDtYCFVKxg/309m/dCGmKCQBNdVHmyqVJIs4tFxgtGj5Du0/25GysCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"7c242cd23184b018f51df626566f5d77a643f2c1653587b310e29909b38bfe48","last_reissued_at":"2026-05-17T23:38:53.302991Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.302991Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models.","cross_cats":["cs.LG"],"primary_cat":"cs.CR","authors_text":"Alexander Robey, Edgar Dobriban, Edoardo Debenedetti, Eric Wong, Florian Tramer, Francesco Croce, George J. Pappas, Hamed Hassani, Maksym Andriushchenko, Nicolas Flammarion, Patrick Chao, Vikash Sehwag","submitted_at":"2024-03-28T02:44:02Z","abstract_excerpt":"Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address thes"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors; (3) a standardized evaluation framework; and (4) a leaderboard.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the selected 100 behaviors, threat model, system prompts, and scoring functions sufficiently capture real-world jailbreaking risks and success without introducing systematic bias in evaluation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f95ec46d9b24077809065c67c58de33ff5086f87d2977b6a87a9ca3241da3595"},"source":{"id":"2404.01318","kind":"arxiv","version":5},"verdict":{"id":"ad8cda48-d057-48a1-8810-a2345dd287a0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T06:04:40.711130Z","strongest_claim":"To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors; (3) a standardized evaluation framework; and (4) a leaderboard.","one_line_summary":"JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the selected 100 behaviors, threat model, system prompts, and scoring functions sufficiently capture real-world jailbreaking risks and success without introducing systematic bias in evaluation.","pith_extraction_headline":"JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models."},"references":{"count":64,"sample":[{"doi":"","year":2024,"title":"Are you still on track!? catching llm task drift with activations","work_id":"f913aa64-dddb-4ce5-9f2c-5e314589aa1a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Llama 3 model card","work_id":"f46e0736-9f8c-49ee-8a86-552ab9905bf6","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3650203.3663326","year":2024,"title":"Croissant: A Metadata Format for ML-Ready Datasets","work_id":"b13e2013-4762-4e9a-97b5-74aa550ddbde","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Jailbreak chat","work_id":"5d243cb8-eac6-42fe-9c14-a649fb943b5e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Detecting Language Model Attacks with Perplexity","work_id":"8fac4469-dd8b-4784-9ff6-13d2e74e57fb","ref_index":5,"cited_arxiv_id":"2308.14132","is_internal_anchor":true}],"resolved_work":64,"snapshot_sha256":"62a36ba2a5522e98301885f59ce4d838a44ce5e875388b9d08d1c828ce9c1efb","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"f81ed52f054c6750d39a3678c52491d084c9abe2a74f2bf2953db603e09da597"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.01318","created_at":"2026-05-17T23:38:53.303091+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.01318v5","created_at":"2026-05-17T23:38:53.303091+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.01318","created_at":"2026-05-17T23:38:53.303091+00:00"},{"alias_kind":"pith_short_12","alias_value":"PQSCZURRQSYB","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"PQSCZURRQSYBR5I5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"PQSCZURR","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":38,"internal_anchor_count":38,"sample":[{"citing_arxiv_id":"2605.22643","citing_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2405.13068","citing_title":"Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2503.02574","citing_title":"LLM-Safety Evaluations Lack Robustness","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2603.14987","citing_title":"Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21602","citing_title":"Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21602","citing_title":"Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22643","citing_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2511.12710","citing_title":"Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04459","citing_title":"Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17413","citing_title":"Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14087","citing_title":"Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2506.01770","citing_title":"ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2506.06414","citing_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2506.12382","citing_title":"Exploring the Secondary Risks of Large Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2506.13510","citing_title":"Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21540","citing_title":"PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2508.20325","citing_title":"GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02280","citing_title":"RACC: Representation-Aware Coverage Criteria for LLM Safety Testing","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14087","citing_title":"Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2407.04295","citing_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14418","citing_title":"The Great Pretender: A Stochasticity Problem in LLM Jailbreak","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2310.03684","citing_title":"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2406.11717","citing_title":"Refusal in Language Models Is Mediated by a Single Direction","ref_index":125,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11712","citing_title":"Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2406.13352","citing_title":"AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents","ref_index":7,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6","json":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6.json","graph_json":"https://pith.science/api/pith-number/PQSCZURRQSYBR5I56YTFM325O6/graph.json","events_json":"https://pith.science/api/pith-number/PQSCZURRQSYBR5I56YTFM325O6/events.json","paper":"https://pith.science/paper/PQSCZURR"},"agent_actions":{"view_html":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6","download_json":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6.json","view_paper":"https://pith.science/paper/PQSCZURR","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.01318&json=true","fetch_graph":"https://pith.science/api/pith-number/PQSCZURRQSYBR5I56YTFM325O6/graph.json","fetch_events":"https://pith.science/api/pith-number/PQSCZURRQSYBR5I56YTFM325O6/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6/action/timestamp_anchor","attest_storage":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6/action/storage_attestation","attest_author":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6/action/author_attestation","sign_citation":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6/action/citation_signature","submit_replication":"https://pith.science/pith/PQSCZURRQSYBR5I56YTFM325O6/action/replication_record"}},"created_at":"2026-05-17T23:38:53.303091+00:00","updated_at":"2026-05-17T23:38:53.303091+00:00"}