{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:24JPX53PX6EE2BI7P5VBMAC4IY","short_pith_number":"pith:24JPX53P","schema_version":"1.0","canonical_sha256":"d712fbf76fbf884d051f7f6a16005c462a4b5c0178c08fb4ceb8a3814444ef34","source":{"kind":"arxiv","id":"2402.10260","version":2},"attestation_state":"computed","paper":{"title":"A StrongREJECT for Empty Jailbreaks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"The StrongREJECT benchmark and evaluator match human judgments on jailbreak effectiveness more closely than prior methods and show that existing evaluations overstate success rates.","cross_cats":["cs.CL","cs.CR"],"primary_cat":"cs.LG","authors_text":"Alexandra Souly, Dillon Bowen, Elvis Hsieh, Justin Svegliato, Olivia Watkins, Pieter Abbeel, Qingyuan Lu, Sam Toyer, Sana Pandey, Scott Emmons, Tu Trinh","submitted_at":"2024-02-15T18:58:09Z","abstract_excerpt":"Most jailbreak papers claim the jailbreaks they propose are highly effective, often boasting near-100% attack success rates. However, it is perhaps more common than not for jailbreak developers to substantially exaggerate the effectiveness of their jailbreaks. We suggest this problem arises because jailbreak researchers lack a standard, high-quality benchmark for evaluating jailbreak performance, leaving researchers to create their own. To create a benchmark, researchers must choose a dataset of forbidden prompts to which a victim model will respond, along with an evaluation method that scores"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2402.10260","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2024-02-15T18:58:09Z","cross_cats_sorted":["cs.CL","cs.CR"],"title_canon_sha256":"991e809fe481e050656d5a79c357f8a24e4f0c2f9ac32ef723a66c8f72f1efd9","abstract_canon_sha256":"3feb38ad9a6d4b8a115403d4d6c3460070d9069053358a27d297f587a84c0f97"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.519681Z","signature_b64":"P/qpJWO9TPxdOmE5GMfnClgzxrsHpJRrm5ghDAYijXeD57J6RLoetS8QjOdsucq421xN62KZjfeD+8jidX21CQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"d712fbf76fbf884d051f7f6a16005c462a4b5c0178c08fb4ceb8a3814444ef34","last_reissued_at":"2026-05-17T23:38:46.519126Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.519126Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"A StrongREJECT for Empty Jailbreaks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"The StrongREJECT benchmark and evaluator match human judgments on jailbreak effectiveness more closely than prior methods and show that existing evaluations overstate success rates.","cross_cats":["cs.CL","cs.CR"],"primary_cat":"cs.LG","authors_text":"Alexandra Souly, Dillon Bowen, Elvis Hsieh, Justin Svegliato, Olivia Watkins, Pieter Abbeel, Qingyuan Lu, Sam Toyer, Sana Pandey, Scott Emmons, Tu Trinh","submitted_at":"2024-02-15T18:58:09Z","abstract_excerpt":"Most jailbreak papers claim the jailbreaks they propose are highly effective, often boasting near-100% attack success rates. However, it is perhaps more common than not for jailbreak developers to substantially exaggerate the effectiveness of their jailbreaks. We suggest this problem arises because jailbreak researchers lack a standard, high-quality benchmark for evaluating jailbreak performance, leaving researchers to create their own. To create a benchmark, researchers must choose a dataset of forbidden prompts to which a victim model will respond, along with an evaluation method that scores"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The StrongREJECT evaluator achieves state-of-the-art agreement with human judgments of jailbreak effectiveness, and existing evaluation methods significantly overstate jailbreak effectiveness compared to human judgments and the StrongREJECT evaluator.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen dataset of forbidden prompts is representative enough of real-world harmful queries and that the automated evaluator's scoring rules capture the full notion of 'useful harmful information' without introducing new biases.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The StrongREJECT benchmark and evaluator match human judgments on jailbreak effectiveness more closely than prior methods and show that existing evaluations overstate success rates.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8e1d38713acbdfa8d8a5eef3c30656c865e6e17c1911a4e2cc3f2a8bbd291a72"},"source":{"id":"2402.10260","kind":"arxiv","version":2},"verdict":{"id":"6fb6fe11-e7c1-439e-8992-5cf7bc78d897","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T21:25:12.824989Z","strongest_claim":"The StrongREJECT evaluator achieves state-of-the-art agreement with human judgments of jailbreak effectiveness, and existing evaluation methods significantly overstate jailbreak effectiveness compared to human judgments and the StrongREJECT evaluator.","one_line_summary":"StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen dataset of forbidden prompts is representative enough of real-world harmful queries and that the automated evaluator's scoring rules capture the full notion of 'useful harmful information' without introducing new biases.","pith_extraction_headline":"The StrongREJECT benchmark and evaluator match human judgments on jailbreak effectiveness more closely than prior methods and show that existing evaluations overstate success rates."},"references":{"count":74,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2023,"title":"Shield and spear: Jailbreaking aligned LLMs with generative prompting","work_id":"dc88de70-3346-4b20-9416-45a6cb3586e5","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"arXiv preprint arXiv:2309.00236 , year=","work_id":"a629c7a2-b381-4af9-97e3-fcc9a6e8de84","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","ref_index":4,"cited_arxiv_id":"2310.08419","is_internal_anchor":true},{"doi":"","year":2022,"title":"Y . Chen, H. Gao, G. Cui, F. Qi, L. Huang, Z. Liu, and M. Sun. Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp. arXiv preprint arXiv:2210.10683,","work_id":"8cfc21f9-ad70-4867-a284-c47761f51655","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":74,"snapshot_sha256":"08f49a9b49416c0a41c525642174f4dc7681acbb8b07b16fcd5ae84cc0e0b627","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a2dd8d60bf3e0868da62df3aaa7777ae5715eb15b929d4be9291753c1de227eb"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2402.10260","created_at":"2026-05-17T23:38:46.519216+00:00"},{"alias_kind":"arxiv_version","alias_value":"2402.10260v2","created_at":"2026-05-17T23:38:46.519216+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2402.10260","created_at":"2026-05-17T23:38:46.519216+00:00"},{"alias_kind":"pith_short_12","alias_value":"24JPX53PX6EE","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"24JPX53PX6EE2BI7","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"24JPX53P","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":28,"internal_anchor_count":28,"sample":[{"citing_arxiv_id":"2602.01694","citing_title":"Beyond the Single Turn: Reframing Refusals as Dynamic Experiences Embedded in the Context of Mental Health Support Interactions with LLMs","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22643","citing_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2412.16720","citing_title":"OpenAI o1 System Card","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21602","citing_title":"Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21602","citing_title":"Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22643","citing_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20296","citing_title":"Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2506.00166","citing_title":"Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2506.06414","citing_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2506.12382","citing_title":"Exploring the Secondary Risks of Large Language Models","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03267","citing_title":"OpenAI GPT-5 System Card","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01694","citing_title":"Beyond the Single Turn: Reframing Refusals as Dynamic Experiences Embedded in the Context of Mental Health Support Interactions with LLMs","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2404.01318","citing_title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2407.04295","citing_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2406.11717","citing_title":"Refusal in Language Models Is Mediated by a Single Direction","ref_index":185,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27861","citing_title":"TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10639","citing_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00267","citing_title":"Jailbroken Frontier Models Retain Their Capabilities","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18976","citing_title":"STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18510","citing_title":"Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07655","citing_title":"Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08524","citing_title":"What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05872","citing_title":"Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2508.10925","citing_title":"gpt-oss-120b & gpt-oss-20b Model Card","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15415","citing_title":"HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?","ref_index":63,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY","json":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY.json","graph_json":"https://pith.science/api/pith-number/24JPX53PX6EE2BI7P5VBMAC4IY/graph.json","events_json":"https://pith.science/api/pith-number/24JPX53PX6EE2BI7P5VBMAC4IY/events.json","paper":"https://pith.science/paper/24JPX53P"},"agent_actions":{"view_html":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY","download_json":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY.json","view_paper":"https://pith.science/paper/24JPX53P","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2402.10260&json=true","fetch_graph":"https://pith.science/api/pith-number/24JPX53PX6EE2BI7P5VBMAC4IY/graph.json","fetch_events":"https://pith.science/api/pith-number/24JPX53PX6EE2BI7P5VBMAC4IY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY/action/storage_attestation","attest_author":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY/action/author_attestation","sign_citation":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY/action/citation_signature","submit_replication":"https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY/action/replication_record"}},"created_at":"2026-05-17T23:38:46.519216+00:00","updated_at":"2026-05-17T23:38:46.519216+00:00"}