Pith Number

pith:24JPX53P

pith:2024:24JPX53PX6EE2BI7P5VBMAC4IY

not attested not anchored not stored refs resolved

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Dillon Bowen, Elvis Hsieh, Justin Svegliato, Olivia Watkins, Pieter Abbeel, Qingyuan Lu, Sam Toyer, Sana Pandey, Scott Emmons, Tu Trinh

The StrongREJECT benchmark and evaluator match human judgments on jailbreak effectiveness more closely than prior methods and show that existing evaluations overstate success rates.

arxiv:2402.10260 v2 · 2024-02-15 · cs.LG · cs.CL · cs.CR

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{24JPX53PX6EE2BI7P5VBMAC4IY}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The StrongREJECT evaluator achieves state-of-the-art agreement with human judgments of jailbreak effectiveness, and existing evaluation methods significantly overstate jailbreak effectiveness compared to human judgments and the StrongREJECT evaluator.

C2weakest assumption

That the chosen dataset of forbidden prompts is representative enough of real-world harmful queries and that the automated evaluator's scoring rules capture the full notion of 'useful harmful information' without introducing new biases.

C3one line summary

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

References

74 extracted · 74 resolved · 13 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774

[2] Shield and spear: Jailbreaking aligned LLMs with generative prompting 2023

[3] arXiv preprint arXiv:2309.00236 , year= 2023

[4] Jailbreaking Black Box Large Language Models in Twenty Queries 2023 · arXiv:2310.08419

[5] Y . Chen, H. Gao, G. Cui, F. Qi, L. Huang, Z. Liu, and M. Sun. Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp. arXiv preprint arXiv:2210.10683, 2022

Formal links

2 machine-checked theorem links

Cited by

28 papers in Pith

Beyond the Single Turn: Reframing Refusals as Dynamic Experiences Embedded in the Context of Mental Health Support Interactions with LLMs

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

OpenAI o1 System Card

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Receipt and verification

First computed	2026-05-17T23:38:46.519126Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

d712fbf76fbf884d051f7f6a16005c462a4b5c0178c08fb4ceb8a3814444ef34

Aliases

arxiv: 2402.10260 · arxiv_version: 2402.10260v2 · doi: 10.48550/arxiv.2402.10260 · pith_short_12: 24JPX53PX6EE · pith_short_16: 24JPX53PX6EE2BI7 · pith_short_8: 24JPX53P

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/24JPX53PX6EE2BI7P5VBMAC4IY \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: d712fbf76fbf884d051f7f6a16005c462a4b5c0178c08fb4ceb8a3814444ef34

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "3feb38ad9a6d4b8a115403d4d6c3460070d9069053358a27d297f587a84c0f97",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.CR"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-02-15T18:58:09Z",
    "title_canon_sha256": "991e809fe481e050656d5a79c357f8a24e4f0c2f9ac32ef723a66c8f72f1efd9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2402.10260",
    "kind": "arxiv",
    "version": 2
  }
}