pith:J56AAO6W
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
One frontier LLM can persuade another, including a copy of itself, to generate prohibited essays on topics like Holocaust denial or climate change denial.
arxiv:2605.13334 v1 · 2026-05-13 · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{J56AAO6WJ65NDTW5NQOXMQDGD3}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Across 9 attacker-subject pairings on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100% essay production on multiple topics... Opus-as-attacker against Opus-as-subject averages 65% across the six topics.
That the automated judge LLM accurately classifies generated text as fully satisfying the prohibited request rather than producing partial or hedged compliance that the judge still counts as success.
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
References
Receipt and verification
| First computed | 2026-05-18T02:44:48.507296Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
4f7c003bd64fbad1cedd6c1d7640661ecb85a5c489aef1ef834bd2b07e268157
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/J56AAO6WJ65NDTW5NQOXMQDGD3 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4f7c003bd64fbad1cedd6c1d7640661ecb85a5c489aef1ef834bd2b07e268157
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "8325e51ff12a856ac881608a7a6662828edd3d9dfa13a84055c07bb64119454c",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-05-13T10:51:56Z",
"title_canon_sha256": "0d2df69146e0faa2d1cf9da3202880f712c602eb80e6e7ff763302624fcfad87"
},
"schema_version": "1.0",
"source": {
"id": "2605.13334",
"kind": "arxiv",
"version": 1
}
}