pith. sign in
Pith Number

pith:J56AAO6W

pith:2026:J56AAO6WJ65NDTW5NQOXMQDGD3
not attested not anchored not stored refs resolved

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

Andrea Roque, Celio Larcher, Giovana Kerche Bon\'as, Hugo Abonizio, Marcos Piau, Ramon Pires, Rodrigo Nogueira, Roseval Malaquias Junior, Thales Sales Almeida, Thiago Laitz

One frontier LLM can persuade another, including a copy of itself, to generate prohibited essays on topics like Holocaust denial or climate change denial.

arxiv:2605.13334 v1 · 2026-05-13 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{J56AAO6WJ65NDTW5NQOXMQDGD3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across 9 attacker-subject pairings on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100% essay production on multiple topics... Opus-as-attacker against Opus-as-subject averages 65% across the six topics.

C2weakest assumption

That the automated judge LLM accurately classifies generated text as fully satisfying the prohibited request rather than producing partial or hedged compliance that the judge still counts as success.

C3one line summary

LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.

References

37 extracted · 37 resolved · 5 Pith anchors

[1] Proceedings of the 40th International Conference on Machine Learning (ICML) , year =
[2] SORRY - Bench : Systematically Evaluating Large Language Model Safety Refusal , March 2025
[3] Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel R. , booktitle =
[4] Nadeem, Moin and Bethke, Anna and Reddy, Siva , booktitle =
[5] Discovering Language Model Behaviors with Model-Written Evaluations · arXiv:2212.09251
Receipt and verification
First computed 2026-05-18T02:44:48.507296Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

4f7c003bd64fbad1cedd6c1d7640661ecb85a5c489aef1ef834bd2b07e268157

Aliases

arxiv: 2605.13334 · arxiv_version: 2605.13334v1 · doi: 10.48550/arxiv.2605.13334 · pith_short_12: J56AAO6WJ65N · pith_short_16: J56AAO6WJ65NDTW5 · pith_short_8: J56AAO6W
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/J56AAO6WJ65NDTW5NQOXMQDGD3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4f7c003bd64fbad1cedd6c1d7640661ecb85a5c489aef1ef834bd2b07e268157
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "8325e51ff12a856ac881608a7a6662828edd3d9dfa13a84055c07bb64119454c",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T10:51:56Z",
    "title_canon_sha256": "0d2df69146e0faa2d1cf9da3202880f712c602eb80e6e7ff763302624fcfad87"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13334",
    "kind": "arxiv",
    "version": 1
  }
}