Pith Number

pith:J56AAO6W

pith:2026:J56AAO6WJ65NDTW5NQOXMQDGD3

not attested not anchored not stored refs resolved

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

Andrea Roque, Celio Larcher, Giovana Kerche Bon\'as, Hugo Abonizio, Marcos Piau, Ramon Pires, Rodrigo Nogueira, Roseval Malaquias Junior, Thales Sales Almeida, Thiago Laitz

One frontier LLM can persuade another, including a copy of itself, to generate prohibited essays on topics like Holocaust denial or climate change denial.

arxiv:2605.13334 v1 · 2026-05-13 · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{J56AAO6WJ65NDTW5NQOXMQDGD3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across 9 attacker-subject pairings on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100% essay production on multiple topics... Opus-as-attacker against Opus-as-subject averages 65% across the six topics.

C2weakest assumption

That the automated judge LLM accurately classifies generated text as fully satisfying the prohibited request rather than producing partial or hedged compliance that the judge still counts as success.

C3one line summary

LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.

References

37 extracted · 37 resolved · 5 Pith anchors

[1] Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

[2] SORRY - Bench : Systematically Evaluating Large Language Model Safety Refusal , March 2025

[3] Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel R. , booktitle =

[4] Nadeem, Moin and Bethke, Anna and Reddy, Siva , booktitle =

[5] Discovering Language Model Behaviors with Model-Written Evaluations · arXiv:2212.09251

Receipt and verification

First computed	2026-05-18T02:44:48.507296Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

4f7c003bd64fbad1cedd6c1d7640661ecb85a5c489aef1ef834bd2b07e268157

Aliases

arxiv: 2605.13334 · arxiv_version: 2605.13334v1 · doi: 10.48550/arxiv.2605.13334 · pith_short_12: J56AAO6WJ65N · pith_short_16: J56AAO6WJ65NDTW5 · pith_short_8: J56AAO6W

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/J56AAO6WJ65NDTW5NQOXMQDGD3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4f7c003bd64fbad1cedd6c1d7640661ecb85a5c489aef1ef834bd2b07e268157

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "8325e51ff12a856ac881608a7a6662828edd3d9dfa13a84055c07bb64119454c",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T10:51:56Z",
    "title_canon_sha256": "0d2df69146e0faa2d1cf9da3202880f712c602eb80e6e7ff763302624fcfad87"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13334",
    "kind": "arxiv",
    "version": 1
  }
}