pith. sign in
Pith Number

pith:ZFOZZZP7

pith:2026:ZFOZZZP7LMECGCNMCJHQ46TN53
not attested not anchored not stored refs pending

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

Jiahao Xu, Olivera Kotevska, Rui Hu, Zikai Zhang

SelfGrader detects jailbreaks by grading queries with logits over digits 0-9

arxiv:2604.01473 v3 · 2026-04-01 · cs.CR · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ZFOZZZP7LMECGCNMCJHQ46TN53}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

SelfGrader achieves up to a 22.66% reduction in ASR on LLaMA-3-8B, while maintaining significantly lower memory overhead (up to 173x) and latency (up to 26x).

C2weakest assumption

That the logit distribution over a fixed set of numerical tokens (0-9) provides a stable, human-aligned signal of query maliciousness without requiring full response generation or access to internal model features.

C3one line summary

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-29T01:05:08.581411Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c95d9ce5ff5b082309ac124f0e7a6deef92057d54fb060d1c743dd089ef43c28

Aliases

arxiv: 2604.01473 · arxiv_version: 2604.01473v3 · doi: 10.48550/arxiv.2604.01473 · pith_short_12: ZFOZZZP7LMEC · pith_short_16: ZFOZZZP7LMECGCNM · pith_short_8: ZFOZZZP7
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZFOZZZP7LMECGCNMCJHQ46TN53 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c95d9ce5ff5b082309ac124f0e7a6deef92057d54fb060d1c743dd089ef43c28
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "63e7128930bcca93eca5aaaa87c62467681600e5d0567086b2166f3d6f636b96",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CR",
    "submitted_at": "2026-04-01T23:29:12Z",
    "title_canon_sha256": "3a5d1bbce9842ed427c5a1c3ad2092e21d42b6c59d525b8fb56d38e4ae1f4a0e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.01473",
    "kind": "arxiv",
    "version": 3
  }
}