Pith Number

pith:UUGCHFME

pith:2023:UUGCHFMEAGXRULYILDSIDJ35RX

not attested not anchored not stored refs resolved

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Bertie Vidgen, Dirk Hovy, Federico Bianchi, Giuseppe Attanasio, Hannah Rose Kirk, Paul R\"ottger

Large language models refuse safe prompts that resemble unsafe requests.

arxiv:2308.01263 v3 · 2023-08-02 · cs.CL · cs.AI

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{UUGCHFMEAGXRULYILDSIDJ35RX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.

C2weakest assumption

That the 250 prompts selected by the authors are unambiguously safe and that model refusals on them reliably indicate exaggerated safety rather than other factors such as capability limits or prompt ambiguity.

C3one line summary

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

References

14 extracted · 14 resolved · 3 Pith anchors

[1] A General Language Assistant as a Laboratory for Alignment 2021 · arXiv:2112.00861

[2] Improving alignment of dialogue agents via targeted human judgements 2020 · arXiv:2209.14375

[3] Cohn, Nigel Shadbolt, and Michael Wooldridge 2023

[4] Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Cop- pin, and Po-Sen Huang 2021

[5] Universal and Transferable Adversarial Attacks on Aligned Language Models 2020 · arXiv:2307.15043

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Receipt and verification

First computed	2026-05-17T23:38:53.209339Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

a50c23958401af1a2f0858e481a77d8dfd1538538de98387ade099064474869e

Aliases

arxiv: 2308.01263 · arxiv_version: 2308.01263v3 · doi: 10.48550/arxiv.2308.01263 · pith_short_12: UUGCHFMEAGXR · pith_short_16: UUGCHFMEAGXRULYI · pith_short_8: UUGCHFME

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a50c23958401af1a2f0858e481a77d8dfd1538538de98387ade099064474869e

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "dcb2f0de1688e0d8877977724715ab901970b310f67a94791a0b805d0afb6017",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-08-02T16:30:40Z",
    "title_canon_sha256": "60bdeef85f4cf639f393480b8b495ace355dbbf4deb3d84130db1d1dd184504a"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2308.01263",
    "kind": "arxiv",
    "version": 3
  }
}