pith. sign in
Pith Number

pith:UUGCHFME

pith:2023:UUGCHFMEAGXRULYILDSIDJ35RX
not attested not anchored not stored refs resolved

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Bertie Vidgen, Dirk Hovy, Federico Bianchi, Giuseppe Attanasio, Hannah Rose Kirk, Paul R\"ottger

Large language models refuse safe prompts that resemble unsafe requests.

arxiv:2308.01263 v3 · 2023-08-02 · cs.CL · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{UUGCHFMEAGXRULYILDSIDJ35RX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.

C2weakest assumption

That the 250 prompts selected by the authors are unambiguously safe and that model refusals on them reliably indicate exaggerated safety rather than other factors such as capability limits or prompt ambiguity.

C3one line summary

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

References

14 extracted · 14 resolved · 3 Pith anchors

[1] A General Language Assistant as a Laboratory for Alignment 2021 · arXiv:2112.00861
[2] Improving alignment of dialogue agents via targeted human judgements 2020 · arXiv:2209.14375
[3] Cohn, Nigel Shadbolt, and Michael Wooldridge 2023
[4] Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Cop- pin, and Po-Sen Huang 2021
[5] Universal and Transferable Adversarial Attacks on Aligned Language Models 2020 · arXiv:2307.15043

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:53.209339Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

a50c23958401af1a2f0858e481a77d8dfd1538538de98387ade099064474869e

Aliases

arxiv: 2308.01263 · arxiv_version: 2308.01263v3 · doi: 10.48550/arxiv.2308.01263 · pith_short_12: UUGCHFMEAGXR · pith_short_16: UUGCHFMEAGXRULYI · pith_short_8: UUGCHFME
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a50c23958401af1a2f0858e481a77d8dfd1538538de98387ade099064474869e
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "dcb2f0de1688e0d8877977724715ab901970b310f67a94791a0b805d0afb6017",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-08-02T16:30:40Z",
    "title_canon_sha256": "60bdeef85f4cf639f393480b8b495ace355dbbf4deb3d84130db1d1dd184504a"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2308.01263",
    "kind": "arxiv",
    "version": 3
  }
}