{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:UUGCHFMEAGXRULYILDSIDJ35RX","short_pith_number":"pith:UUGCHFME","schema_version":"1.0","canonical_sha256":"a50c23958401af1a2f0858e481a77d8dfd1538538de98387ade099064474869e","source":{"kind":"arxiv","id":"2308.01263","version":3},"attestation_state":"computed","paper":{"title":"XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large language models refuse safe prompts that resemble unsafe requests.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Bertie Vidgen, Dirk Hovy, Federico Bianchi, Giuseppe Attanasio, Hannah Rose Kirk, Paul R\\\"ottger","submitted_at":"2023-08-02T16:30:40Z","abstract_excerpt":"Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2308.01263","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-08-02T16:30:40Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"60bdeef85f4cf639f393480b8b495ace355dbbf4deb3d84130db1d1dd184504a","abstract_canon_sha256":"dcb2f0de1688e0d8877977724715ab901970b310f67a94791a0b805d0afb6017"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.210103Z","signature_b64":"ha4i978q8d6e/5E/Zgqcnv2o2i7/WRCT663WAsXBvN/L0FreLX2/zRR/tW/xZQkF4HgzrS/LUFRRUWufncsoCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a50c23958401af1a2f0858e481a77d8dfd1538538de98387ade099064474869e","last_reissued_at":"2026-05-17T23:38:53.209339Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.209339Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large language models refuse safe prompts that resemble unsafe requests.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Bertie Vidgen, Dirk Hovy, Federico Bianchi, Giuseppe Attanasio, Hannah Rose Kirk, Paul R\\\"ottger","submitted_at":"2023-08-02T16:30:40Z","abstract_excerpt":"Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 250 prompts selected by the authors are unambiguously safe and that model refusals on them reliably indicate exaggerated safety rather than other factors such as capability limits or prompt ambiguity.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large language models refuse safe prompts that resemble unsafe requests.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f33d4877d7b0ab790ee89fce7d8bcb01f75d5f99befe384094a17e1088e855c6"},"source":{"id":"2308.01263","kind":"arxiv","version":3},"verdict":{"id":"ef95a451-20f6-45d0-82bf-5a7fd6d7c1c8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T06:47:08.468756Z","strongest_claim":"we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.","one_line_summary":"XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 250 prompts selected by the authors are unambiguously safe and that model refusals on them reliably indicate exaggerated safety rather than other factors such as capability limits or prompt ambiguity.","pith_extraction_headline":"Large language models refuse safe prompts that resemble unsafe requests."},"references":{"count":14,"sample":[{"doi":"","year":2021,"title":"A General Language Assistant as a Laboratory for Alignment","work_id":"a43f9ea0-01be-47d5-b8ee-a1a9f73381c5","ref_index":1,"cited_arxiv_id":"2112.00861","is_internal_anchor":true},{"doi":"","year":2020,"title":"Improving alignment of dialogue agents via targeted human judgements","work_id":"6ad5970e-7550-4ae8-a158-7084dec7e3cc","ref_index":2,"cited_arxiv_id":"2209.14375","is_internal_anchor":true},{"doi":"","year":2023,"title":"Cohn, Nigel Shadbolt, and Michael Wooldridge","work_id":"77a119ca-85b9-49db-ad4b-a702b4d3ce9e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Cop- pin, and Po-Sen Huang","work_id":"9153601a-4aa0-4bc6-a9eb-1b9705a06843","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","ref_index":5,"cited_arxiv_id":"2307.15043","is_internal_anchor":true}],"resolved_work":14,"snapshot_sha256":"c04996e82d354bc25fc9989ab392132738a4159c4be4c9507de8a29dea51940b","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"cad4631ca6def732f0fc75f15329eed16309e00a68a760fbea48a67112225618"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2308.01263","created_at":"2026-05-17T23:38:53.209465+00:00"},{"alias_kind":"arxiv_version","alias_value":"2308.01263v3","created_at":"2026-05-17T23:38:53.209465+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2308.01263","created_at":"2026-05-17T23:38:53.209465+00:00"},{"alias_kind":"pith_short_12","alias_value":"UUGCHFMEAGXR","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"UUGCHFMEAGXRULYI","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"UUGCHFME","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2605.17329","citing_title":"LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2508.11222","citing_title":"ORFuzz: Fuzzing the \"Other Side\" of LLM Safety -- Testing Over-Refusal","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2509.09708","citing_title":"Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2406.18495","citing_title":"WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2309.10253","citing_title":"GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2404.01318","citing_title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2407.04295","citing_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14168","citing_title":"SAGE Celer 2.6 Technical Card","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12726","citing_title":"Before the Last Token: Diagnosing Final-Token Safety Probe Failures","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08504","citing_title":"A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12843","citing_title":"Bayesian Model Merging","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03217","citing_title":"Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08504","citing_title":"A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10639","citing_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01899","citing_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07982","citing_title":"GLiGuard: Schema-Conditioned Classification for LLM Safeguard","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07284","citing_title":"Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07754","citing_title":"The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07709","citing_title":"IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07727","citing_title":"TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07883","citing_title":"An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18519","citing_title":"LLM Safety From Within: Detecting Harmful Content with Internal Representations","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19049","citing_title":"Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18946","citing_title":"Reasoning Structure Matters for Safety Alignment of Reasoning Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03179","citing_title":"A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts","ref_index":18,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX","json":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX.json","graph_json":"https://pith.science/api/pith-number/UUGCHFMEAGXRULYILDSIDJ35RX/graph.json","events_json":"https://pith.science/api/pith-number/UUGCHFMEAGXRULYILDSIDJ35RX/events.json","paper":"https://pith.science/paper/UUGCHFME"},"agent_actions":{"view_html":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX","download_json":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX.json","view_paper":"https://pith.science/paper/UUGCHFME","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2308.01263&json=true","fetch_graph":"https://pith.science/api/pith-number/UUGCHFMEAGXRULYILDSIDJ35RX/graph.json","fetch_events":"https://pith.science/api/pith-number/UUGCHFMEAGXRULYILDSIDJ35RX/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX/action/timestamp_anchor","attest_storage":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX/action/storage_attestation","attest_author":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX/action/author_attestation","sign_citation":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX/action/citation_signature","submit_replication":"https://pith.science/pith/UUGCHFMEAGXRULYILDSIDJ35RX/action/replication_record"}},"created_at":"2026-05-17T23:38:53.209465+00:00","updated_at":"2026-05-17T23:38:53.209465+00:00"}