{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2020:KPXUNR6THD3NXDGYDRYVC6QNO2","short_pith_number":"pith:KPXUNR6T","schema_version":"1.0","canonical_sha256":"53ef46c7d338f6db8cd81c71517a0d768edd406efdeedc909e2b1c1243e8fbf3","source":{"kind":"arxiv","id":"2009.11462","version":2},"attestation_state":"computed","paper":{"title":"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pretrained language models can generate toxic text from seemingly innocuous prompts, and no current control method prevents it reliably.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Maarten Sap, Noah A. Smith, Samuel Gehman, Suchin Gururangan, Yejin Choi","submitted_at":"2020-09-24T03:17:19Z","abstract_excerpt":"Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2009.11462","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2020-09-24T03:17:19Z","cross_cats_sorted":[],"title_canon_sha256":"d92c209d59272778bc18f45a0692e5200a239338ee2718041aa4910328593b2b","abstract_canon_sha256":"3749c25aaae21dcfecfa070717e38d7d70f9e1fbaa64207045cf52e3f8b4d422"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.603637Z","signature_b64":"wu0RJEDg1thKew0NVwTxrsB+5n2fXRI4KKvGCzi1dIcMEoF+EcQYJNpvCVZRuFUuGh9K4DvWkFaWBXrHnunWAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"53ef46c7d338f6db8cd81c71517a0d768edd406efdeedc909e2b1c1243e8fbf3","last_reissued_at":"2026-05-17T23:38:50.603114Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.603114Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pretrained language models can generate toxic text from seemingly innocuous prompts, and no current control method prevents it reliably.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Maarten Sap, Noah A. Smith, Samuel Gehman, Suchin Gururangan, Yejin Choi","submitted_at":"2020-09-24T03:17:19Z","abstract_excerpt":"Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts... no current method is failsafe against neural toxic degeneration.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the automated toxicity classifier produces scores that reliably correspond to human judgments of toxicity across diverse prompts and generations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Language models produce toxic text from innocuous prompts, and no tested control method fully prevents it, demonstrated via a new 100K-prompt web-derived dataset.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Pretrained language models can generate toxic text from seemingly innocuous prompts, and no current control method prevents it reliably.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4b35b50293afaad2f088eb6c524f31a83d9fa50b6fc03b1d3193fee25a3882b1"},"source":{"id":"2009.11462","kind":"arxiv","version":2},"verdict":{"id":"8d0a36ef-b202-4602-baae-9b86514c0835","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T18:16:59.965453Z","strongest_claim":"Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts... no current method is failsafe against neural toxic degeneration.","one_line_summary":"Language models produce toxic text from innocuous prompts, and no tested control method fully prevents it, demonstrated via a new 100K-prompt web-derived dataset.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the automated toxicity classifier produces scores that reliably correspond to human judgments of toxicity across diverse prompts and generations.","pith_extraction_headline":"Pretrained language models can generate toxic text from seemingly innocuous prompts, and no current control method prevents it reliably."},"references":{"count":12,"sample":[{"doi":"","year":2018,"title":"In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 33–39, Florence, Italy","work_id":"10cede94-a482-403f-93ff-42ba663eb54a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2016,"title":"Enriching word vectors with subword information","work_id":"8d8270fc-359a-49cb-948b-94200e98ccb1","ref_index":2,"cited_arxiv_id":"1607.04606","is_internal_anchor":true},{"doi":"","year":2020,"title":"In Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics (V olume 1: Long Papers), pages 250–259, Soﬁa, Bulgaria","work_id":"2b1d67a3-0dd6-4754-933a-37e1ec84650e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Lucas Dixon, John Li, Jeffrey Scott Sorensen, Nithum Thain, and Lucy Vasserman","work_id":"931e7131-2846-4dfe-90d6-99bec022b2b4","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"In Proceedings of the 28th International Conference on International Conference on Machine Learning , ICML’11, page 10411048, Madison, WI, USA","work_id":"ca3997fc-540b-408d-be33-1037560236a9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":12,"snapshot_sha256":"110fd506bac8dd5f6dd040ced38f0803a301d2321ff610a7144027b1e5073b91","internal_anchors":1},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ebc6b1d49076462c75a569a2f3f06601923192c18ccec6bef262004d5b1b8215"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2009.11462","created_at":"2026-05-17T23:38:50.603195+00:00"},{"alias_kind":"arxiv_version","alias_value":"2009.11462v2","created_at":"2026-05-17T23:38:50.603195+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2009.11462","created_at":"2026-05-17T23:38:50.603195+00:00"},{"alias_kind":"pith_short_12","alias_value":"KPXUNR6THD3N","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"KPXUNR6THD3NXDGY","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"KPXUNR6T","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2605.17128","citing_title":"New Wide-Net-Casting Jailbreak Attacks Risk Large Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19940","citing_title":"Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2506.13727","citing_title":"Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05561","citing_title":"TrustLLM: Trustworthiness in Large Language Models","ref_index":250,"is_internal_anchor":true},{"citing_arxiv_id":"2305.16264","citing_title":"Scaling Data-Constrained Language Models","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2304.06767","citing_title":"RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2310.02446","citing_title":"Low-Resource Languages Jailbreak GPT-4","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2512.12283","citing_title":"Large Language Models have Chain-of-Affect","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2309.10253","citing_title":"GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2310.03684","citing_title":"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2211.09085","citing_title":"Galactica: A Large Language Model for Science","ref_index":165,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08116","citing_title":"The Safety-Aware Denoiser for Text Diffusion Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10582","citing_title":"Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10639","citing_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2209.07858","citing_title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2112.04359","citing_title":"Ethical and social risks of harm from Language Models","ref_index":89,"is_internal_anchor":true},{"citing_arxiv_id":"2112.00861","citing_title":"A General Language Assistant as a Laboratory for Alignment","ref_index":222,"is_internal_anchor":true},{"citing_arxiv_id":"2211.09527","citing_title":"Ignore Previous Prompt: Attack Techniques For Language Models","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19018","citing_title":"Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09212","citing_title":"SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07096","citing_title":"Query-efficient model evaluation using cached responses","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07063","citing_title":"Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training","ref_index":135,"is_internal_anchor":true},{"citing_arxiv_id":"2308.10248","citing_title":"Steering Language Models With Activation Engineering","ref_index":148,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04410","citing_title":"Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07369","citing_title":"The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2","json":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2.json","graph_json":"https://pith.science/api/pith-number/KPXUNR6THD3NXDGYDRYVC6QNO2/graph.json","events_json":"https://pith.science/api/pith-number/KPXUNR6THD3NXDGYDRYVC6QNO2/events.json","paper":"https://pith.science/paper/KPXUNR6T"},"agent_actions":{"view_html":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2","download_json":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2.json","view_paper":"https://pith.science/paper/KPXUNR6T","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2009.11462&json=true","fetch_graph":"https://pith.science/api/pith-number/KPXUNR6THD3NXDGYDRYVC6QNO2/graph.json","fetch_events":"https://pith.science/api/pith-number/KPXUNR6THD3NXDGYDRYVC6QNO2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2/action/storage_attestation","attest_author":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2/action/author_attestation","sign_citation":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2/action/citation_signature","submit_replication":"https://pith.science/pith/KPXUNR6THD3NXDGYDRYVC6QNO2/action/replication_record"}},"created_at":"2026-05-17T23:38:50.603195+00:00","updated_at":"2026-05-17T23:38:50.603195+00:00"}