{"work":{"id":"7ee4de98-0bdd-47ab-abe6-1865cb65b1ae","openalex_id":null,"doi":null,"arxiv_id":"2311.17035","raw_key":null,"title":"Scalable Extraction of Training Data from (Production) Language Models","authors":null,"authors_text":"Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito","year":2023,"venue":"cs.LG","abstract":"This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.","external_url":"https://arxiv.org/abs/2311.17035","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T00:13:39.606411+00:00","pith_arxiv_id":"2311.17035","created_at":"2026-05-08T22:04:18.024344+00:00","updated_at":"2026-05-24T00:13:39.606411+00:00","title_quality_ok":true,"display_title":"Scalable Extraction of Training Data from (Production) Language Models","render_title":"Scalable Extraction of Training Data from (Production) Language Models"},"hub":{"state":{"work_id":"7ee4de98-0bdd-47ab-abe6-1865cb65b1ae","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":36,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2024-03-13T06:59:16+00:00","last_pith_cited_at":"2026-05-19T04:41:39+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T16:07:45.250310+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":10},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":8},{"context_polarity":"support","n":2},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}