pith. machine review for the scientific record. sign in
Pith Number

pith:AVYX74M7

pith:2023:AVYX74M7QB6UASGVPFJM7RADFX
not attested not anchored not stored refs resolved

Detecting Pretraining Data from Large Language Models

Anirudh Ajith, Danqi Chen, Daogao Liu, Luke Zettlemoyer, Mengzhou Xia, Terra Blevins, Weijia Shi, Yangsibo Huang

Min-K% Prob detects if text was in an LLM's pretraining data by averaging the lowest-probability tokens.

arxiv:2310.16789 v3 · 2023-10-25 · cs.CL · cs.CR · cs.LG

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.

C2weakest assumption

An unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities.

C3one line summary

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.

References

130 extracted · 130 resolved · 7 Pith anchors

[1] Stability of stochastic gradient descent on nonsmooth convex losses 2020
[2] Pythia: A suite for analyzing large language models across training and scaling 2023
[3] S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S 2022 · arXiv:2204.06745
[4] Machine unlearning 2021
[5] Language models are few-shot learners 1901

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.446899Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

05717ff19f807d4048d57952cfc4032dfa072bf49732ab3f13cb7f60ae8cfb4a

Aliases

arxiv: 2310.16789 · arxiv_version: 2310.16789v3 · doi: 10.48550/arxiv.2310.16789 · pith_short_12: AVYX74M7QB6U · pith_short_16: AVYX74M7QB6UASGV · pith_short_8: AVYX74M7
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/AVYX74M7QB6UASGVPFJM7RADFX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 05717ff19f807d4048d57952cfc4032dfa072bf49732ab3f13cb7f60ae8cfb4a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "c0edcfe54c48ba8be05af857532100278a4dad7c829824b0fd3ffbd5584273ec",
    "cross_cats_sorted": [
      "cs.CR",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-10-25T17:21:23Z",
    "title_canon_sha256": "eab3e6f8bd1ab37ec5727705f226b4a49b4c5d1b21a3097644eb670981474e65"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2310.16789",
    "kind": "arxiv",
    "version": 3
  }
}