pith:L2VHQACI
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale.
arxiv:2406.20094 v3 · 2024-06-28 · cs.CL · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{L2VHQACIZP7GT2TXWP7FF5ABRZ}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios.
That automatically curated web personas are sufficiently diverse, unbiased, and faithfully simulable by the LLM without introducing repetition or hallucinated perspectives that degrade data quality.
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:49.676680Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
5eaa780048cbfe69ea77b3fe52f4018e4ccfb186efec30f7795f934cd756d5b8
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 5eaa780048cbfe69ea77b3fe52f4018e4ccfb186efec30f7795f934cd756d5b8
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "2c7f02e414d1271b02b929d305adaf70bc07f6b89eec240f428b7c3baac3ed37",
"cross_cats_sorted": [
"cs.LG"
],
"license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2024-06-28T17:59:01Z",
"title_canon_sha256": "7ff707acefed87c3ecc5c9283dcc8d2a1c6be1b76f5e79bd4f5d5d552177ff6d"
},
"schema_version": "1.0",
"source": {
"id": "2406.20094",
"kind": "arxiv",
"version": 3
}
}