pith. sign in
Pith Number

pith:L2VHQACI

pith:2024:L2VHQACIZP7GT2TXWP7FF5ABRZ
not attested not anchored not stored refs resolved

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Dian Yu, Dong Yu, Haitao Mi, Tao Ge, Xiaoyang Wang, Xin Chan

A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale.

arxiv:2406.20094 v3 · 2024-06-28 · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{L2VHQACIZP7GT2TXWP7FF5ABRZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios.

C2weakest assumption

That automatically curated web personas are sufficiently diverse, unbiased, and faithfully simulable by the LLM without introducing repetition or hallucinated perspectives that degrade data quality.

C3one line summary

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

References

29 extracted · 29 resolved · 12 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219
[2] GPT-4 Technical Report · arXiv:2303.08774
[3] Coig-cqia: Quality is all you need for chinese instruction fine-tuning
[4] arXiv preprint arXiv:2401.02524 , year=
[5] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism · arXiv:2401.02954

Formal links

3 machine-checked theorem links

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.676680Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

5eaa780048cbfe69ea77b3fe52f4018e4ccfb186efec30f7795f934cd756d5b8

Aliases

arxiv: 2406.20094 · arxiv_version: 2406.20094v3 · doi: 10.48550/arxiv.2406.20094 · pith_short_12: L2VHQACIZP7G · pith_short_16: L2VHQACIZP7GT2TX · pith_short_8: L2VHQACI
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 5eaa780048cbfe69ea77b3fe52f4018e4ccfb186efec30f7795f934cd756d5b8
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "2c7f02e414d1271b02b929d305adaf70bc07f6b89eec240f428b7c3baac3ed37",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-06-28T17:59:01Z",
    "title_canon_sha256": "7ff707acefed87c3ecc5c9283dcc8d2a1c6be1b76f5e79bd4f5d5d552177ff6d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.20094",
    "kind": "arxiv",
    "version": 3
  }
}