pith. sign in
Pith Number

pith:QXYMIEQD

pith:2026:QXYMIEQDSE3D2RQXG2QGJED4FX
not attested not anchored not stored refs pending

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

Alden Rose, Cesar de la Fuente-Nunez, Haydn Jones, Jacob R. Gardner, Jiaming Liang, Kaiwen Wu, Li S. Yifei, Maggie Ziyu Huan, Mark Yatskar, Osbert Bastani, Yimeng Zeng, Yining Huang, Yoseph Barash, Zachary Ives

PubMed can be autonomously turned into structured biomedical datasets larger, more nuanced, and more accurate than the curated databases they replace.

arxiv:2605.07022 v2 · 2026-05-07 · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{QXYMIEQDSE3D2RQXG2QGJED4FX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace.

C2weakest assumption

That frontier-model rejection rates on the extracted records provide a reliable proxy for actual correctness and that the multi-agent extraction process does not introduce systematic biases not captured by those checks.

C3one line summary

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

Receipt and verification
First computed 2026-05-20T00:02:12.599667Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

85f0c4120391363d461736a064907c2dfd47c8b5b15737f9d1667bc8ee7640bd

Aliases

arxiv: 2605.07022 · arxiv_version: 2605.07022v2 · doi: 10.48550/arxiv.2605.07022 · pith_short_12: QXYMIEQDSE3D · pith_short_16: QXYMIEQDSE3D2RQX · pith_short_8: QXYMIEQD
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/QXYMIEQDSE3D2RQXG2QGJED4FX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 85f0c4120391363d461736a064907c2dfd47c8b5b15737f9d1667bc8ee7640bd
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "534fe76e0dfbf7c293edda4666f9e3c40e246050a72cfa8d9e2bd17bae455c6d",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-07T23:08:18Z",
    "title_canon_sha256": "4b83343cffc37ad48fdb397209199bc51669e3276667c6f46ce5b357c140b019"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.07022",
    "kind": "arxiv",
    "version": 2
  }
}