pith:QXYMIEQD
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
PubMed can be autonomously turned into structured biomedical datasets larger, more nuanced, and more accurate than the curated databases they replace.
arxiv:2605.07022 v2 · 2026-05-07 · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{QXYMIEQDSE3D2RQXG2QGJED4FX}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace.
That frontier-model rejection rates on the extracted records provide a reliable proxy for actual correctness and that the multi-agent extraction process does not introduce systematic biases not captured by those checks.
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
Receipt and verification
| First computed | 2026-05-20T00:02:12.599667Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
85f0c4120391363d461736a064907c2dfd47c8b5b15737f9d1667bc8ee7640bd
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/QXYMIEQDSE3D2RQXG2QGJED4FX \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 85f0c4120391363d461736a064907c2dfd47c8b5b15737f9d1667bc8ee7640bd
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "534fe76e0dfbf7c293edda4666f9e3c40e246050a72cfa8d9e2bd17bae455c6d",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-07T23:08:18Z",
"title_canon_sha256": "4b83343cffc37ad48fdb397209199bc51669e3276667c6f46ce5b357c140b019"
},
"schema_version": "1.0",
"source": {
"id": "2605.07022",
"kind": "arxiv",
"version": 2
}
}