{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:L2VHQACIZP7GT2TXWP7FF5ABRZ","short_pith_number":"pith:L2VHQACI","schema_version":"1.0","canonical_sha256":"5eaa780048cbfe69ea77b3fe52f4018e4ccfb186efec30f7795f934cd756d5b8","source":{"kind":"arxiv","id":"2406.20094","version":3},"attestation_state":"computed","paper":{"title":"Scaling Synthetic Data Creation with 1,000,000,000 Personas","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Dian Yu, Dong Yu, Haitao Mi, Tao Ge, Xiaoyang Wang, Xin Chan","submitted_at":"2024-06-28T17:59:01Z","abstract_excerpt":"We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By sho"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2406.20094","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.CL","submitted_at":"2024-06-28T17:59:01Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"7ff707acefed87c3ecc5c9283dcc8d2a1c6be1b76f5e79bd4f5d5d552177ff6d","abstract_canon_sha256":"2c7f02e414d1271b02b929d305adaf70bc07f6b89eec240f428b7c3baac3ed37"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.677274Z","signature_b64":"MEuukhOYJ3Aq0DO3U5V0J/9p8dqAQxRhHf1Col0XgsV8V+wWr2lZuSVZBqcf2nXI8AigAGJVpua4H0P7laS8BQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5eaa780048cbfe69ea77b3fe52f4018e4ccfb186efec30f7795f934cd756d5b8","last_reissued_at":"2026-05-17T23:38:49.676680Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.676680Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Scaling Synthetic Data Creation with 1,000,000,000 Personas","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Dian Yu, Dong Yu, Haitao Mi, Tao Ge, Xiaoyang Wang, Xin Chan","submitted_at":"2024-06-28T17:59:01Z","abstract_excerpt":"We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By sho"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That automatically curated web personas are sufficiently diverse, unbiased, and faithfully simulable by the LLM without introducing repetition or hallucinated perspectives that degrade data quality.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6d4d700fdf7c88394be83ca0ef9bf75ad16019307c0c60de7394cd420b3b8ade"},"source":{"id":"2406.20094","kind":"arxiv","version":3},"verdict":{"id":"4abaef22-c6d5-45a3-b733-2fe6de6760a5","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T00:00:27.724368Z","strongest_claim":"These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios.","one_line_summary":"A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That automatically curated web personas are sufficiently diverse, unbiased, and faithfully simulable by the LLM without introducing repetition or hallucinated perspectives that degrade data quality.","pith_extraction_headline":"A hub of one billion web-curated personas lets an LLM generate diverse synthetic data across math, instructions, knowledge texts, NPCs, and tools at scale."},"references":{"count":29,"sample":[{"doi":"","year":null,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":null,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":null,"title":"Coig-cqia: Quality is all you need for chinese instruction fine-tuning","work_id":"4d58f189-c89b-4190-b567-6fc2fba5e0b7","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"arXiv preprint arXiv:2401.02524 , year=","work_id":"7c3b201d-cc93-4fcb-bcb6-2583f29edbbf","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","work_id":"01b10587-025b-499d-8ba3-7a538d24c2d6","ref_index":5,"cited_arxiv_id":"2401.02954","is_internal_anchor":true}],"resolved_work":29,"snapshot_sha256":"f5adcbed9c271f68d69e260d12c3db9a3e38a5d9b74a51eb3b9dc8255144b34d","internal_anchors":12},"formal_canon":{"evidence_count":3,"snapshot_sha256":"3a102509548c9d31ff3da7862f6a216bcbc51b936ee73aca69f74f4950069157"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2406.20094","created_at":"2026-05-17T23:38:49.676799+00:00"},{"alias_kind":"arxiv_version","alias_value":"2406.20094v3","created_at":"2026-05-17T23:38:49.676799+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2406.20094","created_at":"2026-05-17T23:38:49.676799+00:00"},{"alias_kind":"pith_short_12","alias_value":"L2VHQACIZP7G","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"L2VHQACIZP7GT2TX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"L2VHQACI","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2504.20605","citing_title":"TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2508.18167","citing_title":"DiscussLLM: Teaching Large Language Models When to Speak","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08167","citing_title":"Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20876","citing_title":"Terminal-World: Scaling Terminal-Agent Environments via Agent Skills","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2508.00414","citing_title":"Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2511.08565","citing_title":"Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06204","citing_title":"SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13074","citing_title":"PersonaVLM: Long-Term Personalized Multimodal LLMs","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09530","citing_title":"MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2409.17146","citing_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2603.24326","citing_title":"Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12894","citing_title":"Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02522","citing_title":"Opal: Private Memory for Personal AI","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02794","citing_title":"CharTool: Tool-Integrated Visual Reasoning for Chart Understanding","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2502.02737","citing_title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","ref_index":173,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09530","citing_title":"MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09808","citing_title":"Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09530","citing_title":"MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08766","citing_title":"UserGPT Technical Report","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04018","citing_title":"Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03547","citing_title":"Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10733","citing_title":"Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08516","citing_title":"MolmoWeb: Open Visual Web Agent and Open Data for the Open Web","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07699","citing_title":"DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2506.05176","citing_title":"Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ","json":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ.json","graph_json":"https://pith.science/api/pith-number/L2VHQACIZP7GT2TXWP7FF5ABRZ/graph.json","events_json":"https://pith.science/api/pith-number/L2VHQACIZP7GT2TXWP7FF5ABRZ/events.json","paper":"https://pith.science/paper/L2VHQACI"},"agent_actions":{"view_html":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ","download_json":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ.json","view_paper":"https://pith.science/paper/L2VHQACI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2406.20094&json=true","fetch_graph":"https://pith.science/api/pith-number/L2VHQACIZP7GT2TXWP7FF5ABRZ/graph.json","fetch_events":"https://pith.science/api/pith-number/L2VHQACIZP7GT2TXWP7FF5ABRZ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ/action/storage_attestation","attest_author":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ/action/author_attestation","sign_citation":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ/action/citation_signature","submit_replication":"https://pith.science/pith/L2VHQACIZP7GT2TXWP7FF5ABRZ/action/replication_record"}},"created_at":"2026-05-17T23:38:49.676799+00:00","updated_at":"2026-05-17T23:38:49.676799+00:00"}