{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:PNR7WYFDY56BHRSVDE7CYNLHTD","short_pith_number":"pith:PNR7WYFD","schema_version":"1.0","canonical_sha256":"7b63fb60a3c77c13c655193e2c356798d5009e3bb0cd862eebc10e7ca5dd0fcf","source":{"kind":"arxiv","id":"2401.01335","version":3},"attestation_state":"computed","paper":{"title":"Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data.","cross_cats":["cs.AI","cs.CL","stat.ML"],"primary_cat":"cs.LG","authors_text":"Huizhuo Yuan, Kaixuan Ji, Quanquan Gu, Yihe Deng, Zixiang Chen","submitted_at":"2024-01-02T18:53:13Z","abstract_excerpt":"Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2401.01335","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2024-01-02T18:53:13Z","cross_cats_sorted":["cs.AI","cs.CL","stat.ML"],"title_canon_sha256":"2f69f69cbc581696e830d29dd6d32aeed783be8aefed4b103ddfce31006cb938","abstract_canon_sha256":"925cc3c9884b19ea31170356b7ee90c6ebd9eec1148b0fe5e311970cc28cec29"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:21.380819Z","signature_b64":"vdnVIMlMaZ1abElcz96sIdaCMA0pS+S5a+4RrB+IUN81qRm7+LAiAlNhqU7BAw/gJeKCxe/idKAat5ptYbiFDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"7b63fb60a3c77c13c655193e2c356798d5009e3bb0cd862eebc10e7ca5dd0fcf","last_reissued_at":"2026-05-17T23:39:21.380083Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:21.380083Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data.","cross_cats":["cs.AI","cs.CL","stat.ML"],"primary_cat":"cs.LG","authors_text":"Huizhuo Yuan, Kaixuan Ji, Quanquan Gu, Yihe Deng, Zixiang Chen","submitted_at":"2024-01-02T18:53:13Z","abstract_excerpt":"Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the self-generated responses from earlier model iterations provide useful contrastive signals without introducing persistent biases or distribution shifts that would prevent steady improvement toward the human data distribution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1fb24955f25072ffaec90ebea7881b857f408f4952d66d46e44d10a14e891512"},"source":{"id":"2401.01335","kind":"arxiv","version":3},"verdict":{"id":"ec21d135-e217-45c5-99a8-e10bd1e22e20","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T22:55:44.631426Z","strongest_claim":"The global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data.","one_line_summary":"SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the self-generated responses from earlier model iterations provide useful contrastive signals without introducing persistent biases or distribution shifts that would prevent steady improvement toward the human data distribution.","pith_extraction_headline":"Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data."},"references":{"count":300,"sample":[{"doi":"","year":null,"title":"arXiv preprint arXiv:2306.05268 , year=","work_id":"e13e1f36-db48-4928-a01c-93be2a7c0380","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1909,"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","ref_index":2,"cited_arxiv_id":"1909.08593","is_internal_anchor":true},{"doi":"","year":null,"title":"Self-Rewarding Language Models","work_id":"b3903c9e-1bc7-4267-a171-6311902be2a4","ref_index":3,"cited_arxiv_id":"2401.10020","is_internal_anchor":true},{"doi":"","year":null,"title":"RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback","work_id":"81d8781d-2933-4e89-97ee-9bbfc6d4ca0c","ref_index":4,"cited_arxiv_id":"2309.00267","is_internal_anchor":true},{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems , volume=","work_id":"b202dcb7-0590-4a1b-b41f-72cd5085cc57","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":300,"snapshot_sha256":"09a3c73f2d8dd24982478fab3c5cdd113394c2139b01ac03a902ec4965797f8e","internal_anchors":47},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c2e3df07e74a1dbc4ca1b89fc8d9b27213935534b8c2019da09bb7939cc25fc2"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2401.01335","created_at":"2026-05-17T23:39:21.380204+00:00"},{"alias_kind":"arxiv_version","alias_value":"2401.01335v3","created_at":"2026-05-17T23:39:21.380204+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2401.01335","created_at":"2026-05-17T23:39:21.380204+00:00"},{"alias_kind":"pith_short_12","alias_value":"PNR7WYFDY56B","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"PNR7WYFDY56BHRSV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"PNR7WYFD","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2504.13898","citing_title":"Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2505.07527","citing_title":"Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16299","citing_title":"ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21931","citing_title":"EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2509.03526","citing_title":"Enhancing Speech Large Language Models through Reinforced Behavior Alignment","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16299","citing_title":"ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15034","citing_title":"Autogenesis: A Self-Evolving Agent Protocol","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15113","citing_title":"Learning from Language Feedback via Variational Policy Distillation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2509.07177","citing_title":"Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23102","citing_title":"Multiplayer Nash Preference Optimization","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2409.12917","citing_title":"Training Language Models to Self-Correct via Reinforcement Learning","ref_index":105,"is_internal_anchor":true},{"citing_arxiv_id":"2503.12937","citing_title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00918","citing_title":"Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21046","citing_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","ref_index":187,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12741","citing_title":"Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11679","citing_title":"Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13643","citing_title":"Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03472","citing_title":"Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10020","citing_title":"Self-Rewarding Language Models","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11679","citing_title":"Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11636","citing_title":"Seir\\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2402.01306","citing_title":"KTO: Model Alignment as Prospect Theoretic Optimization","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08703","citing_title":"RewardHarness: Self-Evolving Agentic Post-Training","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09959","citing_title":"G-Zero: Self-Play for Open-Ended Generation from Zero Data","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09214","citing_title":"Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability","ref_index":41,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD","json":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD.json","graph_json":"https://pith.science/api/pith-number/PNR7WYFDY56BHRSVDE7CYNLHTD/graph.json","events_json":"https://pith.science/api/pith-number/PNR7WYFDY56BHRSVDE7CYNLHTD/events.json","paper":"https://pith.science/paper/PNR7WYFD"},"agent_actions":{"view_html":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD","download_json":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD.json","view_paper":"https://pith.science/paper/PNR7WYFD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2401.01335&json=true","fetch_graph":"https://pith.science/api/pith-number/PNR7WYFDY56BHRSVDE7CYNLHTD/graph.json","fetch_events":"https://pith.science/api/pith-number/PNR7WYFDY56BHRSVDE7CYNLHTD/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD/action/timestamp_anchor","attest_storage":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD/action/storage_attestation","attest_author":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD/action/author_attestation","sign_citation":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD/action/citation_signature","submit_replication":"https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD/action/replication_record"}},"created_at":"2026-05-17T23:39:21.380204+00:00","updated_at":"2026-05-17T23:39:21.380204+00:00"}