{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:O6KD46D5N4AYG3WDJMRUPETDND","short_pith_number":"pith:O6KD46D5","schema_version":"1.0","canonical_sha256":"77943e787d6f01836ec34b2347926368c1e8d65c863c19e419cb78384dbde901","source":{"kind":"arxiv","id":"2507.01352","version":3},"attestation_state":"computed","paper":{"title":"Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Chaojie Wang, Chris Yuhao Liu, Fuxiang Zhang, Jiacai Liu, Jiacheng Xu, Jujie He, Liang Zeng, Rui Yan, Wei Shen, Yahui Zhou, Yang Liu, Yuzhen Xiao","submitted_at":"2025-07-02T04:40:29Z","abstract_excerpt":"Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2507.01352","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2025-07-02T04:40:29Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"6f033713f4cab0f8de5ba8bf4b556e999b6b3850ba518e493c47fec4a93cd745","abstract_canon_sha256":"eed245eae4f5619efa15a39473bb27c79db95695756a3e819b98bcf16f934774"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.414238Z","signature_b64":"5FLtFz7X9nonuDgoiFFjgHVkm3Vuig40Fq9VPtamdr3JmY1X003kPK4qHQSvuhkWYs1bf52xlCZdOIqC5UmDAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"77943e787d6f01836ec34b2347926368c1e8d65c863c19e419cb78384dbde901","last_reissued_at":"2026-05-17T23:38:46.413654Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.413654Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Chaojie Wang, Chris Yuhao Liu, Fuxiang Zhang, Jiacai Liu, Jiacheng Xu, Jujie He, Liang Zeng, Rui Yan, Wei Shen, Yahui Zhou, Yang Liu, Yuzhen Xiao","submitted_at":"2025-07-02T04:40:29Z","abstract_excerpt":"Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Skywork-Reward-V2 models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The brittleness of current reward models stems primarily from limitations in preference datasets, and the human-AI synergistic pipeline produces measurably higher-quality data that directly causes the reported benchmark gains.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Skywork-Reward-V2 models trained on 26 million human-AI curated preference pairs set new state-of-the-art results on seven major reward model benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"99b168e6256d8384e58df7bf6887432d8156cff65909acd93761ad687a084c27"},"source":{"id":"2507.01352","kind":"arxiv","version":3},"verdict":{"id":"80bb961c-ece1-4f6d-bbb6-d119aab4c401","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T22:13:47.494891Z","strongest_claim":"Skywork-Reward-V2 models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance.","one_line_summary":"Skywork-Reward-V2 models trained on 26 million human-AI curated preference pairs set new state-of-the-art results on seven major reward model benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The brittleness of current reward models stems primarily from limitations in preference datasets, and the human-AI synergistic pipeline produces measurably higher-quality data that directly causes the reported benchmark gains.","pith_extraction_headline":"Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models."},"references":{"count":13,"sample":[{"doi":"","year":2023,"title":"Most BT-based models fall under the sequence classifier category, while generative models primarily include LLM-as-a-Judge approaches","work_id":"112ea89c-bff6-4bbb-bfef-157919436666","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"This stratification identifies objective/low-controversial versus subjective/high- controversial regions, where intransitivity is more common","work_id":"843045db-9b8b-47d0-8b30-a624baed982e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Error-driven adaptive retrieval focuses on “unstable” regions.In Stage 1, we repeatedly train an RM, evaluate it on human-verified gold data, and use error-driven adaptive retrieval to pull in new exa","work_id":"bc2f160c-bde8-43f8-93e1-af92ec24f101","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Stage 2 dual-RM consistency filtering targets contradictory signals.Stage 2 introduces a consistency filter: we train a gold RM on cumulative human-verified samples and use it together with the Stage-","work_id":"5f8ba03c-8b02-45b9-942b-57e2c763faa3","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Human annotators may not be experts in all types of math and coding problems","work_id":"d72fb68a-3e6b-484f-8884-df2369714391","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":13,"snapshot_sha256":"f3ae2e6a6817d99eb2b25ae3606bec9c875e44c35f925c22882207ae6f7c6400","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"f281721ff41d90b9b9cf489b4666903c8c846284d674d1eb3a0ccaeed1e8de96"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2507.01352","created_at":"2026-05-17T23:38:46.413730+00:00"},{"alias_kind":"arxiv_version","alias_value":"2507.01352v3","created_at":"2026-05-17T23:38:46.413730+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2507.01352","created_at":"2026-05-17T23:38:46.413730+00:00"},{"alias_kind":"pith_short_12","alias_value":"O6KD46D5N4AY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"O6KD46D5N4AYG3WD","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"O6KD46D5","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":22,"internal_anchor_count":22,"sample":[{"citing_arxiv_id":"2605.20473","citing_title":"Code Generation by Differential Test Time Scaling","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16339","citing_title":"Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15464","citing_title":"GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12667","citing_title":"ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2510.23868","citing_title":"GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2601.02535","citing_title":"ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2602.12125","citing_title":"Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14186","citing_title":"LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14098","citing_title":"Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12667","citing_title":"ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02686","citing_title":"Beyond Semantic Manipulation: Token-Space Attacks on Reward Models","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02766","citing_title":"Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26644","citing_title":"When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08472","citing_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25872","citing_title":"When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13564","citing_title":"Memory in the Age of AI Agents","ref_index":124,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19544","citing_title":"DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18176","citing_title":"QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07461","citing_title":"Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15602","citing_title":"GroupDPO: Memory efficient Group-wise Direct Preference Optimization","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16004","citing_title":"AgentV-RL: Scaling Reward Modeling with Agentic Verifier","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17501","citing_title":"CoAct: Co-Active LLM Preference Learning with Human-AI Synergy","ref_index":4,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND","json":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND.json","graph_json":"https://pith.science/api/pith-number/O6KD46D5N4AYG3WDJMRUPETDND/graph.json","events_json":"https://pith.science/api/pith-number/O6KD46D5N4AYG3WDJMRUPETDND/events.json","paper":"https://pith.science/paper/O6KD46D5"},"agent_actions":{"view_html":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND","download_json":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND.json","view_paper":"https://pith.science/paper/O6KD46D5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2507.01352&json=true","fetch_graph":"https://pith.science/api/pith-number/O6KD46D5N4AYG3WDJMRUPETDND/graph.json","fetch_events":"https://pith.science/api/pith-number/O6KD46D5N4AYG3WDJMRUPETDND/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND/action/timestamp_anchor","attest_storage":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND/action/storage_attestation","attest_author":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND/action/author_attestation","sign_citation":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND/action/citation_signature","submit_replication":"https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND/action/replication_record"}},"created_at":"2026-05-17T23:38:46.413730+00:00","updated_at":"2026-05-17T23:38:46.413730+00:00"}