pith. sign in
Pith Number

pith:O6KD46D5

pith:2025:O6KD46D5N4AYG3WDJMRUPETDND
not attested not anchored not stored refs resolved

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chaojie Wang, Chris Yuhao Liu, Fuxiang Zhang, Jiacai Liu, Jiacheng Xu, Jujie He, Liang Zeng, Rui Yan, Wei Shen, Yahui Zhou, Yang Liu, Yuzhen Xiao

Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.

arxiv:2507.01352 v3 · 2025-07-02 · cs.CL · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{O6KD46D5N4AYG3WDJMRUPETDND}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Skywork-Reward-V2 models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance.

C2weakest assumption

The brittleness of current reward models stems primarily from limitations in preference datasets, and the human-AI synergistic pipeline produces measurably higher-quality data that directly causes the reported benchmark gains.

C3one line summary

Skywork-Reward-V2 models trained on 26 million human-AI curated preference pairs set new state-of-the-art results on seven major reward model benchmarks.

References

13 extracted · 13 resolved · 0 Pith anchors

[1] Most BT-based models fall under the sequence classifier category, while generative models primarily include LLM-as-a-Judge approaches 2023
[2] This stratification identifies objective/low-controversial versus subjective/high- controversial regions, where intransitivity is more common
[3] Error-driven adaptive retrieval focuses on “unstable” regions.In Stage 1, we repeatedly train an RM, evaluate it on human-verified gold data, and use error-driven adaptive retrieval to pull in new exa
[4] Stage 2 dual-RM consistency filtering targets contradictory signals.Stage 2 introduces a consistency filter: we train a gold RM on cumulative human-verified samples and use it together with the Stage- 2024
[5] Human annotators may not be experts in all types of math and coding problems

Formal links

2 machine-checked theorem links

Cited by

22 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.413654Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

77943e787d6f01836ec34b2347926368c1e8d65c863c19e419cb78384dbde901

Aliases

arxiv: 2507.01352 · arxiv_version: 2507.01352v3 · doi: 10.48550/arxiv.2507.01352 · pith_short_12: O6KD46D5N4AY · pith_short_16: O6KD46D5N4AYG3WD · pith_short_8: O6KD46D5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 77943e787d6f01836ec34b2347926368c1e8d65c863c19e419cb78384dbde901
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "eed245eae4f5619efa15a39473bb27c79db95695756a3e819b98bcf16f934774",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-07-02T04:40:29Z",
    "title_canon_sha256": "6f033713f4cab0f8de5ba8bf4b556e999b6b3850ba518e493c47fec4a93cd745"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2507.01352",
    "kind": "arxiv",
    "version": 3
  }
}