Pith Number

pith:O6KD46D5

pith:2025:O6KD46D5N4AYG3WDJMRUPETDND

not attested not anchored not stored refs resolved

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chaojie Wang, Chris Yuhao Liu, Fuxiang Zhang, Jiacai Liu, Jiacheng Xu, Jujie He, Liang Zeng, Rui Yan, Wei Shen, Yahui Zhou, Yang Liu, Yuzhen Xiao

Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.

arxiv:2507.01352 v3 · 2025-07-02 · cs.CL · cs.AI · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{O6KD46D5N4AYG3WDJMRUPETDND}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Skywork-Reward-V2 models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance.

C2weakest assumption

The brittleness of current reward models stems primarily from limitations in preference datasets, and the human-AI synergistic pipeline produces measurably higher-quality data that directly causes the reported benchmark gains.

C3one line summary

Skywork-Reward-V2 models trained on 26 million human-AI curated preference pairs set new state-of-the-art results on seven major reward model benchmarks.

References

13 extracted · 13 resolved · 0 Pith anchors

[1] Most BT-based models fall under the sequence classifier category, while generative models primarily include LLM-as-a-Judge approaches 2023

[2] This stratification identifies objective/low-controversial versus subjective/high- controversial regions, where intransitivity is more common

[3] Error-driven adaptive retrieval focuses on “unstable” regions.In Stage 1, we repeatedly train an RM, evaluate it on human-verified gold data, and use error-driven adaptive retrieval to pull in new exa

[4] Stage 2 dual-RM consistency filtering targets contradictory signals.Stage 2 introduces a consistency filter: we train a gold RM on cumulative human-verified samples and use it together with the Stage- 2024

[5] Human annotators may not be experts in all types of math and coding problems

Formal links

2 machine-checked theorem links

Cited by

22 papers in Pith

Code Generation by Differential Test Time Scaling

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Receipt and verification

First computed	2026-05-17T23:38:46.413654Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

77943e787d6f01836ec34b2347926368c1e8d65c863c19e419cb78384dbde901

Aliases

arxiv: 2507.01352 · arxiv_version: 2507.01352v3 · doi: 10.48550/arxiv.2507.01352 · pith_short_12: O6KD46D5N4AY · pith_short_16: O6KD46D5N4AYG3WD · pith_short_8: O6KD46D5

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/O6KD46D5N4AYG3WDJMRUPETDND \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 77943e787d6f01836ec34b2347926368c1e8d65c863c19e419cb78384dbde901

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "eed245eae4f5619efa15a39473bb27c79db95695756a3e819b98bcf16f934774",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-07-02T04:40:29Z",
    "title_canon_sha256": "6f033713f4cab0f8de5ba8bf4b556e999b6b3850ba518e493c47fec4a93cd745"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2507.01352",
    "kind": "arxiv",
    "version": 3
  }
}