pith. sign in
Pith Number

pith:PNR7WYFD

pith:2024:PNR7WYFDY56BHRSVDE7CYNLHTD
not attested not anchored not stored refs resolved

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Huizhuo Yuan, Kaixuan Ji, Quanquan Gu, Yihe Deng, Zixiang Chen

Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data.

arxiv:2401.01335 v3 · 2024-01-02 · cs.LG · cs.AI · cs.CL · stat.ML

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{PNR7WYFDY56BHRSVDE7CYNLHTD}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data.

C2weakest assumption

That the self-generated responses from earlier model iterations provide useful contrastive signals without introducing persistent biases or distribution shifts that would prevent steady improvement toward the human data distribution.

C3one line summary

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

References

300 extracted · 300 resolved · 47 Pith anchors

[1] arXiv preprint arXiv:2306.05268 , year=
[2] Fine-Tuning Language Models from Human Preferences 1909 · arXiv:1909.08593
[3] Self-Rewarding Language Models · arXiv:2401.10020
[4] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback · arXiv:2309.00267
[5] Advances in Neural Information Processing Systems , volume=

Formal links

2 machine-checked theorem links

Cited by

34 papers in Pith

Receipt and verification
First computed 2026-05-17T23:39:21.380083Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

7b63fb60a3c77c13c655193e2c356798d5009e3bb0cd862eebc10e7ca5dd0fcf

Aliases

arxiv: 2401.01335 · arxiv_version: 2401.01335v3 · doi: 10.48550/arxiv.2401.01335 · pith_short_12: PNR7WYFDY56B · pith_short_16: PNR7WYFDY56BHRSV · pith_short_8: PNR7WYFD
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 7b63fb60a3c77c13c655193e2c356798d5009e3bb0cd862eebc10e7ca5dd0fcf
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "925cc3c9884b19ea31170356b7ee90c6ebd9eec1148b0fe5e311970cc28cec29",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "stat.ML"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-01-02T18:53:13Z",
    "title_canon_sha256": "2f69f69cbc581696e830d29dd6d32aeed783be8aefed4b103ddfce31006cb938"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2401.01335",
    "kind": "arxiv",
    "version": 3
  }
}