Pith Number

pith:PNR7WYFD

pith:2024:PNR7WYFDY56BHRSVDE7CYNLHTD

not attested not anchored not stored refs resolved

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Huizhuo Yuan, Kaixuan Ji, Quanquan Gu, Yihe Deng, Zixiang Chen

Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data.

arxiv:2401.01335 v3 · 2024-01-02 · cs.LG · cs.AI · cs.CL · stat.ML

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{PNR7WYFDY56BHRSVDE7CYNLHTD}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data.

C2weakest assumption

That the self-generated responses from earlier model iterations provide useful contrastive signals without introducing persistent biases or distribution shifts that would prevent steady improvement toward the human data distribution.

C3one line summary

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

References

300 extracted · 300 resolved · 47 Pith anchors

[1] arXiv preprint arXiv:2306.05268 , year=

[2] Fine-Tuning Language Models from Human Preferences 1909 · arXiv:1909.08593

[3] Self-Rewarding Language Models · arXiv:2401.10020

[4] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback · arXiv:2309.00267

[5] Advances in Neural Information Processing Systems , volume=

Formal links

2 machine-checked theorem links

Cited by

34 papers in Pith

Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Receipt and verification

First computed	2026-05-17T23:39:21.380083Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

7b63fb60a3c77c13c655193e2c356798d5009e3bb0cd862eebc10e7ca5dd0fcf

Aliases

arxiv: 2401.01335 · arxiv_version: 2401.01335v3 · doi: 10.48550/arxiv.2401.01335 · pith_short_12: PNR7WYFDY56B · pith_short_16: PNR7WYFDY56BHRSV · pith_short_8: PNR7WYFD

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/PNR7WYFDY56BHRSVDE7CYNLHTD \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 7b63fb60a3c77c13c655193e2c356798d5009e3bb0cd862eebc10e7ca5dd0fcf

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "925cc3c9884b19ea31170356b7ee90c6ebd9eec1148b0fe5e311970cc28cec29",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "stat.ML"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-01-02T18:53:13Z",
    "title_canon_sha256": "2f69f69cbc581696e830d29dd6d32aeed783be8aefed4b103ddfce31006cb938"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2401.01335",
    "kind": "arxiv",
    "version": 3
  }
}