RSPO: Regularized self-play alignment of large language models

Tang, X · 2025 · arXiv 2503.00030

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.

citing papers explorer

Showing 3 of 3 citing papers after filters.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 68
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models cs.LG · 2026-05-28 · unverdicted · none · ref 49
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
S-SPPO: Semantic-Calibrated Self-Play Preference Optimization cs.AI · 2026-06-01 · unverdicted · none · ref 18
S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.

RSPO: Regularized self-play alignment of large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer