arxiv: 2604.20933 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI

Recognition: unknown

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

Wenjie Liao , Like Wu , Liangjie Zhao , Shihui Xu , Shigeru Fujimura

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-play fine-tuningRényi divergencelarge language modelsadaptive alphaiterative trainingdistributional gapsupervised fine-tuning

0 comments

The pith

An adjustable Rényi order parameter lets self-play fine-tuning adapt its objective as the model closes the gap to the target distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IRIS as a self-play fine-tuning framework for large language models that replaces fixed divergence measures with a continuously tunable Rényi objective. The order parameter alpha governs exponential importance weights on annotated versus self-generated responses and is scheduled according to the current distributional gap. This produces a unified view in which prior methods appear as special cases at particular alpha values while allowing the training to move from sharper weighting early to smoother refinement later. Experiments indicate that the resulting procedure reaches higher average benchmark scores than fixed-divergence baselines and can exceed standard supervised fine-tuning even when trained on only a small fraction of the annotated data.

Core claim

IRIS decomposes the self-play objective into two independent tilted risk terms, one over annotated data and one over synthetic data, whose relative weighting is controlled by the Rényi order alpha through exponential importance factors. An adaptive schedule adjusts alpha to the distributional gap between the current model and the target, the method is proved to possess a fixed-point property, and alpha is shown to modulate gradient concentration during updates.

What carries the argument

The Rényi order parameter alpha that interpolates between divergence regimes and is scheduled adaptively to the distributional gap between model and target.

If this is right

Existing self-play methods such as SPIN, SPACE, and SPIF correspond to particular fixed values of alpha.
Training dynamics remain stable by shifting from sharp importance weighting early in training to smoother weighting near convergence.
Higher final performance becomes reachable with substantially fewer annotated samples than are needed for full supervised fine-tuning.
Gradient updates can be made more or less concentrated by direct choice of the alpha schedule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar adaptive interpolation between divergence measures may stabilize iterative self-improvement loops in other machine-learning settings.
The unification suggests that choosing the divergence regime according to training stage is a general principle worth testing in reinforcement-learning-from-human-feedback pipelines.
The method raises the possibility that fewer human annotations overall could suffice for alignment tasks if the adaptive schedule generalizes across model scales.

Load-bearing premise

The distributional gap between model and target can be measured reliably enough to set alpha without introducing instability or requiring per-task retuning.

What would settle it

A side-by-side run of IRIS with the adaptive alpha schedule versus the same method using any single fixed alpha value, on the same data splits and model, that shows no consistent performance gain for the adaptive version would falsify the claimed benefit of the schedule.

Figures

Figures reproduced from arXiv: 2604.20933 by Liangjie Zhao, Like Wu, Shigeru Fujimura, Shihui Xu, Wenjie Liao.

**Figure 2.** Figure 2: Performance comparison between training at iteration 0 for multiple epochs on fixed [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics at Iter 0 on Zephyr-7B: (a) reward of synthetic responses; (b) reward of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation analysis on Zephyr-7B: (a) sensitivity to scaling constant [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison across self-play iterations on Qwen2.5-3B-Instruct for three [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

Self-play fine-tuning enables large language models to improve beyond supervised fine-tuning without additional human annotations by contrasting annotated responses with self-generated ones. Many existing methods rely on a fixed divergence regime. SPIN is closely related to a KL-based regime, SPACE to a Jensen-Shannon-style objective via noise contrastive estimation, and SPIF to $\chi^2$-regularized self-play. Since these divergences exhibit different strengths depending on the distributional gap between model and target, no single choice appears to provide favorable learning dynamics across training stages. We propose IRIS (Interpolative R\'enyi Iterative Self-play), a R\'enyi-based self-play fine-tuning framework with a continuously adjustable objective. IRIS decomposes into two independent tilted risk terms over annotated and synthetic data, with exponential importance weights controlled by the order parameter $\alpha$. We show that several self-play objectives can be interpreted as limiting or representative regimes at particular values of $\alpha$, providing a unified theoretical perspective on these methods. An adaptive order schedule further adjusts $\alpha$ to the distributional gap, shifting from sharper importance weighting early in training to smoother refinement near convergence. Theoretically, we establish the fixed-point property of IRIS and analyze how $\alpha$ controls gradient concentration. Experiments on Zephyr-7B and Qwen2.5-3B across ten benchmarks show that IRIS improves upon baselines, reaching 44.57\% average score with gains across iterations. In our setting, IRIS with only 26$k$ annotated samples surpasses standard supervised fine-tuning trained on the full 200$k$ dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IRIS unifies prior self-play methods under a tunable Rényi objective and adds an adaptive alpha schedule, but the data-efficiency headline rests on thin experimental controls.

read the letter

The core contribution is a single Rényi-based self-play objective that recovers KL-style, Jensen-Shannon-style, and chi-squared-style behavior as special cases of the order parameter alpha. The adaptive schedule that shifts alpha with the current distributional gap is the actual addition beyond the cited SPIN, SPACE, and SPIF papers. They also prove the fixed-point property and give a gradient-concentration argument, which is clean formal work that stands on its own.

Referee Report

3 major / 2 minor

Summary. The paper proposes IRIS, a Rényi-divergence-based self-play fine-tuning framework for LLMs that uses an order parameter α to interpolate between divergence regimes. It unifies SPIN (KL), SPACE (JS-style), and SPIF (χ²) as special cases of α, introduces an adaptive schedule that adjusts α according to the distributional gap between model and target (sharper early, smoother later), proves a fixed-point property, and analyzes α's effect on gradient concentration. Experiments on Zephyr-7B and Qwen2.5-3B across ten benchmarks report an average score of 44.57% with iterative gains, claiming that IRIS trained on only 26k annotated samples outperforms standard SFT on the full 200k dataset.

Significance. If the empirical results hold under proper validation, IRIS offers a unified theoretical lens on self-play methods and a practical route to improved data efficiency by reducing reliance on large annotated sets. The fixed-point property and gradient analysis are clear theoretical strengths; the adaptive schedule is a plausible design choice for handling varying distributional gaps across training stages.

major comments (3)

[Experiments] Experiments section: The central data-efficiency claim (IRIS at 26k annotated samples surpassing SFT at 200k) and the 44.57% average score are reported without error bars, standard deviations across runs, statistical significance tests, or details on the 26k subset selection protocol. This undermines assessment of whether the iterative gains are reliable or reproducible.
[Method and Experiments] Adaptive order schedule (described in the method and experiments): The headline performance depends on the adaptive α schedule that shifts based on distributional gap, yet no ablations isolate its contribution versus fixed-α variants, test stability across tasks/models, or demonstrate that no per-task retuning is needed. This is load-bearing for the practical unification and efficiency claims.
[Theoretical Analysis] Theoretical analysis: The fixed-point property and gradient concentration analysis are established for IRIS, but the manuscript provides no explicit proof details, assumptions on the adaptive schedule's interaction with the fixed point, or conditions under which the unification holds when α varies dynamically.

minor comments (2)

[Method] Notation for the two tilted risk terms and exponential importance weights could be clarified with an explicit equation relating them to the Rényi objective.
[Experiments] The abstract and experiments would benefit from a table summarizing per-benchmark scores and baselines for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight important areas for strengthening the empirical validation, ablation studies, and theoretical exposition in our manuscript. We address each major comment below and commit to revisions that improve clarity and rigor without altering the core contributions of IRIS.

read point-by-point responses

Referee: [Experiments] Experiments section: The central data-efficiency claim (IRIS at 26k annotated samples surpassing SFT at 200k) and the 44.57% average score are reported without error bars, standard deviations across runs, statistical significance tests, or details on the 26k subset selection protocol. This undermines assessment of whether the iterative gains are reliable or reproducible.

Authors: We agree that the lack of error bars, standard deviations, and statistical tests weakens the assessment of result reliability. In the revised manuscript, we will rerun the key experiments on Zephyr-7B and Qwen2.5-3B with at least three random seeds, reporting means and standard deviations for the average score and per-benchmark results. We will add paired t-tests or Wilcoxon tests to establish statistical significance against baselines. For the 26k subset, we will explicitly state that it was obtained by stratified random sampling from the full 200k dataset to preserve the distribution of query difficulty and domains; this protocol will be detailed in the Experiments section along with a new appendix table showing per-run scores. revision: yes
Referee: [Method and Experiments] Adaptive order schedule (described in the method and experiments): The headline performance depends on the adaptive α schedule that shifts based on distributional gap, yet no ablations isolate its contribution versus fixed-α variants, test stability across tasks/models, or demonstrate that no per-task retuning is needed. This is load-bearing for the practical unification and efficiency claims.

Authors: We acknowledge the value of isolating the adaptive schedule's contribution. In the revision, we will add a dedicated ablation subsection comparing the full adaptive schedule against fixed-α baselines (α = 0.5, 1.0, 2.0) on the same benchmarks and models. We will also report results on an additional held-out task to assess cross-task stability and confirm that the schedule requires no per-task hyperparameter retuning, as α is computed automatically from the distributional gap at each iteration. These results will be presented with the same evaluation protocol as the main experiments. revision: yes
Referee: [Theoretical Analysis] Theoretical analysis: The fixed-point property and gradient concentration analysis are established for IRIS, but the manuscript provides no explicit proof details, assumptions on the adaptive schedule's interaction with the fixed point, or conditions under which the unification holds when α varies dynamically.

Authors: We agree that explicit proof details and discussion of the adaptive schedule are needed. In the revised manuscript, we will expand the theoretical section and add a full appendix containing the complete proof of the fixed-point property, including all assumptions (e.g., the objective is minimized via gradient steps and the distributional gap is non-increasing). We will clarify that the unification holds for any fixed α and that the adaptive schedule preserves the fixed-point property at convergence because α stabilizes as the gap approaches zero; we will also state the conditions under which dynamic α does not violate the unification (namely, that α remains within the valid range [0, ∞) throughout training). revision: yes

Circularity Check

0 steps flagged

No circularity: unification is algebraic observation; adaptive schedule is independent design choice

full rationale

The paper derives IRIS by expressing the objective as two tilted risks with exponential weights controlled by α, then algebraically shows that SPIN (KL), SPACE (JS via NCE), and SPIF (χ²) arise at specific α values or limits. This unification is a direct mathematical rewriting, not a fitted parameter or self-referential definition. The fixed-point property and gradient-concentration analysis are stated as independent theoretical results. The adaptive α schedule is introduced as an additional heuristic that adjusts to distributional gap; no equation shows that downstream performance predictions are forced by the schedule itself or by prior self-citations. Empirical claims (44.57 % average, 26 k samples beating 200 k SFT) rest on reported runs rather than definitional equivalence. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a fixed-point property for the IRIS objective and on the effectiveness of the distributional-gap heuristic for scheduling alpha; both are treated as domain assumptions rather than derived from first principles.

free parameters (1)

alpha
Order parameter of the Rényi divergence that controls importance weighting and is adjusted via an adaptive schedule during training.

axioms (2)

domain assumption IRIS objective possesses a fixed-point property
Invoked as a theoretical guarantee of the method.
domain assumption Distributional gap between model and target can be reliably estimated and used to adapt alpha
Underlies the adaptive schedule described in the abstract.

pith-pipeline@v0.9.0 · 5596 in / 1463 out tokens · 89731 ms · 2026-05-10T01:12:10.471692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 26 canonical work pages · 8 internal anchors

[1]

Investigating regularization of self-play language models.arXiv preprint arXiv:2404.04291, 2024

Reda Alami, Abdalgader Abubaker, Mastane Achab, Mohamed El Amine Seddik, and Salem Lahlou. Investigating regularization of self-play language models.arXiv preprint arXiv:2404.04291, 2024

work page arXiv 2024
[2]

A general class of coefficients of divergence of one distribution from another.Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966

Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another.Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966

1966
[3]

Springer, 2016

Shun-ichi Amari.Information Geometry and Its Applications. Springer, 2016

2016
[4]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet ,
[5]

Accessed: 2026-04-04

2026
[6]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

2024
[7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review arXiv 2022
[8]

Open llm leaderboard (2023–2024)

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard (2023–2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_ llm_leaderboard, 2023. Accessed: 2026-04-04

2023
[9]

On a measure of divergence between two statistical populations defined by their probability distributions.Bulletin of the Calcutta Mathematical Society, 35:99– 109, 1943

Anil Kumar Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions.Bulletin of the Calcutta Mathematical Society, 35:99– 109, 1943

1943
[10]

Varia- tional representations and neural network estimation of Rényi divergences.SIAM Journal on Mathematics of Data Science, 4(2):773–811, 2022

Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Luc Rey-Bellet, and Jie Wang. Varia- tional representations and neural network estimation of Rényi divergences.SIAM Journal on Mathematics of Data Science, 4(2):773–811, 2022

2022
[11]

Accurately computing the log-sum-exp and softmax functions.IMA Journal of Numerical Analysis, 41(4):2311–2330, 2021

Pierre Blanchard, Desmond J Higham, and Nicholas J Higham. Accurately computing the log-sum-exp and softmax functions.IMA Journal of Numerical Analysis, 41(4):2311–2330, 2021

2021
[12]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review arXiv 2024
[13]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[14]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

2024
[15]

Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities.Entropy, 12(6):1532–1568, 2010

Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities.Entropy, 12(6):1532–1568, 2010

2010
[16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Information-type measures of difference of probability distributions and indirect observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967

Imre Csiszár. Information-type measures of difference of probability distributions and indirect observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967

1967
[18]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

2024
[19]

SAIL: Self-improving efficient online alignment of large language models.arXiv preprint arXiv:2406.15567, 2024

Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, and Furong Huang. SAIL: Self-improving efficient online alignment of large language models.arXiv preprint arXiv:2406.15567, 2024

work page arXiv 2024
[20]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233, 2023

work page arXiv 2023
[21]

How abilities in large language models are affected by supervised fine-tuning data composition

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 177–198, 2024

2024
[22]

Asymptotic evaluation of certain Markov process expectations for large time

Monroe D Donsker and S R Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time. IV.Communications on Pure and Applied Mathematics, 36(2):183–212, 1983

1983
[23]

KTO: Model alignment as prospect theoretic optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: Model alignment as prospect theoretic optimization. InProceedings of the 41st International Conference on Machine Learning, 2024

2024
[24]

Towards analyzing and understanding the limitations of dpo: A theoretical perspective.arXiv preprint arXiv:2404.04626, 2024

Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang, and Wenqiang Lei. Towards ana- lyzing and understanding the limitations of DPO: A theoretical perspective.ArXiv e-prints, arXiv:2404.04626, 2024

work page arXiv 2024
[25]

Sprec: Leveraging self-play to debias preference alignment for large language model-based recommendations.arXiv e-prints, pages arXiv–2412, 2024

Chongming Gao, Ruijun Chen, Shuai Yuan, Kexin Huang, Yuanqing Yu, and Xiangnan He. Sprec: Leveraging self-play to debias preference alignment for large language model-based recommendations.arXiv e-prints, pages arXiv–2412, 2024

2024
[26]

A framework for few-shot language model evaluation.Zenodo, 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golber, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework ...

2023
[27]

Rényi divergence measures for commonly used univariate continuous distributions.Information Sciences, 249:124–131, 2013

Manuel Gil, Fady Alajaji, and Tamás Linder. Rényi divergence measures for commonly used univariate continuous distributions.Information Sciences, 249:124–131, 2013

2013
[28]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

2014
[29]

Gemini 2.5: Our most intelligent ai model

Google. Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/ google-deepmind/gemini-model-thinking-updates-march-2025/ , 2025. Accessed: 2026-04-04

2025
[30]

Accelerate: Training and inference at scale made simple, efficient and adaptable

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Manber, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. 2022

2022
[31]

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. InProceedings of the thirteenth international con- ference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010. 11

2010
[32]

Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.Journal of machine learning research, 13(2), 2012

Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.Journal of machine learning research, 13(2), 2012

2012
[33]

Measuring massive multitask language understanding.International Conference on Learning Representations, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Ja- cob Steinhardt. Measuring massive multitask language understanding.International Conference on Learning Representations, 2021

2021
[34]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. Advances in Neural Information Processing Systems, 34:7294–7306, 2021

2021
[35]

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang Bui, Daniel Hernández- Lobato, and Richard E. Turner. Black-box α-divergence minimization. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1511–1520, 2016

2016
[36]

Neural networks for machine learning: Lecture 6a overview of mini-batch gradient descent

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning: Lecture 6a overview of mini-batch gradient descent. Technical report, University of Toronto, 2012

2012
[37]

ORPO: Monolithic preference optimization without reference model

Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[38]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[39]

Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274, 2024

Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274, 2024

work page arXiv 2024
[40]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. InProceedings of the 41st International Conference on Machine Learning, volume 235. PMLR, 2024

2024
[41]

Your self-play algorithm is secretly an adversarial imitator: Understanding llm self-play through the lens of imitation learning.arXiv preprint arXiv:2602.01357, 2026

Shangzhe Li, Xuchao Zhang, Chetan Bansal, and Weitong Zhang. Your self-play algorithm is secretly an adversarial imitator: Understanding llm self-play through the lens of imitation learning.arXiv preprint arXiv:2602.01357, 2026

work page arXiv 2026
[42]

Tilted empirical risk minimization

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. InInternational Conference on Learning Representations, 2021

2021
[43]

Rényi divergence variational inference

Yingzhen Li and Richard E Turner. Rényi divergence variational inference. InAdvances in Neural Information Processing Systems, volume 29, 2016

2016
[44]

Drift: Difference-aware reinforcement through iterative fine-tuning for language model

Wenjie Liao, Xiaohui Song, and Haonan Lu. Drift: Difference-aware reinforcement through iterative fine-tuning for language model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31988–31996, 2026

2026
[45]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685, 2023

work page arXiv 2023
[46]

SimPO: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free reward. InAdvances in Neural Information Processing Systems, 2024

2024
[47]

Rényi differential privacy

Ilya Mironov. Rényi differential privacy. In2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017. 12

2017
[48]

Cross-task gener- alization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022

2022
[49]

Contribution to the theory of the chi-square test

Jerzy Neyman. Contribution to the theory of the chi-square test. InProceedings of the First Berkeley Symposium on Mathematical Statistics and Probability, pages 239–273. University of California Press, 1949

1949
[50]

Estimating divergence func- tionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010

XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence func- tionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010

2010
[51]

f-GAN: Training generative neu- ral samplers using variational divergence minimization

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neu- ral samplers using variational divergence minimization. InAdvances in Neural Information Processing Systems, volume 29, 2016

2016
[52]

OpenAI. Gpt-4. https://openai.com/index/gpt-4-research/, 2023. Accessed: 2026- 04-04

2023
[53]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[54]

Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with DPO-positive.ArXiv e-prints, arXiv:2402.13228, 2024

work page arXiv 2024
[55]

Karl Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175, 1900

1900
[56]

From r to q∗: Your language model is secretly aq-function

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q∗: Your language model is secretly aq-function. InFirst Conference on Language Modeling, 2024

2024
[57]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[58]

ZeRO: Memory op- timizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory op- timizations toward training trillion parameter models. InInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020

2020
[59]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[60]

Learning or self-aligning? rethinking instruction fine-tuning

Mengjie Ren, Boxi Cao, Hongyu Lin, Cao Liu, Xianpei Han, Ke Zeng, Wan Guanglu, Xunliang Cai, and Le Sun. Learning or self-aligning? rethinking instruction fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6090–6105, 2024

2024
[61]

On measures of entropy and information

Alfréd Rényi. On measures of entropy and information. InProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547–561. University of California Press, 1961

1961
[62]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023. 13

work page internal anchor Pith review arXiv 2023
[63]

WinoGrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13726–13734, 2021

2021
[64]

Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

2017
[65]

MuSR: Testing the limits of chain-of-thought with multistep soft reasoning

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. MuSR: Testing the limits of chain-of-thought with multistep soft reasoning. InInternational Conference on Learning Representations, 2024

2024
[66]

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them.Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

2023
[67]

Large language models for data annotation and synthesis: A survey

Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 930–957, 2024

2024
[68]

RSPO: Regularized self-play alignment of large language models.arXiv preprint arXiv:2503.00030, 2025

Xiaohang Tang, Sangwoong Yoon, Seongho Son, Huizhuo Yuan, Quanquan Gu, and Ilija Bogunovic. RSPO: Regularized self-play alignment of large language models.arXiv preprint arXiv:2503.00030, 2025

work page arXiv 2025
[69]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239, 2022

work page Pith review arXiv 2022
[70]

In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track), pages 134–148

Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, et al. Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation.arXiv preprint arXiv:2503.12854, 2025

work page arXiv 2025
[71]

Rush, and Thomas Wolf

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Shengyi Huang, Kashif Rasul, Alvaro Bartolome, Alexander M. Rush, and Thomas Wolf. The alignment handbook. GitHub repository, 2024

2024
[72]

arXiv preprint arXiv:2310.16944 , year=

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro V on Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023

work page arXiv 2023
[73]

Rényi divergence and Kullback-Leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

Tim van Erven and Peter Harremoës. Rényi divergence and Kullback-Leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

2014
[74]

Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints

Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. InThe Twelfth International Conference on Learning Representations, 2024

2024
[75]

Su, and Yaodong Yang

Mingzhi Wang, Chengdong Ma, Qizhi Chen, Linjian Meng, Yang Han, Jiancong Xiao, Zhaowei Zhang, Jing Huo, Weijie J. Su, and Yaodong Yang. Magnetic preference optimization: Achieving last-iterate convergence for language model alignment. InInternational Conference on Learning Representations, 2025

2025
[76]

Space: Noise contrastive estimation stabilizes self-play fine-tuning for large language models.arXiv preprint arXiv:2512.07175, 2025

Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Space: Noise contrastive estimation stabilizes self-play fine-tuning for large language models.arXiv preprint arXiv:2512.07175, 2025

work page arXiv 2025
[77]

Triplets better than pairs: Towards stable and effective self-play fine-tuning for llms

Yibo Wang, Hai-Long Sun, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Triplets better than pairs: Towards stable and effective self-play fine-tuning for llms. arXiv preprint arXiv:2601.08198, 2026. 14

work page arXiv 2026
[78]

How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 36:74764–74786, 2023

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 36:74764–74786, 2023

2023
[79]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, 2024

2024
[80]

Self-generative adversarial fine-tuning for large language models.arXiv preprint arXiv:2602.01137, 2026

Shiguang Wu, Yaqing Wang, and Quanming Yao. Self-generative adversarial fine-tuning for large language models.arXiv preprint arXiv:2602.01137, 2026

work page arXiv 2026

Showing first 80 references.