Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Dong Nie

arxiv: 2605.22731 · v1 · pith:PBNEBYWPnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Dong Nie This is my paper

Pith reviewed 2026-05-22 07:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords post-trainingsupervised fine-tuningreinforcement learningon-policy distillationstate distributionlanguage model fine-tuningGSM8K

0 comments

The pith

Post-training of language models depends as much on the distribution of training states as on the supervision signal itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that post-training methods for autoregressive language models should be analyzed through the states they supervise, where a state consists of a prompt plus the generated prefix. SFT applies supervision to fixed dataset states, while RL and on-policy distillation apply it to states generated by the current model. Controlled experiments on Qwen3-0.6B-Base with GSM8K demonstrate that on-policy approaches can improve task performance and reduce forgetting compared to certain SFT regimes, even when starting from a degraded teacher. A reader would care because this reframes post-training design around controlling state source and locality rather than solely tuning loss functions.

Core claim

Post-training is about states, not tokens: the source and locality of training states can be as important as the form of the supervision signal. In small-scale experiments, a mild SFT improves GSM8K with little forgetting while a stress SFT causes substantial retention loss; on-policy distillation from a degraded SFT teacher surpasses that teacher across GSM8K, TruthfulQA, and MMLU; and a lightweight on-policy RL run improves GSM8K while preserving retention.

What carries the argument

State distribution shaping, the mechanism by which the source and locality of states (prompt plus generated prefix) determine the outcomes of SFT, RL, and on-policy distillation.

If this is right

Mild SFT on fixed states can improve task accuracy with minimal retention loss on other benchmarks.
On-policy distillation can exceed the performance of its teacher model even when that teacher is degraded.
Lightweight on-policy RL improves the target task while better preserving performance on retention evaluations than stress SFT.
Controlling whether states come from a fixed dataset or the current learner affects both capability gains and forgetting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could treat state locality as an explicit hyperparameter to trade off task improvement against retention.
The state-centric view may extend to other autoregressive domains such as code generation or multimodal models.
Hybrid training loops that deliberately mix fixed and on-policy states could combine the strengths of SFT and RL.
State distribution analysis might help diagnose why certain post-training recipes succeed or fail across model scales.

Load-bearing premise

The observed performance differences between SFT, OPD, and RL are driven primarily by differences in state distribution rather than by variations in training hyperparameters or optimization dynamics.

What would settle it

Re-run the GSM8K experiments while forcing SFT, RL, and OPD to sample from identical state distributions and check whether the performance and retention gaps disappear.

read the original abstract

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

State distribution offers a useful complementary lens on post-training, but the small study does not yet isolate it cleanly from optimization differences.

read the letter

The main point worth knowing is that this paper pushes a state-distribution framing for post-training: SFT uses fixed dataset states while RL and on-policy distillation generate states from the current learner. They back it with three observations from a 0.6B Qwen model on GSM8K plus retention checks on TruthfulQA and MMLU. Mild SFT improves the target task with little forgetting; stress SFT hurts retention; on-policy distillation beats its degraded teacher; and light on-policy RL improves the task while keeping retention. The framing itself is the clearest addition here, shifting attention from loss functions alone to where the supervision actually lands during training.

Referee Report

2 major / 1 minor

Summary. The paper argues that post-training of large language models is better understood as shaping the distribution of states (prompt plus generated prefix) on which supervision is applied, rather than focusing primarily on the form of the loss function. It supports this view with a controlled small-scale empirical study using Qwen3-0.6B-Base on GSM8K (with TruthfulQA and MMLU for retention), reporting that mild SFT improves task performance with little forgetting, stress SFT causes substantial retention loss, on-policy distillation (OPD) from a degraded teacher can surpass the teacher, and lightweight on-policy RL improves GSM8K while preserving retention. The central claim is that the source and locality of training states can be as important as the supervision signal.

Significance. If the empirical phenomena hold under tighter controls, the work offers a useful complementary perspective that could guide more effective post-training by emphasizing state distribution matching or shaping. The small-scale study provides concrete, falsifiable observations (mild vs. stress SFT, OPD surpassing teacher, RL retention) that are worth testing at larger scales. No machine-checked proofs or parameter-free derivations are present, but the directional results constitute an initial empirical test of the state-centric framing.

major comments (2)

The manuscript describes a 'controlled small-scale study' comparing SFT, OPD, and RL but provides no explicit details on equalization of optimization hyperparameters (learning rate, batch size, total gradient steps, optimizer, or loss scaling) across regimes. In a 0.6B model on GSM8K, unmatched dynamics could drive the reported gaps (e.g., OPD surpassing teacher or RL retention) independently of state locality; without matching or ablations, the attribution to state distribution remains unisolated.
The claim that performance differences are driven by state distribution (fixed dataset vs. learner-induced states) would be strengthened by an ablation that applies the same loss but with off-policy or mismatched states; the current results do not rule out that the observed advantages of OPD and RL stem from on-policy sampling mechanics rather than the state distribution per se.

minor comments (1)

The abstract and results section would benefit from a brief table or quantitative summary of the exact performance deltas and retention metrics to make the directional claims easier to evaluate without full text access.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and constructive suggestions for improving our manuscript on the state distribution view of post-training. We address each major comment below, providing clarifications and outlining planned revisions to enhance the rigor of our empirical study.

read point-by-point responses

Referee: The manuscript describes a 'controlled small-scale study' comparing SFT, OPD, and RL but provides no explicit details on equalization of optimization hyperparameters (learning rate, batch size, total gradient steps, optimizer, or loss scaling) across regimes. In a 0.6B model on GSM8K, unmatched dynamics could drive the reported gaps (e.g., OPD surpassing teacher or RL retention) independently of state locality; without matching or ablations, the attribution to state distribution remains unisolated.

Authors: We agree that providing explicit details on hyperparameter settings is essential to support the claim that differences arise from state distributions rather than optimization dynamics. In our controlled study, we matched the optimizer (AdamW with the same beta parameters), learning rate (with identical warmup and decay schedules), batch size, and total number of gradient steps across the SFT, OPD, and RL regimes. Loss scaling was also kept consistent where applicable. We regret that these details were not included in the initial submission and will add a comprehensive table or subsection detailing all hyperparameters for each method in the revised manuscript. revision: yes
Referee: The claim that performance differences are driven by state distribution (fixed dataset vs. learner-induced states) would be strengthened by an ablation that applies the same loss but with off-policy or mismatched states; the current results do not rule out that the observed advantages of OPD and RL stem from on-policy sampling mechanics rather than the state distribution per se.

Authors: This is a valid point for further isolating the role of state distributions. Our design already varies the state source while using method-appropriate losses: SFT applies the standard next-token prediction loss on a fixed dataset of states, whereas OPD and RL apply their respective objectives on states generated on-policy by the learner. The surprising result that OPD from a degraded teacher exceeds the teacher's performance on multiple benchmarks is hard to explain without invoking the benefit of training on the learner's own state distribution. That said, we acknowledge that an additional experiment applying, for example, the SFT loss on on-policy states or a distillation loss on off-policy states would provide a cleaner isolation. We will include a discussion of this potential ablation and its interpretation in the revised version; performing the full set of new runs may be reported as future work given the scope of the current small-scale study. revision: partial

Circularity Check

0 steps flagged

Empirical comparisons of training regimes with no derivation reducing to self-inputs

full rationale

The paper conducts a controlled small-scale empirical study on Qwen3-0.6B-Base using GSM8K with TruthfulQA and MMLU retention checks. It compares outcomes across fixed-dataset SFT, on-policy RL, and on-policy distillation, attributing differences to state locality. No equations, fitted parameters, or uniqueness theorems are invoked that reduce the reported phenomena to inputs defined by the paper itself. The central claims rest on observed performance gaps from distinct training runs rather than any self-definitional construction or self-citation chain, rendering the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard machine-learning assumptions that the chosen benchmarks are valid proxies for capability and retention and that the small-scale controlled runs isolate the intended variable; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1263 out tokens · 56492 ms · 2026-05-22T07:23:01.349604+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize post-training as state-distribution shaping... SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dk+1(s)=T(dk(s),signal); qk is the training state distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 8 internal anchors

[1]

Wasserstein generative adversarial networks

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017

work page 2017
[2]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

work page 2024
[3]

Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

work page 2015
[4]

Model compression

Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006

work page 2006
[5]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021.URL https://arxiv. org/abs/2110.14168, 9, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

work page 1999
[8]

Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021. 8

work page 2021
[9]

A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

work page 2012
[10]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[11]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[13]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

work page 2016
[14]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[15]

Truthfulqa: Measuring how models mimic hu- man falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022
[16]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

work page 2017
[17]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989
[18]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[19]

Wasserstein barycenter and its application to texture mixing

Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. InInternational conference on scale space and variational methods in computer vision, pages 435–446. Springer, 2011

work page 2011
[20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[21]

Sequence Level Training with Recurrent Neural Networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks.arXiv preprint arXiv:1511.06732, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[23]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020
[27]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 10

work page internal anchor Pith review Pith/arXiv arXiv 1909

[1] [1]

Wasserstein generative adversarial networks

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017

work page 2017

[2] [2]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

work page 2024

[3] [3]

Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

work page 2015

[4] [4]

Model compression

Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006

work page 2006

[5] [5]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017

[6] [6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021.URL https://arxiv. org/abs/2110.14168, 9, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

work page 1999

[8] [8]

Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021. 8

work page 2021

[9] [9]

A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

work page 2012

[10] [10]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[11] [11]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[13] [13]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

work page 2016

[14] [14]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[15] [15]

Truthfulqa: Measuring how models mimic hu- man falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022

[16] [16]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

work page 2017

[17] [17]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989

[18] [18]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[19] [19]

Wasserstein barycenter and its application to texture mixing

Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. InInternational conference on scale space and variational methods in computer vision, pages 435–446. Springer, 2011

work page 2011

[20] [20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[21] [21]

Sequence Level Training with Recurrent Neural Networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks.arXiv preprint arXiv:1511.06732, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [22]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[23] [23]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[24] [24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020

[27] [27]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 10

work page internal anchor Pith review Pith/arXiv arXiv 1909