Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
Pith reviewed 2026-05-22 07:23 UTC · model grok-4.3
The pith
Post-training of language models depends as much on the distribution of training states as on the supervision signal itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Post-training is about states, not tokens: the source and locality of training states can be as important as the form of the supervision signal. In small-scale experiments, a mild SFT improves GSM8K with little forgetting while a stress SFT causes substantial retention loss; on-policy distillation from a degraded SFT teacher surpasses that teacher across GSM8K, TruthfulQA, and MMLU; and a lightweight on-policy RL run improves GSM8K while preserving retention.
What carries the argument
State distribution shaping, the mechanism by which the source and locality of states (prompt plus generated prefix) determine the outcomes of SFT, RL, and on-policy distillation.
If this is right
- Mild SFT on fixed states can improve task accuracy with minimal retention loss on other benchmarks.
- On-policy distillation can exceed the performance of its teacher model even when that teacher is degraded.
- Lightweight on-policy RL improves the target task while better preserving performance on retention evaluations than stress SFT.
- Controlling whether states come from a fixed dataset or the current learner affects both capability gains and forgetting.
Where Pith is reading between the lines
- Designers could treat state locality as an explicit hyperparameter to trade off task improvement against retention.
- The state-centric view may extend to other autoregressive domains such as code generation or multimodal models.
- Hybrid training loops that deliberately mix fixed and on-policy states could combine the strengths of SFT and RL.
- State distribution analysis might help diagnose why certain post-training recipes succeed or fail across model scales.
Load-bearing premise
The observed performance differences between SFT, OPD, and RL are driven primarily by differences in state distribution rather than by variations in training hyperparameters or optimization dynamics.
What would settle it
Re-run the GSM8K experiments while forcing SFT, RL, and OPD to sample from identical state distributions and check whether the performance and retention gaps disappear.
read the original abstract
Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that post-training of large language models is better understood as shaping the distribution of states (prompt plus generated prefix) on which supervision is applied, rather than focusing primarily on the form of the loss function. It supports this view with a controlled small-scale empirical study using Qwen3-0.6B-Base on GSM8K (with TruthfulQA and MMLU for retention), reporting that mild SFT improves task performance with little forgetting, stress SFT causes substantial retention loss, on-policy distillation (OPD) from a degraded teacher can surpass the teacher, and lightweight on-policy RL improves GSM8K while preserving retention. The central claim is that the source and locality of training states can be as important as the supervision signal.
Significance. If the empirical phenomena hold under tighter controls, the work offers a useful complementary perspective that could guide more effective post-training by emphasizing state distribution matching or shaping. The small-scale study provides concrete, falsifiable observations (mild vs. stress SFT, OPD surpassing teacher, RL retention) that are worth testing at larger scales. No machine-checked proofs or parameter-free derivations are present, but the directional results constitute an initial empirical test of the state-centric framing.
major comments (2)
- The manuscript describes a 'controlled small-scale study' comparing SFT, OPD, and RL but provides no explicit details on equalization of optimization hyperparameters (learning rate, batch size, total gradient steps, optimizer, or loss scaling) across regimes. In a 0.6B model on GSM8K, unmatched dynamics could drive the reported gaps (e.g., OPD surpassing teacher or RL retention) independently of state locality; without matching or ablations, the attribution to state distribution remains unisolated.
- The claim that performance differences are driven by state distribution (fixed dataset vs. learner-induced states) would be strengthened by an ablation that applies the same loss but with off-policy or mismatched states; the current results do not rule out that the observed advantages of OPD and RL stem from on-policy sampling mechanics rather than the state distribution per se.
minor comments (1)
- The abstract and results section would benefit from a brief table or quantitative summary of the exact performance deltas and retention metrics to make the directional claims easier to evaluate without full text access.
Simulated Author's Rebuttal
We appreciate the referee's detailed review and constructive suggestions for improving our manuscript on the state distribution view of post-training. We address each major comment below, providing clarifications and outlining planned revisions to enhance the rigor of our empirical study.
read point-by-point responses
-
Referee: The manuscript describes a 'controlled small-scale study' comparing SFT, OPD, and RL but provides no explicit details on equalization of optimization hyperparameters (learning rate, batch size, total gradient steps, optimizer, or loss scaling) across regimes. In a 0.6B model on GSM8K, unmatched dynamics could drive the reported gaps (e.g., OPD surpassing teacher or RL retention) independently of state locality; without matching or ablations, the attribution to state distribution remains unisolated.
Authors: We agree that providing explicit details on hyperparameter settings is essential to support the claim that differences arise from state distributions rather than optimization dynamics. In our controlled study, we matched the optimizer (AdamW with the same beta parameters), learning rate (with identical warmup and decay schedules), batch size, and total number of gradient steps across the SFT, OPD, and RL regimes. Loss scaling was also kept consistent where applicable. We regret that these details were not included in the initial submission and will add a comprehensive table or subsection detailing all hyperparameters for each method in the revised manuscript. revision: yes
-
Referee: The claim that performance differences are driven by state distribution (fixed dataset vs. learner-induced states) would be strengthened by an ablation that applies the same loss but with off-policy or mismatched states; the current results do not rule out that the observed advantages of OPD and RL stem from on-policy sampling mechanics rather than the state distribution per se.
Authors: This is a valid point for further isolating the role of state distributions. Our design already varies the state source while using method-appropriate losses: SFT applies the standard next-token prediction loss on a fixed dataset of states, whereas OPD and RL apply their respective objectives on states generated on-policy by the learner. The surprising result that OPD from a degraded teacher exceeds the teacher's performance on multiple benchmarks is hard to explain without invoking the benefit of training on the learner's own state distribution. That said, we acknowledge that an additional experiment applying, for example, the SFT loss on on-policy states or a distillation loss on off-policy states would provide a cleaner isolation. We will include a discussion of this potential ablation and its interpretation in the revised version; performing the full set of new runs may be reported as future work given the scope of the current small-scale study. revision: partial
Circularity Check
Empirical comparisons of training regimes with no derivation reducing to self-inputs
full rationale
The paper conducts a controlled small-scale empirical study on Qwen3-0.6B-Base using GSM8K with TruthfulQA and MMLU retention checks. It compares outcomes across fixed-dataset SFT, on-policy RL, and on-policy distillation, attributing differences to state locality. No equations, fitted parameters, or uniqueness theorems are invoked that reduce the reported phenomena to inputs defined by the paper itself. The central claims rest on observed performance gaps from distinct training runs rather than any self-definitional construction or self-citation chain, rendering the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize post-training as state-distribution shaping... SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dk+1(s)=T(dk(s),signal); qk is the training state distribution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Wasserstein generative adversarial networks
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017
work page 2017
-
[2]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024
work page 2024
-
[3]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015
work page 2015
-
[4]
Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006
work page 2006
-
[5]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021.URL https://arxiv. org/abs/2110.14168, 9, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999
Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999
work page 1999
-
[8]
Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021. 8
work page 2021
-
[9]
A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012
work page 2012
-
[10]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[11]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
work page 2022
-
[13]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016
work page 2016
-
[14]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[15]
Truthfulqa: Measuring how models mimic hu- man falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022
work page 2022
-
[16]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017
work page 2017
-
[17]
Catastrophic interference in connectionist networks: The sequential learning problem
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989
work page 1989
-
[18]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[19]
Wasserstein barycenter and its application to texture mixing
Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. InInternational conference on scale space and variational methods in computer vision, pages 435–446. Springer, 2011
work page 2011
-
[20]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[21]
Sequence Level Training with Recurrent Neural Networks
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks.arXiv preprint arXiv:1511.06732, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[22]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[23]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[24]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020
work page 2020
-
[27]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 10
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.