pith. sign in

arxiv: 2605.22731 · v1 · pith:PBNEBYWPnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Pith reviewed 2026-05-22 07:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords post-trainingsupervised fine-tuningreinforcement learningon-policy distillationstate distributionlanguage model fine-tuningGSM8K
0
0 comments X

The pith

Post-training of language models depends as much on the distribution of training states as on the supervision signal itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that post-training methods for autoregressive language models should be analyzed through the states they supervise, where a state consists of a prompt plus the generated prefix. SFT applies supervision to fixed dataset states, while RL and on-policy distillation apply it to states generated by the current model. Controlled experiments on Qwen3-0.6B-Base with GSM8K demonstrate that on-policy approaches can improve task performance and reduce forgetting compared to certain SFT regimes, even when starting from a degraded teacher. A reader would care because this reframes post-training design around controlling state source and locality rather than solely tuning loss functions.

Core claim

Post-training is about states, not tokens: the source and locality of training states can be as important as the form of the supervision signal. In small-scale experiments, a mild SFT improves GSM8K with little forgetting while a stress SFT causes substantial retention loss; on-policy distillation from a degraded SFT teacher surpasses that teacher across GSM8K, TruthfulQA, and MMLU; and a lightweight on-policy RL run improves GSM8K while preserving retention.

What carries the argument

State distribution shaping, the mechanism by which the source and locality of states (prompt plus generated prefix) determine the outcomes of SFT, RL, and on-policy distillation.

If this is right

  • Mild SFT on fixed states can improve task accuracy with minimal retention loss on other benchmarks.
  • On-policy distillation can exceed the performance of its teacher model even when that teacher is degraded.
  • Lightweight on-policy RL improves the target task while better preserving performance on retention evaluations than stress SFT.
  • Controlling whether states come from a fixed dataset or the current learner affects both capability gains and forgetting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could treat state locality as an explicit hyperparameter to trade off task improvement against retention.
  • The state-centric view may extend to other autoregressive domains such as code generation or multimodal models.
  • Hybrid training loops that deliberately mix fixed and on-policy states could combine the strengths of SFT and RL.
  • State distribution analysis might help diagnose why certain post-training recipes succeed or fail across model scales.

Load-bearing premise

The observed performance differences between SFT, OPD, and RL are driven primarily by differences in state distribution rather than by variations in training hyperparameters or optimization dynamics.

What would settle it

Re-run the GSM8K experiments while forcing SFT, RL, and OPD to sample from identical state distributions and check whether the performance and retention gaps disappear.

read the original abstract

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that post-training of large language models is better understood as shaping the distribution of states (prompt plus generated prefix) on which supervision is applied, rather than focusing primarily on the form of the loss function. It supports this view with a controlled small-scale empirical study using Qwen3-0.6B-Base on GSM8K (with TruthfulQA and MMLU for retention), reporting that mild SFT improves task performance with little forgetting, stress SFT causes substantial retention loss, on-policy distillation (OPD) from a degraded teacher can surpass the teacher, and lightweight on-policy RL improves GSM8K while preserving retention. The central claim is that the source and locality of training states can be as important as the supervision signal.

Significance. If the empirical phenomena hold under tighter controls, the work offers a useful complementary perspective that could guide more effective post-training by emphasizing state distribution matching or shaping. The small-scale study provides concrete, falsifiable observations (mild vs. stress SFT, OPD surpassing teacher, RL retention) that are worth testing at larger scales. No machine-checked proofs or parameter-free derivations are present, but the directional results constitute an initial empirical test of the state-centric framing.

major comments (2)
  1. The manuscript describes a 'controlled small-scale study' comparing SFT, OPD, and RL but provides no explicit details on equalization of optimization hyperparameters (learning rate, batch size, total gradient steps, optimizer, or loss scaling) across regimes. In a 0.6B model on GSM8K, unmatched dynamics could drive the reported gaps (e.g., OPD surpassing teacher or RL retention) independently of state locality; without matching or ablations, the attribution to state distribution remains unisolated.
  2. The claim that performance differences are driven by state distribution (fixed dataset vs. learner-induced states) would be strengthened by an ablation that applies the same loss but with off-policy or mismatched states; the current results do not rule out that the observed advantages of OPD and RL stem from on-policy sampling mechanics rather than the state distribution per se.
minor comments (1)
  1. The abstract and results section would benefit from a brief table or quantitative summary of the exact performance deltas and retention metrics to make the directional claims easier to evaluate without full text access.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and constructive suggestions for improving our manuscript on the state distribution view of post-training. We address each major comment below, providing clarifications and outlining planned revisions to enhance the rigor of our empirical study.

read point-by-point responses
  1. Referee: The manuscript describes a 'controlled small-scale study' comparing SFT, OPD, and RL but provides no explicit details on equalization of optimization hyperparameters (learning rate, batch size, total gradient steps, optimizer, or loss scaling) across regimes. In a 0.6B model on GSM8K, unmatched dynamics could drive the reported gaps (e.g., OPD surpassing teacher or RL retention) independently of state locality; without matching or ablations, the attribution to state distribution remains unisolated.

    Authors: We agree that providing explicit details on hyperparameter settings is essential to support the claim that differences arise from state distributions rather than optimization dynamics. In our controlled study, we matched the optimizer (AdamW with the same beta parameters), learning rate (with identical warmup and decay schedules), batch size, and total number of gradient steps across the SFT, OPD, and RL regimes. Loss scaling was also kept consistent where applicable. We regret that these details were not included in the initial submission and will add a comprehensive table or subsection detailing all hyperparameters for each method in the revised manuscript. revision: yes

  2. Referee: The claim that performance differences are driven by state distribution (fixed dataset vs. learner-induced states) would be strengthened by an ablation that applies the same loss but with off-policy or mismatched states; the current results do not rule out that the observed advantages of OPD and RL stem from on-policy sampling mechanics rather than the state distribution per se.

    Authors: This is a valid point for further isolating the role of state distributions. Our design already varies the state source while using method-appropriate losses: SFT applies the standard next-token prediction loss on a fixed dataset of states, whereas OPD and RL apply their respective objectives on states generated on-policy by the learner. The surprising result that OPD from a degraded teacher exceeds the teacher's performance on multiple benchmarks is hard to explain without invoking the benefit of training on the learner's own state distribution. That said, we acknowledge that an additional experiment applying, for example, the SFT loss on on-policy states or a distillation loss on off-policy states would provide a cleaner isolation. We will include a discussion of this potential ablation and its interpretation in the revised version; performing the full set of new runs may be reported as future work given the scope of the current small-scale study. revision: partial

Circularity Check

0 steps flagged

Empirical comparisons of training regimes with no derivation reducing to self-inputs

full rationale

The paper conducts a controlled small-scale empirical study on Qwen3-0.6B-Base using GSM8K with TruthfulQA and MMLU retention checks. It compares outcomes across fixed-dataset SFT, on-policy RL, and on-policy distillation, attributing differences to state locality. No equations, fitted parameters, or uniqueness theorems are invoked that reduce the reported phenomena to inputs defined by the paper itself. The central claims rest on observed performance gaps from distinct training runs rather than any self-definitional construction or self-citation chain, rendering the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard machine-learning assumptions that the chosen benchmarks are valid proxies for capability and retention and that the small-scale controlled runs isolate the intended variable; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1263 out tokens · 56492 ms · 2026-05-22T07:23:01.349604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 8 internal anchors

  1. [1]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017

  2. [2]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

  3. [3]

    Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

  4. [4]

    Model compression

    Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006

  5. [5]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021.URL https://arxiv. org/abs/2110.14168, 9, 2021

  7. [7]

    Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

    Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

  8. [8]

    Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021. 8

  9. [9]

    A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

  10. [10]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  11. [11]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  12. [12]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  13. [13]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

  14. [14]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  15. [15]

    Truthfulqa: Measuring how models mimic hu- man falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  16. [16]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

  17. [17]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  18. [18]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  19. [19]

    Wasserstein barycenter and its application to texture mixing

    Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. InInternational conference on scale space and variational methods in computer vision, pages 435–446. Springer, 2011

  20. [20]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  21. [21]

    Sequence Level Training with Recurrent Neural Networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks.arXiv preprint arXiv:1511.06732, 2015

  22. [22]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  23. [23]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  25. [25]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 9

  26. [26]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

  27. [27]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 10