Recognition: 2 theorem links
· Lean TheoremLarge Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3
The pith
LLM post-training is best understood as structured intervention on model behavior divided into off-policy and on-policy regimes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Post-training is best understood as structured intervention on model behavior. The field is organized first by trajectory provenance into off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts, then interpreted through the roles of effective support expansion, policy reshaping, and behavioral consolidation to diagnose bottlenecks and guide stage composition.
What carries the argument
Trajectory provenance, which distinguishes off-policy learning on external trajectories from on-policy learning on learner-generated rollouts, together with the three roles of effective support expansion, policy reshaping, and behavioral consolidation.
If this is right
- SFT may serve either support expansion or policy reshaping depending on the trajectories used.
- Preference optimization functions mainly as off-policy reshaping, though online variants shift toward on-policy states.
- On-policy RL primarily improves behavior on learner-generated states but can also make hard-to-reach paths reachable with stronger guidance.
- Distillation is better interpreted as behavioral consolidation across model transitions than as pure compression.
- Hybrid pipelines succeed when designed as coordinated multi-stage compositions rather than sequences of independent objectives.
Where Pith is reading between the lines
- This organization suggests that future pipelines could dynamically select off-policy or on-policy components based on detected support gaps rather than fixed schedules.
- The framework may extend naturally to non-language sequential tasks where behavior consolidation across model sizes is a bottleneck.
- Controlled ablations that isolate support expansion from reshaping in a single training run would test whether the two roles are truly separable in practice.
Load-bearing premise
The proposed roles of effective support expansion, policy reshaping, and behavioral consolidation provide a more useful diagnostic lens for composing post-training stages than categorizations by method labels or objectives.
What would settle it
An experiment that applies the same post-training pipeline with and without guidance from this trajectory-and-role framework and finds no measurable difference in final model capability or stage efficiency would falsify the claim.
Figures
read the original abstract
Post-training has become central to turning pretrained large language models (LLMs) into aligned, capable, and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objectives rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary regimes: off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes useful behavior across stages and model transitions. Under this view, SFT may serve either support expansion or policy reshaping; preference optimization is usually off-policy reshaping, though online variants move closer to learner-generated states. On-policy RL often improves behavior on learner-generated states, but stronger guidance can also make hard-to-reach reasoning paths reachable. Distillation is often better understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress increasingly depends on coordinated systems design rather than any single dominant objective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey of LLM post-training methods (SFT, preference optimization, RL, process supervision, distillation, and multi-stage pipelines). It organizes the field first by trajectory provenance into off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts. Methods are interpreted through the recurring roles of effective support expansion (making useful behaviors reachable), policy reshaping (improving behavior in reachable regions), and behavioral consolidation (preserving and amortizing behavior across stages). The central claim is that this view provides a sharper diagnostic lens for bottlenecks and stage composition than categorizations by method labels or objectives, with online variants and hybrids creating movement between regimes.
Significance. If the distinctions prove useful in practice, the framework could help researchers design more coordinated post-training pipelines by focusing on behavioral interventions rather than isolated objectives. As a purely conceptual synthesis without new experiments, formal proofs, or parameter-free derivations, its value is interpretive and depends on whether the provenance axis and three roles yield clearer reasoning about hybrids and multi-stage systems than existing taxonomies.
major comments (1)
- [Abstract] Abstract: the claim that trajectory provenance defines 'two primary regimes' and serves as the primary axis is load-bearing for the central claim of a 'sharper diagnostic lens,' yet the text only notes in passing that 'online variants and stronger guidance create movement between regimes' and lists examples such as rejection sampling from the current model, process supervision on self-rollouts, and online DPO with model-generated pairs. No explicit classification rules, decision criteria, or boundary conditions are supplied for these hybrids, leaving the partition potentially fuzzy and the framework open to post-hoc labeling rather than reliable guidance for stage composition.
minor comments (1)
- A summary table explicitly mapping representative methods (SFT, DPO, PPO, distillation, verifier-guided RL) to the off/on-policy regimes and the three roles would improve readability and allow readers to test the framework against known pipelines.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We appreciate the recognition that the provenance-based organization and the three behavioral roles offer an interpretive lens for post-training pipelines. We address the major comment on the clarity of regime boundaries below and commit to revisions that strengthen the framework's guidance for hybrid methods.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that trajectory provenance defines 'two primary regimes' and serves as the primary axis is load-bearing for the central claim of a 'sharper diagnostic lens,' yet the text only notes in passing that 'online variants and stronger guidance create movement between regimes' and lists examples such as rejection sampling from the current model, process supervision on self-rollouts, and online DPO with model-generated pairs. No explicit classification rules, decision criteria, or boundary conditions are supplied for these hybrids, leaving the partition potentially fuzzy and the framework open to post-hoc labeling rather than reliable guidance for stage composition.
Authors: We agree that the two-regime distinction is central and that the current treatment of hybrids is illustrative rather than prescriptive. The manuscript defines the regimes by trajectory provenance (externally supplied vs. learner-generated) and notes transitions via examples, but does not supply explicit decision criteria or boundary conditions. In the revision we will add a dedicated subsection in the framework overview that formalizes classification rules: a method is on-policy when its training trajectories are sampled from the learner's current policy; off-policy when trajectories originate from external sources, prior model versions, or fixed datasets; and hybrid when a pipeline mixes both sources or transitions between them (e.g., initial off-policy SFT followed by on-policy RL). We will include a decision table that maps representative methods and pipelines to these categories, together with guidance on how to diagnose bottlenecks under each regime. This addition will make the diagnostic lens more reliable for stage composition while preserving the survey's conceptual character. revision: yes
Circularity Check
Conceptual survey with no mathematical derivations or self-referential reductions
full rationale
The paper is a survey proposing an organizational lens for LLM post-training based on trajectory provenance (off-policy vs. on-policy) and roles such as support expansion, policy reshaping, and behavioral consolidation. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. The framework is interpretive, mapping existing methods onto the proposed categories without reducing any claim to a self-definition, fitted input renamed as prediction, or load-bearing self-citation. It remains self-contained as a conceptual taxonomy drawing distinctions from prior literature.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Post-training methods can be meaningfully categorized by whether trajectories are externally supplied or generated by the learner.
- ad hoc to paper Methods serve one or more of the roles of effective support expansion, policy reshaping, or behavioral consolidation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We organize the field first by trajectory provenance, which defines two primary regimes: off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
effective support expansion, which makes useful behaviors more reachable, and policy reshaping
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Finetuned Language Models are Zero-Shot Learners
J. Wei et al. “Finetuned Language Models are Zero-Shot Learners”. In:International Conference on Learning Representations. 2022.url: https://openreview.net/forum?id=gEZrGCozdqR
2022
-
[2]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
R. Rafailov et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”. In: Advances in neural information processing systems36 (2023), pp. 53728–53741
2023
-
[3]
Training language models to follow instructions with human feedback
L. Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in neural information processing systems35 (2022), pp. 27730–27744
2022
-
[4]
Let’s Verify Step by Step
H. Lightman et al. “Let’s Verify Step by Step”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id=v8L0pN6EOi
2024
-
[5]
arXiv preprint arXiv:2411.11504 , year=
X. Guan et al. “Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Founda- tion Models via Verifier Engineering”. In:arXiv preprint arXiv:2411.11504(2024). 32•Zhao et al
-
[6]
MiniLLM: Knowledge Distillation of Large Language Models
Y. Gu et al. “MiniLLM: Knowledge Distillation of Large Language Models”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id=5h0qf7IBZZ
2024
-
[7]
GLM-5: from Vibe Coding to Agentic Engineering
A. Zeng et al. “GLM-5: from Vibe Coding to Agentic Engineering”. In:arXiv preprint arXiv:2602.15763 (2026)
work page internal anchor Pith review arXiv 2026
-
[8]
A. Grattafiori et al. “The Llama 3 Herd of Models”. In:arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Kimi K2: Open Agentic Intelligence
K. Team et al. “Kimi K2: Open Agentic Intelligence”. In:arXiv preprint arXiv:2507.20534(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
A. Liu et al. “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models”. In:arXiv preprint arXiv:2512.02556(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Instruction Tuning for Large Language Models: A Survey
S. Zhang et al. “Instruction Tuning for Large Language Models: A Survey”. In:ACM Computing Surveys 58.7 (2026), pp. 1–36
2026
-
[12]
A Survey of Direct Preference Optimization , journal =
S. Liu et al. “A Survey of Direct Preference Optimization”. In:arXiv preprint arXiv:2503.11701(2025)
-
[13]
Aligning large language models with human: A survey
Y. Wang et al. “Aligning Large Language Models with Human: A Survey”. In:arXiv preprint arXiv:2307.12966 (2023)
-
[14]
A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,
K. Zhang et al. “A Survey of Reinforcement Learning for Large Reasoning Models”. In:arXiv preprint arXiv:2509.08827(2025)
-
[15]
Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng
X. Xu et al. “A Survey on Knowledge Distillation of Large Language Models”. In:arXiv preprint arXiv:2402.13116 (2024)
-
[16]
Multitask Prompted Training Enables Zero-Shot Task Generalization
V. Sanh et al. “Multitask Prompted Training Enables Zero-Shot Task Generalization”. In:International Conference on Learning Representations. 2022.url: https://openreview.net/forum?id=9Vrb9D0WI4
2022
-
[17]
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Y. Wang et al. “Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks”. In:Proceedings of the 2022 conference on empirical methods in natural language processing. 2022, pp. 5085–5109
2022
-
[18]
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions
Y. Wang et al. “SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions”. In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 2023, pp. 13484–13508
2023
-
[19]
Alpaca: A Strong, Replicable Instruction-Following Model
R. Taori et al. “Alpaca: A Strong, Replicable Instruction-Following Model”. In:Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/2023/03/13/alpaca.html3.6 (2023), p. 7
2023
-
[20]
LIMA: Less Is More for Alignment
C. Zhou et al. “LIMA: Less Is More for Alignment”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023.url: https://openreview.net/forum?id=KBMOKmX2he
2023
-
[21]
Scaling Instruction-Finetuned Language Models
H. W. Chung et al. “Scaling Instruction-Finetuned Language Models”. In:Journal of Machine Learning Research25.70 (2024), pp. 1–53
2024
-
[22]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
P. Wang et al. “Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, pp. 9426–9439
2024
-
[23]
arXiv preprint arXiv:2506.09340 (2025) 3
S. Li et al. “RePO: Replay-Enhanced Policy Optimization”. In:arXiv preprint arXiv:2506.09340(2025)
-
[24]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
D. Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature 645.8081 (2025), pp. 633–638
2025
-
[25]
Red Teaming Language Models with Language Models
E. Perez et al. “Red Teaming Language Models with Language Models”. In:Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, pp. 3419–3448
2022
-
[26]
ProRAG: Process-Supervised Reinforcement Learning for Retrieval- Augmented Generation
Z. Wang, Z. Zhao, and Z. Dou. “ProRAG: Process-Supervised Reinforcement Learning for Retrieval- Augmented Generation”. In:arXiv preprint arXiv:2601.21912(2026)
-
[27]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick et al. “Toolformer: Language Models Can Teach Themselves to Use Tools”. In:Advances in neural information processing systems36 (2023), pp. 68539–68551
2023
-
[28]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:The eleventh international conference on learning representations. 2022. LLM Post-Training•33
2022
-
[29]
Open Problems and Fundamental Limitations of Reinforcement Learning from Hu- man Feedback
S. Casper et al. “Open Problems and Fundamental Limitations of Reinforcement Learning from Hu- man Feedback”. In:Transactions on Machine Learning Research(2023). Survey Certification, Featured Certification.issn: 2835-8856.url: https://openreview.net/forum?id=bx24KpJ4Eb
2023
-
[30]
C. Zheng et al. “A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models”. In:arXiv preprint arXiv:2510.08049(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
G. Tie et al. “A Survey on Post-training of Large Language Models”. In:arXiv preprint arXiv:2503.06072 (2025)
-
[32]
A Survey of Post-Training Scaling in Large Language Models
H. Lai et al. “A Survey of Post-Training Scaling in Large Language Models”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 2771– 2791
2025
-
[33]
arXiv preprint arXiv:2502.21321
K. Kumar et al. “LLM Post-Training: A Deep Dive into Reasoning Large Language Models”. In:arXiv preprint arXiv:2502.21321(2025)
-
[34]
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
G. I. Winata et al. “Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey”. In:Journal of Artificial Intelligence Research82 (2025), pp. 2595–2661
2025
-
[35]
The RL/LLM Taxonomy Tree: Reviewing Synergies between Reinforcement Learning and Large Language Models
M. Pternea et al. “The RL/LLM Taxonomy Tree: Reviewing Synergies between Reinforcement Learning and Large Language Models”. In:Journal of Artificial Intelligence Research80 (2024), pp. 1525–1573
2024
-
[36]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In:arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Trust Region Policy Optimization
J. Schulman et al. “Trust Region Policy Optimization”. In:Proceedings of the 32nd International Conference on Machine Learning. Ed. by F. Bach and D. Blei. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, 2015, pp. 1889–1897.url: https://proceedings.mlr.press/v37/schulman15.html
2015
-
[38]
Proximal Policy Optimization Algorithms
J. Schulman et al. “Proximal Policy Optimization Algorithms”. In:arXiv preprint arXiv:1707.06347(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
R. S. Sutton, A. G. Barto, et al.Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998
1998
-
[40]
ORPO: Monolithic Preference Optimization without Reference Model
J. Hong, N. Lee, and J. Thorne. “ORPO: Monolithic Preference Optimization without Reference Model”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024, pp. 11170– 11189
2024
-
[41]
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
H. Xu et al. “Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation”. In:Proceedings of the 41st International Conference on Machine Learning. ICML’24. Vienna, Austria: JMLR.org, 2024
2024
-
[42]
Model Alignment as Prospect Theoretic Optimization
K. Ethayarajh et al. “Model Alignment as Prospect Theoretic Optimization”. In:Forty-first International Conference on Machine Learning. 2024.url: https://openreview.net/forum?id=iUwHnoENnl
2024
-
[43]
SimPO: Simple Preference Optimization with a Reference-Free Reward
Y. Meng, M. Xia, and D. Chen. “SimPO: Simple Preference Optimization with a Reference-Free Reward”. In:Advances in Neural Information Processing Systems37 (2024), pp. 124198–124235
2024
-
[44]
Fine-Tuning Language Models from Human Preferences
D. M. Ziegler et al. “Fine-Tuning Language Models from Human Preferences”. In:arXiv preprint arXiv:1909.08593 (2019)
work page internal anchor Pith review arXiv 1909
-
[45]
Learning to summarize with human feedback
N. Stiennon et al. “Learning to summarize with human feedback”. In:Advances in neural information processing systems33 (2020), pp. 3008–3021
2020
-
[46]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y. Bai et al. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”. In:arXiv preprint arXiv:2204.05862(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Constitutional AI: Harmlessness from AI Feedback
Y. Bai et al. “Constitutional AI: Harmlessness from AI Feedback”. In:arXiv preprint arXiv:2212.08073(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
S. Levine et al. “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”. In:arXiv preprint arXiv:2005.01643(2020)
work page internal anchor Pith review arXiv 2005
-
[49]
A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems
R. F. Prudencio, M. R. Maximo, and E. L. Colombini. “A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems”. In:IEEE transactions on neural networks and learning systems 35.8 (2023), pp. 10237–10257. 34•Zhao et al
2023
-
[50]
M. L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st. USA: John Wiley & Sons, Inc., 1994.isbn: 0471619779
1994
-
[51]
Altman.Constrained Markov Decision Processes
E. Altman.Constrained Markov Decision Processes. 1st. Routledge, 1999.doi: 10.1201/9781315140223.url: https://doi.org/10.1201/9781315140223
-
[52]
Direct Multi-Turn Preference Optimization for Language Agents
W. Shi et al. “Direct Multi-Turn Preference Optimization for Language Agents”. In:Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024, pp. 2312–2324
2024
-
[53]
Multi-Step Alignment as Markov Games: An Optimistic Online Mirror Descent Approach with Convergence Guarantees
Y. Wu et al. “Multi-Step Alignment as Markov Games: An Optimistic Online Mirror Descent Approach with Convergence Guarantees”. In:Transactions on Machine Learning Research(2025)
2025
-
[54]
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
S. Bengio et al. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”. In: Advances in neural information processing systems28 (2015)
2015
-
[55]
A Reduction of Imitation Learning and Structured Prediction to No- Regret Online Learning
S. Ross, G. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No- Regret Online Learning”. In:Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011, pp. 627–635
2011
-
[56]
Professor Forcing: A New Algorithm for Training Recurrent Networks
A. M. Lamb et al. “Professor Forcing: A New Algorithm for Training Recurrent Networks”. In:Advances in neural information processing systems29 (2016)
2016
-
[57]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean. “Distilling the Knowledge in a Neural Network”. In:arXiv preprint arXiv:1503.02531(2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[58]
Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022
C. Snell, D. Klein, and R. Zhong. “Learning by Distilling Context”. In:arXiv preprint arXiv:2209.15189 (2022)
-
[59]
Y. Huang et al. “In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models”. In:arXiv preprint arXiv:2212.10670(2022)
-
[60]
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
C.-Y. Hsieh et al. “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes”. In:Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 8003–8017
2023
-
[61]
Distilling Reasoning Capabilities into Smaller Language Models
K. Shridhar, A. Stolfo, and M. Sachan. “Distilling Reasoning Capabilities into Smaller Language Models”. In:Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 7059–7073
2023
-
[62]
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
R. Agarwal et al. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/ forum?id=3zKtaqxLhW
2024
-
[63]
Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,
T. Ye et al. “Black-Box On-Policy Distillation of Large Language Models”. In:arXiv preprint arXiv:2511.10643 (2025)
-
[64]
On-Policy Context Distillation for Language Models
T. Ye et al. “On-Policy Context Distillation for Language Models”. In:arXiv preprint arXiv:2602.12275 (2026)
work page internal anchor Pith review arXiv 2026
-
[65]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
S. Zhao et al. “Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models”. In:arXiv preprint arXiv:2601.18734(2026)
work page internal anchor Pith review arXiv 2026
-
[66]
W. Yang et al. “Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation”. In:arXiv preprint arXiv:2602.12125(2026)
-
[67]
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
H. Sang et al. “CRISP: Compressed Reasoning via Iterative Self-Policy Distillation”. In:arXiv preprint arXiv:2603.05433v5(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[68]
arXiv preprint arXiv:2507.14843 , year=
F. Wu et al. “The Invisible Leash: Why RLVR May or May Not Escape Its Origin”. In:arXiv preprint arXiv:2507.14843(2025)
-
[69]
Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem
M. McCloskey and N. J. Cohen. “Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem”. In:Psychology of learning and motivation. Vol. 24. Elsevier, 1989, pp. 109–165
1989
-
[70]
Overcoming catastrophic forgetting in neural networks
J. Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the national academy of sciences114.13 (2017), pp. 3521–3526. LLM Post-Training•35
2017
-
[71]
Agentark: Distilling multi-agent intelligence into a single llm agent
Y. Luo et al. “AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent”. In:arXiv preprint arXiv:2602.03955(2026)
-
[72]
InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions
Y. Wang et al. “InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, pp. 663–677
2024
-
[73]
MergeBench: A Benchmark for Merging Domain-Specialized LLMs
Y. He et al. “MergeBench: A Benchmark for Merging Domain-Specialized LLMs”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2025.url: https://openreview.net/forum?id=rw50iUoyLu
2025
-
[74]
Language Models Can Easily Learn to Reason from Demonstrations
D. Li et al. “Language Models Can Easily Learn to Reason from Demonstrations”. In:Findings of the Association for Computational Linguistics: EMNLP2025 (2025), pp. 15979–15997
2025
-
[75]
LIMO: Less is More for Reasoning
Y. Ye et al. “LIMO: Less is More for Reasoning”. In:Second Conference on Language Modeling. 2025.url: https://openreview.net/forum?id=T2TZ0RY4Zk
2025
-
[76]
Large Language Models Encode Clinical Knowledge
K. Singhal et al. “Large Language Models Encode Clinical Knowledge”. In:Nature620.7972 (2023), pp. 172– 180
2023
-
[77]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
L. Yu et al. “MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= N8N0hgNDRt
2024
-
[78]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
X. Yue et al. “MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= yLClGs770I
2024
-
[79]
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Z. Tang et al. “MathScale: Scaling Instruction Tuning for Mathematical Reasoning”. In:Forty-first Interna- tional Conference on Machine Learning. 2024.url: https://openreview.net/forum?id=Kjww7ZN47M
2024
-
[80]
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
Y. Dong et al. “SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF”. In: Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, pp. 11275–11288
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.