arxiv: 2604.07941 · v2 · submitted 2026-04-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao , Zhihu Wang , Xuyang Zhao , Jiaming Zhou , Caiyue Xu , Chenfei Liu , Liting Zhang , Yuhang Jia

show 5 more authors

Yanzhe Zhang Hualong Yu Zichen Xu Qicheng Li Yong Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM post-trainingoff-policy learningon-policy learningsupervised fine-tuningpreference optimizationreinforcement learningbehavioral consolidationtrajectory provenance

0 comments

The pith

LLM post-training is best understood as structured intervention on model behavior divided into off-policy and on-policy regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fragmented discussions of SFT, preference optimization, RL, distillation, and hybrid pipelines obscure the behavioral bottlenecks they address. It organizes methods first by trajectory provenance, separating off-policy learning from external trajectories and on-policy learning from learner-generated rollouts. Methods are then interpreted through recurring roles of effective support expansion to make useful behaviors reachable, policy reshaping to improve behavior in reachable regions, and behavioral consolidation to preserve and transfer gains across stages. This lens is presented as superior for diagnosing bottlenecks and composing multi-stage pipelines compared with labels or objectives alone.

Core claim

Post-training is best understood as structured intervention on model behavior. The field is organized first by trajectory provenance into off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts, then interpreted through the roles of effective support expansion, policy reshaping, and behavioral consolidation to diagnose bottlenecks and guide stage composition.

What carries the argument

Trajectory provenance, which distinguishes off-policy learning on external trajectories from on-policy learning on learner-generated rollouts, together with the three roles of effective support expansion, policy reshaping, and behavioral consolidation.

If this is right

SFT may serve either support expansion or policy reshaping depending on the trajectories used.
Preference optimization functions mainly as off-policy reshaping, though online variants shift toward on-policy states.
On-policy RL primarily improves behavior on learner-generated states but can also make hard-to-reach paths reachable with stronger guidance.
Distillation is better interpreted as behavioral consolidation across model transitions than as pure compression.
Hybrid pipelines succeed when designed as coordinated multi-stage compositions rather than sequences of independent objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This organization suggests that future pipelines could dynamically select off-policy or on-policy components based on detected support gaps rather than fixed schedules.
The framework may extend naturally to non-language sequential tasks where behavior consolidation across model sizes is a bottleneck.
Controlled ablations that isolate support expansion from reshaping in a single training run would test whether the two roles are truly separable in practice.

Load-bearing premise

The proposed roles of effective support expansion, policy reshaping, and behavioral consolidation provide a more useful diagnostic lens for composing post-training stages than categorizations by method labels or objectives.

What would settle it

An experiment that applies the same post-training pipeline with and without guidance from this trajectory-and-role framework and finds no measurable difference in final model capability or stage efficiency would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.07941 by Caiyue Xu, Chenfei Liu, Hualong Yu, Jiaming Zhou, Liting Zhang, Qicheng Li, Shiwan Zhao, Xuyang Zhao, Yanzhe Zhang, Yong Qin, Yuhang Jia, Zhihu Wang, Zichen Xu.

**Figure 1.** Figure 1: Unified overview of the survey’s organizational and analytical framework. Trajectory provenance serves as the top-level [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Post-training has become central to turning pretrained large language models (LLMs) into aligned, capable, and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objectives rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary regimes: off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes useful behavior across stages and model transitions. Under this view, SFT may serve either support expansion or policy reshaping; preference optimization is usually off-policy reshaping, though online variants move closer to learner-generated states. On-policy RL often improves behavior on learner-generated states, but stronger guidance can also make hard-to-reach reasoning paths reachable. Distillation is often better understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress increasingly depends on coordinated systems design rather than any single dominant objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey re-organizes LLM post-training around off-policy versus on-policy trajectories and three behavioral roles, giving a practical systems view but without new experiments or tight rules for hybrids.

read the letter

The main takeaway is that this paper treats post-training as structured intervention on behavior, organized first by trajectory provenance (external off-policy data versus learner-generated on-policy rollouts) and then by roles of support expansion, policy reshaping, and behavioral consolidation. It maps SFT, preference optimization, RL, distillation, and multi-stage work onto these categories, noting how online variants and guidance can shift methods between regimes.

Referee Report

1 major / 1 minor

Summary. The manuscript is a survey of LLM post-training methods (SFT, preference optimization, RL, process supervision, distillation, and multi-stage pipelines). It organizes the field first by trajectory provenance into off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts. Methods are interpreted through the recurring roles of effective support expansion (making useful behaviors reachable), policy reshaping (improving behavior in reachable regions), and behavioral consolidation (preserving and amortizing behavior across stages). The central claim is that this view provides a sharper diagnostic lens for bottlenecks and stage composition than categorizations by method labels or objectives, with online variants and hybrids creating movement between regimes.

Significance. If the distinctions prove useful in practice, the framework could help researchers design more coordinated post-training pipelines by focusing on behavioral interventions rather than isolated objectives. As a purely conceptual synthesis without new experiments, formal proofs, or parameter-free derivations, its value is interpretive and depends on whether the provenance axis and three roles yield clearer reasoning about hybrids and multi-stage systems than existing taxonomies.

major comments (1)

[Abstract] Abstract: the claim that trajectory provenance defines 'two primary regimes' and serves as the primary axis is load-bearing for the central claim of a 'sharper diagnostic lens,' yet the text only notes in passing that 'online variants and stronger guidance create movement between regimes' and lists examples such as rejection sampling from the current model, process supervision on self-rollouts, and online DPO with model-generated pairs. No explicit classification rules, decision criteria, or boundary conditions are supplied for these hybrids, leaving the partition potentially fuzzy and the framework open to post-hoc labeling rather than reliable guidance for stage composition.

minor comments (1)

A summary table explicitly mapping representative methods (SFT, DPO, PPO, distillation, verifier-guided RL) to the off/on-policy regimes and the three roles would improve readability and allow readers to test the framework against known pipelines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We appreciate the recognition that the provenance-based organization and the three behavioral roles offer an interpretive lens for post-training pipelines. We address the major comment on the clarity of regime boundaries below and commit to revisions that strengthen the framework's guidance for hybrid methods.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that trajectory provenance defines 'two primary regimes' and serves as the primary axis is load-bearing for the central claim of a 'sharper diagnostic lens,' yet the text only notes in passing that 'online variants and stronger guidance create movement between regimes' and lists examples such as rejection sampling from the current model, process supervision on self-rollouts, and online DPO with model-generated pairs. No explicit classification rules, decision criteria, or boundary conditions are supplied for these hybrids, leaving the partition potentially fuzzy and the framework open to post-hoc labeling rather than reliable guidance for stage composition.

Authors: We agree that the two-regime distinction is central and that the current treatment of hybrids is illustrative rather than prescriptive. The manuscript defines the regimes by trajectory provenance (externally supplied vs. learner-generated) and notes transitions via examples, but does not supply explicit decision criteria or boundary conditions. In the revision we will add a dedicated subsection in the framework overview that formalizes classification rules: a method is on-policy when its training trajectories are sampled from the learner's current policy; off-policy when trajectories originate from external sources, prior model versions, or fixed datasets; and hybrid when a pipeline mixes both sources or transitions between them (e.g., initial off-policy SFT followed by on-policy RL). We will include a decision table that maps representative methods and pipelines to these categories, together with guidance on how to diagnose bottlenecks under each regime. This addition will make the diagnostic lens more reliable for stage composition while preserving the survey's conceptual character. revision: yes

Circularity Check

0 steps flagged

Conceptual survey with no mathematical derivations or self-referential reductions

full rationale

The paper is a survey proposing an organizational lens for LLM post-training based on trajectory provenance (off-policy vs. on-policy) and roles such as support expansion, policy reshaping, and behavioral consolidation. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. The framework is interpretive, mapping existing methods onto the proposed categories without reducing any claim to a self-definition, fitted input renamed as prediction, or load-bearing self-citation. It remains self-contained as a conceptual taxonomy drawing distinctions from prior literature.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is a survey that introduces conceptual categories without new empirical claims or mathematical derivations. It relies on standard machine-learning assumptions about fine-tuning and reinforcement learning.

axioms (2)

domain assumption Post-training methods can be meaningfully categorized by whether trajectories are externally supplied or generated by the learner.
This is the primary organizing distinction proposed in the survey.
ad hoc to paper Methods serve one or more of the roles of effective support expansion, policy reshaping, or behavioral consolidation.
These three roles are introduced by the authors to interpret existing techniques.

pith-pipeline@v0.9.0 · 5628 in / 1409 out tokens · 69118 ms · 2026-05-10T18:24:32.515622+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We organize the field first by trajectory provenance, which defines two primary regimes: off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

effective support expansion, which makes useful behaviors more reachable, and policy reshaping

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

136 extracted references · 53 canonical work pages · 20 internal anchors

[1]

Finetuned Language Models are Zero-Shot Learners

J. Wei et al. “Finetuned Language Models are Zero-Shot Learners”. In:International Conference on Learning Representations. 2022.url: https://openreview.net/forum?id=gEZrGCozdqR

2022
[2]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”. In: Advances in neural information processing systems36 (2023), pp. 53728–53741

2023
[3]

Training language models to follow instructions with human feedback

L. Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in neural information processing systems35 (2022), pp. 27730–27744

2022
[4]

Let’s Verify Step by Step

H. Lightman et al. “Let’s Verify Step by Step”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id=v8L0pN6EOi

2024
[5]

arXiv preprint arXiv:2411.11504 , year=

X. Guan et al. “Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Founda- tion Models via Verifier Engineering”. In:arXiv preprint arXiv:2411.11504(2024). 32•Zhao et al

work page arXiv 2024
[6]

MiniLLM: Knowledge Distillation of Large Language Models

Y. Gu et al. “MiniLLM: Knowledge Distillation of Large Language Models”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id=5h0qf7IBZZ

2024
[7]

GLM-5: from Vibe Coding to Agentic Engineering

A. Zeng et al. “GLM-5: from Vibe Coding to Agentic Engineering”. In:arXiv preprint arXiv:2602.15763 (2026)

work page internal anchor Pith review arXiv 2026
[8]

The Llama 3 Herd of Models

A. Grattafiori et al. “The Llama 3 Herd of Models”. In:arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Kimi K2: Open Agentic Intelligence

K. Team et al. “Kimi K2: Open Agentic Intelligence”. In:arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

A. Liu et al. “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models”. In:arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Instruction Tuning for Large Language Models: A Survey

S. Zhang et al. “Instruction Tuning for Large Language Models: A Survey”. In:ACM Computing Surveys 58.7 (2026), pp. 1–36

2026
[12]

A Survey of Direct Preference Optimization , journal =

S. Liu et al. “A Survey of Direct Preference Optimization”. In:arXiv preprint arXiv:2503.11701(2025)

work page arXiv 2025
[13]

Aligning large language models with human: A survey

Y. Wang et al. “Aligning Large Language Models with Human: A Survey”. In:arXiv preprint arXiv:2307.12966 (2023)

work page arXiv 2023
[14]

A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,

K. Zhang et al. “A Survey of Reinforcement Learning for Large Reasoning Models”. In:arXiv preprint arXiv:2509.08827(2025)

work page arXiv 2025
[15]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

X. Xu et al. “A Survey on Knowledge Distillation of Large Language Models”. In:arXiv preprint arXiv:2402.13116 (2024)

work page arXiv 2024
[16]

Multitask Prompted Training Enables Zero-Shot Task Generalization

V. Sanh et al. “Multitask Prompted Training Enables Zero-Shot Task Generalization”. In:International Conference on Learning Representations. 2022.url: https://openreview.net/forum?id=9Vrb9D0WI4

2022
[17]

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Y. Wang et al. “Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks”. In:Proceedings of the 2022 conference on empirical methods in natural language processing. 2022, pp. 5085–5109

2022
[18]

SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions

Y. Wang et al. “SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions”. In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 2023, pp. 13484–13508

2023
[19]

Alpaca: A Strong, Replicable Instruction-Following Model

R. Taori et al. “Alpaca: A Strong, Replicable Instruction-Following Model”. In:Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/2023/03/13/alpaca.html3.6 (2023), p. 7

2023
[20]

LIMA: Less Is More for Alignment

C. Zhou et al. “LIMA: Less Is More for Alignment”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023.url: https://openreview.net/forum?id=KBMOKmX2he

2023
[21]

Scaling Instruction-Finetuned Language Models

H. W. Chung et al. “Scaling Instruction-Finetuned Language Models”. In:Journal of Machine Learning Research25.70 (2024), pp. 1–53

2024
[22]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

P. Wang et al. “Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, pp. 9426–9439

2024
[23]

arXiv preprint arXiv:2506.09340 (2025) 3

S. Li et al. “RePO: Replay-Enhanced Policy Optimization”. In:arXiv preprint arXiv:2506.09340(2025)

work page arXiv 2025
[24]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

D. Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature 645.8081 (2025), pp. 633–638

2025
[25]

Red Teaming Language Models with Language Models

E. Perez et al. “Red Teaming Language Models with Language Models”. In:Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, pp. 3419–3448

2022
[26]

ProRAG: Process-Supervised Reinforcement Learning for Retrieval- Augmented Generation

Z. Wang, Z. Zhao, and Z. Dou. “ProRAG: Process-Supervised Reinforcement Learning for Retrieval- Augmented Generation”. In:arXiv preprint arXiv:2601.21912(2026)

work page arXiv 2026
[27]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick et al. “Toolformer: Language Models Can Teach Themselves to Use Tools”. In:Advances in neural information processing systems36 (2023), pp. 68539–68551

2023
[28]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:The eleventh international conference on learning representations. 2022. LLM Post-Training•33

2022
[29]

Open Problems and Fundamental Limitations of Reinforcement Learning from Hu- man Feedback

S. Casper et al. “Open Problems and Fundamental Limitations of Reinforcement Learning from Hu- man Feedback”. In:Transactions on Machine Learning Research(2023). Survey Certification, Featured Certification.issn: 2835-8856.url: https://openreview.net/forum?id=bx24KpJ4Eb

2023
[30]

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

C. Zheng et al. “A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models”. In:arXiv preprint arXiv:2510.08049(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Yu, and Jianfeng Gao

G. Tie et al. “A Survey on Post-training of Large Language Models”. In:arXiv preprint arXiv:2503.06072 (2025)

work page arXiv 2025
[32]

A Survey of Post-Training Scaling in Large Language Models

H. Lai et al. “A Survey of Post-Training Scaling in Large Language Models”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 2771– 2791

2025
[33]

arXiv preprint arXiv:2502.21321

K. Kumar et al. “LLM Post-Training: A Deep Dive into Reasoning Large Language Models”. In:arXiv preprint arXiv:2502.21321(2025)

work page arXiv 2025
[34]

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

G. I. Winata et al. “Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey”. In:Journal of Artificial Intelligence Research82 (2025), pp. 2595–2661

2025
[35]

The RL/LLM Taxonomy Tree: Reviewing Synergies between Reinforcement Learning and Large Language Models

M. Pternea et al. “The RL/LLM Taxonomy Tree: Reviewing Synergies between Reinforcement Learning and Large Language Models”. In:Journal of Artificial Intelligence Research80 (2024), pp. 1525–1573

2024
[36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In:arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Trust Region Policy Optimization

J. Schulman et al. “Trust Region Policy Optimization”. In:Proceedings of the 32nd International Conference on Machine Learning. Ed. by F. Bach and D. Blei. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, 2015, pp. 1889–1897.url: https://proceedings.mlr.press/v37/schulman15.html

2015
[38]

Proximal Policy Optimization Algorithms

J. Schulman et al. “Proximal Policy Optimization Algorithms”. In:arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

R. S. Sutton, A. G. Barto, et al.Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998

1998
[40]

ORPO: Monolithic Preference Optimization without Reference Model

J. Hong, N. Lee, and J. Thorne. “ORPO: Monolithic Preference Optimization without Reference Model”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024, pp. 11170– 11189

2024
[41]

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

H. Xu et al. “Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation”. In:Proceedings of the 41st International Conference on Machine Learning. ICML’24. Vienna, Austria: JMLR.org, 2024

2024
[42]

Model Alignment as Prospect Theoretic Optimization

K. Ethayarajh et al. “Model Alignment as Prospect Theoretic Optimization”. In:Forty-first International Conference on Machine Learning. 2024.url: https://openreview.net/forum?id=iUwHnoENnl

2024
[43]

SimPO: Simple Preference Optimization with a Reference-Free Reward

Y. Meng, M. Xia, and D. Chen. “SimPO: Simple Preference Optimization with a Reference-Free Reward”. In:Advances in Neural Information Processing Systems37 (2024), pp. 124198–124235

2024
[44]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler et al. “Fine-Tuning Language Models from Human Preferences”. In:arXiv preprint arXiv:1909.08593 (2019)

work page internal anchor Pith review arXiv 1909
[45]

Learning to summarize with human feedback

N. Stiennon et al. “Learning to summarize with human feedback”. In:Advances in neural information processing systems33 (2020), pp. 3008–3021

2020
[46]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y. Bai et al. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”. In:arXiv preprint arXiv:2204.05862(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai et al. “Constitutional AI: Harmlessness from AI Feedback”. In:arXiv preprint arXiv:2212.08073(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine et al. “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”. In:arXiv preprint arXiv:2005.01643(2020)

work page internal anchor Pith review arXiv 2005
[49]

A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems

R. F. Prudencio, M. R. Maximo, and E. L. Colombini. “A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems”. In:IEEE transactions on neural networks and learning systems 35.8 (2023), pp. 10237–10257. 34•Zhao et al

2023
[50]

M. L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st. USA: John Wiley & Sons, Inc., 1994.isbn: 0471619779

1994
[51]

Altman.Constrained Markov Decision Processes

E. Altman.Constrained Markov Decision Processes. 1st. Routledge, 1999.doi: 10.1201/9781315140223.url: https://doi.org/10.1201/9781315140223

work page doi:10.1201/9781315140223.url: 1999
[52]

Direct Multi-Turn Preference Optimization for Language Agents

W. Shi et al. “Direct Multi-Turn Preference Optimization for Language Agents”. In:Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024, pp. 2312–2324

2024
[53]

Multi-Step Alignment as Markov Games: An Optimistic Online Mirror Descent Approach with Convergence Guarantees

Y. Wu et al. “Multi-Step Alignment as Markov Games: An Optimistic Online Mirror Descent Approach with Convergence Guarantees”. In:Transactions on Machine Learning Research(2025)

2025
[54]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

S. Bengio et al. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”. In: Advances in neural information processing systems28 (2015)

2015
[55]

A Reduction of Imitation Learning and Structured Prediction to No- Regret Online Learning

S. Ross, G. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No- Regret Online Learning”. In:Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011, pp. 627–635

2011
[56]

Professor Forcing: A New Algorithm for Training Recurrent Networks

A. M. Lamb et al. “Professor Forcing: A New Algorithm for Training Recurrent Networks”. In:Advances in neural information processing systems29 (2016)

2016
[57]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. “Distilling the Knowledge in a Neural Network”. In:arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[58]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

C. Snell, D. Klein, and R. Zhong. “Learning by Distilling Context”. In:arXiv preprint arXiv:2209.15189 (2022)

work page arXiv 2022
[59]

In-context learning distillation: Transferring few-shot learning ability of pre-trained language models,

Y. Huang et al. “In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models”. In:arXiv preprint arXiv:2212.10670(2022)

work page arXiv 2022
[60]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

C.-Y. Hsieh et al. “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes”. In:Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 8003–8017

2023
[61]

Distilling Reasoning Capabilities into Smaller Language Models

K. Shridhar, A. Stolfo, and M. Sachan. “Distilling Reasoning Capabilities into Smaller Language Models”. In:Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 7059–7073

2023
[62]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

R. Agarwal et al. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/ forum?id=3zKtaqxLhW

2024
[63]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

T. Ye et al. “Black-Box On-Policy Distillation of Large Language Models”. In:arXiv preprint arXiv:2511.10643 (2025)

work page arXiv 2025
[64]

On-Policy Context Distillation for Language Models

T. Ye et al. “On-Policy Context Distillation for Language Models”. In:arXiv preprint arXiv:2602.12275 (2026)

work page internal anchor Pith review arXiv 2026
[65]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

S. Zhao et al. “Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models”. In:arXiv preprint arXiv:2601.18734(2026)

work page internal anchor Pith review arXiv 2026
[66]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

W. Yang et al. “Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation”. In:arXiv preprint arXiv:2602.12125(2026)

work page arXiv 2026
[67]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

H. Sang et al. “CRISP: Compressed Reasoning via Iterative Self-Policy Distillation”. In:arXiv preprint arXiv:2603.05433v5(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

arXiv preprint arXiv:2507.14843 , year=

F. Wu et al. “The Invisible Leash: Why RLVR May or May Not Escape Its Origin”. In:arXiv preprint arXiv:2507.14843(2025)

work page arXiv 2025
[69]

Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem

M. McCloskey and N. J. Cohen. “Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem”. In:Psychology of learning and motivation. Vol. 24. Elsevier, 1989, pp. 109–165

1989
[70]

Overcoming catastrophic forgetting in neural networks

J. Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the national academy of sciences114.13 (2017), pp. 3521–3526. LLM Post-Training•35

2017
[71]

Agentark: Distilling multi-agent intelligence into a single llm agent

Y. Luo et al. “AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent”. In:arXiv preprint arXiv:2602.03955(2026)

work page arXiv 2026
[72]

InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions

Y. Wang et al. “InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, pp. 663–677

2024
[73]

MergeBench: A Benchmark for Merging Domain-Specialized LLMs

Y. He et al. “MergeBench: A Benchmark for Merging Domain-Specialized LLMs”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2025.url: https://openreview.net/forum?id=rw50iUoyLu

2025
[74]

Language Models Can Easily Learn to Reason from Demonstrations

D. Li et al. “Language Models Can Easily Learn to Reason from Demonstrations”. In:Findings of the Association for Computational Linguistics: EMNLP2025 (2025), pp. 15979–15997

2025
[75]

LIMO: Less is More for Reasoning

Y. Ye et al. “LIMO: Less is More for Reasoning”. In:Second Conference on Language Modeling. 2025.url: https://openreview.net/forum?id=T2TZ0RY4Zk

2025
[76]

Large Language Models Encode Clinical Knowledge

K. Singhal et al. “Large Language Models Encode Clinical Knowledge”. In:Nature620.7972 (2023), pp. 172– 180

2023
[77]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

L. Yu et al. “MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= N8N0hgNDRt

2024
[78]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

X. Yue et al. “MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= yLClGs770I

2024
[79]

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

Z. Tang et al. “MathScale: Scaling Instruction Tuning for Mathematical Reasoning”. In:Forty-first Interna- tional Conference on Machine Learning. 2024.url: https://openreview.net/forum?id=Kjww7ZN47M

2024
[80]

SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Y. Dong et al. “SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF”. In: Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, pp. 11275–11288

2023

Showing first 80 references.