Self-Improving Pretraining: using post-trained models to pretrain better models

Danwei Li; Ellen Xiaoqing Tan; Ilia Kulikov; Jack Lanchantin; Jason Weston; Jing Xu; Olga Golovneva; Ping Yu; Sainbayar Sukhbaatar; Shehzaad Dhuliawala

arxiv: 2601.21343 · v3 · submitted 2026-01-29 · 💻 cs.CL · cs.AI· cs.LG

Self-Improving Pretraining: using post-trained models to pretrain better models

Ellen Xiaoqing Tan , Jack Lanchantin , Shehzaad Dhuliawala , Danwei Li , Thao Nguyen , Jing Xu , Ping Yu , Ilia Kulikov

show 4 more authors

Sainbayar Sukhbaatar Jason Weston Xian Li Olga Golovneva

This is my paper

Pith reviewed 2026-05-16 10:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords self-improving pretrainingpost-trained modelsdata rewritingrollout judgmentLLM pretrainingreinforcement signalsmodel quality

0 comments

The pith

A post-trained model can rewrite pretraining data and judge rollouts to embed safety, factuality, and reasoning earlier in LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are classically pretrained on raw text before post-training adds safety, factuality, reasoning, and other behaviors. This staged separation means early training lacks signals for desirable traits that later stages must compensate for. The paper proposes using an existing strong post-trained model both to rewrite the pretraining corpus and to judge policy rollouts, thereby injecting reinforcement signals into the earlier phases. Experiments demonstrate resulting gains in overall quality, safety, factuality, and reasoning. A sympathetic reader would care because the approach questions whether the standard pretrain-then-posttrain pipeline is necessary and points toward more integrated training.

Core claim

By utilizing an existing strong, post-trained model to both rewrite pretraining data and to judge policy model rollouts, reinforcement can be incorporated earlier in training, leading to strong gains in quality, safety, factuality and reasoning.

What carries the argument

The mechanism of having a post-trained model rewrite raw pretraining data and evaluate rollouts to move post-training behaviors into the pretraining phase.

Load-bearing premise

That an existing post-trained model can reliably rewrite pretraining data and judge rollouts without introducing its own biases or errors that propagate into the new model.

What would settle it

Train two models on the same base data—one with the rewriting and judging process and one without—then compare their scores on safety, factuality, and reasoning benchmarks; equal or worse performance in the rewritten version would falsify the central claim.

read the original abstract

Large language models are classically trained in stages: pretraining on raw text followed by post-training for instruction following and reasoning. However, this separation creates a fundamental limitation: many desirable behaviors such as safety, factuality, overall generation quality, and reasoning ability are only added at a late stage, even though the patterns learned earlier strongly shape a model's capabilities. To tackle this issue, we introduce a new way to pretrain and mid-train models that incorporates these behaviors earlier. We utilize an existing strong, post-trained model to both rewrite pretraining data and to judge policy model rollouts, thus using reinforcement earlier in training. In our experiments, we show this can give strong gains in quality, safety, factuality and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The idea of feeding post-trained model outputs back into pretraining data and early rollouts is straightforward but the abstract supplies no experimental details, so the claimed gains in safety and reasoning cannot be evaluated yet.

read the letter

The main move here is to let an existing post-trained model rewrite raw pretraining text and score policy rollouts so that safety, factuality, and reasoning patterns appear earlier than usual. That reverses the standard pipeline where those behaviors only arrive late. The approach is new in its specific pairing of rewriting plus early reinforcement, even if it sits near other self-improvement and data-filtering work. If the gains hold, it could change how people think about when to inject capability signals and might improve training efficiency. The abstract states clear improvements in quality, safety, factuality, and reasoning, which is the kind of practical outcome worth checking. The evidence presented is thin. No dataset sizes, no baseline models, no metrics, and no controls for the rewriting step are described, so it is impossible to tell whether any measured lift comes from the method or from implicit curation. The risk that the post-trained model’s own biases or reduced diversity get baked into the new pretraining data is real and unaddressed in the visible text. Because the starting model is independent, circularity stays low, but that does not remove the attribution problem. This paper is aimed at groups that run large-scale pretraining and want to experiment with mixing stages. A reader looking for reproducible recipes or strong ablations will find little to use right now. It still deserves a serious referee because the proposal is simple to test and targets a genuine limitation in current staged training; the current draft just needs the missing experimental backbone before it can be assessed properly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Self-Improving Pretraining, a method that employs an existing post-trained model both to rewrite raw pretraining corpora and to judge policy rollouts, thereby injecting signals for safety, factuality, quality, and reasoning into earlier training stages rather than only during post-training. The authors report that this yields strong experimental gains across those dimensions.

Significance. If the claimed gains are reproducible and attributable to the method rather than post-trained-model artifacts, the work would be significant for LLM training pipelines by demonstrating that post-training behaviors can be usefully folded into pretraining dynamics, potentially improving sample efficiency and final model properties.

major comments (2)

[Abstract] Abstract: the central claim of 'strong gains in quality, safety, factuality and reasoning' supplies no quantitative metrics, baseline models, data scales, or controls, so the support for the headline result cannot be evaluated.
[Method] Method (rewriting and judging steps): the approach assumes an existing post-trained model can rewrite pretraining data and score rollouts without propagating its own alignment biases, refusals, or distributional shifts; no ablation, control corpus, or error analysis is described to isolate the contribution of the proposed procedure from implicit curation effects.

minor comments (1)

[Introduction] The title refers to 'Self-Improving' pretraining, yet the procedure relies on an external post-trained model rather than a closed loop from the policy model's own outputs; this distinction should be clarified in the introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and have incorporated revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'strong gains in quality, safety, factuality and reasoning' supplies no quantitative metrics, baseline models, data scales, or controls, so the support for the headline result cannot be evaluated.

Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript, we have updated the abstract to include key quantitative results from our experiments, such as relative improvements over standard pretraining baselines, model scales, and dataset sizes, while retaining the high-level summary of the method. revision: yes
Referee: [Method] Method (rewriting and judging steps): the approach assumes an existing post-trained model can rewrite pretraining data and score rollouts without propagating its own alignment biases, refusals, or distributional shifts; no ablation, control corpus, or error analysis is described to isolate the contribution of the proposed procedure from implicit curation effects.

Authors: This concern is well-taken. We have added new ablation studies comparing our rewritten corpus against a control corpus generated without the post-trained model, along with quantitative error analysis of bias propagation, refusal rates, and distributional shifts in the judging step. These results are presented in the revised method section and appendix to better isolate the procedure's contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses independent external post-trained model with empirical validation

full rationale

The paper's core proposal applies an existing strong post-trained model (independent of the training run) to rewrite raw pretraining corpora and judge rollouts, then measures downstream gains experimentally. No equations, derivations, or self-citations are presented that reduce the claimed improvements to a fitted parameter, self-definition, or closed-loop renaming of inputs. The central results remain falsifiable via external benchmarks and do not rely on a load-bearing self-citation chain or uniqueness theorem imported from the authors' prior work. This is the standard case of an honest empirical method whose assumptions (quality of the external model) are stated separately from the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that post-trained models provide high-quality guidance for earlier training stages without circular dependency or bias injection.

axioms (1)

domain assumption Post-trained models can provide reliable signals for rewriting pretraining data and judging policy rollouts.
Invoked in the description of the method using an existing strong post-trained model.

pith-pipeline@v0.9.0 · 5467 in / 1094 out tokens · 24218 ms · 2026-05-16T10:05:13.645828+00:00 · methodology

Self-Improving Pretraining: using post-trained models to pretrain better models

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)