Small LLMs: Pruning vs. Training from Scratch

Jiachen Zhu; Kunjun Li; Mingjie Sun; Taiming Lu; Yufeng Xu; Zhuang Liu

arxiv: 2606.14150 · v3 · pith:QVMLCX7Cnew · submitted 2026-06-12 · 💻 cs.LG · cs.CL

Small LLMs: Pruning vs. Training from Scratch

Yufeng Xu , Taiming Lu , Kunjun Li , Jiachen Zhu , Mingjie Sun , Zhuang Liu This is my paper

Pith reviewed 2026-06-27 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM pruningsmall language modelstraining from scratchmodel compressiontoken budgetLlama-3.1structured pruning

0 comments

The pith

Pruning a large LLM to create small ones outperforms training from scratch when the training token budget is limited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares pruning Llama-3.1-8B at 50-80 percent reduction ratios using six methods spanning depth, width and sparse granularities against training the resulting small models from random initialization. In one setting the training token count is matched between the two approaches; pruned initialization wins. In the second setting training from scratch receives the entire token count consumed by the full pruning-plus-retraining pipeline; finer-grained pruning still wins while coarser structured pruning loses its edge or is overtaken. The results support a practical rule: a large pretrained parent is useful mainly when the downstream training budget is constrained.

Core claim

Under matched training token budgets, pruned initialization from the parent model consistently outperforms random initialization. When training from scratch is instead allotted the full token budget of the pruning pipeline, pruning at finer granularities retains an advantage while coarser structured pruning can be matched or surpassed. This indicates that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only when pruning operates at fine granularity.

What carries the argument

Token-matched experimental settings that isolate the effect of pruned initialization from a large parent versus random initialization across six pruning methods at ratios 0.5-0.8.

If this is right

When training tokens are scarce, pruning a large parent is the stronger route to a small model.
When training tokens are abundant, training from scratch becomes competitive for coarse structured pruning.
The performance gap between pruned and random initialization narrows as the pruning ratio increases.
Knowledge transferred from the parent model cannot be recovered by tokens alone when pruning is fine-grained.
A large pretrained parent is not always required if the practitioner can afford a large training budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For extremely large future token budgets the marginal value of maintaining and pruning very large parent models may decline.
Hardware or data-center operators who can allocate long training runs may safely skip the cost of keeping oversized parent models.
The granularity dependence suggests that hybrid pipelines mixing coarse structured pruning with later fine unstructured pruning deserve direct comparison.

Load-bearing premise

The token budgets are accurately matched between the pruning-plus-retraining pipeline and the pure from-scratch condition, and the six tested pruning methods are representative of the broader space of possible techniques.

What would settle it

A controlled replication in which training from scratch with the full pipeline token budget matches or exceeds every pruned model at every granularity and every ratio would falsify the retained advantage of fine-grained pruning.

Figures

Figures reproduced from arXiv: 2606.14150 by Jiachen Zhu, Kunjun Li, Mingjie Sun, Taiming Lu, Yufeng Xu, Zhuang Liu.

**Figure 1.** Figure 1: Initialization by pruning provides a strong advantage over random initialization, but this advantage diminishes as training continues. Left: under the same training token budget, pruning initialization beats random initialization, although the advantage decreases with longer training. Right: when the random initialization baseline is trained with the full token budget used by the entire pruning pipeline, i… view at source ↗

**Figure 2.** Figure 2: Pruning granularity and method overview. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Four structured pruning methods across retraining token budgets (Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: S50 vs. P200-R50 across model sizes for depth and width pruning. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: P200-R50 versus S250 across pruning methods. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pruning holds an edge over scratch training only under tight token budgets and fine granularity; the token-matching claim needs full methods to confirm.

read the letter

The main takeaway is that under explicitly token-matched conditions, pruning Llama-3.1-8B at 0.5-0.8 ratios beats random initialization when the post-pruning training budget is small, but the gap shrinks as tokens increase or pruning gets coarser. When scratch training gets the full pipeline token count instead, finer pruning still wins while coarser structured methods can be matched or beaten. This supplies a practical rule of thumb for when a parent model is worth keeping.

What the work does cleanly is run the same six pruning methods across two controlled regimes and report how the advantage changes with budget and ratio. That controlled narrowing is the incremental empirical piece that prior pruning-versus-scratch comparisons often left loose.

The soft spot is exactly the one flagged in the stress test: the abstract asserts token matching but gives no line-item accounting of pretraining tokens, epochs, sequence lengths, or optimizer steps. Without those details it is impossible to know whether the two settings are truly equivalent or whether hidden differences in data quality or training dynamics drive the reported trends. No error bars or statistical tests are mentioned either, so the directional claims rest on point estimates alone.

The paper is for practitioners deciding whether to prune an existing 8B model or start fresh when token budgets are constrained. It deserves a serious referee once the methods section is checked for the token arithmetic; the core question is well-posed and the experiments are set up to answer it even if the current write-up leaves the controls opaque.

Referee Report

1 major / 2 minor

Summary. The paper compares pruning Llama-3.1-8B at ratios 0.5-0.8 using six methods (depth, width, sparse) against training from scratch, under two controlled token-matched settings. Setting (1) uses the same post-pruning training tokens and finds pruned init outperforms random init, with the gap narrowing at higher budgets/ratios. Setting (2) gives scratch training the full pipeline token count and finds finer pruning retains advantage while coarser structured pruning can be matched or surpassed. The recommendation is that pruning is preferable with limited budgets but a large parent is not always necessary for coarser pruning with unlimited budgets.

Significance. If the token budgets are verifiably matched and the six methods representative, the work supplies practical guidance on when pretrained knowledge via pruning cannot be recovered by extra tokens alone. The two-setting design with held-out performance measurement is a positive feature for an empirical study in this area.

major comments (1)

[Abstract] Abstract: the central recommendation rests on the two settings being 'token-matched,' yet the abstract provides no explicit accounting of how pretraining tokens are included in the full pipeline budget, how epochs/sequence lengths are normalized across conditions, or whether batch sizes and optimizer steps are identical. This equivalence is load-bearing for the claim that pruning's advantage at limited budgets is due to initialization rather than budget mismatch.

minor comments (2)

[Abstract] Abstract: the six pruning methods are described only at the level of 'spanning depth, width, and sparse granularities' without names or citations; the main text should list them explicitly with references for reproducibility.
[Abstract] Abstract: directional trends are reported without mention of error bars, number of runs, or statistical tests; these should be added to the results sections to allow assessment of the reported narrowing of advantages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's description of the token-matching procedure. The concern is well-taken, as precise budget accounting is central to the claims. We address the point below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central recommendation rests on the two settings being 'token-matched,' yet the abstract provides no explicit accounting of how pretraining tokens are included in the full pipeline budget, how epochs/sequence lengths are normalized across conditions, or whether batch sizes and optimizer steps are identical. This equivalence is load-bearing for the claim that pruning's advantage at limited budgets is due to initialization rather than budget mismatch.

Authors: We agree the abstract would benefit from greater precision on these points. In the revised version we will add a concise clause clarifying that (i) Setting (2) assigns the scratch model the sum of the parent model's original pretraining tokens plus the post-pruning training tokens, (ii) all runs use identical batch size, sequence length, and optimizer hyperparameters, and (iii) the number of optimizer steps is therefore matched once sequence length and batch size are fixed. The full experimental protocol, including these normalizations, is already detailed in Section 3; the abstract revision will simply surface the key equivalences without lengthening the summary excessively. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with independent measurements

full rationale

The paper conducts controlled experiments comparing pruned initializations against random initialization and full-pipeline token budgets, reporting held-out performance metrics. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the central claims; the token-matching conditions are stated as experimental controls rather than definitions, and results are falsifiable against external benchmarks. This is a standard empirical study with no load-bearing reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning study; no mathematical derivations or new theoretical constructs. Relies only on standard assumptions of LLM training such as the validity of next-token prediction loss and the representativeness of the chosen evaluation metrics.

pith-pipeline@v0.9.1-grok · 5757 in / 1148 out tokens · 23029 ms · 2026-06-27T05:03:45.178374+00:00 · methodology

Small LLMs: Pruning vs. Training from Scratch

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)