Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi

arxiv: 2605.11907 · v2 · submitted 2026-05-12 · 💻 cs.LG

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi This is my paper

Pith reviewed 2026-05-15 05:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords procedural skillssupervised fine-tuningSFTmodel scalingQwenLLM evaluationW-shaped trajectoryregime asymmetry

0 comments

The pith

Procedural SFT produces roughly uniform gains across 0.8B to 4B models, with the largest absolute lifts where pre-SFT performance is weakest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how supervised fine-tuning on procedural skills changes pass rates on a 200-task holdout when evaluation uses matched LLM judges instead of format extractors. It finds the SFT-attributable lift stays similar in size at the three tested scales while the final post-SFT scores vary mainly because the untuned base models already follow a W-shaped performance curve on the target procedure. A reader would care because the result replaces two earlier claims, one that SFT only teaches format at small scales and one that SFT gains shrink at larger scales, with a single size-dependent pattern that can be checked at 8B and 14B.

Core claim

Under matched-path LLM-only scoring the SFT-attributable procedural-Δ lift is roughly uniform across sizes (+0.070 / +0.040 / +0.075 at 0.8B / 2B / 4B). Variation in post-SFT Δ is dominated by a W-shaped pre-SFT base trajectory that hurts 0.8B and 4B while helping 2B; SFT therefore supplies its largest absolute correction precisely where the base model struggles with the five-step procedure.

What carries the argument

The W-shaped pre-SFT base trajectory, which determines where SFT supplies the largest absolute procedural lift.

If this is right

SFT contribution to procedural performance stays stable rather than shrinking between 2B and 4B.
Pre-SFT performance on the target procedure is non-monotonic across the tested capacity range.
The largest absolute SFT gains occur exactly at the capacity tiers where the base model already performs worst on the procedure.
Earlier conclusions that SFT at 0.8B is only format learning or that SFT value drops at 4B were produced by mismatched evaluation paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the W-shape persists, training recipes for mid-size models could deliberately target the capacity tiers where base procedural competence dips.
The pattern suggests that scaling studies for procedural skills should track base trajectory shape rather than assume monotonic improvement.
A direct test at 8B and 14B would distinguish whether the asymmetry is a fixed feature of this procedure family or a transient effect limited to the 0.8B-4B window.

Load-bearing premise

The 200-task holdout and 353 demonstration rows stand for general procedural skills, and the LLM judges produce unbiased pass/fail decisions without format or model-family bias.

What would settle it

Measuring the same 5-step procedure on 8B or 14B Qwen3.5 models and finding either a non-uniform SFT lift or a pre-SFT trajectory that is not W-shaped would falsify the claimed regime-asymmetric mechanism.

Figures

Figures reproduced from arXiv: 2605.11907 by Igor Strozzi.

**Figure 1.** Figure 1: Baseline (baseline) and curated (curated) pass rates across all eleven evaluated configurations: pre-SFT controls, the five 0.8B SFT iterations (§6), and the model-size pivot to 2B and 4B. 3.2 W-shaped pre-SFT base trajectory The pre-SFT base trajectory of procedural responsiveness is non-monotone with two negative pockets: • pre-SFT 0.8B (HF, LLM-only): ∆ = −0.075 — deepest trough • pre-SFT 2B (HF, LLM-on… view at source ↗

**Figure 2.** Figure 2: Per-skill procedural-∆ at v2.0 (n=5 per skill, 40 skills). 11 skills lift on curated; 25 are flat (mostly because baseline is already at ceiling); 4 regress. The lift cluster concentrates on interpretive and spatial-social skills; the regression cluster on deductive-logic skills where the model has ceiling competence on baseline. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Per-skill procedural-∆ at v2.0 versus per-skill baseline pass rate (LLM-only judging, n=5 tasks per skill, 40 skills). Spearman ρ = −0.227. Skills where the base is already at ceiling on baseline cannot benefit from procedure injection (flat cluster); skills with lower baseline have the most ∆ headroom (lift cluster). The negative slope is qualitative evidence that SFT’s distinctive value is in scaffolding… view at source ↗

**Figure 4.** Figure 4: v1.9 lift over pre-SFT 0.8B decomposed into base-model scaling (gray; measured between the [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Left: bootstrap distributions of the procedural- [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural-$\Delta$ lift is roughly uniform across sizes: $+0.070 / +0.040 / +0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $\Delta$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. Methodology. (i) A bench format-compliance artifact: 83.5% of the holdout uses a deterministic ANSWER-line extractor that under-counts free-form-prose conclusions; our LLM-only re-judge reveals it was systematically biased against the curated condition. (ii) A negative-iteration sequence at 0.8B: three well-formed recipe variants cluster post-SFT curated pass-rate within a 2 pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. Cross-family judge validation. GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $\kappa \geq 0.754$, agreement $\geq 93.25\%$, max headline $\Delta$ shift $\leq 0.035$ pp. Two earlier framings -- "format-only learning at 0.8B" and "SFT contribution shrinks at 4B" -- were path-mismatch artifacts; this paper supersedes both. Single-seed evaluation; threats itemised in the paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The W-shaped pre-SFT curve and roughly uniform SFT lifts across 0.8B-4B are the real observations here, but single-seed runs leave the uniformity claim shaky.

read the letter

The paper's core contribution is an empirical measurement showing that SFT on procedural skills produces similar absolute lifts (+0.070 / +0.040 / +0.075) across the three Qwen3.5 sizes once you control for format artifacts. The post-SFT differences mostly trace back to a W-shaped pre-SFT baseline where the 5-step procedure hurts the 0.8B and 4B models but helps the 2B. That regime-asymmetric pattern is the new piece: SFT appears to deliver more where the base model is weakest on the procedure. They also fixed an earlier format-compliance bias by switching to LLM-only judging and checked it against GPT-5.4, getting high agreement and no direction flips. That part is solid and directly addresses their own prior claims about format-only learning and shrinking returns. The 353-row demonstration set and 200-task holdout are narrow but internally consistent after the correction. The main weakness is the single-seed design. The reported deltas sit right at the edge of the 0.035 pp judge-shift bound, and with only 353 examples the training run can easily move pass rates by several points from initialization or ordering. No standard deviations or repeated runs are given, so it is hard to know whether the uniform lift and the W-shape are stable or tied to the particular seed. The falsifiable prediction at 8B/14B is useful, but it will need multi-seed confirmation to carry weight. This work is aimed at researchers running SFT on small-to-medium models for procedural tasks. A reader who cares about scaling laws for fine-tuning will get something concrete from the trajectory data and the judge-validation numbers. It is worth sending to peer review because the observation is falsifiable and the artifact fix is reproducible, even though the current evidence needs more runs to pin down the effect sizes.

Referee Report

2 major / 1 minor

Summary. The paper measures the contribution of procedural-skill supervised fine-tuning (SFT) across Qwen3.5 model scales (0.8B, 2B, 4B) on a 200-task/40-skill holdout set. Using Claude Haiku 4.5 as reference, it finds roughly uniform SFT-attributable lifts of +0.070, +0.040, +0.075 respectively under matched-path LLM scoring. It attributes variations in post-SFT performance to a W-shaped pre-SFT trajectory and corrects for a format-compliance artifact in evaluation, validated by high agreement with GPT-5.4 judge.

Significance. If substantiated, these findings highlight a regime-asymmetric mechanism where SFT provides greater absolute benefits for models struggling with procedural tasks, offering a falsifiable prediction for 8B and 14B scales. The work's strengths lie in its empirical approach with cross-judge validation (κ ≥ 0.754, agreement ≥ 93.25%) and correction of an evaluation bias that affected prior interpretations, contributing to more reliable assessment of SFT effects in small-to-medium models.

major comments (2)

[Abstract, main finding] The reported SFT-Δ lifts (+0.070 / +0.040 / +0.075) and the W-shaped pre-SFT trajectory (-0.075, +0.060, -0.010) are obtained from single-seed training and evaluation. Given that the 0.035 difference between the 2B and 4B lifts equals the maximum headline Δ shift from judge swapping, the claims of uniformity and regime-asymmetry require supporting multi-seed statistics or variance estimates to rule out run-to-run stochasticity.
[Methodology] The construction of the 353-row demonstration corpus and the 200-task holdout is described at a high level; without public release of the data or detailed holdout sampling procedure, independent verification of the bias correction and generalizability of the procedural skills is limited.

minor comments (1)

[Abstract] The notation for deltas (e.g., procedural-Δ) is clear in context but could be defined more explicitly upon first use for readers unfamiliar with the prior framings mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address each major comment in detail below, providing clarifications and indicating planned revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract, main finding] The reported SFT-Δ lifts (+0.070 / +0.040 / +0.075) and the W-shaped pre-SFT trajectory (-0.075, +0.060, -0.010) are obtained from single-seed training and evaluation. Given that the 0.035 difference between the 2B and 4B lifts equals the maximum headline Δ shift from judge swapping, the claims of uniformity and regime-asymmetry require supporting multi-seed statistics or variance estimates to rule out run-to-run stochasticity.

Authors: We concur that single-seed training and evaluation limits the ability to quantify run-to-run variance, and the noted 0.035 difference highlights the need for caution. The manuscript explicitly states 'Single-seed evaluation; threats itemised in the paper' and discusses the judge-swapping sensitivity. The high inter-judge agreement (κ ≥ 0.754, ≥93.25% agreement) across all configurations supports the directional consistency of our findings. To address this, we will include additional multi-seed results for the 2B model on a reduced task set in the appendix of the revised manuscript, providing variance estimates to support the uniformity claim. revision: yes
Referee: [Methodology] The construction of the 353-row demonstration corpus and the 200-task holdout is described at a high level; without public release of the data or detailed holdout sampling procedure, independent verification of the bias correction and generalizability of the procedural skills is limited.

Authors: We appreciate the concern for reproducibility. In the revised manuscript, we will provide a more granular description of the holdout construction, including the sampling strategy to ensure coverage of the 40 skills across 200 tasks and the curation process for the 353 demonstration rows. While we cannot release the full dataset due to licensing restrictions on the source materials, we will include representative examples and a detailed pseudocode for the sampling procedure to facilitate independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on holdout tasks with external judges

full rationale

The paper reports observed performance deltas from SFT training runs on a fixed 353-row demonstration corpus evaluated against a 200-task holdout using independent LLM judges (Claude Haiku 4.5 and GPT-5.4). No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the derivation chain. The reported lifts (+0.070 / +0.040 / +0.075) and W-shaped pre-SFT trajectory are direct outputs of the experimental protocol rather than reductions to self-referential inputs or self-citations. The work is self-contained against external benchmarks and falsifiable at larger scales.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the representativeness of the 200-task holdout and the reliability of the LLM-as-judge protocol; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The 200-task / 40-skill holdout set and 353 demonstration rows capture general procedural skills without selection bias from the specific corpus.
All reported deltas are computed on this holdout.
domain assumption LLM judges (Claude Haiku 4.5 and GPT-5.4) produce consistent, unbiased pass/fail labels across model sizes and conditions.
All procedural-Δ values depend on these judgments.

pith-pipeline@v0.9.0 · 5779 in / 1489 out tokens · 53608 ms · 2026-05-15T05:45:50.453223+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 2 internal anchors

[1]

D. Yu, S. Kaur, A. Gupta, J. Brown-Cohen, A. Goyal, S. Arora.Skill-Mix: A Flexible and Expandable Family of Evaluations for AI Models. NeurIPS 2024

work page 2024
[2]

Arora, A

S. Arora, A. Goyal.A Theory for Emergence of Complex Skills in Language Models. 2023

work page 2023
[3]

Li et al.SkillsBench: Benchmarking the Effectiveness of Skill Injection on LLMs. 2026

work page 2026
[4]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen.LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022

work page 2022
[5]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer.QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023

work page 2023
[6]

Tunstall, E

L. Tunstall, E. Beeching, et al.TRL: Transformer Reinforcement Learning library, version≥0.18

work page
[7]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks 2023

work page 2023
[8]

Lessons from the Trenches on Reproducible Evaluation of Language Models

S. Biderman, H. Schoelkopf, et al.Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv:2405.14782, 2024

work page internal anchor Pith review arXiv 2024
[9]

A. Yang, A. Li, B. Yang, et al.Qwen3 Technical Report. arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

∆-flipped negative

Qwen Team.Qwen3.5: Accelerating Productivity with Native Multimodal Agents. Release blog and model cards, 2026.https://huggingface.co/Qwen/Qwen3.5-4B-Base. 12 A Path-mismatch resolution for pre-SFT 0.8B The original pre-SFT 0.8B baseline (baseline0.510/curated0 .565/∆+0 .055) was run via Ollama, not the same HuggingFace transformers path used for 2B/4B. Q...

work page 2026

[1] [1]

D. Yu, S. Kaur, A. Gupta, J. Brown-Cohen, A. Goyal, S. Arora.Skill-Mix: A Flexible and Expandable Family of Evaluations for AI Models. NeurIPS 2024

work page 2024

[2] [2]

Arora, A

S. Arora, A. Goyal.A Theory for Emergence of Complex Skills in Language Models. 2023

work page 2023

[3] [3]

Li et al.SkillsBench: Benchmarking the Effectiveness of Skill Injection on LLMs. 2026

work page 2026

[4] [4]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen.LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022

work page 2022

[5] [5]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer.QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023

work page 2023

[6] [6]

Tunstall, E

L. Tunstall, E. Beeching, et al.TRL: Transformer Reinforcement Learning library, version≥0.18

work page

[7] [7]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks 2023

work page 2023

[8] [8]

Lessons from the Trenches on Reproducible Evaluation of Language Models

S. Biderman, H. Schoelkopf, et al.Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv:2405.14782, 2024

work page internal anchor Pith review arXiv 2024

[9] [9]

A. Yang, A. Li, B. Yang, et al.Qwen3 Technical Report. arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

∆-flipped negative

Qwen Team.Qwen3.5: Accelerating Productivity with Native Multimodal Agents. Release blog and model cards, 2026.https://huggingface.co/Qwen/Qwen3.5-4B-Base. 12 A Path-mismatch resolution for pre-SFT 0.8B The original pre-SFT 0.8B baseline (baseline0.510/curated0 .565/∆+0 .055) was run via Ollama, not the same HuggingFace transformers path used for 2B/4B. Q...

work page 2026