Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
Pith reviewed 2026-05-15 05:45 UTC · model grok-4.3
The pith
Procedural SFT produces roughly uniform gains across 0.8B to 4B models, with the largest absolute lifts where pre-SFT performance is weakest.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under matched-path LLM-only scoring the SFT-attributable procedural-Δ lift is roughly uniform across sizes (+0.070 / +0.040 / +0.075 at 0.8B / 2B / 4B). Variation in post-SFT Δ is dominated by a W-shaped pre-SFT base trajectory that hurts 0.8B and 4B while helping 2B; SFT therefore supplies its largest absolute correction precisely where the base model struggles with the five-step procedure.
What carries the argument
The W-shaped pre-SFT base trajectory, which determines where SFT supplies the largest absolute procedural lift.
If this is right
- SFT contribution to procedural performance stays stable rather than shrinking between 2B and 4B.
- Pre-SFT performance on the target procedure is non-monotonic across the tested capacity range.
- The largest absolute SFT gains occur exactly at the capacity tiers where the base model already performs worst on the procedure.
- Earlier conclusions that SFT at 0.8B is only format learning or that SFT value drops at 4B were produced by mismatched evaluation paths.
Where Pith is reading between the lines
- If the W-shape persists, training recipes for mid-size models could deliberately target the capacity tiers where base procedural competence dips.
- The pattern suggests that scaling studies for procedural skills should track base trajectory shape rather than assume monotonic improvement.
- A direct test at 8B and 14B would distinguish whether the asymmetry is a fixed feature of this procedure family or a transient effect limited to the 0.8B-4B window.
Load-bearing premise
The 200-task holdout and 353 demonstration rows stand for general procedural skills, and the LLM judges produce unbiased pass/fail decisions without format or model-family bias.
What would settle it
Measuring the same 5-step procedure on 8B or 14B Qwen3.5 models and finding either a non-uniform SFT lift or a pre-SFT trajectory that is not W-shaped would falsify the claimed regime-asymmetric mechanism.
Figures
read the original abstract
We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural-$\Delta$ lift is roughly uniform across sizes: $+0.070 / +0.040 / +0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $\Delta$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. Methodology. (i) A bench format-compliance artifact: 83.5% of the holdout uses a deterministic ANSWER-line extractor that under-counts free-form-prose conclusions; our LLM-only re-judge reveals it was systematically biased against the curated condition. (ii) A negative-iteration sequence at 0.8B: three well-formed recipe variants cluster post-SFT curated pass-rate within a 2 pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. Cross-family judge validation. GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $\kappa \geq 0.754$, agreement $\geq 93.25\%$, max headline $\Delta$ shift $\leq 0.035$ pp. Two earlier framings -- "format-only learning at 0.8B" and "SFT contribution shrinks at 4B" -- were path-mismatch artifacts; this paper supersedes both. Single-seed evaluation; threats itemised in the paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper measures the contribution of procedural-skill supervised fine-tuning (SFT) across Qwen3.5 model scales (0.8B, 2B, 4B) on a 200-task/40-skill holdout set. Using Claude Haiku 4.5 as reference, it finds roughly uniform SFT-attributable lifts of +0.070, +0.040, +0.075 respectively under matched-path LLM scoring. It attributes variations in post-SFT performance to a W-shaped pre-SFT trajectory and corrects for a format-compliance artifact in evaluation, validated by high agreement with GPT-5.4 judge.
Significance. If substantiated, these findings highlight a regime-asymmetric mechanism where SFT provides greater absolute benefits for models struggling with procedural tasks, offering a falsifiable prediction for 8B and 14B scales. The work's strengths lie in its empirical approach with cross-judge validation (κ ≥ 0.754, agreement ≥ 93.25%) and correction of an evaluation bias that affected prior interpretations, contributing to more reliable assessment of SFT effects in small-to-medium models.
major comments (2)
- [Abstract, main finding] The reported SFT-Δ lifts (+0.070 / +0.040 / +0.075) and the W-shaped pre-SFT trajectory (-0.075, +0.060, -0.010) are obtained from single-seed training and evaluation. Given that the 0.035 difference between the 2B and 4B lifts equals the maximum headline Δ shift from judge swapping, the claims of uniformity and regime-asymmetry require supporting multi-seed statistics or variance estimates to rule out run-to-run stochasticity.
- [Methodology] The construction of the 353-row demonstration corpus and the 200-task holdout is described at a high level; without public release of the data or detailed holdout sampling procedure, independent verification of the bias correction and generalizability of the procedural skills is limited.
minor comments (1)
- [Abstract] The notation for deltas (e.g., procedural-Δ) is clear in context but could be defined more explicitly upon first use for readers unfamiliar with the prior framings mentioned.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We address each major comment in detail below, providing clarifications and indicating planned revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract, main finding] The reported SFT-Δ lifts (+0.070 / +0.040 / +0.075) and the W-shaped pre-SFT trajectory (-0.075, +0.060, -0.010) are obtained from single-seed training and evaluation. Given that the 0.035 difference between the 2B and 4B lifts equals the maximum headline Δ shift from judge swapping, the claims of uniformity and regime-asymmetry require supporting multi-seed statistics or variance estimates to rule out run-to-run stochasticity.
Authors: We concur that single-seed training and evaluation limits the ability to quantify run-to-run variance, and the noted 0.035 difference highlights the need for caution. The manuscript explicitly states 'Single-seed evaluation; threats itemised in the paper' and discusses the judge-swapping sensitivity. The high inter-judge agreement (κ ≥ 0.754, ≥93.25% agreement) across all configurations supports the directional consistency of our findings. To address this, we will include additional multi-seed results for the 2B model on a reduced task set in the appendix of the revised manuscript, providing variance estimates to support the uniformity claim. revision: yes
-
Referee: [Methodology] The construction of the 353-row demonstration corpus and the 200-task holdout is described at a high level; without public release of the data or detailed holdout sampling procedure, independent verification of the bias correction and generalizability of the procedural skills is limited.
Authors: We appreciate the concern for reproducibility. In the revised manuscript, we will provide a more granular description of the holdout construction, including the sampling strategy to ensure coverage of the 40 skills across 200 tasks and the curation process for the 353 demonstration rows. While we cannot release the full dataset due to licensing restrictions on the source materials, we will include representative examples and a detailed pseudocode for the sampling procedure to facilitate independent verification. revision: yes
Circularity Check
No circularity: direct empirical measurements on holdout tasks with external judges
full rationale
The paper reports observed performance deltas from SFT training runs on a fixed 353-row demonstration corpus evaluated against a 200-task holdout using independent LLM judges (Claude Haiku 4.5 and GPT-5.4). No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the derivation chain. The reported lifts (+0.070 / +0.040 / +0.075) and W-shaped pre-SFT trajectory are direct outputs of the experimental protocol rather than reductions to self-referential inputs or self-citations. The work is self-contained against external benchmarks and falsifiable at larger scales.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 200-task / 40-skill holdout set and 353 demonstration rows capture general procedural skills without selection bias from the specific corpus.
- domain assumption LLM judges (Claude Haiku 4.5 and GPT-5.4) produce consistent, unbiased pass/fail labels across model sizes and conditions.
Reference graph
Works this paper leans on
-
[1]
D. Yu, S. Kaur, A. Gupta, J. Brown-Cohen, A. Goyal, S. Arora.Skill-Mix: A Flexible and Expandable Family of Evaluations for AI Models. NeurIPS 2024
work page 2024
- [2]
-
[3]
Li et al.SkillsBench: Benchmarking the Effectiveness of Skill Injection on LLMs. 2026
work page 2026
-
[4]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen.LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022
work page 2022
-
[5]
T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer.QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023
work page 2023
-
[6]
L. Tunstall, E. Beeching, et al.TRL: Transformer Reinforcement Learning library, version≥0.18
-
[7]
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks 2023
work page 2023
-
[8]
Lessons from the Trenches on Reproducible Evaluation of Language Models
S. Biderman, H. Schoelkopf, et al.Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv:2405.14782, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
A. Yang, A. Li, B. Yang, et al.Qwen3 Technical Report. arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Qwen Team.Qwen3.5: Accelerating Productivity with Native Multimodal Agents. Release blog and model cards, 2026.https://huggingface.co/Qwen/Qwen3.5-4B-Base. 12 A Path-mismatch resolution for pre-SFT 0.8B The original pre-SFT 0.8B baseline (baseline0.510/curated0 .565/∆+0 .055) was run via Ollama, not the same HuggingFace transformers path used for 2B/4B. Q...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.