pith. sign in

arxiv: 2512.04072 · v2 · submitted 2025-12-03 · 💻 cs.CL · cs.AI

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

Pith reviewed 2026-05-17 01:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords self-distillationcognitive skillsreinforcement learningreasoning modelssupervised fine-tuningbacktrackinglanguage models
0
0 comments X

The pith

SkillFactory rearranges a model's own outputs into skill-formatted traces to prime it for using cognitive behaviors like backtracking during later RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillFactory as a self-distillation technique that takes samples generated by the base language model and rearranges them into training examples that demonstrate cognitive skills such as verification, backtracking, and retrying alternate methods. This creates imperfect silver SFT data that instills an inductive bias for these behaviors before reinforcement learning begins. The evaluation finds that models initialized with this SFT step generalize better to harder task variants after RL, actively deploy the skills, and show greater robustness against performance drops on out-of-domain tasks than models that skip the SkillFactory stage. The core idea is that pre-RL supervised training structured around cognitive skills makes subsequent RL more effective at teaching reliable reasoning behaviors.

Core claim

SkillFactory generates silver SFT traces by rearranging the model's self-generated reasoning samples into formats that exhibit cognitive skills such as answer verification, backtracking, and retrying by alternate methods. These traces, though imperfect, supply an inductive bias that allows reinforcement learning to more effectively teach robust use of those skills, resulting in better generalization to harder task variants and reduced regression on out-of-domain tasks.

What carries the argument

The SkillFactory rearrangement process, which converts model-generated reasoning traces into supervised examples formatted to demonstrate specific cognitive skills before reinforcement learning.

If this is right

  • Models that begin RL from a SkillFactory SFT initialization generalize to harder variants of reasoning tasks more effectively than those that do not.
  • The RL-trained SkillFactory models actively use cognitive skills such as verification and backtracking during problem solving.
  • SkillFactory models exhibit less performance regression on out-of-domain tasks after RL compared with standard base models.
  • Pre-RL supervised training that structures data around cognitive skills produces more robust skill use than RL alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rearrangement approach might be tested on domains outside the current reasoning tasks to check if the same priming effect appears.
  • Pairing SkillFactory initialization with larger base models could increase the observed gains in generalization and robustness.
  • This method offers a way to reduce dependence on external high-quality traces from stronger models when teaching reasoning behaviors.

Load-bearing premise

The rearranged self-generated silver traces provide a useful inductive bias that helps the model acquire cognitive skills during RL without introducing noise or biases that cannot be overcome.

What would settle it

Train an otherwise identical model with standard SFT on the same data but without the skill rearrangement step, apply the same RL, then measure whether it shows smaller gains on harder task variants and less robustness on out-of-domain tasks than the SkillFactory version.

Figures

Figures reproduced from arXiv: 2512.04072 by Greg Durrett, Jack Lu, Manya Wadhwa, Mengye Ren, Sedrick Keh, Zayne Sprague.

Figure 2
Figure 2. Figure 2: Trace from Countdown ex￾hibiting implicit reflection and retrying. Second, there are advantages to explicitly marking skills’ usage. Our format, shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SkillFactory training trace with self￾reflection and retry. Reflection Generation Next, we prompt M to reflect on each solution attempt using a reflection prompt preflect. A reflection r critiques the rea￾soning in solution y and predicts its correctness, correct(y, ai). We use verdict(r) to extract this prediction from the reflection text. Just like with the answer tags, SkillFactory also prompts the mode… view at source ↗
Figure 4
Figure 4. Figure 4: Results showing performance of different models trained using SkillFactory. Left: Av [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Token length distribution for three tasks for responses given by (a) RL Baseline, (b) R1 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of token response of all responses given by two models: RL Baseline and [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of token response of all responses given by two models: R1 Distillation and [PITH_FULL_IMAGE:figures/full_fig_p041_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of token response of all responses given by two models: BOLT and SkillFac [PITH_FULL_IMAGE:figures/full_fig_p042_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of token response of all responses given by two models: STaR and SkillFac [PITH_FULL_IMAGE:figures/full_fig_p043_9.png] view at source ↗
read the original abstract

Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL;(2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SkillFactory, a self-distillation approach that generates reasoning traces from a base model and rearranges them into SFT training data formatted to exhibit cognitive behaviors such as verification, backtracking, and retrying. These 'silver' traces are used to initialize the model prior to RL, with the central claim being that this pre-RL inductive bias enables better generalization to harder task variants post-RL (despite lower pre-RL performance), demonstrable use of the cognitive skills, and greater robustness against regression on out-of-domain tasks compared to RL applied directly to base models.

Significance. If the reported results hold, the work provides a concrete empirical path for instilling cognitive skills in reasoning models without distillation from stronger teachers, relying instead on restructured self-generated data. This is a useful contribution to RL-based reasoning training, as it isolates the value of pre-RL initialization and supplies supporting evidence across generalization, skill-usage analysis, and robustness checks.

major comments (1)
  1. [Evaluation] Evaluation section (supporting the three claims): the manuscript reports that SkillFactory-initialized models generalize better post-RL and are more robust out-of-domain, yet the strength of these conclusions would be clearer with explicit reporting of the number of runs, variance, or statistical tests used to establish the differences versus base-model RL baselines.
minor comments (2)
  1. [Method] The description of the rearrangement procedure for creating silver SFT traces would benefit from a concise pseudocode or step-by-step example to make the inductive bias more transparent.
  2. [Abstract] Abstract: the high-level summary of results is clear, but a single sentence noting the primary tasks or model sizes used would help readers contextualize the claims without needing to reach the full evaluation section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the positive assessment of our work and the recommendation for minor revision. We appreciate the feedback on strengthening the evaluation section and address the major comment below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (supporting the three claims): the manuscript reports that SkillFactory-initialized models generalize better post-RL and are more robust out-of-domain, yet the strength of these conclusions would be clearer with explicit reporting of the number of runs, variance, or statistical tests used to establish the differences versus base-model RL baselines.

    Authors: We thank the referee for this constructive suggestion. We agree that explicitly reporting the number of runs would improve clarity. Due to the substantial computational cost of RL training on reasoning tasks, all experiments in the manuscript were conducted with a single run per configuration. In the revised manuscript, we will update the Evaluation section to state this explicitly and note that the improvements in post-RL generalization to harder task variants and out-of-domain robustness were observed consistently across multiple tasks, model scales, and evaluation settings. This provides supporting evidence for the reliability of the trends even without multi-run statistics. We will also add a brief discussion of this experimental constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental grounding

full rationale

The paper describes an empirical pipeline: generate silver traces from the base model, rearrange them into skill-formatted SFT data, perform SFT, then RL, and measure generalization, skill usage, and OOD robustness via direct experiments. No equations, fitted parameters, or first-principles derivations are presented that could reduce to their own inputs by construction. Claims rest on reported ablation results and comparisons rather than self-definitional loops or load-bearing self-citations. The silver-trace assumption is explicitly noted as imperfect yet tested through outcomes, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that imperfect self-generated traces can still provide effective priming for cognitive skills. No explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Imperfect self-generated traces rearranged into skill formats can prime models to acquire and use cognitive behaviors during RL
    Explicitly stated in the abstract as the basis for why silver SFT data works despite imperfections.

pith-pipeline@v0.9.0 · 5541 in / 1276 out tokens · 30501 ms · 2026-05-17T01:53:45.854758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    doi: 10.18653/v1/n19-1421

    ISSN 2835-8856. URLhttps://openreview.net/forum?id=HvoG8SxggZ. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A ques- tion answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association fo...

  2. [2]

    First, add 60 and 20: Step 1: 60 + 20 = 80

  3. [3]

    Next, add the result to 1: Step 2: 80 + 1 = 81

  4. [4]

    Let's try a different approach:

    Finally, add the result to 11: Step 3: 81 + 11 = 92 However, 92 is not equal to 880. Let's try a different approach:

  5. [5]

    First, multiply 60 and 11: Step 1: 60 * 11 = 660

  6. [6]

    Let's try a different combination:

    Then, add 20 to the result: Step 2: 660 + 20 = 680 Again, 680 is not equal to 880. Let's try a different combination:

  7. [8]

    Let's try a different set of operations:

    Then, add 60 to the result: Step 2: 220 + 60 = 280 This is also not equal to 880. Let's try a different set of operations:

  8. [10]

    Let's try a different approach:

    Then, add 20 to the result: Step 2: 60 + 20 = 80 2https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M 24 Arxiv Preprint Finally, add the result to 11: Step 3: 80 + 11 = 91 This is also not equal to 880. Let's try a different approach:

  9. [11]

    First, multiply 20 and 11: Step 1: 20 * 11 = 220

  10. [12]

    Let's try a different set of operations:

    Then, subtract 60 from the result: Step 2: 220 - 60 = 160 This is also not equal to 880. Let's try a different set of operations:

  11. [13]

    First, multiply 60 and 1: Step 1: 60 * 1 = 60

  12. [14]

    Then, add 20 to the result: Step 2: 60 + 20 = 80 Then, multiply the result by 11: Step 3: 80 * 11 = 880 This is equal to 880. So, the equation is: <answer> (60 * 1) * 11 </answer> SkillFactory models output for a Countdown 4arg task <think> <sample> To solve for 880 using the numbers 60, 20, 1, and 11 with each number used exactly once, we need to find a ...

  13. [15]

    **Identify possible combinations**: We start by considering different ways to combine these numbers

  14. [16]

    Let's start with some trial: - **Combination 1**: Multiply 60 and 11 first, then add 20 and 1

    **Trial and error**: Since there are limited combinations, we will try various permutations to see if any work. Let's start with some trial: - **Combination 1**: Multiply 60 and 11 first, then add 20 and 1. \[ (60 * 11) + 20 + 1 = 660 + 20 + 1 = 681 \] This does not equal 880. - **Combination 2**: Multiply 20 and 11 first, then add 60 and 1. \[ (20 * 11) ...

  15. [17]

    First, let's multiply 756 by 300: \[756 \times 300 = 226800\]

  16. [18]

    Next, let's multiply 756 by 20 (which is the same as multiplying by 2 and then adding the result of 756 times 10): \[756 \times 20 = 15120\] And, \[756 \times 10 = 7560\] So, \[756 \times 20 = 15120 + 7560 = 22680\]

  17. [19]

    Now, add the results from steps 1 and 2: \[226800 + 22680 = 249480\] So, the solution to the multiplication equation \(756 \times 322\) is 249480. Therefore, the final answer is: <answer> 249480 </answer> 26 Arxiv Preprint SkillFactory models output for a Long Multiplication 3 digit task <think> <sample> To solve the multiplication equation \( 756 \times ...

  18. [20]

    **Multiply 756 by 2**: - \(756 \times 2 = 1512\)

  19. [21]

    **Multiply 756 by 30 (which is 3 times 10)**: - \(756 \times 30 = 756 \times (3 \times 10) = (756 \times 3) \times 10\) - First, calculate \(756 \times 3\): \[ 756 \times 3 = 2268 \] - Then multiply by 10: \[ 2268 \times 10 = 22680 \] 27 Arxiv Preprint

  20. [22]

    **Multiply 756 by 40 (which is 4 times 10)**: - \(756 \times 40 = 756 \times (4 \times 10) = (756 \times 4) \times 10\) - First, calculate \(756 \times 4\): \[ 756 \times 4 = 3024 \] - Then multiply by 10: \[ 3024 \times 10 = 30240 \] Now, add all these products together: \[ 1512 + 22680 + 30240 = 54472 \] Thus, the final result of \(756 \times 322\) is: ...

  21. [23]

    \[ 756 = 700 + 50 + 6 \] \[ 322 = 300 + 20 + 2 \]

    **Break down the numbers**: We can make the multiplication easier by breaking each number into smaller parts. \[ 756 = 700 + 50 + 6 \] \[ 322 = 300 + 20 + 2 \]

  22. [24]

    **Multiply each part separately**: \[ (700 \times 322) = 700 \times (300 + 20 + 2) 28 Arxiv Preprint \] \[ = (700 \times 300) + (700 \times 20) + (700 \times 2) \]

  23. [25]

    **Calculate individual multiplications**: \[ 700 \times 300 = 210000 \] \[ 700 \times 20 = 14000 \] \[ 700 \times 2 = 1400 \]

  24. [26]

    **Add up the results**: \[ 210000 + 14000 + 1400 = 225400 \] Therefore, \(756 \times 322 = 225400\). **Final Answer:** <answer>225400</answer> </sample> <reflect> Let us verify this answer: Firstly, let's multiply 756 by 2: \(756 \times 2 = 1512\) Next, let's multiply 756 by 10: \(756 \times 10 = 7560\) Then, let's multiply 756 by 20: \(756 \times 20 = 15...

  25. [27]

    Write down the numbers one under the other: ``` 756 ×322 ------ ```

  26. [28]

    Multiply 756 by 2 (the units digit of 322): ``` 756 ×2 ----- 1512 ```

  27. [29]

    Multiply 756 by 20 (the tens digit of 322): ``` 756 ×20 ----- 15120 ```

  28. [30]

    Multiply 756 by 300 (the hundreds digit of 322): ``` 756 ×300 ------ 226800 ```

  29. [31]

    Now sum all the partial products obtained above: ``` 1512 30 Arxiv Preprint + 15120 + 226800 --------- 243432 ``` Therefore, the solution to \(756 \times 322\) is \(243432\). </sample> <reflect> Let us verify this answer: First, let's calculate \(756 \times 322\): \[ 756 \times 322 = 243432 \] Now, let's break down the multiplication: \[ 756 \times 3 = 22...

  30. [32]

    **Calculate the total available time:** - Saturday: 7 hours - Sunday: 5 hours - Total time = 7 + 5 = 12 hours

  31. [33]

    **Subtract the time spent on reading and playing video games:** - Time spent on reading: 3 hours - Time spent on video games: \(\frac{1}{3}\) of the remaining time after reading 31 Arxiv Preprint - Remaining time after reading = 12 - 3 = 9 hours - Time spent on video games = \(\frac{1}{3} \times 9 = 3\) hours

  32. [34]

    **Calculate the remaining time after reading and playing video games:** - Remaining time = 9 - 3 = 6 hours

  33. [35]

    - Time spent playing soccer = 6 hours

    **Determine the time spent playing soccer:** - Since the only other activity mentioned is playing soccer, the remaining time (after reading and video games) is all dedicated to soccer. - Time spent playing soccer = 6 hours

  34. [36]

    The final answer is: <answer> 25 </answer> SkillFactory models output for a GSM8k <think> <sample> To solve this problem, we need to follow these steps:

    **Calculate the percentage of the weekend spent playing soccer:** - Total time in a day = 24 hours - Percentage of the weekend spent playing soccer = \(\frac{6}{24} \times 100\% = 25\%\) Therefore, the percentage of her weekend that Tatiana spends playing soccer is 25%. The final answer is: <answer> 25 </answer> SkillFactory models output for a GSM8k <thi...

  35. [37]

    \[ 7 + 5 = 12 \text{ hours} \]

    **Calculate total available time**: Tatiana has 7 hours on Saturday and 5 hours on Sunday. \[ 7 + 5 = 12 \text{ hours} \]

  36. [38]

    **Determine time spent on other activities**: - Time spent reading: 3 hours

  37. [39]

    **Calculate remaining time after reading**: \[ 12 - 3 = 9 \text{ hours} \]

  38. [40]

    \[ \frac{1}{3} \times 9 = 3 \text{ hours} \]

    **Determine time spent playing video games**: - Video games take \( \frac{1}{3} \) of the remaining time. \[ \frac{1}{3} \times 9 = 3 \text{ hours} \]

  39. [41]

    **Calculate time left for soccer**: \[ 9 - 3 = 6 \text{ hours} \]

  40. [42]

    **Convert soccer time into percentages**: 32 Arxiv Preprint - Total weekend time: 12 hours - Time spent on soccer: 6 hours \[ \left( \frac{6}{12} \right) \times 100\% = 50\% \] Therefore, the percentage of her weekend that Tatiana spends playing soccer is 50%. The final answer is: <answer> 50 </answer> </sample> <reflect> Let us verify this answer: First,...

  41. [43]

    **Total Weekend Hours**: - Saturday: 7 hours - Sunday: 5 hours - Total: \( 7 + 5 = 12 \) hours

  42. [44]

    **Time Spent on Other Activities**: - Reading: 3 hours

  43. [45]

    **Remaining Time After Reading**: - Total time minus reading time: \[ 12 - 3 = 9 \] hours

  44. [46]

    - Remaining time after reading: 9 hours - Time spent on video games: \[ \frac{1}{3} \times 9 = 3 \] hours

    **Time Spent Playing Video Games**: - According to the problem, she spends \( \frac{1}{3} \) of the remaining time playing video games. - Remaining time after reading: 9 hours - Time spent on video games: \[ \frac{1}{3} \times 9 = 3 \] hours

  45. [47]

    **Time Left for Soccer**: - Subtracting the time spent on video games from the remaining time: \[ 9 - 3 = 6 \] hours So, Tatiana spends 6 hours playing soccer during her weekend. Therefore, the percentage of her weekend spent playing soccer is calculated as follows: 33 Arxiv Preprint \[ \frac{\text{Hours spent on soccer}}{\text{Total weekend hours}} \time...

  46. [48]

    First, let's count how many times each letter appears in the input: - p appears 1 time - a appears 1 time - s appears 2 times

  47. [49]

    Since we cannot form more than two's'letters with just these four letters, the longest word we can form will have at most three's'letters and some other letters

  48. [50]

    sssa": This gives us the word

    Let's try forming words with different combinations of the letters: - "sssa": This gives us the word "ssssa". - "ssas": This gives us the word "ssass"

  49. [51]

    ssssa" and

    Both "ssssa" and "ssass" are longer than any single letter combination, but we need to check if they can actually be formed from the given letters

  50. [52]

    - We have 1'a'

    "ssssa": - We have 1's'left. - We have 1'a'. - We don't have enough'p'or's'

  51. [53]

    ssass": - We have 1's', 1'a', and 2's'. - This gives us the word

    "ssass": - We have 1's', 1'a', and 2's'. - This gives us the word "ssass". Since "ssass" is a valid English word and it uses all the letters provided, we can conclude that the longest valid word is "ssass". Therefore, the answer is: <answer> ssass </answer> 35 Arxiv Preprint SkillFactory models output for a Letter Countdown 4 task <think> <sample> To solv...

  52. [54]

    - Input letters:`['p','a','s','s']`

    **Identify Available Letters**: First, we identify all the unique letters available from the given set. - Input letters:`['p','a','s','s']`

  53. [55]

    **Check Valid Words**: Next, we check which combinations of these letters form valid English words that meet the criteria (at least 4 characters long)

  54. [56]

    pass" - Possible words starting with'a': -

    **Formulate Word**: We then select the longest valid word from our checks. Let's start with identifying the unique letters: - Unique letters: {'p','a','s'} Now, let’s try forming words using these letters: - Possible words starting with'p': - "pass" - Possible words starting with'a': - "asps" - Possible words starting with's': - "spas" - "asp" Among these...

  55. [57]

    - The words have lengths: 5, 9, 8, and 4 respectively

    **Condition 1**: The acronym must be at least 4 characters long. - The words have lengths: 5, 9, 8, and 4 respectively. - The shortest word has 4 characters, so we need to include the first 4 words in our acronym

  56. [58]

    - The first letters of the words are: r, e, a, l

    **Condition 2**: The acronym must use the first letter of at least 4 words in the list. - The first letters of the words are: r, e, a, l. - We can choose any 4 out of these 4 letters. Let's pick r, e, a, l

  57. [59]

    - We will use the first 4 words: relax, evolution, aesthetic, line

    **Condition 3**: The acronym should only use letters from a subset of the words in the list, in order. - We will use the first 4 words: relax, evolution, aesthetic, line

  58. [60]

    realevate

    **Condition 4**: The acronym should produce a valid English word. - Let's combine the chosen letters: realevate. Now, let's check if "realevate" is a valid English word: - It contains all the required letters from the selected words. - It follows the order of the selected words. - It forms a meaningful word. Therefore, the acronym that satisfies all the c...

  59. [61]

    relax",

    **Identify Key Criteria**: - The acronym must be at least 4 characters long. - It should use the first letter of at least 4 words from the list. - Only letters from specific words should be used. - The resulting acronym must form a valid English word. [...truncated due to space...] Therefore, my final answer is: <answer> REAL </answer> </sample> <reflect>...

  60. [62]

    All letters come from the specified words. 4. It forms the valid English word "real". Therefore, the answer is correct. </reflect> </think> That confirms the final answer is: <answer>real</answer> 38 Arxiv Preprint Table 13: Number of explicit answer attempts, explicit reflections and the verification F1 for the correct and incorrect classes for the Skill...