SkillFactory: Self-Distillation For Learning Cognitive Behaviors
Pith reviewed 2026-05-17 01:53 UTC · model grok-4.3
The pith
SkillFactory rearranges a model's own outputs into skill-formatted traces to prime it for using cognitive behaviors like backtracking during later RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillFactory generates silver SFT traces by rearranging the model's self-generated reasoning samples into formats that exhibit cognitive skills such as answer verification, backtracking, and retrying by alternate methods. These traces, though imperfect, supply an inductive bias that allows reinforcement learning to more effectively teach robust use of those skills, resulting in better generalization to harder task variants and reduced regression on out-of-domain tasks.
What carries the argument
The SkillFactory rearrangement process, which converts model-generated reasoning traces into supervised examples formatted to demonstrate specific cognitive skills before reinforcement learning.
If this is right
- Models that begin RL from a SkillFactory SFT initialization generalize to harder variants of reasoning tasks more effectively than those that do not.
- The RL-trained SkillFactory models actively use cognitive skills such as verification and backtracking during problem solving.
- SkillFactory models exhibit less performance regression on out-of-domain tasks after RL compared with standard base models.
- Pre-RL supervised training that structures data around cognitive skills produces more robust skill use than RL alone.
Where Pith is reading between the lines
- The rearrangement approach might be tested on domains outside the current reasoning tasks to check if the same priming effect appears.
- Pairing SkillFactory initialization with larger base models could increase the observed gains in generalization and robustness.
- This method offers a way to reduce dependence on external high-quality traces from stronger models when teaching reasoning behaviors.
Load-bearing premise
The rearranged self-generated silver traces provide a useful inductive bias that helps the model acquire cognitive skills during RL without introducing noise or biases that cannot be overcome.
What would settle it
Train an otherwise identical model with standard SFT on the same data but without the skill rearrangement step, apply the same RL, then measure whether it shows smaller gains on harder task variants and less robustness on out-of-domain tasks than the SkillFactory version.
Figures
read the original abstract
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL;(2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SkillFactory, a self-distillation approach that generates reasoning traces from a base model and rearranges them into SFT training data formatted to exhibit cognitive behaviors such as verification, backtracking, and retrying. These 'silver' traces are used to initialize the model prior to RL, with the central claim being that this pre-RL inductive bias enables better generalization to harder task variants post-RL (despite lower pre-RL performance), demonstrable use of the cognitive skills, and greater robustness against regression on out-of-domain tasks compared to RL applied directly to base models.
Significance. If the reported results hold, the work provides a concrete empirical path for instilling cognitive skills in reasoning models without distillation from stronger teachers, relying instead on restructured self-generated data. This is a useful contribution to RL-based reasoning training, as it isolates the value of pre-RL initialization and supplies supporting evidence across generalization, skill-usage analysis, and robustness checks.
major comments (1)
- [Evaluation] Evaluation section (supporting the three claims): the manuscript reports that SkillFactory-initialized models generalize better post-RL and are more robust out-of-domain, yet the strength of these conclusions would be clearer with explicit reporting of the number of runs, variance, or statistical tests used to establish the differences versus base-model RL baselines.
minor comments (2)
- [Method] The description of the rearrangement procedure for creating silver SFT traces would benefit from a concise pseudocode or step-by-step example to make the inductive bias more transparent.
- [Abstract] Abstract: the high-level summary of results is clear, but a single sentence noting the primary tasks or model sizes used would help readers contextualize the claims without needing to reach the full evaluation section.
Simulated Author's Rebuttal
Thank you for the positive assessment of our work and the recommendation for minor revision. We appreciate the feedback on strengthening the evaluation section and address the major comment below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (supporting the three claims): the manuscript reports that SkillFactory-initialized models generalize better post-RL and are more robust out-of-domain, yet the strength of these conclusions would be clearer with explicit reporting of the number of runs, variance, or statistical tests used to establish the differences versus base-model RL baselines.
Authors: We thank the referee for this constructive suggestion. We agree that explicitly reporting the number of runs would improve clarity. Due to the substantial computational cost of RL training on reasoning tasks, all experiments in the manuscript were conducted with a single run per configuration. In the revised manuscript, we will update the Evaluation section to state this explicitly and note that the improvements in post-RL generalization to harder task variants and out-of-domain robustness were observed consistently across multiple tasks, model scales, and evaluation settings. This provides supporting evidence for the reliability of the trends even without multi-run statistics. We will also add a brief discussion of this experimental constraint. revision: yes
Circularity Check
No significant circularity; empirical method with independent experimental grounding
full rationale
The paper describes an empirical pipeline: generate silver traces from the base model, rearrange them into skill-formatted SFT data, perform SFT, then RL, and measure generalization, skill usage, and OOD robustness via direct experiments. No equations, fitted parameters, or first-principles derivations are presented that could reduce to their own inputs by construction. Claims rest on reported ablation results and comparisons rather than self-definitional loops or load-bearing self-citations. The silver-trace assumption is explicitly noted as imperfect yet tested through outcomes, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Imperfect self-generated traces rearranged into skill formats can prime models to acquire and use cognitive behaviors during RL
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These 'silver' SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL.
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SkillFactory initialization enables generalization to harder task variants and novel domains post-RL
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ISSN 2835-8856. URLhttps://openreview.net/forum?id=HvoG8SxggZ. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A ques- tion answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association fo...
-
[2]
First, add 60 and 20: Step 1: 60 + 20 = 80
-
[3]
Next, add the result to 1: Step 2: 80 + 1 = 81
-
[4]
Let's try a different approach:
Finally, add the result to 11: Step 3: 81 + 11 = 92 However, 92 is not equal to 880. Let's try a different approach:
-
[5]
First, multiply 60 and 11: Step 1: 60 * 11 = 660
-
[6]
Let's try a different combination:
Then, add 20 to the result: Step 2: 660 + 20 = 680 Again, 680 is not equal to 880. Let's try a different combination:
-
[8]
Let's try a different set of operations:
Then, add 60 to the result: Step 2: 220 + 60 = 280 This is also not equal to 880. Let's try a different set of operations:
-
[10]
Let's try a different approach:
Then, add 20 to the result: Step 2: 60 + 20 = 80 2https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M 24 Arxiv Preprint Finally, add the result to 11: Step 3: 80 + 11 = 91 This is also not equal to 880. Let's try a different approach:
-
[11]
First, multiply 20 and 11: Step 1: 20 * 11 = 220
-
[12]
Let's try a different set of operations:
Then, subtract 60 from the result: Step 2: 220 - 60 = 160 This is also not equal to 880. Let's try a different set of operations:
-
[13]
First, multiply 60 and 1: Step 1: 60 * 1 = 60
-
[14]
Then, add 20 to the result: Step 2: 60 + 20 = 80 Then, multiply the result by 11: Step 3: 80 * 11 = 880 This is equal to 880. So, the equation is: <answer> (60 * 1) * 11 </answer> SkillFactory models output for a Countdown 4arg task <think> <sample> To solve for 880 using the numbers 60, 20, 1, and 11 with each number used exactly once, we need to find a ...
-
[15]
**Identify possible combinations**: We start by considering different ways to combine these numbers
-
[16]
Let's start with some trial: - **Combination 1**: Multiply 60 and 11 first, then add 20 and 1
**Trial and error**: Since there are limited combinations, we will try various permutations to see if any work. Let's start with some trial: - **Combination 1**: Multiply 60 and 11 first, then add 20 and 1. \[ (60 * 11) + 20 + 1 = 660 + 20 + 1 = 681 \] This does not equal 880. - **Combination 2**: Multiply 20 and 11 first, then add 60 and 1. \[ (20 * 11) ...
-
[17]
First, let's multiply 756 by 300: \[756 \times 300 = 226800\]
-
[18]
Next, let's multiply 756 by 20 (which is the same as multiplying by 2 and then adding the result of 756 times 10): \[756 \times 20 = 15120\] And, \[756 \times 10 = 7560\] So, \[756 \times 20 = 15120 + 7560 = 22680\]
-
[19]
Now, add the results from steps 1 and 2: \[226800 + 22680 = 249480\] So, the solution to the multiplication equation \(756 \times 322\) is 249480. Therefore, the final answer is: <answer> 249480 </answer> 26 Arxiv Preprint SkillFactory models output for a Long Multiplication 3 digit task <think> <sample> To solve the multiplication equation \( 756 \times ...
-
[20]
**Multiply 756 by 2**: - \(756 \times 2 = 1512\)
-
[21]
**Multiply 756 by 30 (which is 3 times 10)**: - \(756 \times 30 = 756 \times (3 \times 10) = (756 \times 3) \times 10\) - First, calculate \(756 \times 3\): \[ 756 \times 3 = 2268 \] - Then multiply by 10: \[ 2268 \times 10 = 22680 \] 27 Arxiv Preprint
-
[22]
**Multiply 756 by 40 (which is 4 times 10)**: - \(756 \times 40 = 756 \times (4 \times 10) = (756 \times 4) \times 10\) - First, calculate \(756 \times 4\): \[ 756 \times 4 = 3024 \] - Then multiply by 10: \[ 3024 \times 10 = 30240 \] Now, add all these products together: \[ 1512 + 22680 + 30240 = 54472 \] Thus, the final result of \(756 \times 322\) is: ...
-
[23]
\[ 756 = 700 + 50 + 6 \] \[ 322 = 300 + 20 + 2 \]
**Break down the numbers**: We can make the multiplication easier by breaking each number into smaller parts. \[ 756 = 700 + 50 + 6 \] \[ 322 = 300 + 20 + 2 \]
-
[24]
**Multiply each part separately**: \[ (700 \times 322) = 700 \times (300 + 20 + 2) 28 Arxiv Preprint \] \[ = (700 \times 300) + (700 \times 20) + (700 \times 2) \]
-
[25]
**Calculate individual multiplications**: \[ 700 \times 300 = 210000 \] \[ 700 \times 20 = 14000 \] \[ 700 \times 2 = 1400 \]
-
[26]
**Add up the results**: \[ 210000 + 14000 + 1400 = 225400 \] Therefore, \(756 \times 322 = 225400\). **Final Answer:** <answer>225400</answer> </sample> <reflect> Let us verify this answer: Firstly, let's multiply 756 by 2: \(756 \times 2 = 1512\) Next, let's multiply 756 by 10: \(756 \times 10 = 7560\) Then, let's multiply 756 by 20: \(756 \times 20 = 15...
-
[27]
Write down the numbers one under the other: ``` 756 ×322 ------ ```
-
[28]
Multiply 756 by 2 (the units digit of 322): ``` 756 ×2 ----- 1512 ```
-
[29]
Multiply 756 by 20 (the tens digit of 322): ``` 756 ×20 ----- 15120 ```
-
[30]
Multiply 756 by 300 (the hundreds digit of 322): ``` 756 ×300 ------ 226800 ```
-
[31]
Now sum all the partial products obtained above: ``` 1512 30 Arxiv Preprint + 15120 + 226800 --------- 243432 ``` Therefore, the solution to \(756 \times 322\) is \(243432\). </sample> <reflect> Let us verify this answer: First, let's calculate \(756 \times 322\): \[ 756 \times 322 = 243432 \] Now, let's break down the multiplication: \[ 756 \times 3 = 22...
-
[32]
**Calculate the total available time:** - Saturday: 7 hours - Sunday: 5 hours - Total time = 7 + 5 = 12 hours
-
[33]
**Subtract the time spent on reading and playing video games:** - Time spent on reading: 3 hours - Time spent on video games: \(\frac{1}{3}\) of the remaining time after reading 31 Arxiv Preprint - Remaining time after reading = 12 - 3 = 9 hours - Time spent on video games = \(\frac{1}{3} \times 9 = 3\) hours
-
[34]
**Calculate the remaining time after reading and playing video games:** - Remaining time = 9 - 3 = 6 hours
-
[35]
- Time spent playing soccer = 6 hours
**Determine the time spent playing soccer:** - Since the only other activity mentioned is playing soccer, the remaining time (after reading and video games) is all dedicated to soccer. - Time spent playing soccer = 6 hours
-
[36]
**Calculate the percentage of the weekend spent playing soccer:** - Total time in a day = 24 hours - Percentage of the weekend spent playing soccer = \(\frac{6}{24} \times 100\% = 25\%\) Therefore, the percentage of her weekend that Tatiana spends playing soccer is 25%. The final answer is: <answer> 25 </answer> SkillFactory models output for a GSM8k <thi...
-
[37]
\[ 7 + 5 = 12 \text{ hours} \]
**Calculate total available time**: Tatiana has 7 hours on Saturday and 5 hours on Sunday. \[ 7 + 5 = 12 \text{ hours} \]
-
[38]
**Determine time spent on other activities**: - Time spent reading: 3 hours
-
[39]
**Calculate remaining time after reading**: \[ 12 - 3 = 9 \text{ hours} \]
-
[40]
\[ \frac{1}{3} \times 9 = 3 \text{ hours} \]
**Determine time spent playing video games**: - Video games take \( \frac{1}{3} \) of the remaining time. \[ \frac{1}{3} \times 9 = 3 \text{ hours} \]
-
[41]
**Calculate time left for soccer**: \[ 9 - 3 = 6 \text{ hours} \]
-
[42]
**Convert soccer time into percentages**: 32 Arxiv Preprint - Total weekend time: 12 hours - Time spent on soccer: 6 hours \[ \left( \frac{6}{12} \right) \times 100\% = 50\% \] Therefore, the percentage of her weekend that Tatiana spends playing soccer is 50%. The final answer is: <answer> 50 </answer> </sample> <reflect> Let us verify this answer: First,...
-
[43]
**Total Weekend Hours**: - Saturday: 7 hours - Sunday: 5 hours - Total: \( 7 + 5 = 12 \) hours
-
[44]
**Time Spent on Other Activities**: - Reading: 3 hours
-
[45]
**Remaining Time After Reading**: - Total time minus reading time: \[ 12 - 3 = 9 \] hours
-
[46]
**Time Spent Playing Video Games**: - According to the problem, she spends \( \frac{1}{3} \) of the remaining time playing video games. - Remaining time after reading: 9 hours - Time spent on video games: \[ \frac{1}{3} \times 9 = 3 \] hours
-
[47]
**Time Left for Soccer**: - Subtracting the time spent on video games from the remaining time: \[ 9 - 3 = 6 \] hours So, Tatiana spends 6 hours playing soccer during her weekend. Therefore, the percentage of her weekend spent playing soccer is calculated as follows: 33 Arxiv Preprint \[ \frac{\text{Hours spent on soccer}}{\text{Total weekend hours}} \time...
-
[48]
First, let's count how many times each letter appears in the input: - p appears 1 time - a appears 1 time - s appears 2 times
-
[49]
Since we cannot form more than two's'letters with just these four letters, the longest word we can form will have at most three's'letters and some other letters
-
[50]
Let's try forming words with different combinations of the letters: - "sssa": This gives us the word "ssssa". - "ssas": This gives us the word "ssass"
-
[51]
Both "ssssa" and "ssass" are longer than any single letter combination, but we need to check if they can actually be formed from the given letters
-
[52]
"ssssa": - We have 1's'left. - We have 1'a'. - We don't have enough'p'or's'
-
[53]
ssass": - We have 1's', 1'a', and 2's'. - This gives us the word
"ssass": - We have 1's', 1'a', and 2's'. - This gives us the word "ssass". Since "ssass" is a valid English word and it uses all the letters provided, we can conclude that the longest valid word is "ssass". Therefore, the answer is: <answer> ssass </answer> 35 Arxiv Preprint SkillFactory models output for a Letter Countdown 4 task <think> <sample> To solv...
-
[54]
- Input letters:`['p','a','s','s']`
**Identify Available Letters**: First, we identify all the unique letters available from the given set. - Input letters:`['p','a','s','s']`
-
[55]
**Check Valid Words**: Next, we check which combinations of these letters form valid English words that meet the criteria (at least 4 characters long)
-
[56]
pass" - Possible words starting with'a': -
**Formulate Word**: We then select the longest valid word from our checks. Let's start with identifying the unique letters: - Unique letters: {'p','a','s'} Now, let’s try forming words using these letters: - Possible words starting with'p': - "pass" - Possible words starting with'a': - "asps" - Possible words starting with's': - "spas" - "asp" Among these...
-
[57]
- The words have lengths: 5, 9, 8, and 4 respectively
**Condition 1**: The acronym must be at least 4 characters long. - The words have lengths: 5, 9, 8, and 4 respectively. - The shortest word has 4 characters, so we need to include the first 4 words in our acronym
-
[58]
- The first letters of the words are: r, e, a, l
**Condition 2**: The acronym must use the first letter of at least 4 words in the list. - The first letters of the words are: r, e, a, l. - We can choose any 4 out of these 4 letters. Let's pick r, e, a, l
-
[59]
- We will use the first 4 words: relax, evolution, aesthetic, line
**Condition 3**: The acronym should only use letters from a subset of the words in the list, in order. - We will use the first 4 words: relax, evolution, aesthetic, line
-
[60]
**Condition 4**: The acronym should produce a valid English word. - Let's combine the chosen letters: realevate. Now, let's check if "realevate" is a valid English word: - It contains all the required letters from the selected words. - It follows the order of the selected words. - It forms a meaningful word. Therefore, the acronym that satisfies all the c...
-
[61]
**Identify Key Criteria**: - The acronym must be at least 4 characters long. - It should use the first letter of at least 4 words from the list. - Only letters from specific words should be used. - The resulting acronym must form a valid English word. [...truncated due to space...] Therefore, my final answer is: <answer> REAL </answer> </sample> <reflect>...
-
[62]
All letters come from the specified words. 4. It forms the valid English word "real". Therefore, the answer is correct. </reflect> </think> That confirms the final answer is: <answer>real</answer> 38 Arxiv Preprint Table 13: Number of explicit answer attempts, explicit reflections and the verification F1 for the correct and incorrect classes for the Skill...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.