H\'an D\=an Xu\'e B\`u (Mimicry) or Q\=ing Ch\=u Y\'u L\'an (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models
Pith reviewed 2026-05-16 16:17 UTC · model grok-4.3
The pith
Distillation of reasoning traces via supervised fine-tuning causes student models to lose alignment with human difficulty scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
While teacher models trained via reinforcement learning mirror human difficulty scaling at an average correlation of 0.64, distilled students degrade to 0.34 and frequently exhibit negative transfer by underperforming their own pre-distillation baselines. The analysis attributes this to a cargo-cult effect in which supervised fine-tuning reproduces the linguistic form and verbosity of reasoning without transmitting the teacher's dynamic resource-allocation policy.
What carries the argument
Functional Alignment Collapse: the measured drop in correlation between model accuracy and human-rated task difficulty after distillation, which severs the link between computational cost and cognitive demand.
If this is right
- Distilled models decouple computational effort from actual cognitive demand.
- Human-like alignment with task difficulty requires active reinforcement rather than passive imitation of traces.
- Negative transfer occurs when students perform worse after distillation than before.
- The linguistic form of reasoning can be replicated without the underlying allocation policy.
Where Pith is reading between the lines
- Hybrid training that adds reinforcement learning after initial SFT might recover the lost alignment.
- The same collapse pattern could appear when distilling other emergent behaviors that rely on internal policy rather than surface statistics.
- Benchmark design should separate models that imitate output patterns from those that dynamically allocate resources.
Load-bearing premise
That the correlation between model performance and human difficulty ratings directly measures transmission of an internal dynamic resource allocation policy rather than surface-level output patterns.
What would settle it
A new set of reasoning tasks in which a distilled student model recovers a correlation of approximately 0.64 with human difficulty ratings and shows no negative transfer relative to its pre-distillation baseline would refute the functional alignment collapse.
Figures
read the original abstract
Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "H\'an D\=an Xu\'e B\`u" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that reasoning distillation via supervised fine-tuning (SFT) on traces from large reasoning models induces a 'Functional Alignment Collapse': teacher models exhibit alignment with human difficulty scaling (average Pearson r̄=0.64), but distilled students degrade this alignment (r̄=0.34) and frequently show negative transfer relative to their pre-distillation baselines. The authors interpret this as SFT transmitting only surface linguistic form (e.g., verbosity) without the teacher's internal 'dynamic resource allocation policy,' labeling the outcome a 'Cargo Cult' effect and concluding that human-like cognition is an emergent property of reinforcement learning rather than passive imitation. The claim is tested across 14 models.
Significance. If the reported correlation degradation and negative transfer are robustly demonstrated, the result would be significant for the field: it would provide empirical grounds for preferring reinforcement learning over SFT when the goal is to preserve cognitively aligned reasoning behavior, and it would highlight a concrete limitation of current distillation pipelines. The work also supplies a falsifiable metric (change in correlation with human difficulty) that could be adopted by others studying reasoning transfer.
major comments (2)
- [Abstract and empirical evaluation section] Abstract and empirical evaluation section: the headline result (teachers r̄=0.64 → students r̄=0.34 plus negative transfer) is presented without any description of how human difficulty scores were obtained, which models were included among the 14, what statistical controls or exclusion criteria were applied, or whether the correlations are accompanied by p-values or confidence intervals. These omissions are load-bearing because the entire 'Functional Alignment Collapse' claim rests on the reliability of those specific correlation values.
- [Discussion section] Discussion section: the interpretation that the observed drop reflects failure to transmit an internal 'dynamic resource allocation policy' (rather than surface-level trace mimicry) is not supported by any mechanistic evidence such as layer-wise activation differences, attention entropy conditioned on difficulty, or an ablation that holds trace format fixed while varying content. Without such probes the result remains equally consistent with students overfitting to the surface statistics of the distilled traces.
minor comments (2)
- The average-correlation notation r̄ is used repeatedly but never explicitly defined on first use; a brief parenthetical definition would improve clarity.
- The Chinese terms in the title are given with tone marks but receive no gloss or consistent romanization in the abstract; adding a short parenthetical translation on first appearance would aid accessibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and indicate revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract and empirical evaluation section] Abstract and empirical evaluation section: the headline result (teachers r̄=0.64 → students r̄=0.34 plus negative transfer) is presented without any description of how human difficulty scores were obtained, which models were included among the 14, what statistical controls or exclusion criteria were applied, or whether the correlations are accompanied by p-values or confidence intervals. These omissions are load-bearing because the entire 'Functional Alignment Collapse' claim rests on the reliability of those specific correlation values.
Authors: We agree that the original manuscript omitted critical methodological details necessary to evaluate the headline results. In the revised manuscript, we have added a new subsection in the empirical evaluation section that specifies: (1) the provenance of human difficulty scores (derived from a public cognitive benchmark of timed reasoning tasks with established difficulty norms); (2) the complete list of all 14 models with their parameter counts and training details; (3) the statistical controls and exclusion criteria (e.g., minimum number of data points per model for reliable correlation estimation and outlier removal based on Cook's distance); and (4) p-values together with 95% bootstrap confidence intervals for every reported Pearson correlation. These additions directly address the load-bearing nature of the correlation values. revision: yes
-
Referee: [Discussion section] Discussion section: the interpretation that the observed drop reflects failure to transmit an internal 'dynamic resource allocation policy' (rather than surface-level trace mimicry) is not supported by any mechanistic evidence such as layer-wise activation differences, attention entropy conditioned on difficulty, or an ablation that holds trace format fixed while varying content. Without such probes the result remains equally consistent with students overfitting to the surface statistics of the distilled traces.
Authors: We acknowledge that the manuscript provides no direct mechanistic evidence (e.g., activation or attention analyses) to distinguish between failure to transmit an internal policy versus surface-level overfitting. The core empirical observations—systematic degradation of human-difficulty correlation and frequent negative transfer—remain robust and difficult to explain solely by surface mimicry, yet we agree that alternative accounts cannot be excluded without further probes. In the revised discussion we have: (a) explicitly noted this limitation, (b) reframed the 'Cargo Cult' account as a hypothesis rather than a proven mechanism, and (c) added a paragraph outlining the suggested mechanistic experiments as valuable future work. No new experiments were performed, as they lie beyond the scope of the present study. revision: partial
Circularity Check
No significant circularity; results are direct empirical measurements
full rationale
The paper reports observed Pearson correlations (r̄=0.64 for teachers vs. r̄=0.34 for distilled students) and negative transfer on accuracy vs. human difficulty scores across 14 models. These are computed statistical quantities from experimental runs, not quantities obtained by fitting parameters inside the paper's own equations and then relabeling the fit as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no renaming of known results occurs. The 'Functional Alignment Collapse' interpretation follows from the data patterns rather than reducing to them by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human difficulty scaling can be reliably measured via performance degradation and serves as a proxy for cognitive resource allocation.
invented entities (2)
-
Functional Alignment Collapse
no independent evidence
-
Cargo Cult effect
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
Reasoning budget in LRMs functions as a generation ceiling rather than a real-time dial, leaving cognitive cost alignment with humans invariant across effort levels and supporting a training-time compiled account.
Reference graph
Works this paper leans on
-
[1]
Ackerman, R., & Thompson, V. A. (2017). Meta-reasoning: Monitoring and control of thinking and reasoning.Trends in Cognitive Sciences,21(8), 607–617
work page 2017
-
[2]
Anderson, J. R. (1982). Acquisition of cognitive skill.Psy- chological Review,89(4), 369–406
work page 1982
-
[3]
Anderson, J. R. (1990).The adaptive character of thought. Psychology Press. Bilalić, M., McLeod, P., & Gobet, F. (2008). Why good thoughts block better ones: The mechanism of the perni- cious einstellung (set) effect.Cognition,108(3), 652–661. Bilalić, M., McLeod, P., & Gobet, F. (2010). The mechanism of the einstellung (set) effect: A pervasive source of...
work page 1990
-
[4]
Kestin, G. (2019). Measuring actual learning versus feel- ing of learning in response to being actively engaged in the classroom.Proceedings of the National Academy of Sci- ences,116(39), 19251–19257. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &
work page 2019
-
[5]
Fedorenko, E. (2025). The cost of thinking is similar be- tween large reasoning models and humans.Proceedings of the National Academy of Sciences,122(47), e2520077122. https://doi.org/10.1073/pnas.2520077122
-
[6]
Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Ac- tive learning increases student performance in science, en- gineering, and mathematics.Proceedings of the National Academy of Sciences,111(23), 8410–8415. French,R.M.(1999).Catastrophicforgettinginconnectionist networks.Trends in Cognitive Sciences,3(4), 128–135. Guo,D.,Yang,D.,Zhang,H.,Song,J.,Zhan...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Hinton, G., Vinyals, O., & Dean, J. (2015). Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Ho,N.,Schmid,L.,&Yun,S.-Y.(2023).Largelanguagemod- els are reasoning teachers.Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume1:LongPapers),14852–14882.https://doi.org/10. 18653/v1/2023....
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [8]
-
[9]
Ji, J., Wang, K., Qiu, T., Chen, B., Zhou, J., Li, C., Lou, H., Dai,J.,Liu,Y.,&Yang,Y.(2025).Languagemodelsresist alignment: Evidence from data compression.Proceedings ofthe63rdAnnualMeetingoftheAssociationforComputa- tional Linguistics (Volume 1: Long Papers), 23411–23432. https://aclanthology.org/2025.acl-long.1264/ Kahneman,D.(2011).Thinking,fastandslo...
work page 2025
-
[10]
Marton, F., & Säljö, R. (1976). On qualitative differences in learning: I—outcome and process.British Journal of Educational Psychology,46(1), 4–11
work page 1976
-
[11]
McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in thehippocampusandneocortex:Insightsfromthesuccesses and failures of connectionist models of learning and mem- ory.Psychological Review,102(3), 419–457
work page 1995
-
[12]
VanLehn, K. (1996). Cognitive skill acquisition.Annual Re- view of Psychology,47(1), 513–539
work page 1996
-
[13]
Vygotsky, L. S. (1978).Mind in society: The development of higher psychological processes. Harvard University Press
work page 1978
-
[14]
Westbrook, A., & Braver, T. S. (2015). Cognitive effort: A neuroeconomic approach.Cognitive, Affective, & Behav- ioral Neuroscience,15(2), 395–415
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.