pith. sign in

arxiv: 2601.05019 · v2 · submitted 2026-01-08 · 💻 cs.CL · cs.AI· q-bio.NC

H\'an D\=an Xu\'e B\`u (Mimicry) or Q\=ing Ch\=u Y\'u L\'an (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models

Pith reviewed 2026-05-16 16:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIq-bio.NC
keywords reasoning distillationfunctional alignment collapsenegative transferlarge language modelshuman difficulty scalingsupervised fine-tuningcognitive alignment
0
0 comments X

The pith

Distillation of reasoning traces via supervised fine-tuning causes student models to lose alignment with human difficulty scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the cognitive structure that makes reinforcement-trained reasoning models match human task difficulty can be passed to smaller models through imitation. Across 14 models it shows teachers retain strong correlation with human judgments while distilled students drop sharply and often perform worse than they did before training. This pattern points to superficial copying of reasoning language without the internal policy for matching effort to demand. The authors conclude that human-like cognition arises from active reinforcement rather than passive trace imitation.

Core claim

While teacher models trained via reinforcement learning mirror human difficulty scaling at an average correlation of 0.64, distilled students degrade to 0.34 and frequently exhibit negative transfer by underperforming their own pre-distillation baselines. The analysis attributes this to a cargo-cult effect in which supervised fine-tuning reproduces the linguistic form and verbosity of reasoning without transmitting the teacher's dynamic resource-allocation policy.

What carries the argument

Functional Alignment Collapse: the measured drop in correlation between model accuracy and human-rated task difficulty after distillation, which severs the link between computational cost and cognitive demand.

If this is right

  • Distilled models decouple computational effort from actual cognitive demand.
  • Human-like alignment with task difficulty requires active reinforcement rather than passive imitation of traces.
  • Negative transfer occurs when students perform worse after distillation than before.
  • The linguistic form of reasoning can be replicated without the underlying allocation policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid training that adds reinforcement learning after initial SFT might recover the lost alignment.
  • The same collapse pattern could appear when distilling other emergent behaviors that rely on internal policy rather than surface statistics.
  • Benchmark design should separate models that imitate output patterns from those that dynamically allocate resources.

Load-bearing premise

That the correlation between model performance and human difficulty ratings directly measures transmission of an internal dynamic resource allocation policy rather than surface-level output patterns.

What would settle it

A new set of reasoning tasks in which a distilled student model recovers a correlation of approximately 0.64 with human difficulty ratings and shows no negative transfer relative to its pre-distillation baseline would refute the functional alignment collapse.

Figures

Figures reproduced from arXiv: 2601.05019 by Hanqi Wang, Shuting Peng, Tianhong Wang, Xinyang Peng, Yueqing Hu.

Figure 2
Figure 2. Figure 2: Sample chain-of-thought produced by R1. ing 14 models (see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Combined Analysis of Cognitive Alignment and Reasoning Cost. (Top) Correlation (𝑟): between reasoning cost and human RTs. (Bottom) Comparison of Human RTs (solid, left axis) and model token counts (dashed, right axis) across tasks. where the slope 𝑁 represents a “verbosity multiplier.” A strict fit to this model implies that distillation imposes a pure multiplicative penalty on computational cost, independ… view at source ↗
Figure 5
Figure 5. Figure 5: The CognitiveWaterfall. Models define a topology of failed imitation defined by Teacher Similarity (𝑥-axis) and Human Alignment (𝑦-axis). Instead of converging to the Teacher (top-right), distilled models collapse into a “Spurious Valley” characterized by simultaneous loss of alignment (The Drop) and structural similarity (The Drift). ( [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Surface-Level Mimicry Landscape. Models mi￾grate to the high-verbosity “Mimicry Region” (top-left). Bub￾ble sizes indicate similarity to Teacher (Top) and Human Alignment (Bottom). The shrinking bubbles in the bottom panel reveal that superficial mimicry (low 𝐷𝐾 𝐿) decouples computational effort from cognitive validity. Discussion Our findings reveal a counter-intuitive paradox in the pursuit of artificial… view at source ↗
Figure 7
Figure 7. Figure 7: Inverse Efficiency Analysis. (A) Distilled models (red triangles) exhibit consistently worse efficiency (tokens/correct) than their base counterparts (grey circles). (B) Regression confirms a Linear Inflation Law: distillation imposes a systematic multiplicative penalty (𝑁 ≈ 2.44) on the base model’s inefficiency, independent of intrinsic reasoning heuristics. alignment (𝑟¯ = 0.34) than their own instructi… view at source ↗
read the original abstract

Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "H\'an D\=an Xu\'e B\`u" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that reasoning distillation via supervised fine-tuning (SFT) on traces from large reasoning models induces a 'Functional Alignment Collapse': teacher models exhibit alignment with human difficulty scaling (average Pearson r̄=0.64), but distilled students degrade this alignment (r̄=0.34) and frequently show negative transfer relative to their pre-distillation baselines. The authors interpret this as SFT transmitting only surface linguistic form (e.g., verbosity) without the teacher's internal 'dynamic resource allocation policy,' labeling the outcome a 'Cargo Cult' effect and concluding that human-like cognition is an emergent property of reinforcement learning rather than passive imitation. The claim is tested across 14 models.

Significance. If the reported correlation degradation and negative transfer are robustly demonstrated, the result would be significant for the field: it would provide empirical grounds for preferring reinforcement learning over SFT when the goal is to preserve cognitively aligned reasoning behavior, and it would highlight a concrete limitation of current distillation pipelines. The work also supplies a falsifiable metric (change in correlation with human difficulty) that could be adopted by others studying reasoning transfer.

major comments (2)
  1. [Abstract and empirical evaluation section] Abstract and empirical evaluation section: the headline result (teachers r̄=0.64 → students r̄=0.34 plus negative transfer) is presented without any description of how human difficulty scores were obtained, which models were included among the 14, what statistical controls or exclusion criteria were applied, or whether the correlations are accompanied by p-values or confidence intervals. These omissions are load-bearing because the entire 'Functional Alignment Collapse' claim rests on the reliability of those specific correlation values.
  2. [Discussion section] Discussion section: the interpretation that the observed drop reflects failure to transmit an internal 'dynamic resource allocation policy' (rather than surface-level trace mimicry) is not supported by any mechanistic evidence such as layer-wise activation differences, attention entropy conditioned on difficulty, or an ablation that holds trace format fixed while varying content. Without such probes the result remains equally consistent with students overfitting to the surface statistics of the distilled traces.
minor comments (2)
  1. The average-correlation notation r̄ is used repeatedly but never explicitly defined on first use; a brief parenthetical definition would improve clarity.
  2. The Chinese terms in the title are given with tone marks but receive no gloss or consistent romanization in the abstract; adding a short parenthetical translation on first appearance would aid accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and indicate revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and empirical evaluation section] Abstract and empirical evaluation section: the headline result (teachers r̄=0.64 → students r̄=0.34 plus negative transfer) is presented without any description of how human difficulty scores were obtained, which models were included among the 14, what statistical controls or exclusion criteria were applied, or whether the correlations are accompanied by p-values or confidence intervals. These omissions are load-bearing because the entire 'Functional Alignment Collapse' claim rests on the reliability of those specific correlation values.

    Authors: We agree that the original manuscript omitted critical methodological details necessary to evaluate the headline results. In the revised manuscript, we have added a new subsection in the empirical evaluation section that specifies: (1) the provenance of human difficulty scores (derived from a public cognitive benchmark of timed reasoning tasks with established difficulty norms); (2) the complete list of all 14 models with their parameter counts and training details; (3) the statistical controls and exclusion criteria (e.g., minimum number of data points per model for reliable correlation estimation and outlier removal based on Cook's distance); and (4) p-values together with 95% bootstrap confidence intervals for every reported Pearson correlation. These additions directly address the load-bearing nature of the correlation values. revision: yes

  2. Referee: [Discussion section] Discussion section: the interpretation that the observed drop reflects failure to transmit an internal 'dynamic resource allocation policy' (rather than surface-level trace mimicry) is not supported by any mechanistic evidence such as layer-wise activation differences, attention entropy conditioned on difficulty, or an ablation that holds trace format fixed while varying content. Without such probes the result remains equally consistent with students overfitting to the surface statistics of the distilled traces.

    Authors: We acknowledge that the manuscript provides no direct mechanistic evidence (e.g., activation or attention analyses) to distinguish between failure to transmit an internal policy versus surface-level overfitting. The core empirical observations—systematic degradation of human-difficulty correlation and frequent negative transfer—remain robust and difficult to explain solely by surface mimicry, yet we agree that alternative accounts cannot be excluded without further probes. In the revised discussion we have: (a) explicitly noted this limitation, (b) reframed the 'Cargo Cult' account as a hypothesis rather than a proven mechanism, and (c) added a paragraph outlining the suggested mechanistic experiments as valuable future work. No new experiments were performed, as they lie beyond the scope of the present study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper reports observed Pearson correlations (r̄=0.64 for teachers vs. r̄=0.34 for distilled students) and negative transfer on accuracy vs. human difficulty scores across 14 models. These are computed statistical quantities from experimental runs, not quantities obtained by fitting parameters inside the paper's own equations and then relabeling the fit as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no renaming of known results occurs. The 'Functional Alignment Collapse' interpretation follows from the data patterns rather than reducing to them by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the unstated premise that human difficulty scaling provides a valid external benchmark for cognitive structure and on the introduction of descriptive terms without independent falsifiable evidence.

axioms (1)
  • domain assumption Human difficulty scaling can be reliably measured via performance degradation and serves as a proxy for cognitive resource allocation.
    Invoked to define 'alignment' and 'functional alignment collapse'.
invented entities (2)
  • Functional Alignment Collapse no independent evidence
    purpose: Label for the observed degradation in cognitive alignment after distillation.
    New descriptive term introduced to characterize the result.
  • Cargo Cult effect no independent evidence
    purpose: Explanation for students replicating linguistic form without internalizing resource policy.
    Metaphorical framing without additional empirical test.

pith-pipeline@v0.9.0 · 5527 in / 1229 out tokens · 26264 ms · 2026-05-16T16:17:35.684404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

    cs.CL 2026-05 unverdicted novelty 4.0

    Reasoning budget in LRMs functions as a generation ceiling rather than a real-time dial, leaving cognitive cost alignment with humans invariant across effort levels and supporting a training-time compiled account.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Ackerman, R., & Thompson, V. A. (2017). Meta-reasoning: Monitoring and control of thinking and reasoning.Trends in Cognitive Sciences,21(8), 607–617

  2. [2]

    Anderson, J. R. (1982). Acquisition of cognitive skill.Psy- chological Review,89(4), 369–406

  3. [3]

    Anderson, J. R. (1990).The adaptive character of thought. Psychology Press. Bilalić, M., McLeod, P., & Gobet, F. (2008). Why good thoughts block better ones: The mechanism of the perni- cious einstellung (set) effect.Cognition,108(3), 652–661. Bilalić, M., McLeod, P., & Gobet, F. (2010). The mechanism of the einstellung (set) effect: A pervasive source of...

  4. [4]

    Kestin, G. (2019). Measuring actual learning versus feel- ing of learning in response to being actively engaged in the classroom.Proceedings of the National Academy of Sci- ences,116(39), 19251–19257. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

  5. [5]

    Fedorenko, E. (2025). The cost of thinking is similar be- tween large reasoning models and humans.Proceedings of the National Academy of Sciences,122(47), e2520077122. https://doi.org/10.1073/pnas.2520077122

  6. [6]

    Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Ac- tive learning increases student performance in science, en- gineering, and mathematics.Proceedings of the National Academy of Sciences,111(23), 8410–8415. French,R.M.(1999).Catastrophicforgettinginconnectionist networks.Trends in Cognitive Sciences,3(4), 128–135. Guo,D.,Yang,D.,Zhang,H.,Song,J.,Zhan...

  7. [7]

    Hinton, G., Vinyals, O., & Dean, J. (2015). Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Ho,N.,Schmid,L.,&Yun,S.-Y.(2023).Largelanguagemod- els are reasoning teachers.Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume1:LongPapers),14852–14882.https://doi.org/10. 18653/v1/2023....

  8. [8]

    Andreas, J., & He, K. (2025). ARC is a vision problem! arXiv preprint arXiv:2511.14761

  9. [9]

    https://aclanthology.org/2025.acl-long.1264/ Kahneman,D.(2011).Thinking,fastandslow.Farrar,Straus; Giroux

    Ji, J., Wang, K., Qiu, T., Chen, B., Zhou, J., Li, C., Lou, H., Dai,J.,Liu,Y.,&Yang,Y.(2025).Languagemodelsresist alignment: Evidence from data compression.Proceedings ofthe63rdAnnualMeetingoftheAssociationforComputa- tional Linguistics (Volume 1: Long Papers), 23411–23432. https://aclanthology.org/2025.acl-long.1264/ Kahneman,D.(2011).Thinking,fastandslo...

  10. [10]

    Marton, F., & Säljö, R. (1976). On qualitative differences in learning: I—outcome and process.British Journal of Educational Psychology,46(1), 4–11

  11. [11]

    L., McNaughton, B

    McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in thehippocampusandneocortex:Insightsfromthesuccesses and failures of connectionist models of learning and mem- ory.Psychological Review,102(3), 419–457

  12. [12]

    VanLehn, K. (1996). Cognitive skill acquisition.Annual Re- view of Psychology,47(1), 513–539

  13. [13]

    Vygotsky, L. S. (1978).Mind in society: The development of higher psychological processes. Harvard University Press

  14. [14]

    Westbrook, A., & Braver, T. S. (2015). Cognitive effort: A neuroeconomic approach.Cognitive, Affective, & Behav- ioral Neuroscience,15(2), 395–415