Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation
Pith reviewed 2026-05-22 14:16 UTC · model grok-4.3
The pith
Correct reasoning traces lead to accurate final answers in only 28 percent of cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When LLMs are fine-tuned on QA problems that always supply the correct final answer but pair it with either correct or incorrect traces, correct traces produce correct solutions in only 28 percent of test cases and incorrect traces do not consistently lower accuracy. Fine-tuning on verbose R1 traces yields the strongest model performance, yet human participants rate those same traces lowest on interpretability while rating simpler decomposed traces higher on understandability but lower on downstream accuracy.
What carries the argument
Rule-based problem decomposition that generates paired datasets of verifiably correct or incorrect reasoning sub-steps while holding the final answer fixed, allowing isolation of trace semantics from answer correctness.
If this is right
- Model training can use traces whose only requirement is that they improve final-answer accuracy.
- User-facing explanations can be designed separately from the traces used during distillation.
- Interpretability metrics and accuracy metrics should be measured on different trace variants.
- Current assumptions linking trace validity to performance gains need re-examination in other reasoning domains.
Where Pith is reading between the lines
- Separate optimization pipelines could produce one trace style for the model and a different style for human readers.
- The observed disconnect may appear in non-QA tasks such as code generation or mathematical proof construction.
Load-bearing premise
The rule-based decomposition accurately reflects the reasoning steps that matter for real LLM tasks.
What would settle it
Re-running the same training and evaluation protocol on a held-out set of problems whose reasoning steps were generated directly by a large model rather than by the rule-based decomposer.
read the original abstract
Recent advances in reasoning-focused Large Language Models (LLMs) have introduced Chain-of-Thought (CoT) traces - intermediate reasoning steps generated before a final answer. These traces, as in DeepSeek R1, guide inference and train smaller models. A common but under-examined assumption is that these traces are both semantically correct and interpretable to end-users. While intermediate reasoning steps are believed to improve accuracy, we question whether they are actually valid and understandable. To isolate the effect of trace semantics, we design experiments in Question Answering (QA) using rule-based problem decomposition, creating fine-tuning datasets where each problem is paired with either verifiably correct or incorrect traces, while always providing the correct final answer. Trace correctness is evaluated by checking the accuracy of every reasoning sub-step. To assess interpretability, we fine-tune LLMs on three additional trace types: R1 traces, R1 trace summaries, and post-hoc explanations, and conduct a human study with 100 participants rating each type on a Likert scale. We find: (1) Trace correctness does not reliably predict correct final answers - correct traces led to correct solutions in only 28% of test cases, while incorrect traces did not consistently degrade accuracy. (2) Fine-tuning on verbose R1 traces yielded the best model performance, but users rated them least interpretable (3.39 interpretability, 4.59 cognitive load on a 5-point scale), whereas more interpretable decomposed traces did not achieve comparable accuracy. Together, these findings challenge the assumption in question suggesting that researchers and practitioners should decouple model supervision objectives from end-user-facing trace design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the assumption that semantically correct and interpretable Chain-of-Thought traces improve both model performance in knowledge distillation and end-user understanding. Using rule-based problem decomposition on QA tasks, the authors construct fine-tuning datasets pairing problems with verifiably correct or incorrect traces (while always using the correct final answer), fine-tune LLMs on these plus R1 traces/summaries/explanations, and run a human study with 100 participants rating interpretability on Likert scales. Central empirical claims are that correct traces yield correct final answers in only 28% of cases and that verbose R1 traces produce the strongest fine-tuned models yet receive the lowest interpretability ratings (3.39) and highest cognitive load (4.59).
Significance. If the results hold, the work provides concrete evidence that trace correctness and interpretability are not reliable proxies for downstream accuracy or user comprehension in trace-based distillation. The controlled isolation of trace semantics via rule-based decomposition combined with direct human judgments supplies falsifiable, quantitative data (including the 28% figure) that can inform whether supervision objectives should be decoupled from user-facing explanations.
major comments (2)
- [§3] §3 (Dataset Creation and Trace Generation): the rule-based problem decomposition is presented as isolating trace correctness effects, yet no validation is reported showing that the generated sub-steps and error patterns match the semantic structure or failure modes of naturally occurring LLM CoT traces (e.g., DeepSeek R1). Without such a comparison or ablation, the 28% correlation between trace correctness and final-answer correctness risks being an artifact of the synthetic construction rather than a general property of trace-based distillation.
- [§4.2] §4.2 (Fine-tuning and Evaluation Results): the claim that incorrect traces 'did not consistently degrade accuracy' is central to the disconnect finding, but the manuscript provides no statistical tests, confidence intervals, or per-problem breakdowns for the accuracy differences between correct-trace and incorrect-trace conditions. This detail is load-bearing for the conclusion that trace correctness does not reliably predict outcomes.
minor comments (2)
- [§3.1] The description of how trace correctness is verified for every sub-step would benefit from an explicit example or pseudocode to clarify the rule-based checking procedure.
- [Figure 3] Figure captions and axis labels for the human-study Likert results could be expanded to include exact question wording and scale anchors for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Creation and Trace Generation): the rule-based problem decomposition is presented as isolating trace correctness effects, yet no validation is reported showing that the generated sub-steps and error patterns match the semantic structure or failure modes of naturally occurring LLM CoT traces (e.g., DeepSeek R1). Without such a comparison or ablation, the 28% correlation between trace correctness and final-answer correctness risks being an artifact of the synthetic construction rather than a general property of trace-based distillation.
Authors: Our use of rule-based problem decomposition was intended to provide a controlled and verifiable way to manipulate trace correctness while holding the final answer constant, thereby isolating the impact of intermediate reasoning steps on model performance. This synthetic construction allows for precise error injection and correctness verification that would be difficult with naturally generated LLM traces. We recognize that this may not fully capture all characteristics of LLM-generated CoT traces from models such as DeepSeek R1. To address this, we will include in the revised manuscript a discussion of the design rationale and a qualitative comparison of example traces from our method versus R1 outputs to highlight similarities and differences in structure. revision: partial
-
Referee: [§4.2] §4.2 (Fine-tuning and Evaluation Results): the claim that incorrect traces 'did not consistently degrade accuracy' is central to the disconnect finding, but the manuscript provides no statistical tests, confidence intervals, or per-problem breakdowns for the accuracy differences between correct-trace and incorrect-trace conditions. This detail is load-bearing for the conclusion that trace correctness does not reliably predict outcomes.
Authors: We agree that the inclusion of statistical tests, confidence intervals, and additional breakdowns would provide stronger support for our findings. In the revised manuscript, we will add these analyses, including appropriate significance tests for the accuracy differences and 95% confidence intervals, as well as per-problem performance breakdowns to illustrate the variability across instances. revision: yes
Circularity Check
No circularity: purely empirical study with direct experimental grounding
full rationale
The paper conducts an empirical investigation using rule-based problem decomposition to create QA datasets with controlled correct/incorrect traces, followed by fine-tuning experiments on multiple trace types and a human study with Likert-scale ratings. No mathematical derivations, parameter fittings, or self-referential definitions appear in the described methodology or findings. Claims about trace correctness not predicting final answers (28% success rate) and interpretability-accuracy trade-offs rest on direct experimental outcomes and external human judgments rather than any reduction to inputs by construction. The work is self-contained against its benchmarks with no load-bearing self-citations or ansatzes that collapse the central results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The rule-based problem decomposition creates traces that meaningfully test the semantic correctness assumption in LLM reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
correct traces led to correct solutions in only 28% of test cases, while incorrect traces did not consistently degrade accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Evaluating the False Trust Engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
Evaluating the False Trust Engendered by LLM Explanations
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.