pith. sign in

arxiv: 2505.13792 · v2 · submitted 2025-05-20 · 💻 cs.CL · cs.AI

Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Pith reviewed 2026-05-22 14:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords chain-of-thoughtknowledge distillationtrace interpretabilityLLM reasoningquestion answeringhuman evaluation
0
0 comments X

The pith

Correct reasoning traces lead to accurate final answers in only 28 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that semantically correct Chain-of-Thought traces will produce correct final answers and remain understandable to users. Using rule-based decomposition on question-answering problems, the authors create training data where every example has the right final answer but the intermediate trace is either verifiably correct or incorrect. Models fine-tuned on these datasets show that trace correctness does not reliably predict whether the model reaches the right answer. Human raters find verbose traces from models like DeepSeek R1 least interpretable even when those traces improve accuracy. The work therefore questions whether trace design for model training should be tied to trace design for human consumption.

Core claim

When LLMs are fine-tuned on QA problems that always supply the correct final answer but pair it with either correct or incorrect traces, correct traces produce correct solutions in only 28 percent of test cases and incorrect traces do not consistently lower accuracy. Fine-tuning on verbose R1 traces yields the strongest model performance, yet human participants rate those same traces lowest on interpretability while rating simpler decomposed traces higher on understandability but lower on downstream accuracy.

What carries the argument

Rule-based problem decomposition that generates paired datasets of verifiably correct or incorrect reasoning sub-steps while holding the final answer fixed, allowing isolation of trace semantics from answer correctness.

If this is right

  • Model training can use traces whose only requirement is that they improve final-answer accuracy.
  • User-facing explanations can be designed separately from the traces used during distillation.
  • Interpretability metrics and accuracy metrics should be measured on different trace variants.
  • Current assumptions linking trace validity to performance gains need re-examination in other reasoning domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separate optimization pipelines could produce one trace style for the model and a different style for human readers.
  • The observed disconnect may appear in non-QA tasks such as code generation or mathematical proof construction.

Load-bearing premise

The rule-based decomposition accurately reflects the reasoning steps that matter for real LLM tasks.

What would settle it

Re-running the same training and evaluation protocol on a held-out set of problems whose reasoning steps were generated directly by a large model rather than by the rule-based decomposer.

read the original abstract

Recent advances in reasoning-focused Large Language Models (LLMs) have introduced Chain-of-Thought (CoT) traces - intermediate reasoning steps generated before a final answer. These traces, as in DeepSeek R1, guide inference and train smaller models. A common but under-examined assumption is that these traces are both semantically correct and interpretable to end-users. While intermediate reasoning steps are believed to improve accuracy, we question whether they are actually valid and understandable. To isolate the effect of trace semantics, we design experiments in Question Answering (QA) using rule-based problem decomposition, creating fine-tuning datasets where each problem is paired with either verifiably correct or incorrect traces, while always providing the correct final answer. Trace correctness is evaluated by checking the accuracy of every reasoning sub-step. To assess interpretability, we fine-tune LLMs on three additional trace types: R1 traces, R1 trace summaries, and post-hoc explanations, and conduct a human study with 100 participants rating each type on a Likert scale. We find: (1) Trace correctness does not reliably predict correct final answers - correct traces led to correct solutions in only 28% of test cases, while incorrect traces did not consistently degrade accuracy. (2) Fine-tuning on verbose R1 traces yielded the best model performance, but users rated them least interpretable (3.39 interpretability, 4.59 cognitive load on a 5-point scale), whereas more interpretable decomposed traces did not achieve comparable accuracy. Together, these findings challenge the assumption in question suggesting that researchers and practitioners should decouple model supervision objectives from end-user-facing trace design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines the assumption that semantically correct and interpretable Chain-of-Thought traces improve both model performance in knowledge distillation and end-user understanding. Using rule-based problem decomposition on QA tasks, the authors construct fine-tuning datasets pairing problems with verifiably correct or incorrect traces (while always using the correct final answer), fine-tune LLMs on these plus R1 traces/summaries/explanations, and run a human study with 100 participants rating interpretability on Likert scales. Central empirical claims are that correct traces yield correct final answers in only 28% of cases and that verbose R1 traces produce the strongest fine-tuned models yet receive the lowest interpretability ratings (3.39) and highest cognitive load (4.59).

Significance. If the results hold, the work provides concrete evidence that trace correctness and interpretability are not reliable proxies for downstream accuracy or user comprehension in trace-based distillation. The controlled isolation of trace semantics via rule-based decomposition combined with direct human judgments supplies falsifiable, quantitative data (including the 28% figure) that can inform whether supervision objectives should be decoupled from user-facing explanations.

major comments (2)
  1. [§3] §3 (Dataset Creation and Trace Generation): the rule-based problem decomposition is presented as isolating trace correctness effects, yet no validation is reported showing that the generated sub-steps and error patterns match the semantic structure or failure modes of naturally occurring LLM CoT traces (e.g., DeepSeek R1). Without such a comparison or ablation, the 28% correlation between trace correctness and final-answer correctness risks being an artifact of the synthetic construction rather than a general property of trace-based distillation.
  2. [§4.2] §4.2 (Fine-tuning and Evaluation Results): the claim that incorrect traces 'did not consistently degrade accuracy' is central to the disconnect finding, but the manuscript provides no statistical tests, confidence intervals, or per-problem breakdowns for the accuracy differences between correct-trace and incorrect-trace conditions. This detail is load-bearing for the conclusion that trace correctness does not reliably predict outcomes.
minor comments (2)
  1. [§3.1] The description of how trace correctness is verified for every sub-step would benefit from an explicit example or pseudocode to clarify the rule-based checking procedure.
  2. [Figure 3] Figure captions and axis labels for the human-study Likert results could be expanded to include exact question wording and scale anchors for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Creation and Trace Generation): the rule-based problem decomposition is presented as isolating trace correctness effects, yet no validation is reported showing that the generated sub-steps and error patterns match the semantic structure or failure modes of naturally occurring LLM CoT traces (e.g., DeepSeek R1). Without such a comparison or ablation, the 28% correlation between trace correctness and final-answer correctness risks being an artifact of the synthetic construction rather than a general property of trace-based distillation.

    Authors: Our use of rule-based problem decomposition was intended to provide a controlled and verifiable way to manipulate trace correctness while holding the final answer constant, thereby isolating the impact of intermediate reasoning steps on model performance. This synthetic construction allows for precise error injection and correctness verification that would be difficult with naturally generated LLM traces. We recognize that this may not fully capture all characteristics of LLM-generated CoT traces from models such as DeepSeek R1. To address this, we will include in the revised manuscript a discussion of the design rationale and a qualitative comparison of example traces from our method versus R1 outputs to highlight similarities and differences in structure. revision: partial

  2. Referee: [§4.2] §4.2 (Fine-tuning and Evaluation Results): the claim that incorrect traces 'did not consistently degrade accuracy' is central to the disconnect finding, but the manuscript provides no statistical tests, confidence intervals, or per-problem breakdowns for the accuracy differences between correct-trace and incorrect-trace conditions. This detail is load-bearing for the conclusion that trace correctness does not reliably predict outcomes.

    Authors: We agree that the inclusion of statistical tests, confidence intervals, and additional breakdowns would provide stronger support for our findings. In the revised manuscript, we will add these analyses, including appropriate significance tests for the accuracy differences and 95% confidence intervals, as well as per-problem performance breakdowns to illustrate the variability across instances. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with direct experimental grounding

full rationale

The paper conducts an empirical investigation using rule-based problem decomposition to create QA datasets with controlled correct/incorrect traces, followed by fine-tuning experiments on multiple trace types and a human study with Likert-scale ratings. No mathematical derivations, parameter fittings, or self-referential definitions appear in the described methodology or findings. Claims about trace correctness not predicting final answers (28% success rate) and interpretability-accuracy trade-offs rest on direct experimental outcomes and external human judgments rather than any reduction to inputs by construction. The work is self-contained against its benchmarks with no load-bearing self-citations or ansatzes that collapse the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical paper with no free parameters or invented entities. Relies on the domain assumption that rule-based decompositions validly test trace semantics.

axioms (1)
  • domain assumption The rule-based problem decomposition creates traces that meaningfully test the semantic correctness assumption in LLM reasoning.
    This underpins the creation of correct and incorrect trace datasets.

pith-pipeline@v0.9.0 · 5842 in / 1249 out tokens · 71836 ms · 2026-05-22T14:16:02.864335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating the False Trust Engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 6.0

    A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

  2. Evaluating the False Trust Engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 5.0

    LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.