Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought
Pith reviewed 2026-05-18 02:31 UTC · model grok-4.3
The pith
Most steps in chain-of-thought reasoning have little causal effect on the model's final answer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs often interleave true-thinking steps that are genuinely used to compute the final output with decorative-thinking steps that give the appearance of reasoning but have minimal causal influence. Only a small subset of the total reasoning steps causally drive the model's prediction. On AIME, for example, only an average of 2.3% of reasoning steps in CoT have a TTS of 0.7 or higher for Qwen-2.5. Self-verification steps can be decorative, while steering along the TrueThinking direction can force internal reasoning over these steps.
What carries the argument
True Thinking Score (TTS), which quantifies the causal contribution of each step in the chain-of-thought to the final prediction by isolating its effect.
If this is right
- Only a small fraction of verbalized steps, such as 2.3 percent on AIME problems, show high causal contribution to the answer.
- Self-verification steps and apparent insights in CoT can be decorative and lack internal effect.
- A identified direction allows steering the model to internally follow or disregard specific verbalized steps.
- Chain-of-thought traces may therefore be neither efficient nor fully trustworthy representations of internal reasoning.
Where Pith is reading between the lines
- Identifying and retaining only high-TTS steps could allow shorter reasoning traces that preserve accuracy while reducing computation.
- Future training methods might explicitly reward increases in the proportion of true-thinking steps rather than longer traces.
- The same distinction between causal and decorative content could apply to other generated explanations such as code comments or proof sketches.
Load-bearing premise
Intervening on a single verbalized step to measure its contribution leaves the model's other internal computations unchanged in ways the score cannot detect.
What would settle it
An experiment showing that removing a low-TTS step alters the final answer as often and as strongly as removing a high-TTS step would indicate that the score does not isolate causal contribution.
Figures
read the original abstract
Large language models can generate long chain-of-thought (CoT) reasoning, yet prior work suggests that CoT can be post-hoc rationalization rather than a faithful reflection of the computation through explicitly designed settings. In this work, we go further and propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT to the model's final prediction in realistic reasoning problems. Across eleven models ranging from 1.5B to 1.1T parameters on common reasoning benchmarks, we find that CoTs often interleave true-thinking steps, which causally affect the final answer, with decorative-thinking steps, which appear useful but have little causal influence; Such decorative steps remain prevalent even for frontier models: Over 30% of steps in Kimi-K2.6 are decorative on MATH with TTS <= 0.005. Furthermore, TTS enables effective CoT pruning: removing 50% of CoT steps with the lowest TTS can largely maintain the performance. Self-training on these pruned CoTs reduces reasoning length by 66% while preserving performance on Nemotron3-Nano-30B. Finally, we provide a mechanistic analysis showing that LLMs can be steered in the latent space to engage or disengage with reasoning steps. Overall, our results reveal that frontier LLMs often verbalize reasoning steps that are not causally used, challenging both the efficiency and the trustworthiness of CoT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that chain-of-thought (CoT) reasoning in large language models often includes 'decorative' steps that do not causally influence the final prediction, in contrast to 'true-thinking' steps. It introduces a True Thinking Score (TTS) to quantify the causal contribution of each step and reports that, for example, on the AIME benchmark, only an average of 2.3% of reasoning steps in CoT have a TTS of at least 0.7 for the Qwen-2.5 model. The work also demonstrates that models can be steered using an identified 'TrueThinking' direction to internally follow or disregard specific steps, including self-verification 'aha moments' which may otherwise be decorative.
Significance. Should the TTS metric prove robust and the steering results replicable with appropriate controls, this research would be highly significant. It would provide evidence that verbalized CoT does not reliably reflect internal model computations, raising questions about the interpretability and efficiency of current reasoning approaches in LLMs. The steering mechanism could offer a new tool for enhancing the trustworthiness of model outputs by ensuring internal alignment with verbalized reasoning.
major comments (2)
- [Abstract] The computation of the True Thinking Score (TTS) is not described, nor are the specific interventions (such as masking, deletion, or activation steering) used to measure causal contribution. This omission is load-bearing for the central claim, as the reported 2.3% figure on AIME cannot be evaluated without knowing how TTS isolates the effect of each step without the intervention confounding the model's internal states or subsequent reasoning path.
- [Abstract] No information is provided on experimental setup details, including controls for confounding variables, statistical error bars, number of samples, or validation experiments confirming that the intervention does not alter the model's computation in unaccounted ways. Without these, it is difficult to determine if the distinction between true and decorative steps is causally valid or an artifact of the measurement procedure.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our central claims. We address each major comment below and commit to revisions that improve the self-contained nature of the abstract without altering the underlying results.
read point-by-point responses
-
Referee: [Abstract] The computation of the True Thinking Score (TTS) is not described, nor are the specific interventions (such as masking, deletion, or activation steering) used to measure causal contribution. This omission is load-bearing for the central claim, as the reported 2.3% figure on AIME cannot be evaluated without knowing how TTS isolates the effect of each step without the intervention confounding the model's internal states or subsequent reasoning path.
Authors: We agree that the abstract's brevity leaves the precise definition of TTS and the causal interventions unspecified. The full manuscript defines TTS as the normalized difference in the model's output probability for the correct answer when a given reasoning step is causally intervened upon (via targeted activation editing while preserving the remainder of the CoT). We will revise the abstract to include a concise clause describing this measurement approach and the intervention type, ensuring the 2.3% statistic can be interpreted directly from the abstract. revision: yes
-
Referee: [Abstract] No information is provided on experimental setup details, including controls for confounding variables, statistical error bars, number of samples, or validation experiments confirming that the intervention does not alter the model's computation in unaccounted ways. Without these, it is difficult to determine if the distinction between true and decorative steps is causally valid or an artifact of the measurement procedure.
Authors: We acknowledge that the abstract omits these experimental details. The full manuscript reports results aggregated across the AIME benchmark with multiple independent runs per problem, includes statistical error bars on all quantitative claims, and validates interventions by confirming that baseline model accuracy is preserved and that random interventions produce measurably different effects. We will add a brief summary of sample scale and validation checks to the abstract and ensure all reported figures in the revision display error bars and controls. revision: yes
Circularity Check
No circularity: TTS is an interventional causal metric whose empirical distribution is not forced by definition
full rationale
The abstract defines TTS as a score that quantifies the causal contribution of each CoT step to the final prediction and reports the empirical observation that only 2.3% of steps reach TTS >= 0.7 on AIME. No equations, parameter-fitting procedure, self-citation chain, or ansatz are supplied that would make the reported percentage or the true-vs-decorative distinction reduce to the inputs by construction. The central claim therefore rests on an external measurement whose validity can be checked independently of the paper's own outputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT... ATEnec(1) and ATEsuf(0)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
When Chain-of-Thought Fails, the Solution Hides in the Hidden States
Activation patching shows individual CoT tokens encode sufficient task-relevant information to recover correct answers on GSM8K, often outperforming both direct prompting and the original (sometimes incorrect) CoT trace.
-
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
-
Decoding the Critique Mechanism in Large Reasoning Models
By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.