Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

Dawn Song; Jiachen Zhao; Weiyan Shi; Yiyou Sun

arxiv: 2510.24941 · v4 · pith:B2FBVKLUnew · submitted 2025-10-28 · 💻 cs.LG

Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

Jiachen Zhao , Yiyou Sun , Weiyan Shi , Dawn Song This is my paper

Pith reviewed 2026-05-18 02:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords chain of thoughttrue thinking scoredecorative stepscausal contributionlarge language modelsreasoning stepsself-verificationaha moments

0 comments

The pith

Most steps in chain-of-thought reasoning have little causal effect on the model's final answer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the True Thinking Score to measure the causal contribution of each verbalized step in a model's chain-of-thought. It shows that LLMs commonly mix a few true-thinking steps that actually shape the output with many decorative steps that add little causal influence. This distinction matters because long reasoning traces may therefore waste effort and give a misleading picture of how the model reached its conclusion. The authors further demonstrate that a specific direction in the model can steer it to internally respect or ignore particular steps.

Core claim

LLMs often interleave true-thinking steps that are genuinely used to compute the final output with decorative-thinking steps that give the appearance of reasoning but have minimal causal influence. Only a small subset of the total reasoning steps causally drive the model's prediction. On AIME, for example, only an average of 2.3% of reasoning steps in CoT have a TTS of 0.7 or higher for Qwen-2.5. Self-verification steps can be decorative, while steering along the TrueThinking direction can force internal reasoning over these steps.

What carries the argument

True Thinking Score (TTS), which quantifies the causal contribution of each step in the chain-of-thought to the final prediction by isolating its effect.

If this is right

Only a small fraction of verbalized steps, such as 2.3 percent on AIME problems, show high causal contribution to the answer.
Self-verification steps and apparent insights in CoT can be decorative and lack internal effect.
A identified direction allows steering the model to internally follow or disregard specific verbalized steps.
Chain-of-thought traces may therefore be neither efficient nor fully trustworthy representations of internal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Identifying and retaining only high-TTS steps could allow shorter reasoning traces that preserve accuracy while reducing computation.
Future training methods might explicitly reward increases in the proportion of true-thinking steps rather than longer traces.
The same distinction between causal and decorative content could apply to other generated explanations such as code comments or proof sketches.

Load-bearing premise

Intervening on a single verbalized step to measure its contribution leaves the model's other internal computations unchanged in ways the score cannot detect.

What would settle it

An experiment showing that removing a low-TTS step alters the final answer as often and as strongly as removing a high-TTS step would indicate that the score does not isolate causal contribution.

Figures

Figures reproduced from arXiv: 2510.24941 by Dawn Song, Jiachen Zhao, Weiyan Shi, Yiyou Sun.

**Figure 2.** Figure 2: (a) Illustration of different modes in thinking steps within chain-of-thought (CoT) reasoning. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We uncover the TrueThinking direction in LLMs which is extracted as the difference [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (a) The dataset-level distribution of the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: An example of unfaithful self-verification steps (highlighted in blue) where the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Layer-wise results of steering with the TrueThinking vector. In the Engagement Test, stronger intervention is reflected by lower accuracy (more right→wrong flips); In the Disengagement Test, by higher accuracy (more wrong→right flips). Figures (a–b): layer-wise results on AMC for DeepSeek-R1-Distill-Qwen-7B and its 1.5B variant under the Engagement Test and the Disengagement Test. Figures (c–d): cross-doma… view at source ↗

**Figure 7.** Figure 7: Normalized attention scores of the step in the [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Performance after steering the model to truly think over the selfverification part, where initially the accuracy is zero. We find that steering along the TrueThinking direction can at best reverse 52% of the unfaithful self-verification steps in CoT (layer-wise results shown in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Test results of Nemotron on the Engagement Test where TrueThinking directions are extracted between examples with zero TTS (as decorative-thinking steps sDT) and examples of different ranges of TTS (as true-thinking steps sTT), and the lower accuracy means stronger steering effects. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of TTS on different datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Large language models can generate long chain-of-thought (CoT) reasoning, yet prior work suggests that CoT can be post-hoc rationalization rather than a faithful reflection of the computation through explicitly designed settings. In this work, we go further and propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT to the model's final prediction in realistic reasoning problems. Across eleven models ranging from 1.5B to 1.1T parameters on common reasoning benchmarks, we find that CoTs often interleave true-thinking steps, which causally affect the final answer, with decorative-thinking steps, which appear useful but have little causal influence; Such decorative steps remain prevalent even for frontier models: Over 30% of steps in Kimi-K2.6 are decorative on MATH with TTS <= 0.005. Furthermore, TTS enables effective CoT pruning: removing 50% of CoT steps with the lowest TTS can largely maintain the performance. Self-training on these pruned CoTs reduces reasoning length by 66% while preserving performance on Nemotron3-Nano-30B. Finally, we provide a mechanistic analysis showing that LLMs can be steered in the latent space to engage or disengage with reasoning steps. Overall, our results reveal that frontier LLMs often verbalize reasoning steps that are not causally used, challenging both the efficiency and the trustworthiness of CoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that most CoT steps may not causally matter but the measurement method needs more detail.

read the letter

This paper is trying to quantify how much each step in a chain-of-thought actually influences the model's final output versus just looking like reasoning. They introduce a True Thinking Score to do that and report that most steps have little causal effect. What seems new is the specific measurement of causal contribution per step and the finding that self-verification or aha moments can be decorative. The steering along a TrueThinking direction to make the model follow or ignore steps is also a concrete addition. It does a good job highlighting a practical problem: if CoT is mostly for show, then relying on it for better performance or explanations has limits. The numbers like 2.3% on AIME make the point sharply. The soft spots are clear from the abstract alone. There's no description of how the True Thinking Score is actually computed or what the intervention looks like for testing each step. That leaves open whether the low percentages reflect real internal behavior or come from how the test is set up. The stress-test point about the intervention potentially altering hidden states or trajectories is a real issue here – without controls or validation that the rest of the computation stays the same, the causal claims are hard to trust fully. No error bars or details on the models and datasets beyond the example make it preliminary. This is for people studying LLM reasoning and interpretability. A reader working on CoT faithfulness would find the high-level distinction useful to think about. I would send it for peer review. The question matters enough that getting the methods and results properly checked is worthwhile, even with the current gaps.

Referee Report

2 major / 0 minor

Summary. The paper claims that chain-of-thought (CoT) reasoning in large language models often includes 'decorative' steps that do not causally influence the final prediction, in contrast to 'true-thinking' steps. It introduces a True Thinking Score (TTS) to quantify the causal contribution of each step and reports that, for example, on the AIME benchmark, only an average of 2.3% of reasoning steps in CoT have a TTS of at least 0.7 for the Qwen-2.5 model. The work also demonstrates that models can be steered using an identified 'TrueThinking' direction to internally follow or disregard specific steps, including self-verification 'aha moments' which may otherwise be decorative.

Significance. Should the TTS metric prove robust and the steering results replicable with appropriate controls, this research would be highly significant. It would provide evidence that verbalized CoT does not reliably reflect internal model computations, raising questions about the interpretability and efficiency of current reasoning approaches in LLMs. The steering mechanism could offer a new tool for enhancing the trustworthiness of model outputs by ensuring internal alignment with verbalized reasoning.

major comments (2)

[Abstract] The computation of the True Thinking Score (TTS) is not described, nor are the specific interventions (such as masking, deletion, or activation steering) used to measure causal contribution. This omission is load-bearing for the central claim, as the reported 2.3% figure on AIME cannot be evaluated without knowing how TTS isolates the effect of each step without the intervention confounding the model's internal states or subsequent reasoning path.
[Abstract] No information is provided on experimental setup details, including controls for confounding variables, statistical error bars, number of samples, or validation experiments confirming that the intervention does not alter the model's computation in unaccounted ways. Without these, it is difficult to determine if the distinction between true and decorative steps is causally valid or an artifact of the measurement procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our central claims. We address each major comment below and commit to revisions that improve the self-contained nature of the abstract without altering the underlying results.

read point-by-point responses

Referee: [Abstract] The computation of the True Thinking Score (TTS) is not described, nor are the specific interventions (such as masking, deletion, or activation steering) used to measure causal contribution. This omission is load-bearing for the central claim, as the reported 2.3% figure on AIME cannot be evaluated without knowing how TTS isolates the effect of each step without the intervention confounding the model's internal states or subsequent reasoning path.

Authors: We agree that the abstract's brevity leaves the precise definition of TTS and the causal interventions unspecified. The full manuscript defines TTS as the normalized difference in the model's output probability for the correct answer when a given reasoning step is causally intervened upon (via targeted activation editing while preserving the remainder of the CoT). We will revise the abstract to include a concise clause describing this measurement approach and the intervention type, ensuring the 2.3% statistic can be interpreted directly from the abstract. revision: yes
Referee: [Abstract] No information is provided on experimental setup details, including controls for confounding variables, statistical error bars, number of samples, or validation experiments confirming that the intervention does not alter the model's computation in unaccounted ways. Without these, it is difficult to determine if the distinction between true and decorative steps is causally valid or an artifact of the measurement procedure.

Authors: We acknowledge that the abstract omits these experimental details. The full manuscript reports results aggregated across the AIME benchmark with multiple independent runs per problem, includes statistical error bars on all quantitative claims, and validates interventions by confirming that baseline model accuracy is preserved and that random interventions produce measurably different effects. We will add a brief summary of sample scale and validation checks to the abstract and ensure all reported figures in the revision display error bars and controls. revision: yes

Circularity Check

0 steps flagged

No circularity: TTS is an interventional causal metric whose empirical distribution is not forced by definition

full rationale

The abstract defines TTS as a score that quantifies the causal contribution of each CoT step to the final prediction and reports the empirical observation that only 2.3% of steps reach TTS >= 0.7 on AIME. No equations, parameter-fitting procedure, self-citation chain, or ansatz are supplied that would make the reported percentage or the true-vs-decorative distinction reduce to the inputs by construction. The central claim therefore rests on an external measurement whose validity can be checked independently of the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or invented entities beyond the named TTS and TrueThinking direction; these are treated as introduced constructs whose definitions and independence cannot be audited from available text.

pith-pipeline@v0.9.0 · 5753 in / 1099 out tokens · 35105 ms · 2026-05-18T02:31:58.590742+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT... ATEnec(1) and ATEsuf(0)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Chain-of-Thought Fails, the Solution Hides in the Hidden States
cs.CL 2026-04 unverdicted novelty 7.0

Activation patching shows individual CoT tokens encode sufficient task-relevant information to recover correct answers on GSM8K, often outperforming both direct prompting and the original (sometimes incorrect) CoT trace.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
cs.CL 2026-03 unverdicted novelty 6.0

SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
Decoding the Critique Mechanism in Large Reasoning Models
cs.LG 2026-03 unverdicted novelty 6.0

By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.