{"paper":{"title":"Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Most steps in chain-of-thought reasoning have little causal effect on the model's final answer.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Dawn Song, Jiachen Zhao, Weiyan Shi, Yiyou Sun","submitted_at":"2025-10-28T20:14:02Z","abstract_excerpt":"Large language models can generate long chain-of-thought (CoT) reasoning, yet prior work suggests that CoT can be post-hoc rationalization rather than a faithful reflection of the computation through explicitly designed settings. In this work, we go further and propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT to the model's final prediction in realistic reasoning problems. Across eleven models ranging from 1.5B to 1.1T parameters on common reasoning benchmarks, we find that CoTs often interleave true-thinking steps, which causally affect the final ans"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LLMs often interleave between true-thinking steps (which are genuinely used to compute the final output) and decorative-thinking steps (which give the appearance of reasoning but have minimal causal influence). Only a small subset of the total reasoning steps causally drive the model's prediction: e.g., on AIME, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 for Qwen-2.5.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The method used to compute the True Thinking Score isolates the causal contribution of each verbalized step without the intervention itself changing the model's internal computation in unaccounted ways.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Most steps in chain-of-thought reasoning have little causal effect on the model's final answer.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"88106005bae30e20f94c7b2216c52ff938c6bd40bc53b998f6a925ba9809230d"},"source":{"id":"2510.24941","kind":"arxiv","version":4},"verdict":{"id":"8b0d9199-88e6-472e-b6ac-4c5e9473b16f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-18T02:31:58.590742Z","strongest_claim":"LLMs often interleave between true-thinking steps (which are genuinely used to compute the final output) and decorative-thinking steps (which give the appearance of reasoning but have minimal causal influence). Only a small subset of the total reasoning steps causally drive the model's prediction: e.g., on AIME, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 for Qwen-2.5.","one_line_summary":"LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The method used to compute the True Thinking Score isolates the causal contribution of each verbalized step without the intervention itself changing the model's internal computation in unaccounted ways.","pith_extraction_headline":"Most steps in chain-of-thought reasoning have little causal effect on the model's final answer."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2510.24941/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"7c0f09e0107a2c8b6ac25c448f9ff60b404027e1a60465135be9510a58f98531"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}