Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye; Jonas Rohweder; Linus Wong; Max Loffgren; Om Kotadia

arxiv: 2602.11201 · v2 · pith:S33Y7MIXnew · submitted 2026-02-04 · 💻 cs.CL

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye , Max Loffgren , Om Kotadia , Linus Wong , Jonas Rohweder This is my paper

classification 💻 cs.CL

keywords modelreasoninganswernlddwhetheracrossactuallychain

0 comments

read the original abstract

Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
cs.AI 2026-05 unverdicted novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
cs.CL 2026-03 unverdicted novelty 6.0

SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
LLM Reasoning Is Latent, Not the Chain of Thought
cs.AI 2026-04 unverdicted novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.