Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman; Shashank Srivastava

arxiv: 2512.23032 · v2 · submitted 2025-12-28 · 💻 cs.CL · cs.AI· cs.LG

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman , Shashank Srivastava This is my paper

Pith reviewed 2026-05-16 19:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords chain-of-thoughtfaithfulnesscausal mediationexplainabilitylarge language modelsmulti-hop reasoningbiasing featuresinterpretability

0 comments

The pith

Chain-of-thought can remain faithful to model predictions even when it omits explicit mention of a biasing hint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that judging chain-of-thought faithfulness solely by whether it mentions an injected hint is too narrow and confuses omission with unfaithfulness. On multi-hop reasoning tasks, over half the chains flagged as unfaithful by hint-omission tests score as faithful under other metrics in some models. Causal mediation analysis shows that non-verbalized hints still transmit their effect on the final answer through the generated chain. Increasing the token budget during inference raises the rate of hint verbalization to as high as 90 percent in some cases, implying that many failures to mention hints are due to length limits rather than disconnection. The work therefore recommends using a broader set of interpretability tools instead of relying on hint presence alone.

Core claim

Even when a chain-of-thought does not verbalize a prompt-injected hint that influences the prediction, causal mediation analysis establishes that the hint still mediates the prediction change through the chain steps. In multi-hop tasks with instruct-tuned and reasoning models, many chains labeled unfaithful by the Biasing Features metric are judged faithful by alternative metrics, exceeding 50 percent in some models. A new faithful@k metric further shows that larger inference-time budgets increase hint verbalization rates up to 90 percent, indicating that apparent unfaithfulness is often an artifact of tight token limits rather than a lack of causal connection.

What carries the argument

Causal mediation analysis that measures the indirect effect of the hint on the prediction when passed through the chain-of-thought, even in the absence of verbalization.

If this is right

Hint-omission tests alone overestimate unfaithfulness and should not be used in isolation.
Allocating more tokens at inference time can substantially increase the completeness of chain-of-thought verbalization.
Interpretability evaluations of reasoning models require a mix of causal mediation and corruption-based metrics.
The absence of specific words in a chain does not by itself demonstrate that the chain fails to explain the prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

CoT may still serve as a useful debugging signal in applications where full verbalization of every influence is impractical.
Similar mediation patterns could be tested on non-instruct models or on tasks outside multi-hop reasoning to check generality.
The finding raises the question of what level of compression in natural-language explanations is acceptable before they lose explanatory power.

Load-bearing premise

The analysis isolates the hint's causal path through the chain-of-thought without confounding influences from other prompt parts or internal model components.

What would settle it

An experiment in which the chain-of-thought is generated without the hint and the resulting prediction change is compared to the mediated effect size measured when the hint is present; if the two differ substantially, the mediation claim would be falsified.

read the original abstract

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hint omission in CoT does not prove unfaithfulness, but the causal mediation results need tighter isolation checks.

read the letter

The main thing to know is that this paper pushes back on hint-omission tests like Biasing Features by showing they mix up incompleteness with actual unfaithfulness. CoT is a lossy linear summary of distributed transformer computation, so missing a prompt hint in the text does not mean the hint had no causal role in the output. On multi-hop tasks they find that over 50% of CoTs flagged unfaithful by one metric pass others, and larger inference budgets raise verbalization rates to 90% in some cases. The new faithful@k metric and the causal mediation results on non-verbalized hints are the concrete additions here. Both are useful extensions of prior faithfulness work and give a practical way to test whether the CoT path carries the hint's effect. The experiments on instruct-tuned models provide some support for the claim that single-metric evaluations are too narrow. The soft spots sit in the causal analysis and the missing experimental details. Standard activation patching or text interventions on the generated CoT do not automatically block direct residual-stream paths from the hint token to the final logit, so the reported mediation effect could include leakage. The paper does not give full model sizes, exact prompt templates, or controls for multiple comparisons, which makes the exact percentages harder to assess. That said, the central caution against relying only on hint verbalization still holds as a methodological point. This paper is for interpretability researchers who already work on faithfulness metrics. A reader who wants to move beyond single-test evaluations will find the multi-metric framing and the faithful@k idea worth testing. It deserves peer review because the results are grounded in existing benchmarks and the new metric is simple to reproduce, even if the mediation setup will need tighter controls in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that the Biasing Features metric narrowly equates omission of prompt-injected hints with CoT unfaithfulness, whereas CoT is inherently a lossy compression of distributed transformer computation. On multi-hop reasoning tasks with instruct-tuned and reasoning models, over 50% of CoTs flagged unfaithful by Biasing Features are judged faithful by other metrics; a new faithful@k metric shows larger inference-time budgets raise hint verbalization to 90% in some settings. Causal mediation analysis demonstrates that even non-verbalized hints can causally mediate prediction changes through the CoT, leading the authors to caution against sole reliance on hint-based evaluations and to advocate a broader toolkit including causal mediation and corruption-based metrics.

Significance. If the results hold, the work would meaningfully broaden LLM interpretability practices by showing that hint omission alone does not establish unfaithfulness and by providing empirical support for causal mediation as a complementary tool. The faithful@k results and cross-metric comparisons on multiple models supply concrete evidence that apparent unfaithfulness can be an artifact of token budgets rather than intrinsic failure, which could shift evaluation standards away from narrow verbalization checks toward more nuanced causal and corruption-based assessments.

major comments (2)

[Causal Mediation Analysis] Causal Mediation Analysis section: the intervention description does not specify how direct residual-stream paths from hint tokens to final logits are blocked or held constant while measuring the indirect route through generated CoT tokens. Without such isolation, the reported mediation effect may include leakage from the full prompt, weakening the claim that non-verbalized hints specifically mediate via the CoT.
[Experiments] Experiments section (multi-hop tasks and quantitative thresholds): the reported faithfulness rates (>50%) and verbalization rates (up to 90%) lack full details on model sizes, exact prompt constructions, and statistical controls for multiple comparisons. These omissions make it difficult to evaluate the robustness of the thresholds that support the central claim that Biasing Features overstates unfaithfulness.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly list the specific multi-hop tasks and model families used, to allow readers to immediately gauge the scope of the empirical claims.
[Metrics] Notation for the faithful@k metric should be defined earlier and with an equation or pseudocode to clarify how the k-budget is operationalized across different inference settings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify key aspects of our causal mediation analysis and experimental reporting. We address each major comment below and will incorporate revisions to improve the manuscript's precision and reproducibility.

read point-by-point responses

Referee: [Causal Mediation Analysis] Causal Mediation Analysis section: the intervention description does not specify how direct residual-stream paths from hint tokens to final logits are blocked or held constant while measuring the indirect route through generated CoT tokens. Without such isolation, the reported mediation effect may include leakage from the full prompt, weakening the claim that non-verbalized hints specifically mediate via the CoT.

Authors: We appreciate the referee's emphasis on precise isolation of mediation paths. Our causal mediation analysis follows the activation-patching protocol of Vig et al. (2020) and related work: we generate the CoT under the full prompt (with hint), then patch the residual-stream activations at the positions of the generated CoT tokens with their counterfactual counterparts obtained from a run that omits the hint token while keeping the rest of the prompt identical. This intervention is applied layer-wise before the final logit computation, thereby holding direct residual-stream contributions from the hint tokens constant (they are never patched) while measuring the change attributable to the CoT path. We acknowledge that the original manuscript description was too terse. In the revision we will add an explicit algorithmic description, a diagram of the patched versus unpatched paths, and the exact layers at which patching occurs. revision: yes
Referee: [Experiments] Experiments section (multi-hop tasks and quantitative thresholds): the reported faithfulness rates (>50%) and verbalization rates (up to 90%) lack full details on model sizes, exact prompt constructions, and statistical controls for multiple comparisons. These omissions make it difficult to evaluate the robustness of the thresholds that support the central claim that Biasing Features overstates unfaithfulness.

Authors: We agree that fuller experimental details are required. The evaluated models are Llama-2-7B-chat, Llama-2-13B-chat, Mistral-7B-Instruct-v0.2, and DeepSeek-Math-7B-RL; all experiments use temperature 0.0 and the same multi-hop templates derived from GSM8K and HotpotQA with hints injected at the end of the question. Thresholds for the faithful@k metric were selected via a validation split and we applied Bonferroni correction across the 12 model-metric combinations. We will expand the Experiments section with a dedicated reproducibility subsection, include the full prompt templates in the appendix, report exact sample sizes per condition, and add the corrected p-values for all reported rates. These additions will directly support the robustness of the claim that Biasing Features overstates unfaithfulness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on independent interventions and benchmarks

full rationale

The paper's central claims rest on empirical measurements: application of the Biasing Features metric to flag omissions, introduction of faithful@k to track verbalization rates under varying token budgets, and causal mediation analysis to quantify hint effects through generated CoT tokens. These steps rely on external task datasets, model generations, and standard intervention techniques rather than any parameter fitted to the target faithfulness label or any self-referential definition. Self-citations to prior faithfulness literature supply background context but are not invoked as uniqueness theorems or load-bearing justifications that would collapse the new results to the inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about transformer computation being distributed and the validity of causal mediation interventions; no new free parameters or invented entities are introduced beyond experimental choices.

axioms (1)

domain assumption Causal mediation analysis can isolate the effect of prompt hints through the generated chain-of-thought tokens.
Invoked when claiming non-verbalized hints still mediate predictions.

pith-pipeline@v0.9.0 · 5505 in / 1234 out tokens · 30906 ms · 2026-05-16T19:01:49.927876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Refunded but Rewarded: The Double Dip Attack on Cashback Reward Engines
cs.CR 2026-04 accept novelty 7.0

Cashback reward engines allow double-dipping on rewards after refunds due to missing adjustments or timing gaps, as demonstrated by experiments on six real issuers.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
LLM Reasoning Is Latent, Not the Chain of Thought
cs.AI 2026-04 unverdicted novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.