Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
Pith reviewed 2026-05-16 19:01 UTC · model grok-4.3
The pith
Chain-of-thought can remain faithful to model predictions even when it omits explicit mention of a biasing hint.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even when a chain-of-thought does not verbalize a prompt-injected hint that influences the prediction, causal mediation analysis establishes that the hint still mediates the prediction change through the chain steps. In multi-hop tasks with instruct-tuned and reasoning models, many chains labeled unfaithful by the Biasing Features metric are judged faithful by alternative metrics, exceeding 50 percent in some models. A new faithful@k metric further shows that larger inference-time budgets increase hint verbalization rates up to 90 percent, indicating that apparent unfaithfulness is often an artifact of tight token limits rather than a lack of causal connection.
What carries the argument
Causal mediation analysis that measures the indirect effect of the hint on the prediction when passed through the chain-of-thought, even in the absence of verbalization.
If this is right
- Hint-omission tests alone overestimate unfaithfulness and should not be used in isolation.
- Allocating more tokens at inference time can substantially increase the completeness of chain-of-thought verbalization.
- Interpretability evaluations of reasoning models require a mix of causal mediation and corruption-based metrics.
- The absence of specific words in a chain does not by itself demonstrate that the chain fails to explain the prediction.
Where Pith is reading between the lines
- CoT may still serve as a useful debugging signal in applications where full verbalization of every influence is impractical.
- Similar mediation patterns could be tested on non-instruct models or on tasks outside multi-hop reasoning to check generality.
- The finding raises the question of what level of compression in natural-language explanations is acceptable before they lose explanatory power.
Load-bearing premise
The analysis isolates the hint's causal path through the chain-of-thought without confounding influences from other prompt parts or internal model components.
What would settle it
An experiment in which the chain-of-thought is generated without the hint and the resulting prediction change is compared to the mediated effect size measured when the hint is present; if the two differ substantially, the mediation claim would be falsified.
read the original abstract
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the Biasing Features metric narrowly equates omission of prompt-injected hints with CoT unfaithfulness, whereas CoT is inherently a lossy compression of distributed transformer computation. On multi-hop reasoning tasks with instruct-tuned and reasoning models, over 50% of CoTs flagged unfaithful by Biasing Features are judged faithful by other metrics; a new faithful@k metric shows larger inference-time budgets raise hint verbalization to 90% in some settings. Causal mediation analysis demonstrates that even non-verbalized hints can causally mediate prediction changes through the CoT, leading the authors to caution against sole reliance on hint-based evaluations and to advocate a broader toolkit including causal mediation and corruption-based metrics.
Significance. If the results hold, the work would meaningfully broaden LLM interpretability practices by showing that hint omission alone does not establish unfaithfulness and by providing empirical support for causal mediation as a complementary tool. The faithful@k results and cross-metric comparisons on multiple models supply concrete evidence that apparent unfaithfulness can be an artifact of token budgets rather than intrinsic failure, which could shift evaluation standards away from narrow verbalization checks toward more nuanced causal and corruption-based assessments.
major comments (2)
- [Causal Mediation Analysis] Causal Mediation Analysis section: the intervention description does not specify how direct residual-stream paths from hint tokens to final logits are blocked or held constant while measuring the indirect route through generated CoT tokens. Without such isolation, the reported mediation effect may include leakage from the full prompt, weakening the claim that non-verbalized hints specifically mediate via the CoT.
- [Experiments] Experiments section (multi-hop tasks and quantitative thresholds): the reported faithfulness rates (>50%) and verbalization rates (up to 90%) lack full details on model sizes, exact prompt constructions, and statistical controls for multiple comparisons. These omissions make it difficult to evaluate the robustness of the thresholds that support the central claim that Biasing Features overstates unfaithfulness.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly list the specific multi-hop tasks and model families used, to allow readers to immediately gauge the scope of the empirical claims.
- [Metrics] Notation for the faithful@k metric should be defined earlier and with an equation or pseudocode to clarify how the k-budget is operationalized across different inference settings.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which help clarify key aspects of our causal mediation analysis and experimental reporting. We address each major comment below and will incorporate revisions to improve the manuscript's precision and reproducibility.
read point-by-point responses
-
Referee: [Causal Mediation Analysis] Causal Mediation Analysis section: the intervention description does not specify how direct residual-stream paths from hint tokens to final logits are blocked or held constant while measuring the indirect route through generated CoT tokens. Without such isolation, the reported mediation effect may include leakage from the full prompt, weakening the claim that non-verbalized hints specifically mediate via the CoT.
Authors: We appreciate the referee's emphasis on precise isolation of mediation paths. Our causal mediation analysis follows the activation-patching protocol of Vig et al. (2020) and related work: we generate the CoT under the full prompt (with hint), then patch the residual-stream activations at the positions of the generated CoT tokens with their counterfactual counterparts obtained from a run that omits the hint token while keeping the rest of the prompt identical. This intervention is applied layer-wise before the final logit computation, thereby holding direct residual-stream contributions from the hint tokens constant (they are never patched) while measuring the change attributable to the CoT path. We acknowledge that the original manuscript description was too terse. In the revision we will add an explicit algorithmic description, a diagram of the patched versus unpatched paths, and the exact layers at which patching occurs. revision: yes
-
Referee: [Experiments] Experiments section (multi-hop tasks and quantitative thresholds): the reported faithfulness rates (>50%) and verbalization rates (up to 90%) lack full details on model sizes, exact prompt constructions, and statistical controls for multiple comparisons. These omissions make it difficult to evaluate the robustness of the thresholds that support the central claim that Biasing Features overstates unfaithfulness.
Authors: We agree that fuller experimental details are required. The evaluated models are Llama-2-7B-chat, Llama-2-13B-chat, Mistral-7B-Instruct-v0.2, and DeepSeek-Math-7B-RL; all experiments use temperature 0.0 and the same multi-hop templates derived from GSM8K and HotpotQA with hints injected at the end of the question. Thresholds for the faithful@k metric were selected via a validation split and we applied Bonferroni correction across the 12 model-metric combinations. We will expand the Experiments section with a dedicated reproducibility subsection, include the full prompt templates in the appendix, report exact sample sizes per condition, and add the corrected p-values for all reported rates. These additions will directly support the robustness of the claim that Biasing Features overstates unfaithfulness. revision: yes
Circularity Check
No significant circularity; empirical results rest on independent interventions and benchmarks
full rationale
The paper's central claims rest on empirical measurements: application of the Biasing Features metric to flag omissions, introduction of faithful@k to track verbalization rates under varying token budgets, and causal mediation analysis to quantify hint effects through generated CoT tokens. These steps rely on external task datasets, model generations, and standard intervention techniques rather than any parameter fitted to the target faithfulness label or any self-referential definition. Self-citations to prior faithfulness literature supply background context but are not invoked as uniqueness theorems or load-bearing justifications that would collapse the new results to the inputs. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal mediation analysis can isolate the effect of prompt hints through the generated chain-of-thought tokens.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Refunded but Rewarded: The Double Dip Attack on Cashback Reward Engines
Cashback reward engines allow double-dipping on rewards after refunds due to missing adjustments or timing gaps, as demonstrated by experiments on six real issuers.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
LLM Reasoning Is Latent, Not the Chain of Thought
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.