Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Arthur Conmy; Iv\'an Arcuschin; Jett Janiak; Neel Nanda; Robert Krzyzanowski; Senthooran Rajamanoharan

arxiv: 2503.08679 · v5 · pith:35Y2AJGVnew · submitted 2025-03-11 · 💻 cs.AI · cs.CL· cs.LG

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Iv\'an Arcuschin , Jett Janiak , Robert Krzyzanowski , Senthooran Rajamanoharan , Neel Nanda , Arthur Conmy This is my paper

classification 💻 cs.AI cs.CLcs.LG

keywords modelsbiasesreasoningbiggerchain-of-thoughtfaithfulillogicalimplicit

0 comments

read the original abstract

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
cs.LG 2026-05 unverdicted novelty 7.0

In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
cs.CL 2026-05 unverdicted novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
cs.CL 2026-05 unverdicted novelty 6.0

SCA framework applies Information Bottleneck to assign step-level confidence in black-box LLM reasoning traces, flagging errors and boosting self-correction success by up to 13.5% on math and QA tasks.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
cs.AI 2026-05 unverdicted novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Evaluating the False Trust Engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
cs.AI 2026-05 unverdicted novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and fina...
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
cs.AI 2026-02 unverdicted novelty 6.0

Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
Interpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments
cs.CL 2025-11 unverdicted novelty 6.0

AnalyticScore applies new FGTI interpretability principles to text-based scoring and achieves accuracy within 0.06 QWK of uninterpretable state-of-the-art while matching human featurization on the ASAP-SAS dataset.
Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought
cs.LG 2025-10 unverdicted novelty 6.0

LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.
TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
cs.MA 2025-07 accept novelty 6.0

TinyTroupe provides a toolkit for fine-grained persona-based LLM multi-agent simulations with built-in support for population sampling, experimentation, and validation.
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
cs.AI 2025-06 unverdicted novelty 6.0

RLVR incentivizes correct reasoning in base LLMs, extending reasoning boundaries on math and coding tasks as shown by CoT-Pass@K evaluations and a theoretical incentive framework.
Evaluating the False Trust Engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 5.0

LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
LLM Reasoning Is Latent, Not the Chain of Thought
cs.AI 2026-04 unverdicted novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
cs.AI 2025-06 unverdicted novelty 5.0

KnowRL integrates a knowledge-verification factuality reward into RL training to enforce fact-based reasoning steps and lower hallucination rates in LLMs.
Phi-4-reasoning Technical Report
cs.AI 2025-04 unverdicted novelty 4.0

A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.