Chain-of-Thought Reasoning Without Prompting

Denny Zhou; Xuezhi Wang

arxiv: 2402.10200 · v2 · pith:PXBOOGWFnew · submitted 2024-02-15 · 💻 cs.CL

Chain-of-Thought Reasoning Without Prompting

Xuezhi Wang , Denny Zhou This is my paper

classification 💻 cs.CL

keywords reasoningpromptingdecodingllmseffectivelypathsapproachcapabilities

0 comments

read the original abstract

In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) prompting. These methods, while effective, often involve manually intensive prompt engineering. Our study takes a novel approach by asking: Can LLMs reason effectively without prompting? Our findings reveal that, intriguingly, CoT reasoning paths can be elicited from pre-trained LLMs by simply altering the \textit{decoding} process. Rather than conventional greedy decoding, we investigate the top-$k$ alternative tokens, uncovering that CoT paths are frequently inherent in these sequences. This approach not only bypasses the confounders of prompting but also allows us to assess the LLMs' \textit{intrinsic} reasoning abilities. Moreover, we observe that the presence of a CoT in the decoding path correlates with a higher confidence in the model's decoded answer. This confidence metric effectively differentiates between CoT and non-CoT paths. Extensive empirical studies on various reasoning benchmarks show that the proposed CoT-decoding effectively elicits reasoning capabilities from language models, which were previously obscured by standard greedy decoding.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic Transformers Provably Learn to Search via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

In a stochastic k-ary tree, a two-head transformer learns randomized DFS via policy gradient under depth-wise curriculum, generalizes to deeper trees, and adapts to imbalanced goals via discounting.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models
cs.CL 2024-10 unverdicted novelty 6.0

ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
cs.CV 2025-07 unverdicted novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
cs.CL 2026-06 unverdicted novelty 4.0

A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
A Survey of Scaling in Large Language Model Reasoning
cs.AI 2025-04 unverdicted novelty 3.0

A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.