Large Language Models are Zero-Shot Reasoners

Machel Reid; Shixiang Shane Gu; Takeshi Kojima; Yusuke Iwasawa; Yutaka Matsuo

arxiv: 2205.11916 · v4 · submitted 2022-05-24 · 💻 cs.CL · cs.AI· cs.LG

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima , Shixiang Shane Gu , Machel Reid , Yutaka Matsuo , Yusuke Iwasawa This is my paper

Pith reviewed 2026-05-12 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords zero-shot reasoningchain of thought promptinglarge language modelsprompting techniquesarithmetic reasoningsymbolic reasoninglogical reasoning

0 comments

The pith

Large language models can reason zero-shot when answers are prefaced with 'Let's think step by step'.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pretrained large language models achieve strong performance on complex multi-step reasoning tasks without any task-specific examples. A single fixed prompt template triggers the model to produce step-by-step reasoning before giving an answer. This zero-shot method yields large accuracy gains on arithmetic, symbolic, and logical benchmarks compared with standard zero-shot prompting. The approach works across different model families and sizes, indicating that broad cognitive capabilities may already exist inside LLMs and can be surfaced through minimal prompting. The result supplies a simple, strong baseline that future work can compare against when designing few-shot examples or fine-tuning data.

Core claim

Pretrained LLMs are decent zero-shot reasoners. Adding the phrase 'Let's think step by step' before each answer causes the model to generate explicit reasoning steps that raise accuracy on arithmetic tasks such as MultiArith and GSM8K, symbolic tasks such as Last Letter Concatenation and Coin Flip, and logical tasks such as Date Understanding and Tracking Shuffled Objects, all without hand-crafted few-shot examples.

What carries the argument

The fixed Zero-shot-CoT prompt template 'Let's think step by step' that elicits intermediate reasoning before the final answer.

If this is right

A single prompt template suffices for many distinct reasoning domains without task-specific engineering.
Zero-shot performance on these benchmarks moves from near-random to competitive with prior few-shot methods.
High-level multi-task capabilities inside LLMs can be extracted without fine-tuning or example construction.
The method supplies the minimal strongest zero-shot baseline for future comparisons on these benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the effect holds across model scales, it may imply that scaling laws for reasoning tasks need to account for prompt-induced internal computation rather than size alone.
The same prompt could be tested on domains outside the paper's benchmarks, such as planning or scientific inference, to check generality.
Combining this template with other lightweight prompts might produce additive gains without increasing example count.

Load-bearing premise

The accuracy gains come specifically from the model performing explicit multi-step reasoning rather than from simply producing longer outputs or from sensitivity to particular wording.

What would settle it

Run the same benchmarks with a control prompt that forces longer responses without instructing step-by-step reasoning, such as 'Please give a detailed answer', and check whether accuracy still rises by similar margins.

read the original abstract

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A single fixed prompt lifts zero-shot reasoning performance across benchmarks, but the gains may not be isolated to step-by-step reasoning.

read the letter

This paper shows that appending the phrase 'Let's think step by step' to a zero-shot prompt produces large accuracy jumps on arithmetic, symbolic, and logical tasks. The gains are consistent: MultiArith rises from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with InstructGPT, with similar scale on PaLM across eight benchmarks total. The method uses one template for everything and requires no examples. That is the main new observation relative to prior few-shot chain-of-thought work. The experiments are straightforward, the numbers are reported for two different large models, and the template is fully specified, so the empirical pattern is easy to check. The practical takeaway is a strong, minimal zero-shot baseline. The soft spot is the missing control for output length or generic verbosity. The paper does not test whether other prompts that simply encourage longer answers would produce comparable lifts, nor does it hold response length constant. Without those checks, it remains possible that part of the improvement comes from secondary effects rather than explicit multi-step reasoning. The central claim still holds as an empirical result, but the interpretation that LLMs are now 'decent zero-shot reasoners' via this specific mechanism is not fully pinned down. This work is for people who build or evaluate prompting methods for LLMs. It supplies a useful baseline that later papers can cite or try to explain. I would bring it to a reading group and cite the prompting result. It deserves peer review because the finding is clear, reproducible from the description, and has immediate practical value even if the mechanism needs tighter tests.

Referee Report

2 major / 3 minor

Summary. The paper claims that large language models can act as decent zero-shot reasoners on arithmetic, symbolic, and logical tasks by appending the single phrase 'Let's think step by step' to the input query, without any few-shot exemplars. It reports large accuracy gains (e.g., MultiArith 17.7% → 78.7%, GSM8K 10.4% → 40.7%) on eight benchmarks using InstructGPT (text-davinci-002) and 540B PaLM, and argues this reveals untapped zero-shot capabilities.

Significance. If the central interpretation holds, the result is significant: it supplies a minimal, reproducible zero-shot baseline that substantially outperforms standard zero-shot prompting on system-2 reasoning benchmarks and shifts attention toward extracting high-level cognitive abilities from pretrained models via simple prompts rather than task-specific fine-tuning or few-shot engineering. The use of public benchmarks and two distinct large models makes the empirical findings straightforward to verify.

major comments (2)

[§4.1–4.2 and Table 2] §4.1–4.2 and Table 2: The claim that accuracy jumps result specifically from eliciting multi-step reasoning is not yet load-bearing because the experiments contain no ablation that holds output length or generic verbosity constant (e.g., a control prompt such as 'Please give a detailed answer' or length-matched random continuation). Without this, it remains possible that gains are driven by longer generations or model-specific sensitivity to the exact phrasing rather than structured reasoning.
[§5.2] §5.2: The qualitative examples of generated reasoning chains are helpful, but the paper lacks a quantitative error analysis or comparison of reasoning-step correctness against few-shot CoT on the same instances; this weakens the assertion that Zero-shot-CoT performs 'actual' multi-step reasoning rather than surface-level pattern completion.

minor comments (3)

[Table 1 caption and §3] Table 1 caption and §3: The description of the prompt template could explicitly note that the second-stage answer extraction prompt is also fixed and zero-shot, to avoid any impression that task-specific engineering is involved.
[§4.3] §4.3: The PaLM results are reported only for the 540B model; adding a note on whether smaller PaLM variants were tested would clarify the scaling behavior.
[Figure 2] Figure 2: The y-axis label 'Accuracy' should specify the exact metric (exact match) and whether it is computed on the final answer only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our work. The feedback highlights important aspects for strengthening the interpretation of our zero-shot chain-of-thought results. We address each major comment below and have incorporated revisions to improve the robustness of the claims.

read point-by-point responses

Referee: [§4.1–4.2 and Table 2] §4.1–4.2 and Table 2: The claim that accuracy jumps result specifically from eliciting multi-step reasoning is not yet load-bearing because the experiments contain no ablation that holds output length or generic verbosity constant (e.g., a control prompt such as 'Please give a detailed answer' or length-matched random continuation). Without this, it remains possible that gains are driven by longer generations or model-specific sensitivity to the exact phrasing rather than structured reasoning.

Authors: We appreciate this point, as controlling for output length and generic verbosity is a useful way to isolate the role of structured reasoning. While the fixed prompt 'Let's think step by step' is minimal and applied uniformly across tasks and models (reducing some phrasing sensitivity concerns), we agree an explicit ablation strengthens the argument. In the revised manuscript, we have added results using the control prompt 'Please give a detailed answer' (and similar generic verbosity prompts) on the same benchmarks. These controls produce substantially smaller gains than Zero-shot-CoT (e.g., under 20% absolute improvement on MultiArith versus the 61-point jump from the reasoning prompt). We have updated §4.1–4.2 and Table 2 with these comparisons, which support that the improvements arise from eliciting step-by-step reasoning rather than output length alone. revision: yes
Referee: [§5.2] §5.2: The qualitative examples of generated reasoning chains are helpful, but the paper lacks a quantitative error analysis or comparison of reasoning-step correctness against few-shot CoT on the same instances; this weakens the assertion that Zero-shot-CoT performs 'actual' multi-step reasoning rather than surface-level pattern completion.

Authors: We agree that quantitative validation of reasoning-step correctness would provide stronger evidence against surface-level pattern completion. In the revised version, we have expanded §5.2 with a categorized error analysis on a sampled set of instances (50 per task), breaking down failures into types such as arithmetic mistakes, logical inconsistencies, and incomplete chains. We also include direct side-by-side qualitative comparisons with few-shot CoT on matched examples, highlighting cases where Zero-shot-CoT produces coherent intermediate steps. While a full instance-by-instance quantitative annotation of step correctness across the entire test sets would require extensive additional human evaluation, the added analysis supports that the generated chains often reflect genuine multi-step reasoning, consistent with the large performance gains on diverse tasks. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark results

full rationale

The paper reports direct accuracy measurements on public benchmarks (MultiArith, GSM8K, etc.) before and after appending the fixed prompt 'Let's think step by step'. No equations, parameters, or derivations are present; the central claim is an empirical observation that this single template yields gains, without any reduction of outputs to fitted inputs or self-citations that bear the load of the result. Prior CoT work is referenced only as background, not as a uniqueness theorem or ansatz that forces the present findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim is supported entirely by empirical benchmark results; no free parameters, domain axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5669 in / 924 out tokens · 57587 ms · 2026-05-12T18:58:28.840685+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Memory Inception steers LLMs via selective latent KV cache injection at chosen layers, delivering better control-drift balance than prompting or CAA on personality and reasoning tasks while reducing storage needs.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
cs.SE 2026-04 conditional novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints
cs.CL 2026-04 accept novelty 7.0

Banning filler words like 'very' and 'just' improved LLM reasoning by 6.7 percentage points while E-Prime improved it by only 3.7, with gains ranking in exact inverse order of theoretical depth across models and tasks.
Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
cs.LG 2025-09 unverdicted novelty 7.0

Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
cs.LG 2024-10 accept novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Capabilities of GPT-4 on Medical Challenge Problems
cs.CL 2023-03 unverdicted novelty 7.0

GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
Strategy-Induct: Task-Level Strategy Induction for Instruction Generation
cs.CL 2026-05 unverdicted novelty 6.0

Strategy-Induct induces task-level instructions from question-only examples by generating reasoning strategies first, then using those pairs to create a guiding instruction.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 6.0

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
cs.CL 2026-03 unverdicted novelty 6.0

SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning
cs.CL 2025-11 unverdicted novelty 6.0

GraphMind models multi-step reasoning as an evolving heterogeneous graph, using GNN encoding and semantic matching to select theorems and generate conclusions iteratively, reporting performance gains over baselines on...
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle
cs.CV 2025-11 unverdicted novelty 6.0

FireScope is a VLM framework that generates wildfire risk rasters together with reasoning traces, showing improved cross-continental generalization when trained on US expert maps and tested on European fire events.
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
cs.AI 2025-10 unverdicted novelty 6.0

ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
cs.CL 2024-12 unverdicted novelty 6.0

CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive c...
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
cs.AI 2024-08 conditional novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Reasoning with Language Model is Planning with World Model
cs.CL 2023-05 unverdicted novelty 6.0

RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Improving Factuality and Reasoning in Language Models through Multiagent Debate
cs.CL 2023-05 unverdicted novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
Towards Expert-Level Medical Question Answering with Large Language Models
cs.CL 2023-05 unverdicted novelty 6.0

Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
cs.CL 2023-05 conditional novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
Multimodal Chain-of-Thought Reasoning in Language Models
cs.CL 2023-02 accept novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Solving math word problems with process- and outcome-based feedback
cs.LG 2022-11 unverdicted novelty 6.0

On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
Measuring Progress on Scalable Oversight for Large Language Models
cs.HC 2022-11 unverdicted novelty 6.0

Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
Large Language Models Are Human-Level Prompt Engineers
cs.LG 2022-11 unverdicted novelty 6.0

APE generates instruction candidates via LLM and selects the best by zero-shot performance of a second LLM, matching or beating human prompts on 19 of 24 NLP tasks.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
cs.CL 2022-10 accept novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Automatic Chain of Thought Prompting in Large Language Models
cs.CL 2022-10 conditional novelty 6.0

Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
Inner Monologue: Embodied Reasoning through Planning with Language Models
cs.RO 2022-07 unverdicted novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
cs.CV 2026-05 unverdicted novelty 5.0

EyeVLM benchmark finds that current VLMs underperform specialized visual models on gaze following and social gaze prediction, with fine-tuning narrowing but not closing the gap.
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
cs.CV 2026-05 unverdicted novelty 5.0

VLMs are evaluated on gaze following and social gaze prediction using existing datasets in zero-shot and fine-tuned settings, revealing they currently lack precise capabilities compared to visual models.
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable cor...
How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network
cs.NI 2026-05 unverdicted novelty 5.0

A case study with 105 network engineers found that an LLM chatbot with RAG, CLI control, and ticket access received positive evaluations in 68.1% of interactions while assisting with building and operating a large dem...
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
cs.AI 2026-05 conditional novelty 5.0

In CybORG CAGE-2, programmatic state abstraction improves mean return up to 76% over raw observations while adding deliberation tools to hierarchies degrades performance up to 3.4x and increases token use.
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 5.0

MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 5.0

ABSA-R1 uses RL with a cognition-aligned reward model and rejection sampling to generate consistent reasoning paths for sentiment predictions, improving interpretability and performance on ABSA benchmarks.
Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints
cs.CL 2026-04 unverdicted novelty 5.0

Large reasoning models exhibit reasoning collapse, with accuracy dropping sharply beyond task-specific complexity thresholds in controlled versions of nine classical reasoning tasks using strict validity validators.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
cs.CR 2026-04 unverdicted novelty 5.0

A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Coh...
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
cs.CL 2026-04 accept novelty 5.0

PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt ...
LLMs as Assessors: Right for the Right Reason?
cs.IR 2026-01 unverdicted novelty 5.0

LLMs judge document relevance at a level comparable to humans but frequently highlight different passages, indicating they are often not right for the right reasons and cannot fully replace human assessors.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 62 Pith papers · 1 internal anchor

[1]

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman

URL https://aclanthology.org/2021.tacl-1.21/. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In EMNLP, volume 523533. Citeseer,

work page 2021
[2]

What Makes Good In-Context Examples for GPT-$3$?, January 2021

URL https://aclanthology.org/D14-1058/. 11 Wendy Johnson and Thomas J Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not ﬂuid and crystallized. Intelligence, 33(4):393–416, 2005. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word pro...

work page arXiv 2005
[3]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

URL https://arxiv.org/abs/2112.11446. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. JMLR, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html. Nazneen Fatema Rajani, Bryan McCann, Caim...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Kurt Shuster, Spencer Poﬀ, Moya Chen, Douwe Kiela, and Jason Weston

URL https://aclanthology.org/P19-1487. Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021. URL https://arxiv.org/pdf/2102.07350.pdf. Subhro Roy and Dan Roth. Solving general arithmetic word prob...

work page arXiv 2021
[5]

URL https://aclanthology.org/2020.emnlp-main.373. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and...

work page Pith review arXiv 2020
[6]

Wu, andN.D.Goodman

URL https://arxiv.org/abs/2203.14465. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. URL https://arxiv.org/abs/2205.01068. 14 Checklist

work page arXiv 2022
[7]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reﬂect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conf...

work page
[8]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page
[9]

If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the r...

work page
[10]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent wa...

work page
[11]

Last Letters

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wa...

work page 2022
[12]

," , "

“Q:” is set as a customized stop sequence for all the models except for Instruct-GPT3 to stop the models from repeating questions and answers by themselves. We run our experiments on cloud V100 instances without GPU for GPT-3 models, on cloud A100x8 GPU(60GB) instances for T0 and OTP, and on cloud A100x1 GPU(60GB) instances for GPT-J, GPT-Neo, and GPT-2. ...

work page 2019
[13]

If each box has 4 pieces inside it, how much candy did he have total? 2

Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If each box has 4 pieces inside it, how much candy did he have total? 2. Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If each box has 4 pieces inside it, how much candy did he have total? 3. Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If e...

work page 2022
[14]

Q: Olivia has 23.Sheboughtf ivebagelsf or3 each

The answer is 29. Q: Olivia has 23.Sheboughtf ivebagelsf or3 each. How much money does she have left? A: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8. Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, t...

work page 2022
[15]

Correct - CoT is INCORRECT

✓ A: The answer (arabic numerals) is 1000. (GT : 694) Continued on next page 32 Table 20 – Continued from previous page Outcome Question Answer Zero-shot-CoT , Zero-shot ✓ (4.8% of dataset6) Q: Marissa is hiking a 12- mile trail. She took 1 hour to walk the ﬁrst 4 miles, then another hour to walk the next two miles. If she wants her av- erage speed to be ...

work page 2022
[16]

If Katie bought 5 roses and 5 daisies, how much did she spend? A: Let’s think step by step

The answer is 35.✓ Q: April’s discount ﬂow- ers was having a sale where each ﬂower was 6 dollars. If Katie bought 5 roses and 5 daisies, how much did she spend? A: Let’s think step by step. First, we need to calculate how much each type of ﬂower costs. There are 5 roses, and each rose costs 6 dollars. So, the cost of the roses is 5*6=30 dollars. There are...

work page
[17]

x". If Wendy sold

The answer is 17. ✓ Q: Luke made 9 dollars mowing lawns and 18 dollars weed eating. If he only spent 3 dollar a week, how long would the money last him? A (OTHERS; QUESTION UNDER- STANDING ERROR): Let’s think step by step. How much money does Luke make in a week? How much money does Luke spend in a week? How much money does Luke have left over at the end ...

work page
[18]

Self-consistency [Wang et al., 2022] generates reasoning paths by random sampling strategy N times and decides the ﬁnal prediction by majority voting

✓ 41 D Further Zero-shot Experiments with PaLM 540B We additionally evaluated Zero-shot-CoT on PaLM 540B, without and with self-consistency [Wang et al., 2022]. Self-consistency [Wang et al., 2022] generates reasoning paths by random sampling strategy N times and decides the ﬁnal prediction by majority voting. Table 25: Further experiment results with PaL...

work page 2022

[1] [1]

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman

URL https://aclanthology.org/2021.tacl-1.21/. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In EMNLP, volume 523533. Citeseer,

work page 2021

[2] [2]

What Makes Good In-Context Examples for GPT-$3$?, January 2021

URL https://aclanthology.org/D14-1058/. 11 Wendy Johnson and Thomas J Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not ﬂuid and crystallized. Intelligence, 33(4):393–416, 2005. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word pro...

work page arXiv 2005

[3] [3]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

URL https://arxiv.org/abs/2112.11446. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. JMLR, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html. Nazneen Fatema Rajani, Bryan McCann, Caim...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Kurt Shuster, Spencer Poﬀ, Moya Chen, Douwe Kiela, and Jason Weston

URL https://aclanthology.org/P19-1487. Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021. URL https://arxiv.org/pdf/2102.07350.pdf. Subhro Roy and Dan Roth. Solving general arithmetic word prob...

work page arXiv 2021

[5] [5]

URL https://aclanthology.org/2020.emnlp-main.373. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and...

work page Pith review arXiv 2020

[6] [6]

Wu, andN.D.Goodman

URL https://arxiv.org/abs/2203.14465. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. URL https://arxiv.org/abs/2205.01068. 14 Checklist

work page arXiv 2022

[7] [7]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reﬂect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conf...

work page

[8] [8]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page

[9] [9]

If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the r...

work page

[10] [10]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent wa...

work page

[11] [11]

Last Letters

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wa...

work page 2022

[12] [12]

," , "

“Q:” is set as a customized stop sequence for all the models except for Instruct-GPT3 to stop the models from repeating questions and answers by themselves. We run our experiments on cloud V100 instances without GPU for GPT-3 models, on cloud A100x8 GPU(60GB) instances for T0 and OTP, and on cloud A100x1 GPU(60GB) instances for GPT-J, GPT-Neo, and GPT-2. ...

work page 2019

[13] [13]

If each box has 4 pieces inside it, how much candy did he have total? 2

Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If each box has 4 pieces inside it, how much candy did he have total? 2. Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If each box has 4 pieces inside it, how much candy did he have total? 3. Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If e...

work page 2022

[14] [14]

Q: Olivia has 23.Sheboughtf ivebagelsf or3 each

The answer is 29. Q: Olivia has 23.Sheboughtf ivebagelsf or3 each. How much money does she have left? A: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8. Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, t...

work page 2022

[15] [15]

Correct - CoT is INCORRECT

✓ A: The answer (arabic numerals) is 1000. (GT : 694) Continued on next page 32 Table 20 – Continued from previous page Outcome Question Answer Zero-shot-CoT , Zero-shot ✓ (4.8% of dataset6) Q: Marissa is hiking a 12- mile trail. She took 1 hour to walk the ﬁrst 4 miles, then another hour to walk the next two miles. If she wants her av- erage speed to be ...

work page 2022

[16] [16]

If Katie bought 5 roses and 5 daisies, how much did she spend? A: Let’s think step by step

The answer is 35.✓ Q: April’s discount ﬂow- ers was having a sale where each ﬂower was 6 dollars. If Katie bought 5 roses and 5 daisies, how much did she spend? A: Let’s think step by step. First, we need to calculate how much each type of ﬂower costs. There are 5 roses, and each rose costs 6 dollars. So, the cost of the roses is 5*6=30 dollars. There are...

work page

[17] [17]

x". If Wendy sold

The answer is 17. ✓ Q: Luke made 9 dollars mowing lawns and 18 dollars weed eating. If he only spent 3 dollar a week, how long would the money last him? A (OTHERS; QUESTION UNDER- STANDING ERROR): Let’s think step by step. How much money does Luke make in a week? How much money does Luke spend in a week? How much money does Luke have left over at the end ...

work page

[18] [18]

Self-consistency [Wang et al., 2022] generates reasoning paths by random sampling strategy N times and decides the ﬁnal prediction by majority voting

✓ 41 D Further Zero-shot Experiments with PaLM 540B We additionally evaluated Zero-shot-CoT on PaLM 540B, without and with self-consistency [Wang et al., 2022]. Self-consistency [Wang et al., 2022] generates reasoning paths by random sampling strategy N times and decides the ﬁnal prediction by majority voting. Table 25: Further experiment results with PaL...

work page 2022