pith. sign in

super hub Mixed citations

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Mixed citation behavior. Most common role is background (65%).

307 Pith papers citing it
Background 65% of classified citations
abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

hub tools

citation-role summary

background 50 method 13 baseline 5

citation-polarity summary

claims ledger

  • abstract We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. Th

authors

co-cited works

clear filters

representative citing papers

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models cs.CL · 2023-05-17 · accept · none · ref 39 · internal anchor

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  • IE as Cache: Information Extraction Enhanced Agentic Reasoning cs.CL · 2026-04-16 · unverdicted · none · ref 41 · internal anchor

    IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.

  • Measuring Representation Robustness in Large Language Models for Geometry cs.CL · 2026-04-03 · unverdicted · none · ref 36 · internal anchor

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 126 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • Gorilla: Large Language Model Connected with Massive APIs cs.CL · 2023-05-24 · conditional · none · ref 44 · internal anchor

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  • Towards Expert-Level Medical Question Answering with Large Language Models cs.CL · 2023-05-16 · unverdicted · none · ref 1 · internal anchor

    Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.

  • Information Extraction from Electricity Invoices with General-Purpose Large Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 25 · internal anchor

    Few-shot prompting lifts F1 scores above 96 percent on electricity-invoice extraction for Gemini 1.5 Pro and Mistral-small, while hyperparameter changes produce only marginal gains.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 35 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.