super hub Mixed citations

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Brian Ichter, Dale Schuurmans, Fei Xia, Jason Wei, Maarten Bosma, Xuezhi Wang · 2022 · cs.CL · arXiv 2201.11903

Mixed citation behavior. Most common role is background (65%).

285 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 285 citing papers more from Brian Ichter arXiv PDF

abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 method 13 baseline 5

citation-polarity summary

background 44 use method 13 baseline 5 support 4 unclear 2

claims ledger

abstract We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. Th

authors

Brian Ichter Dale Schuurmans Fei Xia Jason Wei Maarten Bosma Xuezhi Wang

co-cited works

representative citing papers

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

cs.CL · 2022-02-25 · accept · novelty 8.0

Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

physics.soc-ph · 2026-06-20 · unverdicted · novelty 7.0

A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

cs.MA · 2026-05-28 · unverdicted · novelty 7.0

LLM multi-agent debate with epistemically diverse models improves truth-seeking performance on questionnaires and is claimed to be mechanistically grounded in the Argumentative Theory of Reasoning.

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

EnterpriseMem-Bench shows stateless multi-turn Text-to-SQL accuracy drops to zero by turn 3, working memory is the main driver of gains, and additional memory components yield model- and dataset-dependent effects from +14 to -16 percentage points.

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.

Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

cs.IR · 2026-05-23 · unverdicted · novelty 7.0

Spectral Retrieval uses multi-scale sinc convolutions on token embeddings to interpolate between per-token MaxSim and mean-pooling, achieving large gains on synthetic and LIMIT-small benchmarks for localized retrieval.

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models cs.CL · 2026-05-03 · unverdicted · none · ref 1 · internal anchor
Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena cs.CL · 2023-06-09 · accept · none · ref 47 · internal anchor
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 19 · 2 links · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer