hub Mixed citations

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

· 2022 · cs.AI · arXiv 2205.10625

Mixed citation behavior. Most common role is background (60%).

92 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 92 citing papers arXiv PDF

abstract

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 method 4 dataset 1 other 1

citation-polarity summary

background 12 use method 4 support 2 unclear 1 use dataset 1

claims ledger

abstract Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental

co-cited works

representative citing papers

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

cs.DC · 2026-04-27 · unverdicted · novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Applying Canonical Correlation Analysis to paired residual activations from natural-language and symbolic reasoning chains in LLMs reveals a low-dimensional shared logical subspace that can steer the model's reasoning for up to 11 percentage point accuracy gains on logical benchmarks.

Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA and fact-checking datasets.

iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

iTAG generates natural text paired with accurate causal graph annotations by framing concept assignment as an inverse problem and refining selections via chain-of-thought reasoning until the text's relations align with the target causal structure.

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

cs.AI · 2026-03-06 · unverdicted · novelty 7.0

LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.

Video-R1: Reinforcing Video Reasoning in MLLMs

cs.CV · 2025-03-27 · conditional · novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

Training Large Language Models to Reason in a Continuous Latent Space

cs.CL · 2024-12-09 · unverdicted · novelty 7.0

Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency trade-offs.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Large Language Models as Optimizers

cs.LG · 2023-09-07 · unverdicted · novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

cs.CV · 2023-03-08 · accept · novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

cs.SE · 2023-02-21 · accept · novelty 7.0

The authors present a catalog of prompt patterns that provide reusable solutions to common problems in generating and interacting with outputs from LLMs.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

cs.AI · 2026-05-30 · unverdicted · novelty 6.0

LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.

Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

Introduces TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP) as training-free structured prompting frameworks that improve LLM performance on table question answering over baselines on TableBench and achieve SOTA on FeTaQa.

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

citing papers explorer

Showing 42 of 92 citing papers.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 122 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 76 · internal anchor
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Towards Expert-Level Medical Question Answering with Large Language Models cs.CL · 2023-05-16 · unverdicted · none · ref 32 · internal anchor
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance cs.LG · 2023-05-09 · accept · none · ref 24 · internal anchor
FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 134 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 127 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Multimodal Chain-of-Thought Reasoning in Language Models cs.CL · 2023-02-02 · accept · none · ref 42 · internal anchor
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them cs.CL · 2022-10-17 · accept · none · ref 34 · internal anchor
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Automatic Chain of Thought Prompting in Large Language Models cs.CL · 2022-10-07 · conditional · none · ref 13 · internal anchor
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 102 · internal anchor
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs cs.CL · 2026-05-31 · unverdicted · none · ref 20 · internal anchor
HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token use than standard CoT on GSM8K and MATH500.
DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration cs.MA · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
DynaGraph is a multi-model framework that multiplexes PEFT adapters on a shared base model with evaluator-driven dynamic topology reconfiguration and hierarchical self-healing to achieve near-72B performance on reasoning benchmarks using an 8B model while reducing latency and tokens.
Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information cs.AI · 2026-05-27 · unverdicted · none · ref 52 · internal anchor
JTS trains reasoning models via supervised warm-up and missing-premise RL to make an explicit answerability commitment that triggers early termination on unanswerable inputs, raising Abstention@Detection near saturation.
Latent Action Reparameterization for Efficient Agent Inference cs.AI · 2026-05-18 · unverdicted · none · ref 50 · internal anchor
LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.
Runtime-Structured Task Decomposition for Agentic Coding Systems cs.SE · 2026-05-14 · unverdicted · none · ref 17 · internal anchor
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models cs.LG · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 8 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 30 · internal anchor
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Explanation Quality Assessment as Ranking with Listwise Rewards cs.AI · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement learning.
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search cs.AI · 2026-04-17 · unverdicted · none · ref 20 · internal anchor
Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 56 · 3 links · internal anchor
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 23 · internal anchor
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
Semantic-Aware Logical Reasoning via a Semiotic Framework cs.AI · 2025-09-29 · conditional · none · ref 59 · internal anchor
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making cs.AI · 2025-07-25 · unverdicted · none · ref 26 · internal anchor
MAC framework selects Pareto-optimal LLM agents and masks low cross-consistency outputs for adaptive collaboration in medical decision-making.
Small Language Models are the Future of Agentic AI cs.AI · 2025-06-02 · unverdicted · none · ref 87 · internal anchor
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Efficient Reasoning with Hidden Thinking cs.CL · 2025-01-31 · unverdicted · none · ref 30 · internal anchor
Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.
CHESS: Contextual Harnessing for Efficient SQL Synthesis cs.LG · 2024-05-27 · conditional · none · ref 69 · internal anchor
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 118 · internal anchor
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation cs.CL · 2026-05-15 · unverdicted · none · ref 20 · internal anchor
SGR enhances LLM reasoning accuracy by generating external subgraphs from knowledge bases and guiding progressive inference over them, yielding consistent gains over baselines on benchmarks.
ClinQueryAgent: A Conversational Agent for Population Health Management cs.IR · 2026-04-13 · unverdicted · none · ref 293 · internal anchor
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation cs.LO · 2026-04-08 · unverdicted · none · ref 71 · internal anchor
LLMs handle LTL syntax better than semantics, improve with detailed prompts, and perform substantially better when the task is reframed as Python code completion.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis cs.SE · 2026-03-18 · unverdicted · none · ref 220 · 2 links · internal anchor
Systematic review of 145 papers on LLM-based log analysis, providing a unified taxonomy, common design patterns, evaluation practices, and challenges for deployment under drift and limited labels.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 2 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences cs.CL · 2025-09-14 · unverdicted · none · ref 152 · 2 links · internal anchor
The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 241 · internal anchor
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 198 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 92 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 209 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
RAG over Thinking Traces Can Improve Reasoning Tasks cs.IR · 2026-05-05 · unreviewed · ref 30 · internal anchor
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation cs.CL · 2026-05-03 · unreviewed · ref 34 · internal anchor
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents cs.AI · 2025-12-14 · unreviewed · ref 56 · internal anchor
Discovering New Theorems via LLMs with In-Context Proof Learning in Lean cs.LG · 2025-09-16 · unreviewed · ref 15 · internal anchor

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer