Recognition: 2 theorem links
· Lean TheoremLeast-to-Most Prompting Enables Complex Reasoning in Large Language Models
Pith reviewed 2026-05-11 08:39 UTC · model grok-4.3
The pith
Least-to-most prompting lets large language models solve complex reasoning problems by breaking them into simpler subproblems solved in sequence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Least-to-most prompting first asks the model to produce a decomposition of the target problem into a sequence of simpler subproblems, then solves those subproblems in order while conditioning each new solution on all previous answers. This structure enables the model to reach problems that are harder than any shown in the prompt examples, producing at least 99 percent accuracy on every split of the SCAN compositional generalization benchmark with the code-davinci-002 model and only 14 exemplars.
What carries the argument
least-to-most prompting, which decomposes a complex problem into simpler subproblems and solves them sequentially while using prior answers to condition later steps
Load-bearing premise
The model can generate a correct decomposition and solve each subproblem without errors from earlier steps compounding into later ones.
What would settle it
A test set of compositional tasks where every valid decomposition still produces subproblems whose correct solutions depend on information that only appears in later subproblems, causing accuracy to fall below 50 percent even with perfect decompositions supplied in the prompt.
read the original abstract
Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes least-to-most prompting, a technique that decomposes complex problems into simpler subproblems solved sequentially, with each step conditioned on prior solutions. It evaluates the method on symbolic manipulation, compositional generalization (SCAN benchmark), and math reasoning tasks. The central empirical claim is that GPT-3 code-davinci-002 with least-to-most prompting achieves at least 99% accuracy on every SCAN split (including length) using only 14 exemplars, versus 16% with chain-of-thought prompting; the paper supplies the prompts in the appendix.
Significance. If the results are robust, the work is significant for demonstrating that a simple prompting decomposition strategy can elicit compositional generalization in LLMs on a benchmark where prior neural-symbolic systems required full training sets of >15k examples. The consistent gains across tasks and the provision of full prompts for reproducibility are strengths. The approach directly targets the easy-to-hard generalization limitation of chain-of-thought prompting.
major comments (1)
- [Abstract and SCAN results section] Abstract and SCAN results section: the 99% end-to-end accuracy on the length split is reported without a separate metric or ablation for decomposition-step correctness on the held-out length-split commands. Because the length split specifically tests whether the few-shot decomposition prompt itself generalizes compositionally, the absence of this intermediate accuracy leaves open whether the final number is explained by reliable subproblem generation or by other factors.
minor comments (2)
- [Results] The paper does not report variance across multiple runs or random seeds for the SCAN results, which would help assess stability of the 99% figure.
- [Method] While prompts are included in the appendix, a brief description in the main text of how the decomposition and solver exemplars were selected or constructed would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the work's significance. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: the 99% end-to-end accuracy on the length split is reported without a separate metric or ablation for decomposition-step correctness on the held-out length-split commands. Because the length split specifically tests whether the few-shot decomposition prompt itself generalizes compositionally, the absence of this intermediate accuracy leaves open whether the final number is explained by reliable subproblem generation or by other factors.
Authors: We agree that reporting the accuracy of the decomposition steps on the length split would strengthen the evidence that the few-shot prompt itself generalizes compositionally. The manuscript emphasizes end-to-end accuracy as the primary result, but we acknowledge that this leaves some ambiguity regarding the source of the performance. In the revised manuscript we will add a new ablation or table that reports decomposition-step correctness separately on the held-out length commands. This addition will directly address whether the high end-to-end accuracy arises from reliable subproblem generation. revision: yes
Circularity Check
No circularity: purely empirical prompting results on external benchmarks
full rationale
The paper introduces least-to-most prompting as an empirical technique and validates it through accuracy measurements on fixed benchmarks (SCAN, math word problems, etc.) against baselines such as chain-of-thought. No equations, derivations, fitted parameters, uniqueness theorems, or self-referential definitions appear; all reported numbers are direct experimental outcomes using the same model and prompt templates shown in the appendix. The central claim (99% SCAN accuracy with 14 exemplars) is an observed performance figure, not a prediction derived from prior results within the paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can follow instructions to solve subproblems sequentially when prompted appropriately.
Forward citations
Cited by 60 Pith papers
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
PAL: Program-aided Language Models
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
-
Code as Policies: Language Model Programs for Embodied Control
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
-
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
-
Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views
Applying Canonical Correlation Analysis to paired residual activations from natural-language and symbolic reasoning chains in LLMs reveals a low-dimensional shared logical subspace that can steer the model's reasoning...
-
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA an...
-
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
iTAG generates natural text paired with accurate causal graph annotations by framing concept assignment as an inverse problem and refining selections via chain-of-thought reasoning until the text's relations align wit...
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Training Large Language Models to Reason in a Continuous Latent Space
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
The authors present a catalog of prompt patterns that provide reusable solutions to common problems in generating and interacting with outputs from LLMs.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
Learning Agent Routing From Early Experience
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
-
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.
-
RAG over Thinking Traces Can Improve Reasoning Tasks
RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
-
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
-
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
-
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...
-
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
E-GRM triggers CoT reasoning in generative reward models only when parallel generations show high uncertainty, reducing inference cost and raising accuracy on reasoning benchmarks via a hybrid regression-ranking scorer.
-
Generative AI Agent Empowered Power Allocation for HAP Propulsion and Communication Systems
A generative AI agent creates a realistic HAP propulsion power model including aerodynamic interference and enables a Q3E beamforming algorithm that improves QoS and energy efficiency.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
-
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Sustainability Analysis of Prompt Strategies for SLM-based Automated Test Generation
Prompt strategies for SLM-based automated test generation vary widely in energy consumption and carbon emissions, with simpler strategies delivering competitive coverage at markedly lower environmental cost.
-
Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
Oblivion is a decay-driven memory framework that decouples read and write paths in LLM agents to enable adaptive forgetting and reinforcement for better long-horizon reasoning.
-
LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?
LiFT instruction fine-tunes LLMs with a temporal curriculum to improve in-context learning on longitudinal NLP tasks, yielding gains on out-of-distribution data and rare change events across multiple model sizes.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
-
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
-
Explanation Quality Assessment as Ranking with Listwise Rewards
Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement ...
-
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
-
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation
LLMs handle LTL syntax better than semantics, improve with detailed prompts, and perform substantially better when the task is reframed as Python code completion.
Reference graph
Works this paper leans on
-
[1]
So the output of “turn opposite left” is “TURN LEFT” * 2. Q: “turn around left” A: The output of “turn around left” concatenates: the output of “turn left”, the output of “turn left”, the output of “turn left”, the output of “turn left”. “turn left” outputs “TURN LEFT”. So repeating the output of “turn left” four times leads to “TURN LEFT” * 4. So the out...
work page 2023
-
[2]
* 3 instead of (“TURN LEFT” + “LOOK”) * 4 * 2 + (“TURN RIGHT” * 4) * 3. Example of copy error . In the example “walk opposite right twice after look around left thrice”, the code-davinci-001 model produces the correct translations for both sub-expressions that are connected with “after”. In particular, it correctly translates “walk opposite right twice” t...
work page 2023
-
[3]
“turn around right” outputs “TURN RIGHT” * 4. So concatenating the output of “walk around left”, the output of “walk around left”, the output of “turn around right”, the output of “turn around right”, the output of “turn around right” leads to (“TURN LEFT” + “WALK”) * 4 + (“TURN LEFT” + “WALK”) * 4 + “TURN RIGHT” * 4 + “TURN RIGHT” * 4 + “TURN RIGHT” * 4....
-
[4]
“jump around right thrice” outputs (“TURN RIGHT” + “JUMP”) * 12. So concatenating the output of “run left thrice” and the output of “jump around right thrice” leads to (“TURN LEFT” + “RUN”) * 3 + (“TURN RIGHT” + “JUMP”) * 12. So the output of “jump around right thrice after run left thrice” is (“TURN LEFT” + “RUN”) * 3 + (“TURN RIGHT” + “JUMP”) * 12. “run...
work page 2023
-
[5]
So concatenating the output of “run around left twice” and the output of “walk around left thrice” leads to (“TURN LEFT” + “RUN”) * 4 * 2 + (“TURN LEFT” + “WALK”) * 4 * 3. So the output of “run around left twice and walk around left thrice” is (“TURN LEFT” + “RUN”) * 4 * 2 + (“TURN LEFT” + “WALK”) * 4 * 3. “jump around left thrice after turn around left t...
work page 2023
-
[6]
“run left twice” outputs “TURN LEFT” + “RUN” * 2. So concatenating the output of “run around left thrice” and the output of “run left twice” leads to (“TURN LEFT” + “RUN”) * 3 + “TURN LEFT” + “RUN” * 2. So the output of “run around left thrice and run left twice” is (“TURN LEFT” + “RUN”) * 3 + “TURN LEFT” + “RUN” * 2. “walk and walk around left thrice” Go...
work page 2023
-
[7]
So repeating the output of “run around left” three times leads to (“TURN LEFT” + “RUN”) * 4 * 3. So the output of “run around left thrice” is (“TURN LEFT” + “RUN”) * 4 * 3. Q: “run right” A: The output of “run right” concatenates: the output of “turn right”, the output of “run”. “turn right” outputs “TURN RIGHT”. “run” outputs “RUN”. So concatenating the ...
-
[8]
So concatenating the output of “run around left thrice” and the output of “run around right twice” leads to (“TURN LEFT” + “RUN”) * 4 * 3 + (“TURN RIGHT” + “RUN”) * 4 * 2. So the output of “run around left thrice after run around right twice” is (“TURN LEFT” + “RUN”) * 4 * 3 + (“TURN RIGHT” + “RUN”) * 4 * 2. “jump opposite right twice after jump around ri...
work page 2023
-
[9]
jump opposite right twice after jump around right thrice
So the output of “jump opposite right twice after jump around right thrice” is (“TURN RIGHT” + “JUMP”) * 8 + (“TURN RIGHT” * 2 + “JUMP”) * 2. “walk around left thrice after walk right twice” Golden: TURN RIGHT WALK TURN RIGHT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN L...
work page 2023
-
[10]
Sequential multiplication
-
[11]
Avoid commutativity / associativity in addition
-
[12]
Nested multiplication
-
[13]
Addition of two multiplications 9 DROP 9.1 R ESULTS WITH T E X T-D A V I N C I-002 AND LM-540B We reported the results using code-davinci-002. Here, we report results using the text-davinci-002 model and a language model with 540 billion parameters (LM-540B). text-davinci-002 Prompting method Non-football (500 cases) Football (500 cases) Zero-Shot 27.00 3...
work page 2023
-
[14]
So 1979 - 1973 = 6 years passed between the oil and energy crises. So the answer is 6. 9.3 F OOTBALL SUBSET 9.3.1 Z ERO-SHOT PROMPTING For zero-shot, the prompt format is as follows: Q:{question} A: The answer is Notice that we add “The answer is” at the beginning of the answer section. 9.3.2 S TANDARD PROMPTING WITH 3 EXAMPLES Q: The Seahawks played the ...
work page 1979
-
[15]
She became a nurse on the Hospital payroll, where she remained five years after James Peeles death, when she married Ralph Boswell. His siblings included Anne (d. Jan 10, 1568/9), Isabel, Judith (d. Apr. 16, 1582), and James (b. Jan 3, 1563/4). Anne married John Alford on May 14, 1565, and had one son, Robert (October 9, 1567- c. March 12, 1654/5). Judith ...
work page 2023
-
[16]
How many percent of people were from 2 or more races in 2000?
On his way to Constantinople, how many cities did Polin laid waste to? A: To answer the question ”On his way to Constantinople, how many cities did Polin laid waste to?”, we need to know: ”How many cities did Polin laid waste to on his way to Constantinople?”. 9.5.2 E XAMPLE OF WRONG PROBLEM SOLVING In the following example, the answer to the decomposed q...
work page 2000
-
[17]
The model gives the correct answer via least-to-most prompting. There were 20,928 births in 2006. Of these, 19,757 (94.40% of the births, 95.19% of the population) were to Non-Hispanic Whites. There were 22 births to American Indians (0.11% of the births and 0.54% of the population), 177 births to Asians (0.85% of the births and 0.68% of the population), ...
work page 2006
-
[18]
Anna has 2 more apples than Elsa. So Anna has 2 + 5 = 7 apples
-
[19]
Chain-of-Thought (original)“, which is the same as the “Prompt for Math Word Problems
Elsa and Anna have 5 + 7 = 12 apples together. Q:{question} A: Let’s break down this problem: —– The answer is: 10.4 P ROMPT CONTEXTS : E NGINEERED PROMPTS We include here the additional prompt templates used in the experiments reported in Appendix 10.2, with the exception of “Chain-of-Thought (original)“, which is the same as the “Prompt for Math Word Pr...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.