arxiv: 2205.10625 · v3 · submitted 2022-05-21 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou , Nathanael Sch\"arli , Le Hou , Jason Wei , Nathan Scales , Xuezhi Wang , Dale Schuurmans , Claire Cui

show 3 more authors

Olivier Bousquet Quoc Le Ed Chi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 08:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords least-to-most promptingchain-of-thought promptingcompositional generalizationSCAN benchmarklarge language modelsreasoningsymbolic manipulation

0 comments

The pith

Least-to-most prompting lets large language models solve complex reasoning problems by breaking them into simpler subproblems solved in sequence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces least-to-most prompting as a way to overcome the easy-to-hard generalization limits of chain-of-thought prompting. The method decomposes a hard problem into a chain of easier subproblems and solves them one by one, feeding each answer forward to help with the next. Experiments show this works across symbolic manipulation, compositional generalization, and math reasoning tasks. The standout result is near-perfect performance on the SCAN benchmark in every split, including the challenging length split, using only 14 examples.

Core claim

Least-to-most prompting first asks the model to produce a decomposition of the target problem into a sequence of simpler subproblems, then solves those subproblems in order while conditioning each new solution on all previous answers. This structure enables the model to reach problems that are harder than any shown in the prompt examples, producing at least 99 percent accuracy on every split of the SCAN compositional generalization benchmark with the code-davinci-002 model and only 14 exemplars.

What carries the argument

least-to-most prompting, which decomposes a complex problem into simpler subproblems and solves them sequentially while using prior answers to condition later steps

Load-bearing premise

The model can generate a correct decomposition and solve each subproblem without errors from earlier steps compounding into later ones.

What would settle it

A test set of compositional tasks where every valid decomposition still produces subproblems whose correct solutions depend on information that only appears in later subproblems, causing accuracy to fall below 50 percent even with perfect decompositions supplied in the prompt.

read the original abstract

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Least-to-most prompting gets striking SCAN numbers but leaves the key decomposition step unmeasured on the length split.

read the letter

The main thing to know is that this paper introduces least-to-most prompting and reports that GPT-3 code-davinci-002 reaches 99%+ accuracy on every SCAN split, including length, with only 14 exemplars while chain-of-thought gets 16%. That gap is large enough to notice even from the abstract alone. The method works by first asking the model to list simpler subproblems, then solving them one at a time while carrying forward the answers. This is presented as a way to handle problems harder than the examples shown in the prompt. The paper tests the approach on symbolic manipulation, compositional generalization, and math word problems, and the SCAN result stands out because it beats neural-symbolic systems trained on the full 15,000-example set. The prompts themselves are included in the appendix, which makes the setup easier to inspect. What the work does cleanly is show that an explicit decomposition step can produce measurable gains over standard chain-of-thought on these tasks. The empirical comparisons are straightforward and the numbers are reported across multiple splits. The soft spot is exactly the one the stress-test note flags. The length split in SCAN requires the decomposition prompt to produce correct subproblem lists for commands longer than any exemplar. The paper gives only the final end-to-end accuracy and does not report a separate metric for how often the decomposition itself is correct on the held-out length commands. Without that breakdown it is hard to know whether the solver is truly robust or whether the decomposition happens to work well enough in this particular case. An ablation or error analysis on the decomposition step would have made the generalization claim easier to evaluate. This paper is aimed at researchers who work on prompting and reasoning in large language models. Anyone tracking compositional generalization or looking for lightweight ways to improve out-of-distribution performance will get value from the SCAN numbers and the method description. The central empirical claim is sharp enough and the idea is simple enough that the paper deserves a serious referee to check the controls, variance, and the missing decomposition metric. I would send it to review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes least-to-most prompting, a technique that decomposes complex problems into simpler subproblems solved sequentially, with each step conditioned on prior solutions. It evaluates the method on symbolic manipulation, compositional generalization (SCAN benchmark), and math reasoning tasks. The central empirical claim is that GPT-3 code-davinci-002 with least-to-most prompting achieves at least 99% accuracy on every SCAN split (including length) using only 14 exemplars, versus 16% with chain-of-thought prompting; the paper supplies the prompts in the appendix.

Significance. If the results are robust, the work is significant for demonstrating that a simple prompting decomposition strategy can elicit compositional generalization in LLMs on a benchmark where prior neural-symbolic systems required full training sets of >15k examples. The consistent gains across tasks and the provision of full prompts for reproducibility are strengths. The approach directly targets the easy-to-hard generalization limitation of chain-of-thought prompting.

major comments (1)

[Abstract and SCAN results section] Abstract and SCAN results section: the 99% end-to-end accuracy on the length split is reported without a separate metric or ablation for decomposition-step correctness on the held-out length-split commands. Because the length split specifically tests whether the few-shot decomposition prompt itself generalizes compositionally, the absence of this intermediate accuracy leaves open whether the final number is explained by reliable subproblem generation or by other factors.

minor comments (2)

[Results] The paper does not report variance across multiple runs or random seeds for the SCAN results, which would help assess stability of the 99% figure.
[Method] While prompts are included in the appendix, a brief description in the main text of how the decomposition and solver exemplars were selected or constructed would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the work's significance. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: the 99% end-to-end accuracy on the length split is reported without a separate metric or ablation for decomposition-step correctness on the held-out length-split commands. Because the length split specifically tests whether the few-shot decomposition prompt itself generalizes compositionally, the absence of this intermediate accuracy leaves open whether the final number is explained by reliable subproblem generation or by other factors.

Authors: We agree that reporting the accuracy of the decomposition steps on the length split would strengthen the evidence that the few-shot prompt itself generalizes compositionally. The manuscript emphasizes end-to-end accuracy as the primary result, but we acknowledge that this leaves some ambiguity regarding the source of the performance. In the revised manuscript we will add a new ablation or table that reports decomposition-step correctness separately on the held-out length commands. This addition will directly address whether the high end-to-end accuracy arises from reliable subproblem generation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical prompting results on external benchmarks

full rationale

The paper introduces least-to-most prompting as an empirical technique and validates it through accuracy measurements on fixed benchmarks (SCAN, math word problems, etc.) against baselines such as chain-of-thought. No equations, derivations, fitted parameters, uniqueness theorems, or self-referential definitions appear; all reported numbers are direct experimental outcomes using the same model and prompt templates shown in the appendix. The central claim (99% SCAN accuracy with 14 exemplars) is an observed performance figure, not a prediction derived from prior results within the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that LLMs can execute sequential subproblem solving when instructed, with no free parameters or new entities introduced.

axioms (1)

domain assumption Large language models can follow instructions to solve subproblems sequentially when prompted appropriately.
Invoked to explain why decomposition improves performance over direct chain-of-thought on harder instances.

pith-pipeline@v0.9.0 · 5558 in / 1127 out tokens · 48995 ms · 2026-05-11T08:39:19.947898+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models
cs.CL 2023-05 accept novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
PAL: Program-aided Language Models
cs.CL 2022-11 conditional novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
cs.CL 2026-05 unverdicted novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
cs.CL 2026-05 unverdicted novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
cs.DC 2026-04 unverdicted novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views
cs.CL 2026-04 unverdicted novelty 7.0

Applying Canonical Correlation Analysis to paired residual activations from natural-language and symbolic reasoning chains in LLMs reveals a low-dimensional shared logical subspace that can steer the model's reasoning...
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
cs.CL 2026-04 unverdicted novelty 7.0

Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA an...
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
cs.CL 2026-04 unverdicted novelty 7.0

iTAG generates natural text paired with accurate causal graph annotations by framing concept assignment as an inverse problem and refining selections via chain-of-thought reasoning until the text's relations align wit...
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Training Large Language Models to Reason in a Continuous Latent Space
cs.CL 2024-12 unverdicted novelty 7.0

Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
cs.SE 2023-02 accept novelty 7.0

The authors present a catalog of prompt patterns that provide reusable solutions to common problems in generating and interacting with outputs from LLMs.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
cs.CR 2026-05 unverdicted novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
cs.CL 2026-05 unverdicted novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Learning Agent Routing From Early Experience
cs.CL 2026-05 unverdicted novelty 6.0

BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.
RAG over Thinking Traces Can Improve Reasoning Tasks
cs.IR 2026-05 unverdicted novelty 6.0

RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
cs.LG 2026-04 unverdicted novelty 6.0

C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
cs.SE 2026-04 unverdicted novelty 6.0

Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
cs.CL 2026-04 unverdicted novelty 6.0

AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
cs.CL 2026-04 unverdicted novelty 6.0

E-GRM triggers CoT reasoning in generative reward models only when parallel generations show high uncertainty, reducing inference cost and raising accuracy on reasoning benchmarks via a hybrid regression-ranking scorer.
Generative AI Agent Empowered Power Allocation for HAP Propulsion and Communication Systems
cs.NI 2026-04 unverdicted novelty 6.0

A generative AI agent creates a realistic HAP propulsion power model including aerodynamic interference and enables a Q3E beamforming algorithm that improves QoS and energy efficiency.
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
cs.LG 2026-04 unverdicted novelty 6.0

ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
Sustainability Analysis of Prompt Strategies for SLM-based Automated Test Generation
cs.SE 2026-04 unverdicted novelty 6.0

Prompt strategies for SLM-based automated test generation vary widely in energy consumption and carbon emissions, with simpler strategies delivering competitive coverage at markedly lower environmental cost.
Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
cs.CL 2026-03 unverdicted novelty 6.0

Oblivion is a decay-driven memory framework that decouples read and write paths in LLM agents to enable adaptive forgetting and reinforcement for better long-horizon reasoning.
LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?
cs.CL 2026-03 unverdicted novelty 6.0

LiFT instruction fine-tunes LLMs with a temporal curriculum to improve in-context learning on longitudinal NLP tasks, yielding gains on out-of-distribution data and rare change events across multiple model sizes.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
cs.LG 2023-05 accept novelty 6.0

FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Multimodal Chain-of-Thought Reasoning in Language Models
cs.CL 2023-02 accept novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
cs.CL 2022-10 accept novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Explanation Quality Assessment as Ranking with Listwise Rewards
cs.AI 2026-04 unverdicted novelty 5.0

Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement ...
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
cs.AI 2026-04 unverdicted novelty 5.0

Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
cs.CL 2026-04 unverdicted novelty 5.0

H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation
cs.LO 2026-04 unverdicted novelty 4.0

LLMs handle LTL syntax better than semantics, improve with detailed prompts, and perform substantially better when the task is reframed as Python code completion.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 58 Pith papers

[1]

turn opposite left

So the output of “turn opposite left” is “TURN LEFT” * 2. Q: “turn around left” A: The output of “turn around left” concatenates: the output of “turn left”, the output of “turn left”, the output of “turn left”, the output of “turn left”. “turn left” outputs “TURN LEFT”. So repeating the output of “turn left” four times leads to “TURN LEFT” * 4. So the out...

work page 2023
[2]

TURN LEFT

* 3 instead of (“TURN LEFT” + “LOOK”) * 4 * 2 + (“TURN RIGHT” * 4) * 3. Example of copy error . In the example “walk opposite right twice after look around left thrice”, the code-davinci-001 model produces the correct translations for both sub-expressions that are connected with “after”. In particular, it correctly translates “walk opposite right twice” t...

work page 2023
[3]

turn around right

“turn around right” outputs “TURN RIGHT” * 4. So concatenating the output of “walk around left”, the output of “walk around left”, the output of “turn around right”, the output of “turn around right”, the output of “turn around right” leads to (“TURN LEFT” + “WALK”) * 4 + (“TURN LEFT” + “WALK”) * 4 + “TURN RIGHT” * 4 + “TURN RIGHT” * 4 + “TURN RIGHT” * 4....

work page
[4]

jump around right thrice

“jump around right thrice” outputs (“TURN RIGHT” + “JUMP”) * 12. So concatenating the output of “run left thrice” and the output of “jump around right thrice” leads to (“TURN LEFT” + “RUN”) * 3 + (“TURN RIGHT” + “JUMP”) * 12. So the output of “jump around right thrice after run left thrice” is (“TURN LEFT” + “RUN”) * 3 + (“TURN RIGHT” + “JUMP”) * 12. “run...

work page 2023
[5]

run around left twice

So concatenating the output of “run around left twice” and the output of “walk around left thrice” leads to (“TURN LEFT” + “RUN”) * 4 * 2 + (“TURN LEFT” + “WALK”) * 4 * 3. So the output of “run around left twice and walk around left thrice” is (“TURN LEFT” + “RUN”) * 4 * 2 + (“TURN LEFT” + “WALK”) * 4 * 3. “jump around left thrice after turn around left t...

work page 2023
[6]

run left twice

“run left twice” outputs “TURN LEFT” + “RUN” * 2. So concatenating the output of “run around left thrice” and the output of “run left twice” leads to (“TURN LEFT” + “RUN”) * 3 + “TURN LEFT” + “RUN” * 2. So the output of “run around left thrice and run left twice” is (“TURN LEFT” + “RUN”) * 3 + “TURN LEFT” + “RUN” * 2. “walk and walk around left thrice” Go...

work page 2023
[7]

run around left

So repeating the output of “run around left” three times leads to (“TURN LEFT” + “RUN”) * 4 * 3. So the output of “run around left thrice” is (“TURN LEFT” + “RUN”) * 4 * 3. Q: “run right” A: The output of “run right” concatenates: the output of “turn right”, the output of “run”. “turn right” outputs “TURN RIGHT”. “run” outputs “RUN”. So concatenating the ...

work page
[8]

run around left thrice

So concatenating the output of “run around left thrice” and the output of “run around right twice” leads to (“TURN LEFT” + “RUN”) * 4 * 3 + (“TURN RIGHT” + “RUN”) * 4 * 2. So the output of “run around left thrice after run around right twice” is (“TURN LEFT” + “RUN”) * 4 * 3 + (“TURN RIGHT” + “RUN”) * 4 * 2. “jump opposite right twice after jump around ri...

work page 2023
[9]

jump opposite right twice after jump around right thrice

So the output of “jump opposite right twice after jump around right thrice” is (“TURN RIGHT” + “JUMP”) * 8 + (“TURN RIGHT” * 2 + “JUMP”) * 2. “walk around left thrice after walk right twice” Golden: TURN RIGHT WALK TURN RIGHT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN L...

work page 2023
[10]

Sequential multiplication

work page
[11]

Avoid commutativity / associativity in addition

work page
[12]

Nested multiplication

work page
[13]

The answer is

Addition of two multiplications 9 DROP 9.1 R ESULTS WITH T E X T-D A V I N C I-002 AND LM-540B We reported the results using code-davinci-002. Here, we report results using the text-davinci-002 model and a language model with 540 billion parameters (LM-540B). text-davinci-002 Prompting method Non-football (500 cases) Football (500 cases) Zero-Shot 27.00 3...

work page 2023
[14]

The answer is

So 1979 - 1973 = 6 years passed between the oil and energy crises. So the answer is 6. 9.3 F OOTBALL SUBSET 9.3.1 Z ERO-SHOT PROMPTING For zero-shot, the prompt format is as follows: Q:{question} A: The answer is Notice that we add “The answer is” at the beginning of the answer section. 9.3.2 S TANDARD PROMPTING WITH 3 EXAMPLES Q: The Seahawks played the ...

work page 1979
[15]

She became a nurse on the Hospital payroll, where she remained ﬁve years after James Peeles death, when she married Ralph Boswell

She became a nurse on the Hospital payroll, where she remained ﬁve years after James Peeles death, when she married Ralph Boswell. His siblings included Anne (d. Jan 10, 1568/9), Isabel, Judith (d. Apr. 16, 1582), and James (b. Jan 3, 1563/4). Anne married John Alford on May 14, 1565, and had one son, Robert (October 9, 1567- c. March 12, 1654/5). Judith ...

work page 2023
[16]

How many percent of people were from 2 or more races in 2000?

On his way to Constantinople, how many cities did Polin laid waste to? A: To answer the question ”On his way to Constantinople, how many cities did Polin laid waste to?”, we need to know: ”How many cities did Polin laid waste to on his way to Constantinople?”. 9.5.2 E XAMPLE OF WRONG PROBLEM SOLVING In the following example, the answer to the decomposed q...

work page 2000
[17]

177 births to Asians

The model gives the correct answer via least-to-most prompting. There were 20,928 births in 2006. Of these, 19,757 (94.40% of the births, 95.19% of the population) were to Non-Hispanic Whites. There were 22 births to American Indians (0.11% of the births and 0.54% of the population), 177 births to Asians (0.85% of the births and 0.68% of the population), ...

work page 2006
[18]

So Anna has 2 + 5 = 7 apples

Anna has 2 more apples than Elsa. So Anna has 2 + 5 = 7 apples

work page
[19]

Chain-of-Thought (original)“, which is the same as the “Prompt for Math Word Problems

Elsa and Anna have 5 + 7 = 12 apples together. Q:{question} A: Let’s break down this problem: —– The answer is: 10.4 P ROMPT CONTEXTS : E NGINEERED PROMPTS We include here the additional prompt templates used in the experiments reported in Appendix 10.2, with the exception of “Chain-of-Thought (original)“, which is the same as the “Prompt for Math Word Pr...

work page 2022