arxiv: 2211.10435 · v2 · submitted 2022-11-18 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

PAL: Program-aided Language Models

Luyu Gao , Aman Madaan , Shuyan Zhou , Uri Alon , Pengfei Liu , Yiming Yang , Jamie Callan , Graham Neubig

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords program-aided language modelsfew-shot promptingchain-of-thoughtmath word problemsGSM8KPython interpreterreasoning benchmarksneuro-symbolic methods

0 comments

The pith

LLMs generate programs as reasoning steps and let a Python interpreter execute them to solve math and symbolic problems more accurately than much larger models using chain-of-thought.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often decompose reasoning problems correctly but then make arithmetic and logic errors when solving the steps themselves. PAL changes this by having the LLM produce only an executable program that captures the intended reasoning, after which a Python interpreter runs the program to obtain the answer. This division lets the model focus solely on turning natural language into runnable steps while the interpreter handles precise computation. Across thirteen benchmarks the method yields higher few-shot accuracy than chain-of-thought prompting even when the comparison model is much larger, including a 15-point gain on the GSM8K math-word-problem set.

Core claim

PAL uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter.

What carries the argument

LLM-generated program that encodes the full reasoning trace and is executed by a Python interpreter to produce the final answer.

If this is right

On GSM8K, Codex-powered PAL reaches state-of-the-art few-shot accuracy and exceeds PaLM-540B chain-of-thought by 15 absolute points.
The same program-plus-interpreter pattern improves accuracy on thirteen other mathematical, symbolic, and algorithmic tasks from BIG-Bench Hard and related benchmarks.
The LLM no longer needs to perform arithmetic or symbolic execution inside its own generations, reducing a major source of error.
Smaller models paired with an interpreter can outperform much larger models that attempt both decomposition and solution internally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend naturally to any domain where a reliable interpreter exists for the operations the model must perform.
Combining program generation with other prompting techniques could further reduce remaining decomposition errors.
The separation of concerns suggests that future models could be trained primarily to emit correct programs rather than to simulate execution.

Load-bearing premise

The language model will produce programs whose logic exactly matches the intended reasoning and that run without introducing coding or planning mistakes of its own.

What would settle it

A held-out set of word problems on which the generated programs execute cleanly yet return systematically wrong answers because the program logic diverges from the correct decomposition.

read the original abstract

Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAL gets better reasoning accuracy by having the LLM generate Python code and offloading execution to an interpreter, with a reported 15-point gain on GSM8K over PaLM-540B CoT.

read the letter

The main thing here is that PAL improves results on math and symbolic reasoning by restricting the LLM to writing executable programs, then handing the actual solving to a Python interpreter. This avoids the model making arithmetic mistakes even when the decomposition is right, and the abstract shows consistent gains over chain-of-thought across 13 tasks, including that 15-point absolute lift on GSM8K with Codex versus the much larger PaLM-540B baseline.

Referee Report

2 major / 2 minor

Summary. The paper introduces Program-Aided Language Models (PAL), where an LLM generates Python programs as intermediate reasoning steps for natural language problems and delegates execution to a runtime interpreter. It evaluates the approach across 13 mathematical, symbolic, and algorithmic reasoning tasks drawn from BIG-Bench Hard and other benchmarks, claiming consistent accuracy gains over strong baselines and a 15-point absolute improvement on GSM8K few-shot accuracy relative to PaLM-540B using chain-of-thought.

Significance. If the results hold, the work is significant because it demonstrates a practical hybrid neural-symbolic method that improves reasoning accuracy without requiring larger model scale. The public release of code and data is a clear strength that supports reproducibility and further research on this paradigm.

major comments (2)

[Experimental Results] Experimental section: the reported 15% absolute GSM8K gain over PaLM-540B CoT is presented without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the improvement is reliable or could be explained by prompt variance.
[Method] Method and results: no ablation is provided that isolates the contribution of the interpreter execution from the LLM's program-generation quality (e.g., by comparing PAL against an LLM that generates the same programs but solves them internally), which is load-bearing for the central claim that off-loading computation improves accuracy.

minor comments (2)

[Abstract] The abstract states results on '13 tasks' but does not enumerate them; adding a short list or reference to the table that defines the suite would improve readability.
[Figure 1] Figure 1 (or equivalent diagram) would benefit from clearer labeling of the exact interface between the LLM output and the Python interpreter call.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their positive summary and recommendation for major revision. The comments highlight important aspects for improving the clarity and rigor of our experimental results and method. We address each point below and have incorporated revisions accordingly.

read point-by-point responses

Referee: [Experimental Results] Experimental section: the reported 15% absolute GSM8K gain over PaLM-540B CoT is presented without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the improvement is reliable or could be explained by prompt variance.

Authors: We agree with the referee that the lack of error bars and statistical tests makes it difficult to fully assess the reliability of the reported improvement. In the revised manuscript, we now include results from multiple runs with different random seeds for the few-shot prompt ordering on GSM8K. We report the mean and standard deviation, and perform a statistical test to confirm the significance of the 15-point gain over PaLM-540B CoT. This revision directly addresses the concern regarding prompt variance. revision: yes
Referee: [Method] Method and results: no ablation is provided that isolates the contribution of the interpreter execution from the LLM's program-generation quality (e.g., by comparing PAL against an LLM that generates the same programs but solves them internally), which is load-bearing for the central claim that off-loading computation improves accuracy.

Authors: We appreciate the suggestion to include an ablation isolating the interpreter's contribution. While our original comparisons to chain-of-thought already demonstrate the advantage of using programs over text-based reasoning, we have added the requested ablation in the revision. We compare against a setting where the LLM generates the program and then attempts to solve it by simulating the execution in its own generations. The results show that this internal solving leads to lower accuracy due to arithmetic errors, whereas the interpreter ensures correctness, thereby validating the benefit of off-loading to the runtime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand on direct benchmark comparisons

full rationale

The paper introduces PAL as a prompting technique where an LLM generates executable programs for reasoning tasks and delegates execution to an interpreter. No equations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on reported few-shot accuracies across 13 tasks (e.g., GSM8K surpassing PaLM-540B CoT by 15%), which are externally falsifiable via public benchmarks and code. No load-bearing self-citations, uniqueness theorems, or ansatzes reduce the method to its inputs by construction. The derivation chain is self-contained through experimental validation rather than mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of existing LLMs at code generation for these tasks; no new free parameters, axioms beyond standard LLM capabilities, or invented entities are introduced.

axioms (1)

domain assumption Large language models can generate correct and executable programs for the described reasoning tasks when given few-shot examples
This is the core premise that allows the method to work; it is stated implicitly throughout the abstract.

pith-pipeline@v0.9.0 · 5604 in / 1119 out tokens · 35927 ms · 2026-05-15T04:58:42.378977+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 7.0

ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 7.0

TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 7.0

Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
cs.SE 2026-04 conditional novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
cs.LG 2026-05 conditional novelty 6.0

A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing
cs.CR 2026-05 unverdicted novelty 6.0

GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 6.0

ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 6.0

TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
physics.comp-ph 2026-03 unverdicted novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
cs.CL 2025-04 unverdicted novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
cs.LG 2024-08 unverdicted novelty 6.0

An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)
cs.LG 2026-05 unverdicted novelty 5.0

RL post-training lifts answer correctness on FHIR-AgentBench from 50% (o4-mini) to 77% with a cheaper Qwen3-8B CodeAct agent.
LLMs with in-context learning for Algorithmic Theoretical Physics
cs.LG 2026-05 unverdicted novelty 5.0

Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Self-Refine: Iterative Refinement with Self-Feedback
cs.CL 2023-03 unverdicted novelty 5.0

Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
cs.AI 2026-04 unverdicted novelty 4.0

SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 21 Pith papers · 17 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Lee, K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, P., Quia...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

https://aclanthology.org/N19-1245 M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., and Hajishirzi, H. https://aclanthology.org/N19-1245 M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In ACL, 2019

work page 2019
[3]

Giving bert a calculator: Finding operations and arguments with reading comprehension

Andor, D., He, L., Lee, K., and Pitler, E. Giving bert a calculator: Finding operations and arguments with reading comprehension. arXiv preprint arXiv:1909.00109, 2019

work page arXiv 1909
[4]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert - Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page 2020
[6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

A., and Yu, T

Cheng, Z., Xie, T., Shi, P., Li, C., Nadkarni, R., Hu, Y., Xiong, C., Radev, D., Ostendorf, M., Zettlemoyer, L., Smith, N. A., and Yu, T. Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875, 2022

work page arXiv 2022
[9]

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levska...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training Verifiers to Solve Math Word Problems https://arxiv.org/abs/2110.14168. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

and Downey, D

Demeter, D. and Downey, D. Just add functions: A neural-symbolic language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 7634--7642, 2020

work page 2020
[12]

Garcez, A. d. and Lamb, L. C. Neurosymbolic ai: the 3rd wave. arXiv preprint arXiv:2012.05876, 2020

work page arXiv 2012
[13]

S., Anuoluwapo, A., Bosselut, A., Chandu, K

Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Anuoluwapo, A., Bosselut, A., Chandu, K. R., Clinciu, M., Das, D., Dhole, K. D., Du, W., Durmus, E., Dušek, O., Emezue, C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., Jhamtani, H., Ji, Y., Jolly, S., Kale, M., Kumar, D., Ladhak, F., Madaan, A., Maddela, M., Mahajan, K., Maha...

work page arXiv 2021
[14]

Gellenbeck, E. M. and Cook, C. R. An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop, pp.\ 65--81. Ablex Publishing, Norwood, NJ, 1991

work page 1991
[15]

Neural module networks for reasoning over text

Gupta, N., Lin, K., Roth, D., Singh, S., and Gardner, M. Neural module networks for reasoning over text. arXiv preprint arXiv:1912.04971, 2019

work page arXiv 1912
[16]

Measuring mathematical problem solving with the MATH dataset, 2021

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset, 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[17]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751. In ICLR, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[18]

Mawps: A math word problem repository

Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. Mawps: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1152--1157, 2016

work page 2016
[19]

Solving Quantitative Reasoning Problems with Language Models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems https://arxiv.org/abs/1705.04146. arXiv preprint arXiv:1705.04146, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

CoRR , volume =

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing https://arxiv.org/abs/2107.13586. arXiv preprint arXiv:2107.13586, 2021

work page arXiv 2021
[22]

and Yazdanbakhsh, A

Madaan, A. and Yazdanbakhsh, A. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022

work page arXiv 2022
[23]

Language models of code are few-shot commonsense learners

Madaan, A., Zhou, S., Alon, U., Yang, Y., and Neubig, G. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128, 2022

work page arXiv 2022
[24]

Deep Learning: A Critical Appraisal

Marcus, G. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

The next decade in ai: four steps towards robust artificial intelligence

Marcus, G. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177, 2020

work page arXiv 2002
[26]

A diverse corpus for evaluating and developing E nglish math word problem solvers

Miao, S.-y., Liang, C.-C., and Su, K.-Y. A diverse corpus for evaluating and developing E nglish math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 975--984, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.92. URL https://aclanthology.org/2...

work page doi:10.18653/v1/2020.acl-main.92 2020
[27]

Lila: A unified benchmark for mathematical reasoning

Mishra, S., Finlayson, M., Lu, P., Tang, L., Welleck, S., Baral, C., Rajpurohit, T., Tafjord, O., Sabharwal, A., Clark, P., and Kalyan, A. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

work page 2022
[28]

Investigating the limitations of transformers with simple arithmetic tasks

Nogueira, R., Jiang, Z., and Lin, J. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021

work page arXiv 2021
[29]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. Show your Work: Scratchpads for Intermediate Computation with Language Models https://arxiv.org/abs/2112.00114. arXiv preprint arXiv:2112.00114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Are NLP Models really able to Solve Simple Math Word Problems?

Patel, A., Bhattamishra, S., and Goyal, N. Are NLP Models Really Able to Solve Simple Math Word Problems? https://arxiv.org/abs/2103.07191 arXiv preprint arXiv:2103.07191, 2021

work page arXiv 2021
[31]

Reasoning like program executors

Pi, X., Liu, Q., Chen, B., Ziyadi, M., Lin, Z., Gao, Y., Fu, Q., Lou, J.-G., and Chen, W. Reasoning like program executors. arXiv preprint arXiv:2201.11473, 2022

work page arXiv 2022
[32]

Limitations of language models in arithmetic and symbolic induction

Qian, J., Wang, H., Li, Z., Li, S., and Yan, X. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022

work page arXiv 2022
[33]

A Recipe for Arbitrary Text Style Transfer with Large Language Models https://arxiv.org/pdf/2109.03910.pdf

Reif, E., Ippolito, D., Yuan, A., Coenen, A., Callison-Burch, C., and Wei, J. A Recipe for Arbitrary Text Style Transfer with Large Language Models https://arxiv.org/pdf/2109.03910.pdf. arXiv preprint arXiv:2109.03910, 2021

work page arXiv 2021
[34]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

and Van Durme, B

Shin, R. and Van Durme, B. Few-shot semantic parsing with language models trained on code. arXiv preprint arXiv:2112.08696, 2021

work page arXiv 2021
[36]

H., Thomson, S., Chen, C., Roy, S., Platanios, E

Shin, R., Lin, C. H., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., and Van Durme, B. Constrained language models yield few-shot semantic parsers. arXiv preprint arXiv:2104.08768, 2021

work page arXiv 2021
[37]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E., Zhou, D., and Wei, J. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv, abs/2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

A., Grubb, P

Takang, A. A., Grubb, P. A., and Macredie, R. D. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4 0 (3): 0 143--167, 1996

work page 1996
[39]

Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747. arXiv preprints arXiv:2207.00747, 2022 a

work page arXiv 2022
[40]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171. arXiv preprint arXiv:2203.11171, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Finetuned Language Models Are Zero-Shot Learners

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned Language Models are Zero-shot Learners https://arxiv.org/pdf/2109.01652.pdf. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Q., Li, W., Rabe, M

Wu, Y., Jiang, A. Q., Li, W., Rabe, M. N., Staats, C., Jamnik, M., and Szegedy, C. Autoformalization with Large Language Models https://arxiv.org/abs/2205.12615. arXiv preprint arXiv:2205.12615, 2022

work page arXiv 2022
[44]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Sch \"a rli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., and Chi, E. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models https://arxiv.org/abs/2205.10625. arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022