pith. machine review for the scientific record. sign in

arxiv: 2211.10435 · v2 · submitted 2022-11-18 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

PAL: Program-aided Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords program-aided language modelsfew-shot promptingchain-of-thoughtmath word problemsGSM8KPython interpreterreasoning benchmarksneuro-symbolic methods
0
0 comments X

The pith

LLMs generate programs as reasoning steps and let a Python interpreter execute them to solve math and symbolic problems more accurately than much larger models using chain-of-thought.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often decompose reasoning problems correctly but then make arithmetic and logic errors when solving the steps themselves. PAL changes this by having the LLM produce only an executable program that captures the intended reasoning, after which a Python interpreter runs the program to obtain the answer. This division lets the model focus solely on turning natural language into runnable steps while the interpreter handles precise computation. Across thirteen benchmarks the method yields higher few-shot accuracy than chain-of-thought prompting even when the comparison model is much larger, including a 15-point gain on the GSM8K math-word-problem set.

Core claim

PAL uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter.

What carries the argument

LLM-generated program that encodes the full reasoning trace and is executed by a Python interpreter to produce the final answer.

If this is right

  • On GSM8K, Codex-powered PAL reaches state-of-the-art few-shot accuracy and exceeds PaLM-540B chain-of-thought by 15 absolute points.
  • The same program-plus-interpreter pattern improves accuracy on thirteen other mathematical, symbolic, and algorithmic tasks from BIG-Bench Hard and related benchmarks.
  • The LLM no longer needs to perform arithmetic or symbolic execution inside its own generations, reducing a major source of error.
  • Smaller models paired with an interpreter can outperform much larger models that attempt both decomposition and solution internally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend naturally to any domain where a reliable interpreter exists for the operations the model must perform.
  • Combining program generation with other prompting techniques could further reduce remaining decomposition errors.
  • The separation of concerns suggests that future models could be trained primarily to emit correct programs rather than to simulate execution.

Load-bearing premise

The language model will produce programs whose logic exactly matches the intended reasoning and that run without introducing coding or planning mistakes of its own.

What would settle it

A held-out set of word problems on which the generated programs execute cleanly yet return systematically wrong answers because the program logic diverges from the correct decomposition.

read the original abstract

Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Program-Aided Language Models (PAL), where an LLM generates Python programs as intermediate reasoning steps for natural language problems and delegates execution to a runtime interpreter. It evaluates the approach across 13 mathematical, symbolic, and algorithmic reasoning tasks drawn from BIG-Bench Hard and other benchmarks, claiming consistent accuracy gains over strong baselines and a 15-point absolute improvement on GSM8K few-shot accuracy relative to PaLM-540B using chain-of-thought.

Significance. If the results hold, the work is significant because it demonstrates a practical hybrid neural-symbolic method that improves reasoning accuracy without requiring larger model scale. The public release of code and data is a clear strength that supports reproducibility and further research on this paradigm.

major comments (2)
  1. [Experimental Results] Experimental section: the reported 15% absolute GSM8K gain over PaLM-540B CoT is presented without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the improvement is reliable or could be explained by prompt variance.
  2. [Method] Method and results: no ablation is provided that isolates the contribution of the interpreter execution from the LLM's program-generation quality (e.g., by comparing PAL against an LLM that generates the same programs but solves them internally), which is load-bearing for the central claim that off-loading computation improves accuracy.
minor comments (2)
  1. [Abstract] The abstract states results on '13 tasks' but does not enumerate them; adding a short list or reference to the table that defines the suite would improve readability.
  2. [Figure 1] Figure 1 (or equivalent diagram) would benefit from clearer labeling of the exact interface between the LLM output and the Python interpreter call.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their positive summary and recommendation for major revision. The comments highlight important aspects for improving the clarity and rigor of our experimental results and method. We address each point below and have incorporated revisions accordingly.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental section: the reported 15% absolute GSM8K gain over PaLM-540B CoT is presented without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the improvement is reliable or could be explained by prompt variance.

    Authors: We agree with the referee that the lack of error bars and statistical tests makes it difficult to fully assess the reliability of the reported improvement. In the revised manuscript, we now include results from multiple runs with different random seeds for the few-shot prompt ordering on GSM8K. We report the mean and standard deviation, and perform a statistical test to confirm the significance of the 15-point gain over PaLM-540B CoT. This revision directly addresses the concern regarding prompt variance. revision: yes

  2. Referee: [Method] Method and results: no ablation is provided that isolates the contribution of the interpreter execution from the LLM's program-generation quality (e.g., by comparing PAL against an LLM that generates the same programs but solves them internally), which is load-bearing for the central claim that off-loading computation improves accuracy.

    Authors: We appreciate the suggestion to include an ablation isolating the interpreter's contribution. While our original comparisons to chain-of-thought already demonstrate the advantage of using programs over text-based reasoning, we have added the requested ablation in the revision. We compare against a setting where the LLM generates the program and then attempts to solve it by simulating the execution in its own generations. The results show that this internal solving leads to lower accuracy due to arithmetic errors, whereas the interpreter ensures correctness, thereby validating the benefit of off-loading to the runtime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand on direct benchmark comparisons

full rationale

The paper introduces PAL as a prompting technique where an LLM generates executable programs for reasoning tasks and delegates execution to an interpreter. No equations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on reported few-shot accuracies across 13 tasks (e.g., GSM8K surpassing PaLM-540B CoT by 15%), which are externally falsifiable via public benchmarks and code. No load-bearing self-citations, uniqueness theorems, or ansatzes reduce the method to its inputs by construction. The derivation chain is self-contained through experimental validation rather than mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of existing LLMs at code generation for these tasks; no new free parameters, axioms beyond standard LLM capabilities, or invented entities are introduced.

axioms (1)
  • domain assumption Large language models can generate correct and executable programs for the described reasoning tasks when given few-shot examples
    This is the core premise that allows the method to work; it is stated implicitly throughout the abstract.

pith-pipeline@v0.9.0 · 5604 in / 1119 out tokens · 35927 ms · 2026-05-15T04:58:42.378977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  2. Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

    cs.AI 2026-05 unverdicted novelty 7.0

    TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...

  3. The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate

    cs.MA 2026-04 unverdicted novelty 7.0

    Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...

  4. Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

    cs.SE 2026-04 conditional novelty 7.0

    LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...

  5. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  6. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    cs.CL 2022-11 unverdicted novelty 7.0

    PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

  7. LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

    cs.LG 2026-05 conditional novelty 6.0

    A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.

  8. Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

    cs.CR 2026-05 unverdicted novelty 6.0

    GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.

  9. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 6.0

    ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.

  10. Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

    cs.AI 2026-05 unverdicted novelty 6.0

    TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.

  11. Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

    physics.comp-ph 2026-03 unverdicted novelty 6.0

    QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...

  12. ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    cs.CL 2025-04 unverdicted novelty 6.0

    ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.

  13. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    cs.LG 2024-08 unverdicted novelty 6.0

    An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.

  14. Gorilla: Large Language Model Connected with Massive APIs

    cs.CL 2023-05 conditional novelty 6.0

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  15. Teaching Large Language Models to Self-Debug

    cs.CL 2023-04 unverdicted novelty 6.0

    Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.

  16. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    cs.CL 2023-03 unverdicted novelty 6.0

    HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.

  17. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  18. Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

    cs.LG 2026-05 unverdicted novelty 5.0

    RL post-training lifts answer correctness on FHIR-AgentBench from 50% (o4-mini) to 77% with a cheaper Qwen3-8B CodeAct agent.

  19. LLMs with in-context learning for Algorithmic Theoretical Physics

    cs.LG 2026-05 unverdicted novelty 5.0

    Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.

  20. The Cartesian Cut in Agentic AI

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.

  21. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  22. Self-Refine: Iterative Refinement with Self-Feedback

    cs.CL 2023-03 unverdicted novelty 5.0

    Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.

  23. SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

    cs.AI 2026-04 unverdicted novelty 4.0

    SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 21 Pith papers · 17 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Lee, K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, P., Quia...

  2. [2]

    https://aclanthology.org/N19-1245 M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., and Hajishirzi, H. https://aclanthology.org/N19-1245 M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In ACL, 2019

  3. [3]

    Giving bert a calculator: Finding operations and arguments with reading comprehension

    Andor, D., He, L., Lee, K., and Pitler, E. Giving bert a calculator: Finding operations and arguments with reading comprehension. arXiv preprint arXiv:1909.00109, 2019

  4. [4]

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert - Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

  5. [6]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b

  6. [7]

    Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

  7. [8]

    A., and Yu, T

    Cheng, Z., Xie, T., Shi, P., Li, C., Nadkarni, R., Hu, Y., Xiong, C., Radev, D., Ostendorf, M., Zettlemoyer, L., Smith, N. A., and Yu, T. Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875, 2022

  8. [9]

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levska...

  9. [10]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training Verifiers to Solve Math Word Problems https://arxiv.org/abs/2110.14168. arXiv preprint arXiv:2110.14168, 2021

  10. [11]

    and Downey, D

    Demeter, D. and Downey, D. Just add functions: A neural-symbolic language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 7634--7642, 2020

  11. [12]

    Garcez, A. d. and Lamb, L. C. Neurosymbolic ai: the 3rd wave. arXiv preprint arXiv:2012.05876, 2020

  12. [13]

    S., Anuoluwapo, A., Bosselut, A., Chandu, K

    Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Anuoluwapo, A., Bosselut, A., Chandu, K. R., Clinciu, M., Das, D., Dhole, K. D., Du, W., Durmus, E., Dušek, O., Emezue, C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., Jhamtani, H., Ji, Y., Jolly, S., Kale, M., Kumar, D., Ladhak, F., Madaan, A., Maddela, M., Mahajan, K., Maha...

  13. [14]

    Gellenbeck, E. M. and Cook, C. R. An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop, pp.\ 65--81. Ablex Publishing, Norwood, NJ, 1991

  14. [15]

    Neural module networks for reasoning over text

    Gupta, N., Lin, K., Roth, D., Singh, S., and Gardner, M. Neural module networks for reasoning over text. arXiv preprint arXiv:1912.04971, 2019

  15. [16]

    Measuring mathematical problem solving with the MATH dataset, 2021

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset, 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe

  16. [17]

    The Curious Case of Neural Text Degeneration

    Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751. In ICLR, 2019

  17. [18]

    Mawps: A math word problem repository

    Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. Mawps: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1152--1157, 2016

  18. [19]

    Solving Quantitative Reasoning Problems with Language Models

    Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022

  19. [20]

    Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

    Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems https://arxiv.org/abs/1705.04146. arXiv preprint arXiv:1705.04146, 2017

  20. [21]

    CoRR , volume =

    Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing https://arxiv.org/abs/2107.13586. arXiv preprint arXiv:2107.13586, 2021

  21. [22]

    and Yazdanbakhsh, A

    Madaan, A. and Yazdanbakhsh, A. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022

  22. [23]

    Language models of code are few-shot commonsense learners

    Madaan, A., Zhou, S., Alon, U., Yang, Y., and Neubig, G. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128, 2022

  23. [24]

    Deep Learning: A Critical Appraisal

    Marcus, G. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018

  24. [25]

    The next decade in ai: four steps towards robust artificial intelligence

    Marcus, G. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177, 2020

  25. [26]

    A diverse corpus for evaluating and developing E nglish math word problem solvers

    Miao, S.-y., Liang, C.-C., and Su, K.-Y. A diverse corpus for evaluating and developing E nglish math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 975--984, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.92. URL https://aclanthology.org/2...

  26. [27]

    Lila: A unified benchmark for mathematical reasoning

    Mishra, S., Finlayson, M., Lu, P., Tang, L., Welleck, S., Baral, C., Rajpurohit, T., Tafjord, O., Sabharwal, A., Clark, P., and Kalyan, A. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

  27. [28]

    Investigating the limitations of transformers with simple arithmetic tasks

    Nogueira, R., Jiang, Z., and Lin, J. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021

  28. [29]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. Show your Work: Scratchpads for Intermediate Computation with Language Models https://arxiv.org/abs/2112.00114. arXiv preprint arXiv:2112.00114, 2021

  29. [30]

    Are NLP Models really able to Solve Simple Math Word Problems?

    Patel, A., Bhattamishra, S., and Goyal, N. Are NLP Models Really Able to Solve Simple Math Word Problems? https://arxiv.org/abs/2103.07191 arXiv preprint arXiv:2103.07191, 2021

  30. [31]

    Reasoning like program executors

    Pi, X., Liu, Q., Chen, B., Ziyadi, M., Lin, Z., Gao, Y., Fu, Q., Lou, J.-G., and Chen, W. Reasoning like program executors. arXiv preprint arXiv:2201.11473, 2022

  31. [32]

    Limitations of language models in arithmetic and symbolic induction

    Qian, J., Wang, H., Li, Z., Li, S., and Yan, X. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022

  32. [33]

    A Recipe for Arbitrary Text Style Transfer with Large Language Models https://arxiv.org/pdf/2109.03910.pdf

    Reif, E., Ippolito, D., Yuan, A., Coenen, A., Callison-Burch, C., and Wei, J. A Recipe for Arbitrary Text Style Transfer with Large Language Models https://arxiv.org/pdf/2109.03910.pdf. arXiv preprint arXiv:2109.03910, 2021

  33. [34]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj...

  34. [35]

    and Van Durme, B

    Shin, R. and Van Durme, B. Few-shot semantic parsing with language models trained on code. arXiv preprint arXiv:2112.08696, 2021

  35. [36]

    H., Thomson, S., Chen, C., Roy, S., Platanios, E

    Shin, R., Lin, C. H., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., and Van Durme, B. Constrained language models yield few-shot semantic parsers. arXiv preprint arXiv:2104.08768, 2021

  36. [37]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E., Zhou, D., and Wei, J. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv, abs/2210.09261, 2022

  37. [38]

    A., Grubb, P

    Takang, A. A., Grubb, P. A., and Macredie, R. D. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4 0 (3): 0 143--167, 1996

  38. [39]

    Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747. arXiv preprints arXiv:2207.00747, 2022 a

  39. [40]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171. arXiv preprint arXiv:2203.11171, 2022 b

  40. [41]

    Finetuned Language Models Are Zero-Shot Learners

    Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned Language Models are Zero-shot Learners https://arxiv.org/pdf/2109.01652.pdf. arXiv preprint arXiv:2109.01652, 2021

  41. [42]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903. arXiv preprint arXiv:2201.11903, 2022

  42. [43]

    Q., Li, W., Rabe, M

    Wu, Y., Jiang, A. Q., Li, W., Rabe, M. N., Staats, C., Jamnik, M., and Szegedy, C. Autoformalization with Large Language Models https://arxiv.org/abs/2205.12615. arXiv preprint arXiv:2205.12615, 2022

  43. [44]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  44. [45]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Zhou, D., Sch \"a rli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., and Chi, E. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models https://arxiv.org/abs/2205.10625. arXiv preprint arXiv:2205.10625, 2022