{"paper":{"title":"PAL: Program-aided Language Models","license":"http://creativecommons.org/publicdomain/zero/1.0/","headline":"LLMs generate programs as reasoning steps and let a Python interpreter execute them to solve math and symbolic problems more accurately than much larger models using chain-of-thought.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Aman Madaan, Graham Neubig, Jamie Callan, Luyu Gao, Pengfei Liu, Shuyan Zhou, Uri Alon, Yiming Yang","submitted_at":"2022-11-18T18:56:13Z","abstract_excerpt":"Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time (\"few-shot prompting\"). Much of this success can be attributed to prompting methods such as \"chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is dec"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the LLM will reliably generate correct, executable programs whose logic matches the intended reasoning without introducing its own coding or planning errors.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLMs generate programs as reasoning steps and let a Python interpreter execute them to solve math and symbolic problems more accurately than much larger models using chain-of-thought.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c40e8ad7960f510d2b8f7cbb245a628c3a1fb13bbd99eccde6c19a821ec691f3"},"source":{"id":"2211.10435","kind":"arxiv","version":2},"verdict":{"id":"0a44b9dd-a9d7-4379-8e62-79055d9e279b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T04:58:42.378977Z","strongest_claim":"PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1.","one_line_summary":"PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the LLM will reliably generate correct, executable programs whose logic matches the intended reasoning without introducing its own coding or planning errors.","pith_extraction_headline":"LLMs generate programs as reasoning steps and let a Python interpreter execute them to solve math and symbolic problems more accurately than much larger models using chain-of-thought."},"references":{"count":44,"sample":[{"doi":"","year":2022,"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","ref_index":1,"cited_arxiv_id":"2204.01691","is_internal_anchor":true},{"doi":"","year":2019,"title":"https://aclanthology.org/N19-1245 M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms","work_id":"95ff4a33-2a6e-4326-9caa-ac6d568e3241","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1909,"title":"Giving bert a calculator: Finding operations and arguments with reading comprehension","work_id":"ab124ee4-c511-4a61-8482-bad3e558fe10","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert - Voss, A., Krueger, G., Henighan, T., Child, R., Ram","work_id":"96eceee9-e1b2-4c6f-9f77-b5dc792fb8eb","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":6,"cited_arxiv_id":"2107.03374","is_internal_anchor":true}],"resolved_work":44,"snapshot_sha256":"827a1b1a9cdfb4ec9e24c12d0324d53ae30956e876a8ac611fe71a5ff83db1be","internal_anchors":18},"formal_canon":{"evidence_count":2,"snapshot_sha256":"66d2c61de709373853cae67ddcdc89d3e34752de848492c973d8eabef145e072"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}