{"work":{"id":"fd241a05-03b9-4de2-9588-9d77ce176125","openalex_id":null,"doi":null,"arxiv_id":"2108.07732","raw_key":null,"title":"Program Synthesis with Large Language Models","authors":null,"authors_text":"Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al","year":2021,"venue":"cs.PL","abstract":"This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.","external_url":"https://arxiv.org/abs/2108.07732","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T14:33:31.688491+00:00","pith_arxiv_id":"2108.07732","created_at":"2026-05-09T05:55:30.339769+00:00","updated_at":"2026-06-29T14:33:31.688491+00:00","title_quality_ok":true,"display_title":"Program Synthesis with Large Language Models","render_title":"Program Synthesis with Large Language Models"},"hub":{"state":{"work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":410,"external_cited_by_count":null,"distinct_field_count":22,"first_pith_cited_at":"2021-11-30T21:32:46+00:00","last_pith_cited_at":"2026-06-16T19:24:32+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T14:38:56.376932+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":57},{"context_role":"dataset","n":41},{"context_role":"method","n":4},{"context_role":"other","n":2}],"polarity_counts":[{"context_polarity":"background","n":54},{"context_polarity":"use_dataset","n":36},{"context_polarity":"unclear","n":9},{"context_polarity":"use_method","n":4},{"context_polarity":"support","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Program Synthesis with Large Language Models","claims":[{"claim_text":"This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Program Synthesis with Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:33:37.528772+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"5d884260-32e6-4d18-bee5-e8ffc89cdf73","orcid":null,"display_name":"Jacob Austin"},{"id":"b58c3b39-0170-46e1-b8f1-efe56c55ba90","orcid":null,"display_name":"Augustus Odena"},{"id":"1750ce8a-1201-4397-ab14-4b8d5df9389b","orcid":null,"display_name":"Maxwell Nye"},{"id":"0778e3ce-2eaf-448e-8c23-5a30ba22c9fb","orcid":null,"display_name":"Maarten Bosma"},{"id":"2aaf6fc6-91d5-4415-9832-2dab72c87cb4","orcid":null,"display_name":"Henryk Michalewski"},{"id":"38fa6331-dd40-4c8b-935a-542d8cda726c","orcid":null,"display_name":"David Dohan"}]},"error":null,"updated_at":"2026-05-13T20:33:34.855835+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T20:23:33.733592+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":139},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":73},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":43},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":39},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":32},{"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","shared_citers":30},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":30},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":30},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":28},{"title":"Qwen2.5-Coder Technical Report","work_id":"09ba463d-6377-4017-9801-444ffb94b056","shared_citers":27},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":25},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":25},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":24},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":22},{"title":"DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence","work_id":"f22dae5a-27e2-41d0-a061-c4286418dee3","shared_citers":20},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":20},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":19},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":19},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":19},{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","shared_citers":18},{"title":"Measuring Coding Challenge Competence With APPS","work_id":"c014c12f-1080-4cb2-ae03-ab6b7c09445c","shared_citers":18},{"title":"StarCoder: may the source be with you!","work_id":"7e9c3d6e-d6f7-4763-9ef6-de471506c58f","shared_citers":18},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":17},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":17}],"time_series":[{"n":2,"year":2021},{"n":4,"year":2022},{"n":7,"year":2023},{"n":15,"year":2024},{"n":10,"year":2025},{"n":147,"year":2026}]},"error":null,"updated_at":"2026-05-13T20:23:33.873412+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T20:23:49.456067+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Program Synthesis with Large Language Models","claims":[{"claim_text":"This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Program Synthesis with Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:23:32.195750+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Program Synthesis with Large Language Models","claims":[{"claim_text":"This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Program Synthesis with Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:23:49.475357+00:00"}},"summary":{"title":"Program Synthesis with Large Language Models","claims":[{"claim_text":"This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Program Synthesis with Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":139},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":73},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":43},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":39},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":32},{"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","shared_citers":30},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":30},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":30},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":28},{"title":"Qwen2.5-Coder Technical Report","work_id":"09ba463d-6377-4017-9801-444ffb94b056","shared_citers":27},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":25},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":25},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":24},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":22},{"title":"DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence","work_id":"f22dae5a-27e2-41d0-a061-c4286418dee3","shared_citers":20},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":20},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":19},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":19},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":19},{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","shared_citers":18},{"title":"Measuring Coding Challenge Competence With APPS","work_id":"c014c12f-1080-4cb2-ae03-ab6b7c09445c","shared_citers":18},{"title":"StarCoder: may the source be with you!","work_id":"7e9c3d6e-d6f7-4763-9ef6-de471506c58f","shared_citers":18},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":17},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":17}],"time_series":[{"n":2,"year":2021},{"n":4,"year":2022},{"n":7,"year":2023},{"n":15,"year":2024},{"n":10,"year":2025},{"n":147,"year":2026}]},"authors":[{"id":"b58c3b39-0170-46e1-b8f1-efe56c55ba90","orcid":null,"display_name":"Augustus Odena","source":"manual","import_confidence":0.72},{"id":"38fa6331-dd40-4c8b-935a-542d8cda726c","orcid":null,"display_name":"David Dohan","source":"manual","import_confidence":0.72},{"id":"2aaf6fc6-91d5-4415-9832-2dab72c87cb4","orcid":null,"display_name":"Henryk Michalewski","source":"manual","import_confidence":0.72},{"id":"5d884260-32e6-4d18-bee5-e8ffc89cdf73","orcid":null,"display_name":"Jacob Austin","source":"manual","import_confidence":0.72},{"id":"0778e3ce-2eaf-448e-8c23-5a30ba22c9fb","orcid":null,"display_name":"Maarten Bosma","source":"manual","import_confidence":0.72},{"id":"1750ce8a-1201-4397-ab14-4b8d5df9389b","orcid":null,"display_name":"Maxwell Nye","source":"manual","import_confidence":0.72}]}}