A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level

Albert Lu; Avi Shporer; Elizabeth Ke; Eugene Wu; Gilbert Strang; Iddo Drori; Jayson Lynch; Kevin Liu; Leonard Tang; Linda Chen

arxiv: 2112.15594 · v4 · pith:YG4TVE66new · submitted 2021-12-31 · 💻 cs.LG · cs.AI

A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level

Iddo Drori , Sarah Zhang , Reece Shuttleworth , Leonard Tang , Albert Lu , Elizabeth Ke , Kevin Liu , Linda Chen

show 10 more authors

Sunny Tran Newman Cheng Roman Wang Nikhil Singh Taylor L. Patti Jayson Lynch Avi Shporer Nakul Verma Eugene Wu Gilbert Strang

This is my paper

classification 💻 cs.LG cs.AI

keywords questionsmathematicslearningalgebraautomaticallycoursefew-shotproblems

0 comments

read the original abstract

We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pre-trained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work is the first to automatically solve university-level mathematics course questions at a human level and the first work to explain and generate university-level mathematics course questions at scale, a milestone for higher education.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
ORFS-agent: Tool-Using Agents for Chip Design Optimization
cs.AI 2025-06 unverdicted novelty 5.0

ORFS-agent uses LLM agents to tune parameters in chip design flows, improving geometric-mean wirelength, clock period, and co-optimization objectives by up to 2.7% over OR-AutoTuner with 40% fewer iterations on ASAP7 ...
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.