MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini; Hannaneh Hajishirzi; Peter Lin; Rik Koncel-Kedziorski; Saadia Gabriel; Yejin Choi

arxiv: 1905.13319 · v1 · pith:ZHIHGHVHnew · submitted 2019-05-30 · 💻 cs.CL

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini , Saadia Gabriel , Peter Lin , Rik Koncel-Kedziorski , Yejin Choi , Hannaneh Hajishirzi This is my paper

classification 💻 cs.CL

keywords datasetproblemmathintroducemathqaprogramsaquachallenges

0 comments

read the original abstract

We introduce a large-scale dataset of math word problems and an interpretable neural math problem solver that learns to map problems to operation programs. Due to annotation challenges, current datasets in this domain have been either relatively small in scale or did not offer precise operational annotations over diverse problem types. We introduce a new representation language to model precise operation programs corresponding to each math problem that aim to improve both the performance and the interpretability of the learned models. Using this representation language, our new dataset, MathQA, significantly enhances the AQuA dataset with fully-specified operational programs. We additionally introduce a neural sequence-to-program model enhanced with automatic problem categorization. Our experiments show improvements over competitive baselines in our MathQA as well as the AQuA dataset. The results are still significantly lower than human performance indicating that the dataset poses new challenges for future research. Our dataset is available at: https://math-qa.github.io/math-QA/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative Language Modeling for Automated Theorem Proving
cs.LG 2020-09 unverdicted novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
cs.AI 2025-09 unverdicted novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
cs.LG 2025-07 unverdicted novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
cs.CL 2025-02 unverdicted novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
cs.LG 2026-05 unverdicted novelty 6.0

GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
cs.AI 2026-05 unverdicted novelty 6.0

MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Emotional framing in quantitative reasoning problems reduces LLM accuracy by 2-10 percentage points, recoverable by neutralization, unlike neutral paraphrases.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
cs.CV 2025-03 unverdicted novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
Solving math word problems with process- and outcome-based feedback
cs.LG 2022-11 unverdicted novelty 6.0

On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Training Verifiers to Solve Math Word Problems
cs.LG 2021-10 conditional novelty 6.0

Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
cs.LG 2026-04 unverdicted novelty 5.0

TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp fo...
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling
cs.CL 2025-05 conditional novelty 5.0

Output prefilling with a structured prefix steers LLMs to produce cleaner first tokens in MCQA, raising accuracy and calibration over standard first-token probability.
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification
math.NT 2025-04 conditional novelty 5.0

LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
cs.CV 2024-01 unverdicted novelty 5.0

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.