MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
read the original abstract
We introduce a large-scale dataset of math word problems and an interpretable neural math problem solver that learns to map problems to operation programs. Due to annotation challenges, current datasets in this domain have been either relatively small in scale or did not offer precise operational annotations over diverse problem types. We introduce a new representation language to model precise operation programs corresponding to each math problem that aim to improve both the performance and the interpretability of the learned models. Using this representation language, our new dataset, MathQA, significantly enhances the AQuA dataset with fully-specified operational programs. We additionally introduce a neural sequence-to-program model enhanced with automatic problem categorization. Our experiments show improvements over competitive baselines in our MathQA as well as the AQuA dataset. The results are still significantly lower than human performance indicating that the dataset poses new challenges for future research. Our dataset is available at: https://math-qa.github.io/math-QA/
This paper has not been read by Pith yet.
Forward citations
Cited by 24 Pith papers
-
Generative Language Modeling for Automated Theorem Proving
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
-
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
-
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
-
TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
Emotional framing in quantitative reasoning problems reduces LLM accuracy by 2-10 percentage points, recoverable by neutralization, unlike neutral paraphrases.
-
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
Solving math word problems with process- and outcome-based feedback
On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Training Verifiers to Solve Math Word Problems
Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.
-
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp fo...
-
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling
Output prefilling with a structured prefix steers LLMs to produce cleaner first tokens in MCQA, raising accuracy and calibration over standard first-token probability.
-
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification
LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.