Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

arxiv: 2411.18104 · v6 · pith:LB2ILFMAnew · submitted 2024-11-27 · 💻 cs.CL · cs.AI· cs.LG

Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang This is my paper

Pith reviewed 2026-05-23 16:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Template-based Data Generationsynthetic datasetsmathematical reasoninglarge language modelsdata augmentationverifiable solutionsgrade school mathRLVR

0 comments

The pith

Template-based Data Generation uses GPT-4 to create meta-templates that synthesize over 7 million verifiable grade-school math problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Template-based Data Generation (TDG) to address the shortage of large-scale, high-quality datasets required for training large language models on multi-step mathematical reasoning. TDG has frontier models like GPT-4 automatically produce parameterized meta-templates, from which an essentially unlimited supply of problems and solutions can be generated. The authors demonstrate the approach by building TemplateGSM, a dataset exceeding 7 million problems, each accompanied by a programmatically verifiable solution. This resource supports both supervised fine-tuning and reinforcement learning with verifiable rewards, aiming to overcome the data and verification bottlenecks that currently limit reliable reasoning in LLMs.

Core claim

The central claim is that frontier LLMs can be used to generate parameterized meta-templates that in turn synthesize a virtually infinite stream of high-quality, diverse, and complex math problems with programmatically verifiable solutions, as realized in the TemplateGSM dataset of more than 7 million grade-school problems; this directly resolves the scarcity of domain-specific data needed for cultivating sophisticated reasoning abilities in language models.

What carries the argument

Template-based Data Generation (TDG), a paradigm that harnesses frontier LLMs to automatically generate parameterized meta-templates which synthesize problems and solutions.

If this is right

Resolves the data scarcity issue for supervised fine-tuning of LLMs on reasoning tasks.
Provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR).
Elevates data augmentation by producing diverse and complex problem structures at scale.
Enables creation of virtually unlimited high-quality training examples without manual curation.
Supports development of LLMs with stronger and more reliable multi-step reasoning skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same meta-template approach could be tested on domains outside mathematics that admit programmatic verification, such as code generation or symbolic manipulation.
Because the templates are parameterized, one could systematically vary problem features to measure how well trained models generalize to novel combinations.
Widespread adoption would shift dataset creation from human curation toward generator-model dependence, raising questions about how to audit the resulting distribution of problem types.
The method supplies a concrete route to study whether scale in synthetic data alone can close the gap between current LLMs and reliable reasoning without additional architectural changes.

Load-bearing premise

The meta-templates generated by GPT-4 produce problems and solutions that are high-quality, diverse, complex, and programmatically verifiable without introducing errors or biases from the generator model.

What would settle it

Training an LLM on TemplateGSM and finding no measurable improvement on standard math reasoning benchmarks compared with models trained on existing smaller datasets, or discovering that a substantial fraction of the generated solutions fail independent programmatic verification, would falsify the central claim.

read the original abstract

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine-tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by leveraging GPT-4 to generate meta-templates, ensuring diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Template-based Data Generation (TDG), a paradigm that uses frontier LLMs like GPT-4 to generate parameterized meta-templates for synthesizing large-scale, high-quality math problems and solutions. It describes the creation of TemplateMath Part I: TemplateGSM, comprising over 7 million synthetically generated grade school math problems, each with a programmatically verifiable solution, to address data scarcity for training LLMs on complex reasoning tasks and to support RLVR.

Significance. If the TDG method produces truly high-quality, error-free reasoning traces at scale, it would represent a significant advance in generating training data for mathematical reasoning in LLMs, potentially enabling better supervised fine-tuning and reinforcement learning with verifiable rewards, thus helping overcome current limitations in model performance on multi-step math problems.

major comments (2)

Abstract: The claim of an 'unprecedented level of quality at scale' for TemplateGSM is unsupported, as the manuscript provides no experimental results, quality validation, error analysis, human audits, or comparisons to existing datasets such as GSM8K.
§3: The quality of solutions is asserted to follow from GPT-4 meta-template generation and programmatic verification of final answers, but no step-level validation of reasoning chains is described; final-answer matching alone cannot detect hallucinated or flawed intermediate steps that happen to produce the correct number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the current manuscript overstates the quality claims without supporting evidence and will revise accordingly to qualify assertions and explicitly discuss limitations.

read point-by-point responses

Referee: Abstract: The claim of an 'unprecedented level of quality at scale' for TemplateGSM is unsupported, as the manuscript provides no experimental results, quality validation, error analysis, human audits, or comparisons to existing datasets such as GSM8K.

Authors: We agree this claim is unsupported in the current version, which introduces the TDG method and dataset but contains no empirical validation or comparisons. We will revise the abstract to remove the phrase 'unprecedented level of quality at scale' and add a dedicated limitations/quality section that includes initial error analysis, human audit plans, and direct comparisons to GSM8K. revision: yes
Referee: §3: The quality of solutions is asserted to follow from GPT-4 meta-template generation and programmatic verification of final answers, but no step-level validation of reasoning chains is described; final-answer matching alone cannot detect hallucinated or flawed intermediate steps that happen to produce the correct number.

Authors: This observation is correct. The manuscript relies solely on final-answer programmatic verification and does not describe or perform step-level validation. We will revise §3 to explicitly acknowledge this limitation, explain that correct final answers do not guarantee correct reasoning traces, and note the implications for downstream use in SFT and RLVR. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained

full rationale

The paper presents TDG as an external generation process that uses GPT-4 to produce meta-templates, from which problems and programmatically verifiable solutions are synthesized at scale. No step equates a claimed output (e.g., high-quality verifiable solutions) to an input by definition, renames a fitted parameter as a prediction, or relies on a load-bearing self-citation whose content reduces to the present work. The central claims rest on the independent capabilities of frontier LLMs and standard programmatic answer checking, without any reduction of the result to the paper's own fitted values or prior self-referential theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified assumption that frontier LLMs can generate useful meta-templates yielding high-quality verifiable data at scale. No free parameters are mentioned. The approach assumes domain knowledge about math problem structures and LLM capabilities.

axioms (1)

domain assumption Frontier LLMs like GPT-4 can reliably generate parameterized meta-templates that produce high-quality, verifiable math problems.
This is the core assumption enabling the TDG paradigm as described in the abstract.

invented entities (2)

Template-based Data Generation (TDG) no independent evidence
purpose: Scalable automatic generation of math problems and solutions via meta-templates
New method introduced in the paper.
TemplateGSM dataset no independent evidence
purpose: Foundational large-scale dataset for LLM reasoning training
Newly created resource claimed in the abstract.

pith-pipeline@v0.9.0 · 5777 in / 1561 out tokens · 50636 ms · 2026-05-23T16:58:58.296155+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 7 internal anchors

[1]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 ,

work page arXiv
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 ,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

A survey of data augmentation approaches for NLP,

Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075 ,

work page arXiv
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale gener- ation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146 ,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Paster, M

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786 ,

work page arXiv
[9]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Large language models are better reasoners with self-verification

10 Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. CoRR, abs/2212.09561 ,

work page arXiv
[11]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 ,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Meta prompting for ai systems

Yifan Zhang, Yang Yuan, and Andrew Chi-Chih Yao. Meta prompting for ai systems. arXiv preprint arXiv:2311.11482,

work page arXiv
[14]

Autonomous data selection with zero-shot generative classifiers for mathematical texts

Yifan Zhang, Yifan Luo, Yang Yuan, and Andrew C Yao. Autonomous data selection with zero-shot generative classifiers for mathematical texts. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 4168–4189, Vienna, Austria, July

work page 2025
[15]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. URL https://aclanthology.org/2025.findings-acl.216/. 11

work page 2025

[1] [1]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 ,

work page arXiv

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901

[3] [3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 ,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 ,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

A survey of data augmentation approaches for NLP,

Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075 ,

work page arXiv

[6] [6]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale gener- ation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146 ,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Paster, M

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786 ,

work page arXiv

[9] [9]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Large language models are better reasoners with self-verification

10 Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. CoRR, abs/2212.09561 ,

work page arXiv

[11] [11]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 ,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Meta prompting for ai systems

Yifan Zhang, Yang Yuan, and Andrew Chi-Chih Yao. Meta prompting for ai systems. arXiv preprint arXiv:2311.11482,

work page arXiv

[14] [14]

Autonomous data selection with zero-shot generative classifiers for mathematical texts

Yifan Zhang, Yifan Luo, Yang Yuan, and Andrew C Yao. Autonomous data selection with zero-shot generative classifiers for mathematical texts. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 4168–4189, Vienna, Austria, July

work page 2025

[15] [15]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. URL https://aclanthology.org/2025.findings-acl.216/. 11

work page 2025