How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Pith reviewed 2026-05-10 13:21 UTC · model grok-4.3
The pith
Rephrasing web text into structured formats like tables, FAQs, and math problems yields higher-quality synthetic pretraining data than raw web sources or prior synthetic techniques.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Increasing the size of the generator model beyond 1B parameters provides no additional benefit. The selection of the original data used for mixing substantially influences performance. FinePhrase, a 486-billion-token dataset of rephrased web text, outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times.
What carries the argument
Controlled experiments that vary rephrasing prompt formats, generator model scale, and source data selection when synthesizing pretraining corpora from web text.
Load-bearing premise
The gains seen in smaller controlled experiments will carry over to full-scale pretraining of large frontier models without interference from other training variables.
What would settle it
A complete pretraining run of a large language model on the FinePhrase dataset compared directly to training on an equal volume of standard web data, measuring downstream performance differences.
Figures
read the original abstract
Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic empirical investigation into the design of synthetic pretraining data through rephrasing of web text. It examines the impact of prompt structures (favoring formats like tables, math problems, FAQs, and tutorials), the scale of the generator model, and the choice of source data. Based on experiments that generate more than one trillion tokens, the authors identify that structured output formats yield superior performance compared to web baselines and previous synthetic approaches. They observe no gains from generator models exceeding 1 billion parameters and note the importance of source data selection. The work culminates in the creation and release of the FinePhrase dataset comprising 486 billion tokens, which demonstrates improved performance over existing synthetic data methods at significantly reduced generation costs, accompanied by the public release of prompts and the generation framework.
Significance. Should the findings prove robust, this study would offer valuable insights into optimizing synthetic data for LLM pretraining, a critical area given the reliance on such data in modern model training. The scale of the experiments (over 1T tokens generated) and the open-sourcing of the dataset, prompts, and tools represent a substantial contribution, enabling the community to build upon these results and potentially lower the costs associated with high-quality data generation.
major comments (3)
- [Experimental evaluation and FinePhrase results] The claim that structured output formats consistently outperform baselines and that FinePhrase outperforms all existing synthetic data baselines (abstract) rests on proxy evaluations with smaller models and chosen metrics; the manuscript does not report direct pretraining of frontier-scale models on FinePhrase versus matched baselines, leaving generalization to large-scale training unverified and load-bearing for the central claims.
- [Generator model scaling analysis] The finding that increasing the generator model size beyond 1B parameters provides no additional benefit requires specification of the exact models tested (e.g., parameter counts and architectures), the token volumes per condition, and any statistical tests confirming the lack of improvement across structured formats.
- [Source data selection experiments] The substantial influence of source data selection on performance is reported, but without exhaustive ablations on mixing ratios, deduplication strategies, or interactions with training hyperparameters, potential confounding cannot be ruled out.
minor comments (2)
- [Abstract] The abstract refers to 'curated web baselines' and 'prior synthetic methods' without defining their exact composition or selection criteria; this should be clarified in the main text to strengthen the comparisons.
- [Dataset description] Additional details on the exact composition, filtering, and rephrasing pipeline for the 486B-token FinePhrase dataset would improve reproducibility beyond the released artifacts.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications on our experimental design, while acknowledging the inherent limitations of proxy-based evaluations at this scale. We believe our trillion-token experiments and consistent trends provide robust support for the claims, but we are happy to incorporate additional details and caveats where appropriate.
read point-by-point responses
-
Referee: [Experimental evaluation and FinePhrase results] The claim that structured output formats consistently outperform baselines and that FinePhrase outperforms all existing synthetic data baselines (abstract) rests on proxy evaluations with smaller models and chosen metrics; the manuscript does not report direct pretraining of frontier-scale models on FinePhrase versus matched baselines, leaving generalization to large-scale training unverified and load-bearing for the central claims.
Authors: We agree that direct pretraining of frontier-scale models would offer the strongest possible validation. However, such experiments require resources (compute, data, and time) that are not feasible for this study or most academic efforts. Our proxy evaluations follow established practices in the synthetic data literature, involving controlled training of models up to 7B parameters on the generated data and assessment via standard benchmarks. The improvements from structured formats are consistent across multiple model scales, metrics, and over 1T tokens generated, which we argue supports generalization. We will add an explicit limitations section discussing the proxy nature of the evaluations and the rationale for this approach in the revised manuscript. revision: partial
-
Referee: [Generator model scaling analysis] The finding that increasing the generator model size beyond 1B parameters provides no additional benefit requires specification of the exact models tested (e.g., parameter counts and architectures), the token volumes per condition, and any statistical tests confirming the lack of improvement across structured formats.
Authors: We will provide these details in the revision. The experiments used Llama-architecture models with 1B, 3B, and 7B parameters. For each structured output format, we generated identical token volumes (approximately 10B tokens per condition) to control for scale. Performance plateaued beyond 1B parameters with consistent trends across formats and multiple evaluation runs; while we did not apply formal statistical significance tests, we will include variance estimates and error bars to strengthen the presentation. revision: yes
-
Referee: [Source data selection experiments] The substantial influence of source data selection on performance is reported, but without exhaustive ablations on mixing ratios, deduplication strategies, or interactions with training hyperparameters, potential confounding cannot be ruled out.
Authors: We acknowledge that exhaustive ablations across all mixing ratios, deduplication variants, and hyperparameter interactions would be ideal but are computationally prohibitive at our generation scale. Our experiments isolate source data effects through controlled mixing of web sources while holding other factors fixed, and we report clear performance differences. We will expand the manuscript with specifics on the mixing ratios, deduplication procedures applied to source data, and an explicit discussion of potential confounding as a limitation. revision: partial
Circularity Check
No circularity; purely empirical study with released artifacts
full rationale
The paper reports results from controlled experiments that generate >1T tokens across prompt formats, generator sizes, and source data mixes, then measures downstream performance on smaller models. No equations, derivations, or self-referential definitions appear; claims about structured formats outperforming baselines and the FinePhrase dataset are direct empirical outcomes, not reductions of fitted parameters or prior self-citations. The dataset, prompts, and framework are released, allowing independent verification outside the paper's own measurements.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Downstream task performance or perplexity on held-out data serves as a reliable proxy for pretraining data quality at scale
Reference graph
Works this paper leans on
-
[1]
URLhttps://aclanthology.org/2025.naacl-long.262/. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p17-1147 2025
-
[2]
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
URLhttps://aclanthology.org/2021.acl-long.102/. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the Seventh International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=Bkg6RiCqY7. Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephras...
-
[3]
URLhttps://openreview.net/forum?id=lkjhBdz3rn. NVIDIA, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khat- tar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sun...
-
[4]
Winogrande: an adversarial winograd schema challenge at scale
ISSN 0001-0782. doi: 10.1145/3474381. URLhttps://doi.org/10.1145/3474381. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631: 755–759, 2024. doi: 10.1038/s41586-024-07566-y. URL https://doi.org/10.1038/s41586- 024-07566-y. Varun Singh, Luca...
-
[5]
ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2023.127063. URL https: //www.sciencedirect.com/science/article/pii/S0925231223011864. Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Lean- dro Werra, and Thomas Wolf. The ultra-scale playbook: Training LLMs on GPU clusters,
-
[6]
URL https:// doi.org/10.18653/v1/p19-1472
URL https://huggingface.co/spaces/nanotron/ultrascale-playbook. Blog post. Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Ultra-FineWeb: Efficient data filtering and verification for high-quality LLM training data.arXiv preprint, arXiv:2505.05427, 2025. URLhtt...
-
[7]
This approach underperforms every rephrasing approach in our analysis
generates content from scratch rather than rephrasing. This approach underperforms every rephrasing approach in our analysis. EntiGraph (Yang et al., 2025b) targets continued pretraining with entity-centric augmentation; their diversity scaling complements our finding that prompt-level diversity saturates at approximately 20B tokens. Understanding Synthet...
work page 2025
-
[8]
Ask diverse questions that require different cognitive skills or cover different aspects of the text.,→
-
[9]
Ask questions in various forms such as: - Yes/No questions that require determining whether a statement is true or false.,→ - Open-ended questions that begin with words like what, how, when, where, why and who.,→ - Multi-choice questions that offers two or more options to choose from. Include the options in the question.,→ - Comparison questions that comp...
-
[10]
Focus on asking questions about factual information, important knowledge, or concrete details in the text.,→
-
[11]
Write questions and answers using clear and concise language
- [12]
-
[13]
Each question and answer pair should be on a separate line. Tag the question with "Question:" and the answer with "Answer:".,→ Text: [TEXT] Task: 20 Preprint. Under review. After reading the above text, ask up to 8 questions and provide the correct answers following the instructions. Give your response in this format:,→ Here are the questions and answers ...
-
[14]
Revolutions per minute = 120
-
[15]
Number of minutes = 5
-
[16]
Total revolutions = 120×5 $$120 \\times 5 = 600$$ The disc makes 600 revolutions in 5 minutes. SmolLM2 Representative Output
-
[19]
Paul R. Williams: Classic Hollywood Style,
Noninvasive Monitoring Techniques: Other alternatives include the LiDCOplus System for pulsed waveforms for monitoring and Arterial Pulse Contour Analysis for heart function and blood flow measurement. ,→ ,→ Answer: The NICOM, NICO2, and Respironic systems are noninvasive methods for monitoring cardiac output. The NICOM uses BIOREACTANCE technology, the N...
-
[20]
It provides real-time data in seconds and can be done painlessly on the patient
Continuous Wave Doppler Monitor (USCOM): The monitor uses a handheld probe that can measure stroke volume and fluid input to avoid overloading patients and diagnose problems earlier. It provides real-time data in seconds and can be done painlessly on the patient. ,→ ,→ ,→
-
[21]
Noninvasive Cardiac Monitoring: This includes monitoring cardiac output, stroke volume, and fluid input using tools like the USCOM Noninvasive Cardiac Output Monitor. ,→ ,→
-
[22]
Noninvasive Monitoring Techniques: Other alternatives include the LiDCOplus System for pulsed waveforms for monitoring and Arterial Pulse Contour Analysis for heart function and blood flow measurement. ,→ ,→ Answer: The NICOM, NICO2, and Respironic systems are noninvasive methods for monitoring cardiac output. The NICOM uses BIOREACTANCE technology, the N...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.