SynPro uses RL-optimized rephrasing and reformatting of organic data to generate synthetic pretraining tokens that deliver 3.7-5.2x the effective learning of simple repetition and can exceed training on unique data at 1.1B scale.
Think you have Solved Question Answering?
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
SynPro uses RL-optimized rephrasing and reformatting of organic data to generate synthetic pretraining tokens that deliver 3.7-5.2x the effective learning of simple repetition and can exceed training on unique data at 1.1B scale.