TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Pith reviewed 2026-05-25 07:33 UTC · model grok-4.3
The pith
Language models with under 10 million parameters generate fluent multi-paragraph stories when trained on a dataset of simple synthetic tales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TinyStories is a synthetic dataset of short stories generated by GPT-3.5 and GPT-4 that contain only words and concepts a typical 3- to 4-year-old understands. Training language models on this dataset allows models with fewer than 10 million total parameters, or architectures limited to one transformer block, to produce fluent and consistent stories with several paragraphs that are diverse, have almost perfect grammar, and demonstrate reasoning capabilities.
What carries the argument
The TinyStories synthetic dataset of short stories restricted to child-level vocabulary and concepts
Load-bearing premise
The synthetic stories generated by GPT-3.5 and GPT-4 contain only words and concepts that a typical 3- to 4-year-old understands and do not introduce hidden complexity or distributional artifacts from the generator models themselves.
What would settle it
A model with under 10 million parameters or a single transformer block, after training on TinyStories, produces stories that GPT-4 consistently grades as having poor grammar, inconsistencies across paragraphs, or no reasoning when evaluated as student work.
read the original abstract
Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TinyStories, a synthetic dataset of short stories generated by GPT-3.5 and GPT-4 using only vocabulary and concepts typical for 3- to 4-year-old children. It reports that language models with fewer than 10 million parameters, or with simplified architectures such as a single transformer block, can be trained on this dataset to produce fluent, consistent, multi-paragraph stories exhibiting near-perfect grammar, diversity, and some reasoning capabilities. The work also proposes a new evaluation paradigm in which GPT-4 grades model outputs across multiple dimensions (grammar, creativity, consistency) as if assessing student stories.
Significance. If the central empirical claims hold after validation, the results would indicate that coherent language generation can emerge at substantially smaller scales when the training distribution is appropriately constrained, offering a controlled testbed for studying emergence and enabling research in low-resource settings. The GPT-4 grading framework provides a multidimensional alternative to rigid benchmarks. The work supplies a new dataset and reproducible training setup that could facilitate follow-on analysis.
major comments (3)
- [Dataset Generation] Dataset generation section: the assertion that stories contain 'only words that a typical 3 to 4-year-olds usually understand' and lack hidden complexity is load-bearing for the claim that small models learn coherence rather than distill generator patterns, yet no quantitative validation (vocabulary statistics, human complexity ratings, or checks for higher-order narrative regularities) is reported.
- [Evaluation Framework] Evaluation framework (abstract and results): the GPT-4 grading procedure lacks reported details on prompt design, controls for evaluator bias, or correlation with human judgments; without these, the multidimensional scores cannot be treated as reliable evidence for the claimed capabilities.
- [Experiments] Experimental results: baseline comparisons, training hyperparameters, and controls for data artifacts are not detailed, undermining the ability to assess whether the reported fluency and reasoning in <10M-parameter models exceed what would be expected from distilling GPT-3.5/4 patterns.
minor comments (2)
- The abstract could more explicitly separate the dataset contribution from the model-scale claims to improve clarity for readers.
- Figure captions and table headers should include explicit definitions of all reported metrics (e.g., what constitutes a 'reasoning' score) to aid interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional rigor will strengthen the manuscript, and we will revise accordingly while preserving the core contributions.
read point-by-point responses
-
Referee: [Dataset Generation] Dataset generation section: the assertion that stories contain 'only words that a typical 3 to 4-year-olds usually understand' and lack hidden complexity is load-bearing for the claim that small models learn coherence rather than distill generator patterns, yet no quantitative validation (vocabulary statistics, human complexity ratings, or checks for higher-order narrative regularities) is reported.
Authors: We agree that quantitative validation strengthens the central claim. In the revised manuscript we will add (i) vocabulary statistics comparing TinyStories token distributions against standard age-appropriate word lists for 3-4 year olds, (ii) basic narrative-complexity metrics (e.g., average sentence length, dependency depth), and (iii) a short discussion of the generation prompts used to enforce simplicity. These additions will make explicit that the observed coherence is not merely pattern distillation. revision: yes
-
Referee: [Evaluation Framework] Evaluation framework (abstract and results): the GPT-4 grading procedure lacks reported details on prompt design, controls for evaluator bias, or correlation with human judgments; without these, the multidimensional scores cannot be treated as reliable evidence for the claimed capabilities.
Authors: We will append the complete GPT-4 grading prompts and rubrics to the supplementary material and describe the controls already used (fixed temperature, identical instructions across models). We will also add a modest human-evaluation study on a held-out subset of stories to report correlation between GPT-4 and human grades on the same dimensions; this addresses the reliability concern directly. revision: yes
-
Referee: [Experiments] Experimental results: baseline comparisons, training hyperparameters, and controls for data artifacts are not detailed, undermining the ability to assess whether the reported fluency and reasoning in <10M-parameter models exceed what would be expected from distilling GPT-3.5/4 patterns.
Authors: The revised experimental section will list all training hyperparameters (optimizer, learning-rate schedule, batch size, epochs) in a table, include baseline runs on non-TinyStories corpora of comparable size, and report controls for memorization (exact n-gram overlap checks between generated outputs and the training set). These additions will allow readers to evaluate whether the observed capabilities exceed simple distillation. revision: yes
Circularity Check
No circularity; purely empirical dataset generation, training, and evaluation.
full rationale
The paper constructs TinyStories by prompting GPT-3.5/GPT-4 with vocabulary constraints, trains small models from scratch on the resulting corpus, and evaluates outputs via a separate GPT-4 grading rubric. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claim—that models below 10M parameters or with one transformer block can produce multi-paragraph coherent stories—rests on external training runs and human-interpretable outputs rather than any reduction to the paper's own inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard transformer language-model training produces text that can be evaluated for fluency, consistency, and reasoning.
Forward citations
Cited by 24 Pith papers
-
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
-
Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
Neural CTMC decouples jump timing and direction in continuous-time Markov chain diffusion via dedicated heads, achieving lower perplexity on TinyStories (16.36) and OpenWebText than GIDD or MDLM at equivalent training...
-
Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
Neural CTMC decouples discrete diffusion into separate exit-rate and jump-distribution heads, factorizing the path-space KL into Poisson and categorical terms and achieving the first pure-uniform outperformance of mas...
-
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic as...
-
How does the optimizer implicitly bias the model merging loss landscape?
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
-
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
-
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.
-
All is Not Lost: LLM Recovery without Checkpoints
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding st...
-
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
-
Towards Human-Level Book-Writing Capability
A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.
-
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained mode...
-
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
-
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
-
Latent Planning Emerges with Scale
Latent planning ability in LLMs emerges and strengthens with scale, shown through internal features that represent future words and influence token choices on planning and rhyming tasks.
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
Next-Latent Prediction Transformers Learn Compact World Models
NextLat augments next-token prediction with latent next-state prediction, theoretically converging latents to belief states and showing empirical gains in world modeling, reasoning, planning, and faster inference via ...
-
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Textbooks Are All You Need II: phi-1.5 technical report
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
Seed Bank, Co-op, Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing
Workshops with over 100 creative writers produced metaphors and four themes for language model governance that favor consent-driven, smaller open models encoding community values.
-
Path Integral Solution for Dissipative Generative Dynamics
Language generation requires dissipative quantum dynamics with non-local aggregation, not conservation laws, framing it as dissipative quantum field theory.
-
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.
Reference graph
Works this paper leans on
- [1]
-
[2]
Towards understanding ensemble, knowledge distillation and self-distillation in deep learning
Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 , 2020
-
[3]
GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021
work page 2021
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 25
work page 1901
-
[5]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[7]
Young children’s understanding of fact beliefs versus value beliefs
John H Flavell, Eleanor R Flavell, Frances L Green, and Louis J Moses. Young children’s understanding of fact beliefs versus value beliefs. Child development , 61(4):915–928, 1990
work page 1990
-
[8]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[10]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[13]
Quantized neural networks: Training neural networks with low precision weights and activations
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017
work page 2017
-
[14]
Vision transformers provably learn spatial structure
Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems , 35:37822–37836, 2022
work page 2022
-
[15]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[17]
Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning , 2012
work page 2012
-
[18]
Visualizing and Understanding Neural Models in NLP
Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
How do transformers learn topic structure: Towards a mechanistic understanding
Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245 , 2023
-
[20]
The development of grammar in child language
Wick Miller and Susan Ervin. The development of grammar in child language. Monographs of the Society for Research in Child Development , pages 9–34, 1964
work page 1964
- [21]
-
[22]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ´ an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´ andez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. 26
work page 2019
-
[24]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020
work page 2020
-
[25]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[26]
Michael Santacroce, Zixin Wen, Yelong Shen, and Yuanzhi Li. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023
-
[27]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Mobilebert: a compact task-agnostic bert for resource-limited devices
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984 , 2020
-
[29]
Wilson L Taylor. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4):415–433, 1953
work page 1953
-
[30]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[31]
Ccnet: Extracting high quality monolingual datasets from web crawl data
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm´ an, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019
-
[32]
Understanding natural language
Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972
work page 1972
-
[33]
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 , 2017. 27
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.