TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan; Yuanzhi Li

arxiv: 2305.07759 · v2 · pith:6RZPVFJ3new · submitted 2023-05-12 · 💻 cs.CL · cs.AI· cs.LG

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan , Yuanzhi Li This is my paper

Pith reviewed 2026-05-25 07:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords TinyStoriessmall language modelssynthetic datasetcoherent text generationtransformer architecturemodel evaluationlanguage capabilities

0 comments

The pith

Language models with under 10 million parameters generate fluent multi-paragraph stories when trained on a dataset of simple synthetic tales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TinyStories, a synthetic dataset of short stories that use only words and concepts a typical 3- to 4-year-old understands, generated by GPT-3.5 and GPT-4. It shows that this dataset enables training of language models much smaller than current state-of-the-art ones, or with far simpler architectures such as a single transformer block, to produce consistent stories across several paragraphs that are diverse, have near-perfect grammar, and exhibit reasoning. A new evaluation approach uses GPT-4 to grade model outputs on multiple dimensions like grammar, creativity, and consistency, addressing limitations of rigid benchmarks. The work aims to make research on language capabilities more accessible, especially in low-resource settings.

Core claim

TinyStories is a synthetic dataset of short stories generated by GPT-3.5 and GPT-4 that contain only words and concepts a typical 3- to 4-year-old understands. Training language models on this dataset allows models with fewer than 10 million total parameters, or architectures limited to one transformer block, to produce fluent and consistent stories with several paragraphs that are diverse, have almost perfect grammar, and demonstrate reasoning capabilities.

What carries the argument

The TinyStories synthetic dataset of short stories restricted to child-level vocabulary and concepts

Load-bearing premise

The synthetic stories generated by GPT-3.5 and GPT-4 contain only words and concepts that a typical 3- to 4-year-old understands and do not introduce hidden complexity or distributional artifacts from the generator models themselves.

What would settle it

A model with under 10 million parameters or a single transformer block, after training on TinyStories, produces stories that GPT-4 consistently grades as having poor grammar, inconsistencies across paragraphs, or no reasoning when evaluated as student work.

read the original abstract

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TinyStories shows sub-10M models can output multi-paragraph stories on a vocab-restricted synthetic dataset, but the data may carry over patterns from its GPT generators.

read the letter

The central result is that transformers below 10 million parameters or with only one block can produce fluent, multi-paragraph stories with good grammar and some consistency when trained on this new TinyStories dataset of short, child-level vocabulary tales. That scale is well below what prior work suggested was needed for coherent output. The dataset construction and the single-block result are the concrete new pieces, along with the GPT-4 grading setup that scores grammar, creativity, and consistency separately. This gives a practical way to test models without forcing outputs into rigid benchmark formats. The approach is useful for anyone studying how coherence and reasoning emerge when data complexity is deliberately limited. The main limitation is that every training story was itself written by GPT-3.5 or GPT-4 under vocabulary constraints. Strict word filtering does not automatically remove higher-order regularities such as typical story arcs or reasoning chains that large models favor. If those patterns are present, the small models may be distilling them rather than learning coherence from a minimal distribution. The abstract gives no quantitative details on data validation, hyperparameter choices, or controls for evaluator bias in the GPT-4 grades, so the full paper needs to supply those to make the claims hold. This work is aimed at people building efficient or specialized models and at researchers trying to separate data effects from scale effects. It is coherent enough on its own terms to merit peer review so the methods and data can be examined directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TinyStories, a synthetic dataset of short stories generated by GPT-3.5 and GPT-4 using only vocabulary and concepts typical for 3- to 4-year-old children. It reports that language models with fewer than 10 million parameters, or with simplified architectures such as a single transformer block, can be trained on this dataset to produce fluent, consistent, multi-paragraph stories exhibiting near-perfect grammar, diversity, and some reasoning capabilities. The work also proposes a new evaluation paradigm in which GPT-4 grades model outputs across multiple dimensions (grammar, creativity, consistency) as if assessing student stories.

Significance. If the central empirical claims hold after validation, the results would indicate that coherent language generation can emerge at substantially smaller scales when the training distribution is appropriately constrained, offering a controlled testbed for studying emergence and enabling research in low-resource settings. The GPT-4 grading framework provides a multidimensional alternative to rigid benchmarks. The work supplies a new dataset and reproducible training setup that could facilitate follow-on analysis.

major comments (3)

[Dataset Generation] Dataset generation section: the assertion that stories contain 'only words that a typical 3 to 4-year-olds usually understand' and lack hidden complexity is load-bearing for the claim that small models learn coherence rather than distill generator patterns, yet no quantitative validation (vocabulary statistics, human complexity ratings, or checks for higher-order narrative regularities) is reported.
[Evaluation Framework] Evaluation framework (abstract and results): the GPT-4 grading procedure lacks reported details on prompt design, controls for evaluator bias, or correlation with human judgments; without these, the multidimensional scores cannot be treated as reliable evidence for the claimed capabilities.
[Experiments] Experimental results: baseline comparisons, training hyperparameters, and controls for data artifacts are not detailed, undermining the ability to assess whether the reported fluency and reasoning in <10M-parameter models exceed what would be expected from distilling GPT-3.5/4 patterns.

minor comments (2)

The abstract could more explicitly separate the dataset contribution from the model-scale claims to improve clarity for readers.
Figure captions and table headers should include explicit definitions of all reported metrics (e.g., what constitutes a 'reasoning' score) to aid interpretation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional rigor will strengthen the manuscript, and we will revise accordingly while preserving the core contributions.

read point-by-point responses

Referee: [Dataset Generation] Dataset generation section: the assertion that stories contain 'only words that a typical 3 to 4-year-olds usually understand' and lack hidden complexity is load-bearing for the claim that small models learn coherence rather than distill generator patterns, yet no quantitative validation (vocabulary statistics, human complexity ratings, or checks for higher-order narrative regularities) is reported.

Authors: We agree that quantitative validation strengthens the central claim. In the revised manuscript we will add (i) vocabulary statistics comparing TinyStories token distributions against standard age-appropriate word lists for 3-4 year olds, (ii) basic narrative-complexity metrics (e.g., average sentence length, dependency depth), and (iii) a short discussion of the generation prompts used to enforce simplicity. These additions will make explicit that the observed coherence is not merely pattern distillation. revision: yes
Referee: [Evaluation Framework] Evaluation framework (abstract and results): the GPT-4 grading procedure lacks reported details on prompt design, controls for evaluator bias, or correlation with human judgments; without these, the multidimensional scores cannot be treated as reliable evidence for the claimed capabilities.

Authors: We will append the complete GPT-4 grading prompts and rubrics to the supplementary material and describe the controls already used (fixed temperature, identical instructions across models). We will also add a modest human-evaluation study on a held-out subset of stories to report correlation between GPT-4 and human grades on the same dimensions; this addresses the reliability concern directly. revision: yes
Referee: [Experiments] Experimental results: baseline comparisons, training hyperparameters, and controls for data artifacts are not detailed, undermining the ability to assess whether the reported fluency and reasoning in <10M-parameter models exceed what would be expected from distilling GPT-3.5/4 patterns.

Authors: The revised experimental section will list all training hyperparameters (optimizer, learning-rate schedule, batch size, epochs) in a table, include baseline runs on non-TinyStories corpora of comparable size, and report controls for memorization (exact n-gram overlap checks between generated outputs and the training set). These additions will allow readers to evaluate whether the observed capabilities exceed simple distillation. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical dataset generation, training, and evaluation.

full rationale

The paper constructs TinyStories by prompting GPT-3.5/GPT-4 with vocabulary constraints, trains small models from scratch on the resulting corpus, and evaluates outputs via a separate GPT-4 grading rubric. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claim—that models below 10M parameters or with one transformer block can produce multi-paragraph coherent stories—rests on external training runs and human-interpretable outputs rather than any reduction to the paper's own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the GPT-generated stories faithfully reflect only 3-4-year-old vocabulary and concepts; no free parameters are introduced, and no new entities are postulated.

axioms (1)

domain assumption Standard transformer language-model training produces text that can be evaluated for fluency, consistency, and reasoning.
Invoked implicitly when the authors treat generated stories as evidence of the claimed capabilities.

pith-pipeline@v0.9.0 · 5861 in / 1340 out tokens · 24506 ms · 2026-05-25T07:33:25.772329+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
cs.CR 2026-04 conditional novelty 7.0

Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
cs.LG 2026-04 unverdicted novelty 7.0

Neural CTMC decouples jump timing and direction in continuous-time Markov chain diffusion via dedicated heads, achieving lower perplexity on TinyStories (16.36) and OpenWebText than GIDD or MDLM at equivalent training...
Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
cs.LG 2026-04 unverdicted novelty 7.0

Neural CTMC decouples discrete diffusion into separate exit-rate and jump-distribution heads, factorizing the path-space KL into Poisson and categorical terms and achieving the first pure-uniform outperformance of mas...
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
cs.CL 2026-01 unverdicted novelty 7.0

Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic as...
How does the optimizer implicitly bias the model merging loss landscape?
cs.LG 2025-10 unverdicted novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
cs.LG 2025-10 unverdicted novelty 7.0

RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
cs.CR 2025-09 unverdicted novelty 7.0

SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.
All is Not Lost: LLM Recovery without Checkpoints
cs.DC 2025-06 conditional novelty 7.0

CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding st...
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
cs.CL 2025-04 unverdicted novelty 7.0

The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
Towards Human-Level Book-Writing Capability
cs.AI 2026-05 unverdicted novelty 6.0

A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
cs.AI 2026-05 unverdicted novelty 6.0

Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained mode...
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
cs.LG 2026-05 conditional novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
TextLDM: Language Modeling with Continuous Latent Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
Latent Planning Emerges with Scale
cs.CL 2026-04 unverdicted novelty 6.0

Latent planning ability in LLMs emerges and strengthens with scale, shown through internal features that represent future words and influence token choices on planning and rhyming tasks.
Differences in Text Generated by Diffusion and Autoregressive Language Models
cs.CL 2026-04 unverdicted novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Next-Latent Prediction Transformers Learn Compact World Models
cs.LG 2025-11 unverdicted novelty 6.0

NextLat augments next-token prediction with latent next-state prediction, theoretically converging latents to belief states and showing empirical gains in world modeling, reasoning, planning, and faster inference via ...
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
stat.ML 2025-05 unverdicted novelty 6.0

Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
Seed Bank, Co-op, Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing
cs.HC 2026-05 unverdicted novelty 5.0

Workshops with over 100 creative writers produced metaphors and four themes for language model governance that favor consent-driven, smaller open models encoding community values.
Path Integral Solution for Dissipative Generative Dynamics
cs.LG 2025-12 unverdicted novelty 5.0

Language generation requires dissipative quantum dynamics with non-local aggregation, not conservation laws, framing it as dissipative quantum field theory.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)
cs.CL 2025-01 unverdicted novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 23 Pith papers · 15 internal anchors

[1]

Accessed: 2019

Common crawl. Accessed: 2019

work page 2019
[2]

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 , 2020

work page arXiv 2012
[3]

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021

work page 2021
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 25

work page 1901
[5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

What Does BERT Look At? An Analysis of BERT's Attention

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[7]

Young children’s understanding of fact beliefs versus value beliefs

John H Flavell, Eleanor R Flavell, Frances L Green, and Louis J Moses. Young children’s understanding of fact beliefs versus value beliefs. Child development , 61(4):915–928, 1990

work page 1990
[8]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[13]

Quantized neural networks: Training neural networks with low precision weights and activations

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017

work page 2017
[14]

Vision transformers provably learn spatial structure

Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems , 35:37822–37836, 2022

work page 2022
[15]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[17]

The winograd schema challenge

Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning , 2012

work page 2012
[18]

Visualizing and Understanding Neural Models in NLP

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

How do transformers learn topic structure: Towards a mechanistic understanding

Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245 , 2023

work page arXiv 2023
[20]

The development of grammar in child language

Wick Miller and Susan Ervin. The development of grammar in child language. Monographs of the Society for Research in Child Development , pages 9–34, 1964

work page 1964
[21]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[22]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ´ an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´ andez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. 26

work page 2019
[24]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020

work page 2020
[25]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[26]

What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023

Michael Santacroce, Zixin Wen, Yelong Shen, and Yuanzhi Li. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023

work page arXiv 2023
[27]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Mobilebert: a compact task-agnostic bert for resource-limited devices

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984 , 2020

work page arXiv 2004
[29]

cloze procedure

Wilson L Taylor. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4):415–433, 1953

work page 1953
[30]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[31]

Ccnet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm´ an, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019

work page arXiv 1911
[32]

Understanding natural language

Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972

work page 1972
[33]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 , 2017. 27

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Accessed: 2019

Common crawl. Accessed: 2019

work page 2019

[2] [2]

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 , 2020

work page arXiv 2012

[3] [3]

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021

work page 2021

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 25

work page 1901

[5] [5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

What Does BERT Look At? An Analysis of BERT's Attention

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[7] [7]

Young children’s understanding of fact beliefs versus value beliefs

John H Flavell, Eleanor R Flavell, Frances L Green, and Louis J Moses. Young children’s understanding of fact beliefs versus value beliefs. Child development , 61(4):915–928, 1990

work page 1990

[8] [8]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[13] [13]

Quantized neural networks: Training neural networks with low precision weights and activations

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017

work page 2017

[14] [14]

Vision transformers provably learn spatial structure

Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems , 35:37822–37836, 2022

work page 2022

[15] [15]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[17] [17]

The winograd schema challenge

Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning , 2012

work page 2012

[18] [18]

Visualizing and Understanding Neural Models in NLP

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

How do transformers learn topic structure: Towards a mechanistic understanding

Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245 , 2023

work page arXiv 2023

[20] [20]

The development of grammar in child language

Wick Miller and Susan Ervin. The development of grammar in child language. Monographs of the Society for Research in Child Development , pages 9–34, 1964

work page 1964

[21] [21]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023

[22] [22]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ´ an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´ andez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. 26

work page 2019

[24] [24]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020

work page 2020

[25] [25]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[26] [26]

What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023

Michael Santacroce, Zixin Wen, Yelong Shen, and Yuanzhi Li. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023

work page arXiv 2023

[27] [27]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Mobilebert: a compact task-agnostic bert for resource-limited devices

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984 , 2020

work page arXiv 2004

[29] [29]

cloze procedure

Wilson L Taylor. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4):415–433, 1953

work page 1953

[30] [30]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[31] [31]

Ccnet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm´ an, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019

work page arXiv 1911

[32] [32]

Understanding natural language

Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972

work page 1972

[33] [33]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 , 2017. 27

work page internal anchor Pith review Pith/arXiv arXiv 2017