pith. sign in

arxiv: 2305.07759 · v2 · pith:6RZPVFJ3new · submitted 2023-05-12 · 💻 cs.CL · cs.AI· cs.LG

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Pith reviewed 2026-05-25 07:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords TinyStoriessmall language modelssynthetic datasetcoherent text generationtransformer architecturemodel evaluationlanguage capabilities
0
0 comments X

The pith

Language models with under 10 million parameters generate fluent multi-paragraph stories when trained on a dataset of simple synthetic tales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TinyStories, a synthetic dataset of short stories that use only words and concepts a typical 3- to 4-year-old understands, generated by GPT-3.5 and GPT-4. It shows that this dataset enables training of language models much smaller than current state-of-the-art ones, or with far simpler architectures such as a single transformer block, to produce consistent stories across several paragraphs that are diverse, have near-perfect grammar, and exhibit reasoning. A new evaluation approach uses GPT-4 to grade model outputs on multiple dimensions like grammar, creativity, and consistency, addressing limitations of rigid benchmarks. The work aims to make research on language capabilities more accessible, especially in low-resource settings.

Core claim

TinyStories is a synthetic dataset of short stories generated by GPT-3.5 and GPT-4 that contain only words and concepts a typical 3- to 4-year-old understands. Training language models on this dataset allows models with fewer than 10 million total parameters, or architectures limited to one transformer block, to produce fluent and consistent stories with several paragraphs that are diverse, have almost perfect grammar, and demonstrate reasoning capabilities.

What carries the argument

The TinyStories synthetic dataset of short stories restricted to child-level vocabulary and concepts

Load-bearing premise

The synthetic stories generated by GPT-3.5 and GPT-4 contain only words and concepts that a typical 3- to 4-year-old understands and do not introduce hidden complexity or distributional artifacts from the generator models themselves.

What would settle it

A model with under 10 million parameters or a single transformer block, after training on TinyStories, produces stories that GPT-4 consistently grades as having poor grammar, inconsistencies across paragraphs, or no reasoning when evaluated as student work.

read the original abstract

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TinyStories, a synthetic dataset of short stories generated by GPT-3.5 and GPT-4 using only vocabulary and concepts typical for 3- to 4-year-old children. It reports that language models with fewer than 10 million parameters, or with simplified architectures such as a single transformer block, can be trained on this dataset to produce fluent, consistent, multi-paragraph stories exhibiting near-perfect grammar, diversity, and some reasoning capabilities. The work also proposes a new evaluation paradigm in which GPT-4 grades model outputs across multiple dimensions (grammar, creativity, consistency) as if assessing student stories.

Significance. If the central empirical claims hold after validation, the results would indicate that coherent language generation can emerge at substantially smaller scales when the training distribution is appropriately constrained, offering a controlled testbed for studying emergence and enabling research in low-resource settings. The GPT-4 grading framework provides a multidimensional alternative to rigid benchmarks. The work supplies a new dataset and reproducible training setup that could facilitate follow-on analysis.

major comments (3)
  1. [Dataset Generation] Dataset generation section: the assertion that stories contain 'only words that a typical 3 to 4-year-olds usually understand' and lack hidden complexity is load-bearing for the claim that small models learn coherence rather than distill generator patterns, yet no quantitative validation (vocabulary statistics, human complexity ratings, or checks for higher-order narrative regularities) is reported.
  2. [Evaluation Framework] Evaluation framework (abstract and results): the GPT-4 grading procedure lacks reported details on prompt design, controls for evaluator bias, or correlation with human judgments; without these, the multidimensional scores cannot be treated as reliable evidence for the claimed capabilities.
  3. [Experiments] Experimental results: baseline comparisons, training hyperparameters, and controls for data artifacts are not detailed, undermining the ability to assess whether the reported fluency and reasoning in <10M-parameter models exceed what would be expected from distilling GPT-3.5/4 patterns.
minor comments (2)
  1. The abstract could more explicitly separate the dataset contribution from the model-scale claims to improve clarity for readers.
  2. Figure captions and table headers should include explicit definitions of all reported metrics (e.g., what constitutes a 'reasoning' score) to aid interpretation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional rigor will strengthen the manuscript, and we will revise accordingly while preserving the core contributions.

read point-by-point responses
  1. Referee: [Dataset Generation] Dataset generation section: the assertion that stories contain 'only words that a typical 3 to 4-year-olds usually understand' and lack hidden complexity is load-bearing for the claim that small models learn coherence rather than distill generator patterns, yet no quantitative validation (vocabulary statistics, human complexity ratings, or checks for higher-order narrative regularities) is reported.

    Authors: We agree that quantitative validation strengthens the central claim. In the revised manuscript we will add (i) vocabulary statistics comparing TinyStories token distributions against standard age-appropriate word lists for 3-4 year olds, (ii) basic narrative-complexity metrics (e.g., average sentence length, dependency depth), and (iii) a short discussion of the generation prompts used to enforce simplicity. These additions will make explicit that the observed coherence is not merely pattern distillation. revision: yes

  2. Referee: [Evaluation Framework] Evaluation framework (abstract and results): the GPT-4 grading procedure lacks reported details on prompt design, controls for evaluator bias, or correlation with human judgments; without these, the multidimensional scores cannot be treated as reliable evidence for the claimed capabilities.

    Authors: We will append the complete GPT-4 grading prompts and rubrics to the supplementary material and describe the controls already used (fixed temperature, identical instructions across models). We will also add a modest human-evaluation study on a held-out subset of stories to report correlation between GPT-4 and human grades on the same dimensions; this addresses the reliability concern directly. revision: yes

  3. Referee: [Experiments] Experimental results: baseline comparisons, training hyperparameters, and controls for data artifacts are not detailed, undermining the ability to assess whether the reported fluency and reasoning in <10M-parameter models exceed what would be expected from distilling GPT-3.5/4 patterns.

    Authors: The revised experimental section will list all training hyperparameters (optimizer, learning-rate schedule, batch size, epochs) in a table, include baseline runs on non-TinyStories corpora of comparable size, and report controls for memorization (exact n-gram overlap checks between generated outputs and the training set). These additions will allow readers to evaluate whether the observed capabilities exceed simple distillation. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical dataset generation, training, and evaluation.

full rationale

The paper constructs TinyStories by prompting GPT-3.5/GPT-4 with vocabulary constraints, trains small models from scratch on the resulting corpus, and evaluates outputs via a separate GPT-4 grading rubric. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claim—that models below 10M parameters or with one transformer block can produce multi-paragraph coherent stories—rests on external training runs and human-interpretable outputs rather than any reduction to the paper's own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the GPT-generated stories faithfully reflect only 3-4-year-old vocabulary and concepts; no free parameters are introduced, and no new entities are postulated.

axioms (1)
  • domain assumption Standard transformer language-model training produces text that can be evaluated for fluency, consistency, and reasoning.
    Invoked implicitly when the authors treat generated stories as evidence of the claimed capabilities.

pith-pipeline@v0.9.0 · 5861 in / 1340 out tokens · 24506 ms · 2026-05-25T07:33:25.772329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

    cs.CR 2026-04 conditional novelty 7.0

    Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.

  2. Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

    cs.LG 2026-04 unverdicted novelty 7.0

    Neural CTMC decouples jump timing and direction in continuous-time Markov chain diffusion via dedicated heads, achieving lower perplexity on TinyStories (16.36) and OpenWebText than GIDD or MDLM at equivalent training...

  3. Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

    cs.LG 2026-04 unverdicted novelty 7.0

    Neural CTMC decouples discrete diffusion into separate exit-rate and jump-distribution heads, factorizing the path-space KL into Poisson and categorical terms and achieving the first pure-uniform outperformance of mas...

  4. How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

    cs.CL 2026-01 unverdicted novelty 7.0

    Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic as...

  5. How does the optimizer implicitly bias the model merging loss landscape?

    cs.LG 2025-10 unverdicted novelty 7.0

    Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.

  6. RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

    cs.LG 2025-10 unverdicted novelty 7.0

    RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.

  7. SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

    cs.CR 2025-09 unverdicted novelty 7.0

    SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.

  8. All is Not Lost: LLM Recovery without Checkpoints

    cs.DC 2025-06 conditional novelty 7.0

    CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding st...

  9. TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

    cs.CL 2025-04 unverdicted novelty 7.0

    The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.

  10. Towards Human-Level Book-Writing Capability

    cs.AI 2026-05 unverdicted novelty 6.0

    A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.

  11. Primal-Dual Guided Decoding for Constrained Discrete Diffusion

    cs.AI 2026-05 unverdicted novelty 6.0

    Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained mode...

  12. Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

    cs.LG 2026-05 conditional novelty 6.0

    A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

  13. TextLDM: Language Modeling with Continuous Latent Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.

  14. BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.

  15. Latent Planning Emerges with Scale

    cs.CL 2026-04 unverdicted novelty 6.0

    Latent planning ability in LLMs emerges and strengthens with scale, shown through internal features that represent future words and influence token choices on planning and rhyming tasks.

  16. Differences in Text Generated by Diffusion and Autoregressive Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.

  17. Next-Latent Prediction Transformers Learn Compact World Models

    cs.LG 2025-11 unverdicted novelty 6.0

    NextLat augments next-token prediction with latent next-state prediction, theoretically converging latents to belief states and showing empirical gains in world modeling, reasoning, planning, and faster inference via ...

  18. Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

    stat.ML 2025-05 unverdicted novelty 6.0

    Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.

  19. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  20. Textbooks Are All You Need II: phi-1.5 technical report

    cs.CL 2023-09 unverdicted novelty 6.0

    phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.

  21. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  22. Seed Bank, Co-op, Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing

    cs.HC 2026-05 unverdicted novelty 5.0

    Workshops with over 100 creative writers produced metaphors and four themes for language model governance that favor consent-driven, smaller open models encoding community values.

  23. Path Integral Solution for Dissipative Generative Dynamics

    cs.LG 2025-12 unverdicted novelty 5.0

    Language generation requires dissipative quantum dynamics with non-local aggregation, not conservation laws, framing it as dissipative quantum field theory.

  24. Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

    cs.CL 2025-01 unverdicted novelty 2.0

    A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 23 Pith papers · 15 internal anchors

  1. [1]

    Accessed: 2019

    Common crawl. Accessed: 2019

  2. [2]

    Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

    Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 , 2020

  3. [3]

    GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 25

  5. [5]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 , 2023

  6. [6]

    What Does BERT Look At? An Analysis of BERT's Attention

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 , 2019

  7. [7]

    Young children’s understanding of fact beliefs versus value beliefs

    John H Flavell, Eleanor R Flavell, Frances L Green, and Louis J Moses. Young children’s understanding of fact beliefs versus value beliefs. Child development , 61(4):915–928, 1990

  8. [8]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 , 2018

  9. [9]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 , 2020

  10. [10]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  11. [11]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 , 2022

  12. [12]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 , 2019

  13. [13]

    Quantized neural networks: Training neural networks with low precision weights and activations

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017

  14. [14]

    Vision transformers provably learn spatial structure

    Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems , 35:37822–37836, 2022

  15. [15]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 , 2017

  16. [16]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  17. [17]

    The winograd schema challenge

    Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning , 2012

  18. [18]

    Visualizing and Understanding Neural Models in NLP

    Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066 , 2015

  19. [19]

    How do transformers learn topic structure: Towards a mechanistic understanding

    Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245 , 2023

  20. [20]

    The development of grammar in child language

    Wick Miller and Susan Ervin. The development of grammar in child language. Monographs of the Society for Research in Child Development , pages 9–34, 1964

  21. [21]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  22. [22]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ´ an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´ andez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031 , 2016

  23. [23]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. 26

  24. [24]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020

  25. [25]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 , 2019

  26. [26]

    What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023

    Michael Santacroce, Zixin Wen, Yelong Shen, and Yuanzhi Li. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 , 2023

  27. [27]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022

  28. [28]

    Mobilebert: a compact task-agnostic bert for resource-limited devices

    Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984 , 2020

  29. [29]

    cloze procedure

    Wilson L Taylor. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4):415–433, 1953

  30. [30]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418 , 2019

  31. [31]

    Ccnet: Extracting high quality monolingual datasets from web crawl data

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm´ an, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019

  32. [32]

    Understanding natural language

    Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972

  33. [33]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 , 2017. 27