Scaling Laws and Interpretability of Learning from Repeated Data
Pith reviewed 2026-05-17 15:44 UTC · model grok-4.3
The pith
Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Repeating 0.1% of the data 100 times degrades the performance of an 800M parameter model to that of a 400M parameter model, even though 90% of training tokens remain unique. This is accompanied by a double descent in test loss and damage to internal structures associated with generalization.
What carries the argument
Memorization of repeated data consuming model capacity and damaging induction heads and copying mechanisms
If this is right
- A predictable range of repetition frequencies leads to the worst degradation
- Data repetition harms generalization more than it affects memorization of unique data
- Induction heads are disproportionately affected by repeated data
- Small repeated fractions can cause large performance harms
Where Pith is reading between the lines
- Deduplication should be prioritized in data pipelines to prevent these effects
- Scaling laws might need to incorporate repetition rates as a variable
- Recovering from repeats could involve targeted unlearning techniques
Load-bearing premise
The performance degradation is caused by memorization consuming model capacity rather than other factors like optimization changes.
What would settle it
An experiment showing that models do not memorize the repeated data more or that induction heads remain intact despite the performance drop would falsify the claim.
read the original abstract
Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the effects of repeated data during LLM pretraining by training model families where most tokens are unique but a small fraction (e.g. 0.1%) is repeated many times (e.g. 100×). It reports a pronounced double-descent in test loss, with severe degradation in an intermediate repetition regime; a concrete example is that repeating 0.1% of the data 100 times reduces an 800 M model’s performance to that of a 400 M model despite 90% of the training tokens remaining unique. The authors hypothesize that memorization of the repeated slice consumes model capacity and support this by showing disproportionate damage to induction heads and copying circuits.
Significance. If the central empirical pattern and mechanistic link hold, the work supplies actionable guidance for data curation at scale and strengthens the connection between scaling-law phenomena and mechanistic interpretability. The quantitative degradation example and double-descent curves are concrete and potentially reproducible; the induction-head analysis offers a falsifiable mechanistic hypothesis.
major comments (2)
- [Experimental protocol and results (around the 800 M / 0.1% × 100 example)] The central comparison (repeating 0.1% of data 100 times, yielding ~10% repeated tokens and therefore only ~90% unique content for fixed total token count) is made against a no-repetition baseline that supplies 100% unique data. Because the manuscript invokes scaling-law relationships, a 10% reduction in unique data volume alone is expected to increase loss; without an explicit control that matches unique-data volume while eliminating repetition (e.g., training on 90% unique data for the same number of steps or an adjusted schedule), the capacity-consumption account is not isolated from a simpler distributional effect. This issue is load-bearing for the claim that repetition-induced memorization is the primary driver.
- [Mechanistic interpretability section] The mechanistic claim that repetition “disproportionately damages copying and internal structures associated with generalization, such as induction heads” requires quantitative controls. It is unclear whether the reported head damage exceeds what would be expected from the reduced unique-data volume or from changes in optimization trajectory; additional ablations (e.g., head ablation scores before/after repetition, or comparison to a matched-unique-data baseline) would be needed to establish causality.
minor comments (2)
- [Abstract] Clarify the exact token fractions: the abstract states “the other 90% of the training tokens remaining unique,” but the arithmetic (0.1% repeated 100 times) implies ~10% repeated tokens; a short table or sentence making the unique/repeated split explicit would remove ambiguity.
- [Methods / experimental details] The manuscript mentions “full controls for optimizer state and data ordering are not detailed”; adding a brief appendix note on whether the repeated-data runs used identical optimizer states or data-ordering seeds as the baselines would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The points raised about experimental controls and mechanistic interpretability are important for strengthening the paper. We address each comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Experimental protocol and results (around the 800 M / 0.1% × 100 example)] The central comparison (repeating 0.1% of data 100 times, yielding ~10% repeated tokens and therefore only ~90% unique content for fixed total token count) is made against a no-repetition baseline that supplies 100% unique data. Because the manuscript invokes scaling-law relationships, a 10% reduction in unique data volume alone is expected to increase loss; without an explicit control that matches unique-data volume while eliminating repetition (e.g., training on 90% unique data for the same number of steps or an adjusted schedule), the capacity-consumption account is not isolated from a simpler distributional effect. This issue is load-bearing for the claim that repetition-induced memorization is the primary driver.
Authors: We agree that distinguishing the effect of repetition from the reduction in unique data volume is necessary. Our experiments hold total training tokens fixed, and the double-descent behavior—loss rising after an initial decline—cannot be explained by a static reduction in unique data alone, which would produce a monotonic shift rather than non-monotonic dynamics. To isolate the repetition effect more cleanly, we will add a control baseline in the revised manuscript that trains on 90% unique data (with no repetition) for the same total token count and compare it directly to the repetition setting. revision: yes
-
Referee: [Mechanistic interpretability section] The mechanistic claim that repetition “disproportionately damages copying and internal structures associated with generalization, such as induction heads” requires quantitative controls. It is unclear whether the reported head damage exceeds what would be expected from the reduced unique-data volume or from changes in optimization trajectory; additional ablations (e.g., head ablation scores before/after repetition, or comparison to a matched-unique-data baseline) would be needed to establish causality.
Authors: We recognize that additional quantitative controls would strengthen the causal link. The existing analysis already compares induction-head metrics between the repetition regime and the standard no-repetition baseline. In the revision we will include direct comparisons of head importance and ablation scores against the new 90%-unique no-repetition control, as well as before/after repetition measurements, to demonstrate that the damage exceeds what is attributable to reduced unique data volume or optimization differences alone. revision: yes
Circularity Check
No circularity: purely empirical measurements from controlled training runs
full rationale
The paper reports direct experimental results from training a family of models on datasets with a controlled small fraction of repeated tokens. Key claims, such as the degradation of an 800M model to the performance level of a 400M model by repeating 0.1% of data 100 times, are presented as measured outcomes from these runs rather than as outputs of any mathematical derivation or scaling-law equation that reduces to fitted inputs by construction. The suspicion that memorization consumes capacity is explicitly labeled as a hypothesis, not a derived result. Links to induction heads and copying mechanisms are based on post-training mechanistic analysis of the actual models. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing premises; the work is self-contained through reproducible training experiments and does not rely on prior author results to close any logical loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- repetition count and fraction
axioms (1)
- domain assumption Model capacity is finite and can be allocated between memorization and generalization
Forward citations
Cited by 19 Pith papers
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
-
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
-
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
-
Prescriptive Scaling Laws for Data Constrained Training
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the ...
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
The False Promise of Imitating Proprietary LLMs
Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Similarity Field Theory: A Mathematical Framework for Intelligence
Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.
- Superposition Yields Robust Neural Scaling
Reference graph
Works this paper leans on
-
[1]
Learning Transferable Visual Models From Natural Language Supervision
Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , copyright =. Learning Transferable Visual Models From Natural Language Supervision , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARX...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
-
[2]
Multimodal Neurons in Artificial Neural Networks , year =
Goh, Gabriel and Nick, Cammarata and Chelsea, Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , date-added =. Multimodal Neurons in Artificial Neural Networks , year =. doi:10.23915/distill.00030 , journal =
-
[3]
In-context Learning and Induction Heads , year =
Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...
-
[4]
Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
-
[5]
A Variational Approach to Learning Curves , url =
Malzahn, D\". A Variational Approach to Learning Curves , url =. Advances in Neural Information Processing Systems , date-added =. 2001 , bdsk-url-1 =
work page 2001
-
[6]
Statistical mechanics of learning: Generalization , year =
Opper, Manfred , date-added =. Statistical mechanics of learning: Generalization , year =. doi:10.1007/978-1-4612-0723-8_5 , isbn =
-
[7]
Bi, Bin and Li, Chenliang and Wu, Chen and Yan, Ming and Wang, Wei and Huang, Songfang and Huang, Fei and Si, Luo , copyright =. PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2004.07159 , keywords =
-
[8]
Predictability and Surprise in Large Generative Models , url =
Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and DasSarma, Nova and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Kernion, Jackson and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Elhage, Nelson and Showk, Sheer El and Fort, Stanislav and Hatfield-Dodds, Zac and Johnston, Scott and Krave...
-
[9]
Deep Learning Scaling is Predictable, Empirically
Hestness, Joel and Narang, Sharan and Ardalani, Newsha and Diamos, Gregory and Jun, Heewoo and Kianinejad, Hassan and Patwary, Md. Mostofa Ali and Yang, Yang and Zhou, Yanqi , copyright =. Deep Learning Scaling is Predictable, Empirically , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1712.00409 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.00409 2017
-
[10]
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , copyright =. Attention Is All You Need , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1706.03762 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017
-
[11]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , url =. 2019 , bdsk-url-1 =. doi:10.48550/ARXIV.1910.03771 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.03771 2019
-
[12]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , copyright =. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2101.00027 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2101.00027 2021
-
[13]
A General Language Assistant as a Laboratory for Alignment
Askell, Amanda and Bai, Yuntao and Chen, Anna and Drain, Dawn and Ganguli, Deep and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Mann, Ben and DasSarma, Nova and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Kernion, Jackson and Ndousse, Kamal and Olsson, Catherine and Amodei, Dario and Brown, Tom and Clark, Jack and McCandlish...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00861 2021
-
[14]
doi:10.23915/distill.00024 , note =
Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , date-added =. Thread: Circuits , year =. doi:10.23915/distill.00024 , journal =
-
[15]
A Mathematical Framework for Transformer Circuits , year =
Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...
-
[16]
Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , url =
Prato, Gabriele and Guiroy, Simon and Caballero, Ethan and Rish, Irina and Chandar, Sarath , copyright =. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2110.06990 , keywords =
-
[17]
Scaling Laws for Acoustic Models , url =
Droppo, Jasha and Elibol, Oguz , copyright =. Scaling Laws for Acoustic Models , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2106.09488 , keywords =
-
[18]
GitHub Copilot: Parrot or crow? , url =
Albert Ziegler , date-added =. GitHub Copilot: Parrot or crow? , url =
-
[19]
Quantifying Memorization Across Neural Language Models
Carlini, Nicholas and Ippolito, Daphne and Jagielski, Matthew and Lee, Katherine and Tramer, Florian and Zhang, Chiyuan , copyright =. Quantifying Memorization Across Neural Language Models , url =. 2022 , bdsk-url-1 =. doi:10.48550/ARXIV.2202.07646 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.07646 2022
-
[20]
Learning to summarize from human feedback
Stiennon, Nisan and Ouyang, Long and Wu, Jeff and Ziegler, Daniel M. and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , copyright =. Learning to summarize from human feedback , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2009.01325 , keywords =
work page internal anchor Pith review doi:10.48550/arxiv.2009.01325 2020
-
[21]
Extracting Training Data from Large Language Models , url =
Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom and Song, Dawn and Erlingsson, Ulfar and Oprea, Alina and Raffel, Colin , copyright =. Extracting Training Data from Large Language Models , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2012.07805 , k...
-
[22]
Nakkiran, Preetum and Kaplun, Gal and Bansal, Yamini and Yang, Tristan and Barak, Boaz and Sutskever, Ilya , copyright =. Deep Double Descent: Where Bigger Models and More Data Hurt , url =. 2019 , bdsk-url-1 =. doi:10.48550/ARXIV.1912.02292 , keywords =
-
[23]
Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , url =
d'Ascoli, St. Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2003.01054 , keywords =
-
[24]
Jamming transition as a paradigm to understand the loss landscape of deep neural networks , volume =
Geiger, Mario and Spigler, Stefano and d'Ascoli, St. Jamming transition as a paradigm to understand the loss landscape of deep neural networks , volume =. Physical Review E , number =
-
[25]
High-dimensional dynamics of generalization error in neural networks
Advani, Madhu S. and Saxe, Andrew M. , copyright =. High-dimensional dynamics of generalization error in neural networks , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1710.03667 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.03667 2017
-
[26]
Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik , copyright =. Reconciling modern machine learning practice and the bias-variance trade-off , url =. 2018 , bdsk-url-1 =. doi:10.48550/ARXIV.1812.11118 , keywords =
-
[27]
Language models are unsupervised multitask learners , volume =
Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others , date-added =. Language models are unsupervised multitask learners , volume =. OpenAI blog , number =
-
[28]
Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S. and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja , copyright =. Skip-Thought Vectors , url =. 2015 , bdsk-url-1 =. doi:10.48550/ARXIV.1506.06726 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.06726 2015
-
[29]
Pointer Sentinel Mixture Models
Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , copyright =. Pointer Sentinel Mixture Models , url =. 2016 , bdsk-url-1 =. doi:10.48550/ARXIV.1609.07843 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.07843 2016
-
[30]
Exploring the Limits of Language Modeling
Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui , copyright =. Exploring the Limits of Language Modeling , url =. 2016 , bdsk-url-1 =. doi:10.48550/ARXIV.1602.02410 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1602.02410 2016
-
[31]
transfoerm-circuits.pub , title =
Catherine, Olsson and Nelson, Elhage and Neel, Nanda and Nicholas, Joseph and Nova, DasSarma and Tom, Henighan and Ben, Mann and Amanda, Askell and Yuntao, Bai and Anna, Chen and Tom, Conerly and Dawn, Drain and Deep, Ganguli and Zac, Hatfield-Dodds and Danny, Hernandez and Scott, Johnston and Andy, Jones and Jackson, Kernion and Liane, Lovitt and Kamal, ...
-
[32]
Scaling Laws for Autoregressive Generative Modeling
Henighan, Tom and Kaplan, Jared and Katz, Mor and Chen, Mark and Hesse, Christopher and Jackson, Jacob and Jun, Heewoo and Brown, Tom B. and Dhariwal, Prafulla and Gray, Scott and Hallacy, Chris and Mann, Benjamin and Radford, Alec and Ramesh, Aditya and Ryder, Nick and Ziegler, Daniel M. and Schulman, John and Amodei, Dario and McCandlish, Sam , copyrigh...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.14701 2020
-
[33]
Scaling Laws for Neural Language Models
Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , copyright =. Scaling Laws for Neural Language Models , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2001.08361 , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020
-
[35]
Training Compute-Optimal Large Language Models
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and Driessche, George van den and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
-
[36]
Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2020
-
[37]
Rae, Jack W. and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and Driessche, George van den and Hendricks, Lisa Anne and Rauh, Marib...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.11446 2021
-
[38]
Heruntergeladen von https://blog
Amodei, Dario and Hernandez, Danny and Sastry, Girish and Clark, Jack and Brockman, Greg and Sutskever, Ilya , date-added =. Heruntergeladen von https://blog. openai. com/aiand-compute , title =
-
[41]
Advani, M. S. and Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks
work page 2017
-
[42]
Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., and Sutskever, I. (2018). Ai and compute. Heruntergeladen von https://blog. openai. com/aiand-compute
work page 2018
-
[43]
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment
work page 2021
-
[44]
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning practice and the bias-variance trade-off
work page 2018
-
[45]
Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., and Si, L. (2020). Palm: Pre-training an autoencoding and autoregressive language model for context-conditioned generation
work page 2020
-
[46]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...
work page 2020
-
[47]
Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. (2020). Thread: Circuits. Distill . https://distill.pub/2020/circuits
work page 2020
- [48]
-
[49]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2021). A mathematical framework for transfor...
work page 2021
-
[50]
Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph, N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E., Fort, S., Hatfield-Dodds, Z., Johnston, S., Kravec, S., Nanda, N., Ndousse, K., Olsson, C., Amodei, D., Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C...
work page 2022
-
[51]
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. (2021). The pile: An 800gb dataset of diverse text for language modeling
work page 2021
-
[52]
Geiger, M., Spigler, S., d'Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019). Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E , 100(1):012115
work page 2019
-
[53]
Goh, G., Nick, C., Chelsea, V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill . https://distill.pub/2021/multimodal-neurons
work page 2021
-
[54]
B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M., Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative modeling
work page 2020
-
[55]
Hernandez, D. and Brown, T. B. (2020). Measuring the algorithmic efficiency of neural networks. CoRR , abs/2005.04305
-
[56]
Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. arXiv preprint arXiv:2102.01293
work page internal anchor Pith review arXiv 2021
-
[57]
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017). Deep learning scaling is predictable, empirically
work page 2017
-
[58]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models
work page 2022
-
[59]
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling
work page 2016
-
[60]
B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models
work page 2020
-
[61]
S., Torralba, A., Urtasun, R., and Fidler, S
Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. (2015). Skip-thought vectors
work page 2015
-
[62]
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[63]
Malzahn, D. and Opper, M. (2001). A variational approach to learning curves. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems , volume 14. MIT Press
work page 2001
-
[64]
Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models
work page 2016
-
[65]
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt
work page 2019
-
[66]
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2022). In-context learning and...
work page 2022
-
[67]
Opper, M. (1995). Statistical mechanics of learning: Generalization
work page 1995
-
[68]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback
work page 2022
-
[69]
Prato, G., Guiroy, S., Caballero, E., Rish, I., and Chandar, S. (2021). Scaling laws for the few-shot adaptation of pre-trained image classifiers
work page 2021
-
[70]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision
work page 2021
-
[71]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1(8):9
work page 2019
-
[72]
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., Mc...
work page 2021
-
[73]
N., Kaiser, L., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need
work page 2017
-
[74]
L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.