Scaling Laws and Interpretability of Learning from Repeated Data

Ben Mann; Catherine Olsson; Chris Olah; Danny Hernandez; Dario Amodei; Dawn Drain; Jared Kaplan; Nelson Elhage; Nicholas Joseph; Nova DasSarma

REVIEW 2 major objections 2 minor 32 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model

2026-05-17 15:44 UTC pith:DPM7NWRD

load-bearing objection Repeating 0.1% of data 100 times drops 800M model performance to 400M levels and damages induction heads, but the setup leaves open whether this is mostly from reduced unique content rather than capacity eaten by memorization. the 2 major comments →

arxiv 2205.10487 v1 pith:DPM7NWRD submitted 2022-05-21 cs.LG cs.AI

Scaling Laws and Interpretability of Learning from Repeated Data

Danny Hernandez , Tom Brown , Tom Conerly , Nova DasSarma , Dawn Drain , Sheer El-Showk , Nelson Elhage , Zac Hatfield-Dodds

show 10 more authors

Tom Henighan Tristan Hume Scott Johnston Ben Mann Chris Olah Catherine Olsson Dario Amodei Nicholas Joseph Jared Kaplan Sam McCandlish

This is my paper

classification cs.LG cs.AI

keywords repeated datadouble descentmemorizationinduction headsscaling lawsinterpretabilitylanguage models

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that repeating even a small fraction of data during training can lead to significant performance degradation in large language models. The authors train models with mostly unique data but repeat a small portion many times and find a double descent effect where loss increases partway through training. They provide evidence that this happens because memorization of repeats consumes model capacity and damages key generalization mechanisms like induction heads. This offers a mechanistic explanation for why unintentional data repeats can cause outsized harm.

Core claim

Repeating 0.1% of the data 100 times degrades the performance of an 800M parameter model to that of a 400M parameter model, even though 90% of training tokens remain unique. This is accompanied by a double descent in test loss and damage to internal structures associated with generalization.

What carries the argument

Memorization of repeated data consuming model capacity and damaging induction heads and copying mechanisms

Load-bearing premise

The performance degradation is caused by memorization consuming model capacity rather than other factors like optimization changes.

What would settle it

An experiment showing that models do not memorize the repeated data more or that induction heads remain intact despite the performance drop would falsify the claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

A predictable range of repetition frequencies leads to the worst degradation
Data repetition harms generalization more than it affects memorization of unique data
Induction heads are disproportionately affected by repeated data
Small repeated fractions can cause large performance harms

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deduplication should be prioritized in data pipelines to prevent these effects
Scaling laws might need to incorporate repetition rates as a variable
Recovering from repeats could involve targeted unlearning techniques

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper studies the effects of repeated data during LLM pretraining by training model families where most tokens are unique but a small fraction (e.g. 0.1%) is repeated many times (e.g. 100×). It reports a pronounced double-descent in test loss, with severe degradation in an intermediate repetition regime; a concrete example is that repeating 0.1% of the data 100 times reduces an 800 M model’s performance to that of a 400 M model despite 90% of the training tokens remaining unique. The authors hypothesize that memorization of the repeated slice consumes model capacity and support this by showing disproportionate damage to induction heads and copying circuits.

Significance. If the central empirical pattern and mechanistic link hold, the work supplies actionable guidance for data curation at scale and strengthens the connection between scaling-law phenomena and mechanistic interpretability. The quantitative degradation example and double-descent curves are concrete and potentially reproducible; the induction-head analysis offers a falsifiable mechanistic hypothesis.

major comments (2)

[Experimental protocol and results (around the 800 M / 0.1% × 100 example)] The central comparison (repeating 0.1% of data 100 times, yielding ~10% repeated tokens and therefore only ~90% unique content for fixed total token count) is made against a no-repetition baseline that supplies 100% unique data. Because the manuscript invokes scaling-law relationships, a 10% reduction in unique data volume alone is expected to increase loss; without an explicit control that matches unique-data volume while eliminating repetition (e.g., training on 90% unique data for the same number of steps or an adjusted schedule), the capacity-consumption account is not isolated from a simpler distributional effect. This issue is load-bearing for the claim that repetition-induced memorization is the primary driver.
[Mechanistic interpretability section] The mechanistic claim that repetition “disproportionately damages copying and internal structures associated with generalization, such as induction heads” requires quantitative controls. It is unclear whether the reported head damage exceeds what would be expected from the reduced unique-data volume or from changes in optimization trajectory; additional ablations (e.g., head ablation scores before/after repetition, or comparison to a matched-unique-data baseline) would be needed to establish causality.

minor comments (2)

[Abstract] Clarify the exact token fractions: the abstract states “the other 90% of the training tokens remaining unique,” but the arithmetic (0.1% repeated 100 times) implies ~10% repeated tokens; a short table or sentence making the unique/repeated split explicit would remove ambiguity.
[Methods / experimental details] The manuscript mentions “full controls for optimizer state and data ordering are not detailed”; adding a brief appendix note on whether the repeated-data runs used identical optimizer states or data-ordering seeds as the baselines would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The points raised about experimental controls and mechanistic interpretability are important for strengthening the paper. We address each comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Experimental protocol and results (around the 800 M / 0.1% × 100 example)] The central comparison (repeating 0.1% of data 100 times, yielding ~10% repeated tokens and therefore only ~90% unique content for fixed total token count) is made against a no-repetition baseline that supplies 100% unique data. Because the manuscript invokes scaling-law relationships, a 10% reduction in unique data volume alone is expected to increase loss; without an explicit control that matches unique-data volume while eliminating repetition (e.g., training on 90% unique data for the same number of steps or an adjusted schedule), the capacity-consumption account is not isolated from a simpler distributional effect. This issue is load-bearing for the claim that repetition-induced memorization is the primary driver.

Authors: We agree that distinguishing the effect of repetition from the reduction in unique data volume is necessary. Our experiments hold total training tokens fixed, and the double-descent behavior—loss rising after an initial decline—cannot be explained by a static reduction in unique data alone, which would produce a monotonic shift rather than non-monotonic dynamics. To isolate the repetition effect more cleanly, we will add a control baseline in the revised manuscript that trains on 90% unique data (with no repetition) for the same total token count and compare it directly to the repetition setting. revision: yes
Referee: [Mechanistic interpretability section] The mechanistic claim that repetition “disproportionately damages copying and internal structures associated with generalization, such as induction heads” requires quantitative controls. It is unclear whether the reported head damage exceeds what would be expected from the reduced unique-data volume or from changes in optimization trajectory; additional ablations (e.g., head ablation scores before/after repetition, or comparison to a matched-unique-data baseline) would be needed to establish causality.

Authors: We recognize that additional quantitative controls would strengthen the causal link. The existing analysis already compares induction-head metrics between the repetition regime and the standard no-repetition baseline. In the revision we will include direct comparisons of head importance and ablation scores against the new 90%-unique no-repetition control, as well as before/after repetition measurements, to demonstrate that the damage exceeds what is attributable to reduced unique data volume or optimization differences alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements from controlled training runs

full rationale

The paper reports direct experimental results from training a family of models on datasets with a controlled small fraction of repeated tokens. Key claims, such as the degradation of an 800M model to the performance level of a 400M model by repeating 0.1% of data 100 times, are presented as measured outcomes from these runs rather than as outputs of any mathematical derivation or scaling-law equation that reduces to fitted inputs by construction. The suspicion that memorization consumes capacity is explicitly labeled as a hypothesis, not a derived result. Links to induction heads and copying mechanisms are based on post-training mechanistic analysis of the actual models. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing premises; the work is self-contained through reproducible training experiments and does not rely on prior author results to close any logical loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of neural network training dynamics and the interpretation that capacity is consumed by memorization; no new free parameters or invented entities are introduced beyond the experimental repetition schedule.

free parameters (1)

repetition count and fraction
The specific values (0.1% repeated 100 times) are chosen experimentally to demonstrate the degradation effect.

axioms (1)

domain assumption Model capacity is finite and can be allocated between memorization and generalization
Invoked when explaining why repetition leads to performance loss.

pith-pipeline@v0.9.0 · 5666 in / 1298 out tokens · 46892 ms · 2026-05-17T15:44:32.218364+00:00 · methodology

0 comments

read the original abstract

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 conditional novelty 8.0

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Scaling Laws for Mixture Pretraining Under Data Constraints
cs.LG 2026-05 conditional novelty 7.0

Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

LLM first-answer accuracy on procedural arithmetic drops from 61% on 5-step tasks to 20% on 95-step tasks, with frequent failures including skipped steps, premature answers, and hallucinated operations.
DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 unverdicted novelty 6.0

DataComp-VLM benchmark shows instruction-heavy data mixtures outperform caption-heavy ones for VLM training, with DCVLM-Baseline reaching 63.6% on 33 tasks using 200B tokens, +5.4pp over FineVision.
Internal Data Repetition Destroys Language Models
cs.LG 2026-06 unverdicted novelty 6.0

Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.
Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining
cs.LG 2026-06 unverdicted novelty 6.0

Training-time augmentations in token noise, permutation, and offset categories reduce overfitting and improve minimum validation loss in multi-epoch autoregressive pretraining on fixed corpora.
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
cs.LG 2026-05 unverdicted novelty 6.0

Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B p...
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
cs.CL 2026-05 unverdicted novelty 6.0

Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
cs.LG 2026-05 conditional novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Scaling Laws for Mixture Pretraining Under Data Constraints
cs.LG 2026-05 unverdicted novelty 6.0

Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute,...
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
cs.LG 2026-05 conditional novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
Prescriptive Scaling Laws for Data Constrained Training
cs.LG 2026-05 unverdicted novelty 6.0

A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the ...
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
cs.CL 2025-06 conditional novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Superposition Yields Robust Neural Scaling
cs.LG 2025-05 conditional novelty 6.0

Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
cond-mat.dis-nn 2025-02 unverdicted novelty 6.0

Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
cs.CL 2023-06 unverdicted novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
The False Promise of Imitating Proprietary LLMs
cs.CL 2023-05 conditional novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning
cs.CL 2026-05 unverdicted novelty 5.0

SLAP is a new batch-aware pruning framework that uses distribution-aware stratified sampling and Hessian-approximated gradients to select data, claiming 20-40% less data while matching or exceeding full-dataset perfor...
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Similarity Field Theory: A Mathematical Framework for Intelligence
cs.AI 2025-09 unverdicted novelty 5.0

Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
cs.LG 2026-04 unverdicted novelty 2.0

A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 28 Pith papers · 20 internal anchors

[1]

Learning Transferable Visual Models From Natural Language Supervision

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , copyright =. Learning Transferable Visual Models From Natural Language Supervision , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARX...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[2]

Distill , author =

Goh, Gabriel and Nick, Cammarata and Chelsea, Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , date-added =. Multimodal Neurons in Artificial Neural Networks , year =. doi:10.23915/distill.00030 , journal =

work page doi:10.23915/distill.00030
[3]

In-context Learning and Induction Heads , year =

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...

work page
[4]

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[5]

A Variational Approach to Learning Curves , url =

Malzahn, D\". A Variational Approach to Learning Curves , url =. Advances in Neural Information Processing Systems , date-added =. 2001 , bdsk-url-1 =

work page 2001
[6]

Statistical mechanics of learning: Generalization , year =

Opper, Manfred , date-added =. Statistical mechanics of learning: Generalization , year =. doi:10.1007/978-1-4612-0723-8_5 , isbn =

work page doi:10.1007/978-1-4612-0723-8_5
[7]

PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =

Bi, Bin and Li, Chenliang and Wu, Chen and Yan, Ming and Wang, Wei and Huang, Songfang and Huang, Fei and Si, Luo , copyright =. PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2004.07159 , keywords =

work page doi:10.48550/arxiv.2004.07159 2020
[8]

Predictability and Surprise in Large Generative Models , url =

Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and DasSarma, Nova and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Kernion, Jackson and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Elhage, Nelson and Showk, Sheer El and Fort, Stanislav and Hatfield-Dodds, Zac and Johnston, Scott and Krave...

work page doi:10.48550/arxiv.2202.07785 2022
[9]

Deep Learning Scaling is Predictable, Empirically

Hestness, Joel and Narang, Sharan and Ardalani, Newsha and Diamos, Gregory and Jun, Heewoo and Kianinejad, Hassan and Patwary, Md. Mostofa Ali and Yang, Yang and Zhou, Yanqi , copyright =. Deep Learning Scaling is Predictable, Empirically , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1712.00409 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.00409 2017
[10]

Attention Is All You Need

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , copyright =. Attention Is All You Need , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1706.03762 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017
[11]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , url =. 2019 , bdsk-url-1 =. doi:10.48550/ARXIV.1910.03771 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.03771 2019
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , copyright =. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2101.00027 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2101.00027 2021
[13]

A General Language Assistant as a Laboratory for Alignment

Askell, Amanda and Bai, Yuntao and Chen, Anna and Drain, Dawn and Ganguli, Deep and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Mann, Ben and DasSarma, Nova and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Kernion, Jackson and Ndousse, Kamal and Olsson, Catherine and Amodei, Dario and Brown, Tom and Clark, Jack and McCandlish...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00861 2021
[14]

doi:10.23915/distill.00024 , note =

Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , date-added =. Thread: Circuits , year =. doi:10.23915/distill.00024 , journal =

work page doi:10.23915/distill.00024
[15]

A Mathematical Framework for Transformer Circuits , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...

work page
[16]

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , url =

Prato, Gabriele and Guiroy, Simon and Caballero, Ethan and Rish, Irina and Chandar, Sarath , copyright =. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2110.06990 , keywords =

work page doi:10.48550/arxiv.2110.06990 2021
[17]

Scaling laws for acoustic mode ls

Droppo, Jasha and Elibol, Oguz , copyright =. Scaling Laws for Acoustic Models , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2106.09488 , keywords =

work page doi:10.48550/arxiv.2106.09488 2021
[18]

GitHub Copilot: Parrot or crow? , url =

Albert Ziegler , date-added =. GitHub Copilot: Parrot or crow? , url =

work page
[19]

Quantifying Memorization Across Neural Language Models

Carlini, Nicholas and Ippolito, Daphne and Jagielski, Matthew and Lee, Katherine and Tramer, Florian and Zhang, Chiyuan , copyright =. Quantifying Memorization Across Neural Language Models , url =. 2022 , bdsk-url-1 =. doi:10.48550/ARXIV.2202.07646 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.07646 2022
[20]

Learning to summarize from human feedback

Stiennon, Nisan and Ouyang, Long and Wu, Jeff and Ziegler, Daniel M. and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , copyright =. Learning to summarize from human feedback , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2009.01325 , keywords =

work page internal anchor Pith review doi:10.48550/arxiv.2009.01325 2020
[21]

Carlini, F

Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom and Song, Dawn and Erlingsson, Ulfar and Oprea, Alina and Raffel, Colin , copyright =. Extracting Training Data from Large Language Models , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2012.07805 , k...

work page doi:10.48550/arxiv.2012.07805 2020
[22]

arXiv , year =

Nakkiran, Preetum and Kaplun, Gal and Bansal, Yamini and Yang, Tristan and Barak, Boaz and Sutskever, Ilya , copyright =. Deep Double Descent: Where Bigger Models and More Data Hurt , url =. 2019 , bdsk-url-1 =. doi:10.48550/ARXIV.1912.02292 , keywords =

work page doi:10.48550/arxiv.1912.02292 2019
[23]

Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , url =

d'Ascoli, St. Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2003.01054 , keywords =

work page doi:10.48550/arxiv.2003.01054 2020
[24]

Jamming transition as a paradigm to understand the loss landscape of deep neural networks , volume =

Geiger, Mario and Spigler, Stefano and d'Ascoli, St. Jamming transition as a paradigm to understand the loss landscape of deep neural networks , volume =. Physical Review E , number =

work page
[25]

High-dimensional dynamics of generalization error in neural networks

Advani, Madhu S. and Saxe, Andrew M. , copyright =. High-dimensional dynamics of generalization error in neural networks , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1710.03667 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.03667 2017
[26]

Proceedings of the National Academy of Sciences , volume =

Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik , copyright =. Reconciling modern machine learning practice and the bias-variance trade-off , url =. 2018 , bdsk-url-1 =. doi:10.48550/ARXIV.1812.11118 , keywords =

work page doi:10.48550/arxiv.1812.11118 2018
[27]

Language models are unsupervised multitask learners , volume =

Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others , date-added =. Language models are unsupervised multitask learners , volume =. OpenAI blog , number =

work page
[28]

Skip-Thought Vectors

Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S. and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja , copyright =. Skip-Thought Vectors , url =. 2015 , bdsk-url-1 =. doi:10.48550/ARXIV.1506.06726 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.06726 2015
[29]

Pointer Sentinel Mixture Models

Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , copyright =. Pointer Sentinel Mixture Models , url =. 2016 , bdsk-url-1 =. doi:10.48550/ARXIV.1609.07843 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.07843 2016
[30]

Exploring the Limits of Language Modeling

Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui , copyright =. Exploring the Limits of Language Modeling , url =. 2016 , bdsk-url-1 =. doi:10.48550/ARXIV.1602.02410 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1602.02410 2016
[31]

transfoerm-circuits.pub , title =

Catherine, Olsson and Nelson, Elhage and Neel, Nanda and Nicholas, Joseph and Nova, DasSarma and Tom, Henighan and Ben, Mann and Amanda, Askell and Yuntao, Bai and Anna, Chen and Tom, Conerly and Dawn, Drain and Deep, Ganguli and Zac, Hatfield-Dodds and Danny, Hernandez and Scott, Johnston and Andy, Jones and Jackson, Kernion and Liane, Lovitt and Kamal, ...

work page
[32]

Scaling Laws for Autoregressive Generative Modeling

Henighan, Tom and Kaplan, Jared and Katz, Mor and Chen, Mark and Hesse, Christopher and Jackson, Jacob and Jun, Heewoo and Brown, Tom B. and Dhariwal, Prafulla and Gray, Scott and Hallacy, Chris and Mann, Benjamin and Radford, Alec and Ramesh, Aditya and Ryder, Nick and Ziegler, Daniel M. and Schulman, John and Amodei, Dario and McCandlish, Sam , copyrigh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.14701 2020
[33]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , copyright =. Scaling Laws for Neural Language Models , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2001.08361 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020
[35]

Training Compute-Optimal Large Language Models

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and Driessche, George van den and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
[36]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2020
[37]

Rae, Jack W. and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and Driessche, George van den and Hendricks, Lisa Anne and Rauh, Marib...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.11446 2021
[38]

Heruntergeladen von https://blog

Amodei, Dario and Hernandez, Danny and Sastry, Girish and Clark, Jack and Brockman, Greg and Sutskever, Ilya , date-added =. Heruntergeladen von https://blog. openai. com/aiand-compute , title =

work page
[41]

Advani, M. S. and Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks

work page 2017
[42]

Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., and Sutskever, I. (2018). Ai and compute. Heruntergeladen von https://blog. openai. com/aiand-compute

work page 2018
[43]

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment

work page 2021
[44]

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning practice and the bias-variance trade-off

work page 2018
[45]

Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., and Si, L. (2020). Palm: Pre-training an autoencoding and autoregressive language model for context-conditioned generation

work page 2020
[46]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

work page 2020
[47]

Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. (2020). Thread: Circuits. Distill . https://distill.pub/2020/circuits

work page 2020
[48]

and Elibol, O

Droppo, J. and Elibol, O. (2021). Scaling laws for acoustic models

work page 2021
[49]

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2021). A mathematical framework for transfor...

work page 2021
[50]

Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph, N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E., Fort, S., Hatfield-Dodds, Z., Johnston, S., Kravec, S., Nanda, N., Ndousse, K., Olsson, C., Amodei, D., Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C...

work page 2022
[51]

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. (2021). The pile: An 800gb dataset of diverse text for language modeling

work page 2021
[52]

Geiger, M., Spigler, S., d'Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019). Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E , 100(1):012115

work page 2019
[53]

Goh, G., Nick, C., Chelsea, V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill . https://distill.pub/2021/multimodal-neurons

work page 2021
[54]

B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D

Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M., Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative modeling

work page 2020
[55]

arXiv preprint arXiv:2005.04305 , year=

Hernandez, D. and Brown, T. B. (2020). Measuring the algorithmic efficiency of neural networks. CoRR , abs/2005.04305

work page arXiv 2020
[56]

Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. arXiv preprint arXiv:2102.01293

work page internal anchor Pith review arXiv 2021
[57]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017). Deep learning scaling is predictable, empirically

work page 2017
[58]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models

work page 2022
[59]

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling

work page 2016
[60]

B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models

work page 2020
[61]

S., Torralba, A., Urtasun, R., and Fidler, S

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. (2015). Skip-thought vectors

work page 2015
[62]

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499

work page internal anchor Pith review Pith/arXiv arXiv 2021
[63]

and Opper, M

Malzahn, D. and Opper, M. (2001). A variational approach to learning curves. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems , volume 14. MIT Press

work page 2001
[64]

Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models

work page 2016
[65]

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt

work page 2019
[66]

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2022). In-context learning and...

work page 2022
[67]

Opper, M. (1995). Statistical mechanics of learning: Generalization

work page 1995
[68]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback

work page 2022
[69]

Prato, G., Guiroy, S., Caballero, E., Rish, I., and Chandar, S. (2021). Scaling laws for the few-shot adaptation of pre-trained image classifiers

work page 2021
[70]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision

work page 2021
[71]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1(8):9

work page 2019
[72]

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., Mc...

work page 2021
[73]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need

work page 2017
[74]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing

work page 2019

[1] [1]

Learning Transferable Visual Models From Natural Language Supervision

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , copyright =. Learning Transferable Visual Models From Natural Language Supervision , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARX...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021

[2] [2]

Distill , author =

Goh, Gabriel and Nick, Cammarata and Chelsea, Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , date-added =. Multimodal Neurons in Artificial Neural Networks , year =. doi:10.23915/distill.00030 , journal =

work page doi:10.23915/distill.00030

[3] [3]

In-context Learning and Induction Heads , year =

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...

work page

[4] [4]

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[5] [5]

A Variational Approach to Learning Curves , url =

Malzahn, D\". A Variational Approach to Learning Curves , url =. Advances in Neural Information Processing Systems , date-added =. 2001 , bdsk-url-1 =

work page 2001

[6] [6]

Statistical mechanics of learning: Generalization , year =

Opper, Manfred , date-added =. Statistical mechanics of learning: Generalization , year =. doi:10.1007/978-1-4612-0723-8_5 , isbn =

work page doi:10.1007/978-1-4612-0723-8_5

[7] [7]

PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =

Bi, Bin and Li, Chenliang and Wu, Chen and Yan, Ming and Wang, Wei and Huang, Songfang and Huang, Fei and Si, Luo , copyright =. PALM: Pre-training an Autoencoding and Autoregressive Language Model for Context-conditioned Generation , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2004.07159 , keywords =

work page doi:10.48550/arxiv.2004.07159 2020

[8] [8]

Predictability and Surprise in Large Generative Models , url =

Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and DasSarma, Nova and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Kernion, Jackson and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Elhage, Nelson and Showk, Sheer El and Fort, Stanislav and Hatfield-Dodds, Zac and Johnston, Scott and Krave...

work page doi:10.48550/arxiv.2202.07785 2022

[9] [9]

Deep Learning Scaling is Predictable, Empirically

Hestness, Joel and Narang, Sharan and Ardalani, Newsha and Diamos, Gregory and Jun, Heewoo and Kianinejad, Hassan and Patwary, Md. Mostofa Ali and Yang, Yang and Zhou, Yanqi , copyright =. Deep Learning Scaling is Predictable, Empirically , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1712.00409 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.00409 2017

[10] [10]

Attention Is All You Need

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , copyright =. Attention Is All You Need , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1706.03762 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017

[11] [11]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , url =. 2019 , bdsk-url-1 =. doi:10.48550/ARXIV.1910.03771 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.03771 2019

[12] [12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , copyright =. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2101.00027 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2101.00027 2021

[13] [13]

A General Language Assistant as a Laboratory for Alignment

Askell, Amanda and Bai, Yuntao and Chen, Anna and Drain, Dawn and Ganguli, Deep and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Mann, Ben and DasSarma, Nova and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Kernion, Jackson and Ndousse, Kamal and Olsson, Catherine and Amodei, Dario and Brown, Tom and Clark, Jack and McCandlish...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00861 2021

[14] [14]

doi:10.23915/distill.00024 , note =

Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , date-added =. Thread: Circuits , year =. doi:10.23915/distill.00024 , journal =

work page doi:10.23915/distill.00024

[15] [15]

A Mathematical Framework for Transformer Circuits , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...

work page

[16] [16]

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , url =

Prato, Gabriele and Guiroy, Simon and Caballero, Ethan and Rish, Irina and Chandar, Sarath , copyright =. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2110.06990 , keywords =

work page doi:10.48550/arxiv.2110.06990 2021

[17] [17]

Scaling laws for acoustic mode ls

Droppo, Jasha and Elibol, Oguz , copyright =. Scaling Laws for Acoustic Models , url =. 2021 , bdsk-url-1 =. doi:10.48550/ARXIV.2106.09488 , keywords =

work page doi:10.48550/arxiv.2106.09488 2021

[18] [18]

GitHub Copilot: Parrot or crow? , url =

Albert Ziegler , date-added =. GitHub Copilot: Parrot or crow? , url =

work page

[19] [19]

Quantifying Memorization Across Neural Language Models

Carlini, Nicholas and Ippolito, Daphne and Jagielski, Matthew and Lee, Katherine and Tramer, Florian and Zhang, Chiyuan , copyright =. Quantifying Memorization Across Neural Language Models , url =. 2022 , bdsk-url-1 =. doi:10.48550/ARXIV.2202.07646 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.07646 2022

[20] [20]

Learning to summarize from human feedback

Stiennon, Nisan and Ouyang, Long and Wu, Jeff and Ziegler, Daniel M. and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , copyright =. Learning to summarize from human feedback , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2009.01325 , keywords =

work page internal anchor Pith review doi:10.48550/arxiv.2009.01325 2020

[21] [21]

Carlini, F

Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom and Song, Dawn and Erlingsson, Ulfar and Oprea, Alina and Raffel, Colin , copyright =. Extracting Training Data from Large Language Models , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2012.07805 , k...

work page doi:10.48550/arxiv.2012.07805 2020

[22] [22]

arXiv , year =

Nakkiran, Preetum and Kaplun, Gal and Bansal, Yamini and Yang, Tristan and Barak, Boaz and Sutskever, Ilya , copyright =. Deep Double Descent: Where Bigger Models and More Data Hurt , url =. 2019 , bdsk-url-1 =. doi:10.48550/ARXIV.1912.02292 , keywords =

work page doi:10.48550/arxiv.1912.02292 2019

[23] [23]

Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , url =

d'Ascoli, St. Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2003.01054 , keywords =

work page doi:10.48550/arxiv.2003.01054 2020

[24] [24]

Jamming transition as a paradigm to understand the loss landscape of deep neural networks , volume =

Geiger, Mario and Spigler, Stefano and d'Ascoli, St. Jamming transition as a paradigm to understand the loss landscape of deep neural networks , volume =. Physical Review E , number =

work page

[25] [25]

High-dimensional dynamics of generalization error in neural networks

Advani, Madhu S. and Saxe, Andrew M. , copyright =. High-dimensional dynamics of generalization error in neural networks , url =. 2017 , bdsk-url-1 =. doi:10.48550/ARXIV.1710.03667 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.03667 2017

[26] [26]

Proceedings of the National Academy of Sciences , volume =

Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik , copyright =. Reconciling modern machine learning practice and the bias-variance trade-off , url =. 2018 , bdsk-url-1 =. doi:10.48550/ARXIV.1812.11118 , keywords =

work page doi:10.48550/arxiv.1812.11118 2018

[27] [27]

Language models are unsupervised multitask learners , volume =

Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others , date-added =. Language models are unsupervised multitask learners , volume =. OpenAI blog , number =

work page

[28] [28]

Skip-Thought Vectors

Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S. and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja , copyright =. Skip-Thought Vectors , url =. 2015 , bdsk-url-1 =. doi:10.48550/ARXIV.1506.06726 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.06726 2015

[29] [29]

Pointer Sentinel Mixture Models

Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , copyright =. Pointer Sentinel Mixture Models , url =. 2016 , bdsk-url-1 =. doi:10.48550/ARXIV.1609.07843 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.07843 2016

[30] [30]

Exploring the Limits of Language Modeling

Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui , copyright =. Exploring the Limits of Language Modeling , url =. 2016 , bdsk-url-1 =. doi:10.48550/ARXIV.1602.02410 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1602.02410 2016

[31] [31]

transfoerm-circuits.pub , title =

Catherine, Olsson and Nelson, Elhage and Neel, Nanda and Nicholas, Joseph and Nova, DasSarma and Tom, Henighan and Ben, Mann and Amanda, Askell and Yuntao, Bai and Anna, Chen and Tom, Conerly and Dawn, Drain and Deep, Ganguli and Zac, Hatfield-Dodds and Danny, Hernandez and Scott, Johnston and Andy, Jones and Jackson, Kernion and Liane, Lovitt and Kamal, ...

work page

[32] [32]

Scaling Laws for Autoregressive Generative Modeling

Henighan, Tom and Kaplan, Jared and Katz, Mor and Chen, Mark and Hesse, Christopher and Jackson, Jacob and Jun, Heewoo and Brown, Tom B. and Dhariwal, Prafulla and Gray, Scott and Hallacy, Chris and Mann, Benjamin and Radford, Alec and Ramesh, Aditya and Ryder, Nick and Ziegler, Daniel M. and Schulman, John and Amodei, Dario and McCandlish, Sam , copyrigh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.14701 2020

[33] [33]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , copyright =. Scaling Laws for Neural Language Models , url =. 2020 , bdsk-url-1 =. doi:10.48550/ARXIV.2001.08361 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020

[34] [35]

Training Compute-Optimal Large Language Models

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and Driessche, George van den and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022

[35] [36]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2020

[36] [37]

Rae, Jack W. and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and Driessche, George van den and Hendricks, Lisa Anne and Rauh, Marib...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.11446 2021

[37] [38]

Heruntergeladen von https://blog

Amodei, Dario and Hernandez, Danny and Sastry, Girish and Clark, Jack and Brockman, Greg and Sutskever, Ilya , date-added =. Heruntergeladen von https://blog. openai. com/aiand-compute , title =

work page

[38] [41]

Advani, M. S. and Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks

work page 2017

[39] [42]

Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., and Sutskever, I. (2018). Ai and compute. Heruntergeladen von https://blog. openai. com/aiand-compute

work page 2018

[40] [43]

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment

work page 2021

[41] [44]

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning practice and the bias-variance trade-off

work page 2018

[42] [45]

Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., and Si, L. (2020). Palm: Pre-training an autoencoding and autoregressive language model for context-conditioned generation

work page 2020

[43] [46]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

work page 2020

[44] [47]

Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. (2020). Thread: Circuits. Distill . https://distill.pub/2020/circuits

work page 2020

[45] [48]

and Elibol, O

Droppo, J. and Elibol, O. (2021). Scaling laws for acoustic models

work page 2021

[46] [49]

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2021). A mathematical framework for transfor...

work page 2021

[47] [50]

Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph, N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E., Fort, S., Hatfield-Dodds, Z., Johnston, S., Kravec, S., Nanda, N., Ndousse, K., Olsson, C., Amodei, D., Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C...

work page 2022

[48] [51]

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. (2021). The pile: An 800gb dataset of diverse text for language modeling

work page 2021

[49] [52]

Geiger, M., Spigler, S., d'Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019). Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E , 100(1):012115

work page 2019

[50] [53]

Goh, G., Nick, C., Chelsea, V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill . https://distill.pub/2021/multimodal-neurons

work page 2021

[51] [54]

B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D

Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M., Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative modeling

work page 2020

[52] [55]

arXiv preprint arXiv:2005.04305 , year=

Hernandez, D. and Brown, T. B. (2020). Measuring the algorithmic efficiency of neural networks. CoRR , abs/2005.04305

work page arXiv 2020

[53] [56]

Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. arXiv preprint arXiv:2102.01293

work page internal anchor Pith review arXiv 2021

[54] [57]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017). Deep learning scaling is predictable, empirically

work page 2017

[55] [58]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models

work page 2022

[56] [59]

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling

work page 2016

[57] [60]

B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models

work page 2020

[58] [61]

S., Torralba, A., Urtasun, R., and Fidler, S

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. (2015). Skip-thought vectors

work page 2015

[59] [62]

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [63]

and Opper, M

Malzahn, D. and Opper, M. (2001). A variational approach to learning curves. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems , volume 14. MIT Press

work page 2001

[61] [64]

Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models

work page 2016

[62] [65]

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt

work page 2019

[63] [66]

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2022). In-context learning and...

work page 2022

[64] [67]

Opper, M. (1995). Statistical mechanics of learning: Generalization

work page 1995

[65] [68]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback

work page 2022

[66] [69]

Prato, G., Guiroy, S., Caballero, E., Rish, I., and Chandar, S. (2021). Scaling laws for the few-shot adaptation of pre-trained image classifiers

work page 2021

[67] [70]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision

work page 2021

[68] [71]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1(8):9

work page 2019

[69] [72]

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., Mc...

work page 2021

[70] [73]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need

work page 2017

[71] [74]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing

work page 2019