arxiv: 2102.01293 · v1 · pith:EHVUIWH5new · submitted 2021-02-02 · 💻 cs.LG

Scaling Laws for Transfer

Danny Hernandez , Jared Kaplan , Tom Henighan , Sam McCandlish This is my paper

Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords scaling lawstransfer learningfine-tuninglanguage modelspower lawseffective dataneural network scaling

0 comments

The pith

Pre-training multiplies the effective size of fine-tuning datasets according to a power law in model size and data volume.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how much pre-training on a large language corpus helps when later fine-tuning on much smaller datasets. It converts the observed fine-tuned loss back into an equivalent amount of from-scratch training data that would have been required to reach the same loss, then shows that this effective transferred quantity follows a clean power law in both the number of model parameters and the fine-tuning dataset size. A reader should care because the result supplies a concrete, testable rule for predicting transfer gains without running every possible experiment. The authors interpret the power-law exponents as direct measures of how general a model is and how close the pre-training and fine-tuning distributions are. They conclude that pre-training simply multiplies the fine-tuning dataset size by a predictable factor.

Core claim

When models are pre-trained on a large language dataset and then fine-tuned, the loss continues to drop with more parameters even after from-scratch training has saturated; inverting the from-scratch loss-versus-data curve shows that the amount of effective data transferred obeys a power law in parameter count and fine-tuning dataset size, so that pre-training multiplies the fine-tuning dataset size.

What carries the argument

Effective data transferred, obtained by inverting the observed fine-tuned loss against the loss curve measured in from-scratch training to find how much additional data would have produced the same loss.

If this is right

Transfer performance can be predicted in advance from parameter count, fine-tuning size, and the measured exponents.
The slope of the power law in model size quantifies how generally useful the pre-trained representations are.
The slope in fine-tuning data size quantifies how close the pre-training and target distributions are.
Overall scaling of transfer follows the same predictable pattern as scaling of performance from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training budgets could be allocated by first estimating the multiplication factor from the power law and then deciding how much additional fine-tuning data is still worth collecting.
The same inversion technique might reveal whether pre-training on one modality transfers to another by comparing effective data across domains.
If the power-law exponents turn out stable across many tasks, they could serve as a cheap diagnostic for how well a new pre-trained model will generalize before any fine-tuning is run.

Load-bearing premise

The loss curve measured during ordinary from-scratch training can be inverted to give the exact amount of data that would produce the same loss after fine-tuning, with no extra effects from optimization or distribution mismatch.

What would settle it

Measure the actual loss after fine-tuning a model on a new small dataset and check whether the loss matches the value predicted by plugging the model size and dataset size into the reported power-law formula for effective transferred data.

read the original abstract

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero. We calculate the effective data "transferred" from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size. We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies transfer as an effective data multiplier that follows power laws in model size and fine-tuning data, extending scaling laws but relying on an inversion whose robustness is not fully shown.

read the letter

The main thing to know is that pre-training on a big language dataset acts like multiplying the fine-tuning data by a factor that grows as a power law with both parameter count and the amount of fine-tuning data. They measure this by first fitting a loss-versus-data curve on from-scratch runs, then solving for the data size that would match the loss seen after fine-tuning a pre-trained model. That solved quantity is the effective data transferred, and they report power-law fits to it in the low-data regime. Pre-training therefore scales predictably rather than saturating the way from-scratch training does on fixed data.

Referee Report

2 major / 2 minor

Summary. The manuscript studies empirical scaling laws for transfer learning in an unsupervised fine-tuning setting for transformers. It shows that pre-trained models continue improving with fine-tuning data in regimes where from-scratch models plateau due to data limits. Effective data transferred from pre-training is computed by inverting the from-scratch loss-versus-data curve to find the D_eff that would produce the observed fine-tuned loss for the same model size. This D_eff is reported to follow a power-law dependence on parameter count and fine-tuning dataset size in the low-data regime, with the interpretation that pre-training multiplies the fine-tuning data and that the exponents measure generality and distribution proximity.

Significance. If the central results hold after addressing the inversion assumptions, the work supplies a data-centric, quantitative description of transfer that extends existing scaling-law analyses and could guide decisions on pre-training compute allocation versus fine-tuning data. The focus on low-data regime and the explicit power-law form for effective transferred data are useful contributions, though they rest on the validity of treating from-scratch curves as an invertible baseline.

major comments (2)

[Effective data calculation (described in abstract and methods)] The effective-data inversion (L_scratch(N, D_eff) = L_finetune(N, D_ft)) is the load-bearing step for all subsequent power-law claims. The manuscript provides no diagnostics that the power-law regime, exponents, or location remain unchanged when training begins from a pre-trained checkpoint rather than random initialization; differing optimization trajectories or effective capacity could systematically bias D_eff. This assumption is not tested and directly affects the claim that pre-training multiplies fine-tuning data.
[Results on effective transferred data] The power-law fit to effective data in the low-data regime is presented without error bars on the fitted exponents, without the exact functional form or regression procedure used for the from-scratch baseline, and without comparisons to alternative forms (e.g., log or saturating functions). These omissions make it impossible to assess how well the power law actually describes the data or how sensitive the reported exponents are to fitting choices.

minor comments (2)

[Notation and definitions] Notation for D_eff and the power-law exponents should be introduced with explicit equations early in the text to improve readability when the same symbols appear in later figures and interpretations.
[Figures] Several loss-curve figures would be clearer if they overlaid the from-scratch and fine-tuned curves on identical axes with explicit indication of the inversion points used to obtain D_eff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our work. We believe the suggested revisions will strengthen the presentation of our results on scaling laws for transfer learning. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Effective data calculation (described in abstract and methods)] The effective-data inversion (L_scratch(N, D_eff) = L_finetune(N, D_ft)) is the load-bearing step for all subsequent power-law claims. The manuscript provides no diagnostics that the power-law regime, exponents, or location remain unchanged when training begins from a pre-trained checkpoint rather than random initialization; differing optimization trajectories or effective capacity could systematically bias D_eff. This assumption is not tested and directly affects the claim that pre-training multiplies fine-tuning data.

Authors: We agree that validating the inversion assumption is important for the robustness of our claims. While the core methodology relies on matching observed losses to the from-scratch scaling curve, we did not explicitly test whether the from-scratch power-law exponents or regimes shift when initializing from a pre-trained model. In the revised version, we will add a discussion of this potential limitation and, where computationally feasible, include diagnostic experiments comparing loss curves starting from pre-trained weights versus random initialization in the low-data regime to assess any systematic bias in D_eff. revision: yes
Referee: [Results on effective transferred data] The power-law fit to effective data in the low-data regime is presented without error bars on the fitted exponents, without the exact functional form or regression procedure used for the from-scratch baseline, and without comparisons to alternative forms (e.g., log or saturating functions). These omissions make it impossible to assess how well the power law actually describes the data or how sensitive the reported exponents are to fitting choices.

Authors: We appreciate this point and acknowledge that additional details on the fitting procedure would enhance the clarity and reproducibility of our results. In the revision, we will specify the exact functional form used for the from-scratch baseline (power-law in N and D), detail the regression procedure (e.g., linear regression on log-transformed variables), include error bars or confidence intervals on the fitted exponents derived from bootstrap resampling or similar methods, and provide comparisons to alternative functional forms such as logarithmic or saturating models to justify the power-law choice in the low-data regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical definition of effective data followed by power-law fit is standard scaling analysis

full rationale

The paper defines effective transferred data via inversion of the from-scratch loss curve to match observed fine-tuned loss, then empirically observes that this quantity follows a power-law in N and D_ft within the low-data regime. This is a measurement-plus-fitting procedure for reporting scaling relations, not a first-principles derivation whose claimed result reduces to its inputs by construction. The inversion step rests on an assumption about curve applicability (a correctness concern), but does not create a self-definitional loop or rename a fitted quantity as an independent prediction. No equations or steps in the abstract or described chain exhibit the specific reductions required for circularity flags (e.g., no power-law exponents derived tautologically from the inversion itself). The work remains self-contained as observational scaling laws.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of inverting loss curves to obtain effective data and on the assumption that power-law forms observed in from-scratch training continue to apply when matching fine-tuned performance.

free parameters (1)

power-law exponents for effective transferred data
Exponents relating effective data to parameter count and to fine-tuning dataset size are determined by fitting observed values.

axioms (1)

domain assumption Loss scales as a power law with dataset size in the from-scratch regime
Used to back-calculate how much data would have been needed to reach the fine-tuned loss.

pith-pipeline@v0.9.0 · 5721 in / 1337 out tokens · 42357 ms · 2026-05-18T00:52:50.896713+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
On the Invariance and Generality of Neural Scaling Laws
cs.LG 2026-05 unverdicted novelty 7.0

Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
cs.AI 2026-04 unverdicted novelty 7.0

A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
cs.LG 2026-05 conditional novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks
cs.LG 2026-05 unverdicted novelty 6.0

Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.
Knowledge Transfer Scaling Laws for 3D Medical Imaging
cs.CV 2026-05 conditional novelty 6.0

Transfer-aware data allocation derived from observed power-law scaling laws for asymmetric knowledge transfer in 3D medical imaging outperforms standard proportional sampling by up to 58% and generalizes to new budgets.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
SAM 3D: 3Dfy Anything in Images
cs.CV 2025-11 unverdicted novelty 6.0

SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
cs.LG 2023-03 unverdicted novelty 6.0

SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
Efficient Training of Language Models to Fill in the Middle
cs.CL 2022-07 unverdicted novelty 6.0

Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Scaling Laws and Interpretability of Learning from Repeated Data
cs.LG 2022-05 accept novelty 6.0

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
cs.CL 2022-04 unverdicted novelty 6.0

RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
cs.LG 2026-05 unverdicted novelty 5.0

A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Small Language Models are the Future of Agentic AI
cs.AI 2025-06 unverdicted novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · cited by 19 Pith papers · 84 internal anchors

[1]

Rethinking imagenet pre-training , Year =

He, Kaiming and Girshick, Ross and Doll. Rethinking imagenet pre-training , Year =. Proceedings of the IEEE/CVF International Conference on Computer Vision , Date-Added =

work page
[3]

A survey on deep transfer learning , Year =

Tan, Chuanqi and Sun, Fuchun and Kong, Tao and Zhang, Wenchang and Yang, Chao and Liu, Chunfang , Booktitle =. A survey on deep transfer learning , Year =

work page
[4]

lilianweng.github.io/lil-log , Title =

Weng, Lilian , Date-Added =. lilianweng.github.io/lil-log , Title =. 2018 , Bdsk-Url-1 =

work page 2018
[6]

arXiv preprint arXiv:1907.07174 , Title =

Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , Date-Added =. arXiv preprint arXiv:1907.07174 , Title =

work page arXiv 1907
[7]

Learning Transferable Visual Models From Natural Language Supervision , Volume =

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others , Date-Added =. Learning Transferable Visual Models From Natural Language Supervision , Volume =. Image , Pages =

work page
[12]

A Neural Probabilistic Language Model , Volume =

Yoshua Bengio and R. A Neural Probabilistic Language Model , Volume =. JOURNAL OF MACHINE LEARNING RESEARCH , Pages =

work page
[13]

Recurrent neural network based language model , Volume =

Mikolov, Tomas and Karafi. Recurrent neural network based language model , Volume =. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 , Month =

work page 2010
[17]

Silver, David and Huang, Aja and Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and K...

work page doi:10.1038/nature16961
[19]

Learning internal representations by error propagation , Year =

Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J , Date-Added =. Learning internal representations by error propagation , Year =

work page
[20]

Long Short-Term Memory , Volume =

Sepp Hochreiter and J. Long Short-Term Memory , Volume =. Neural Computation , Number =

work page
[21]

Mastering the game of Go with deep neural networks and tree search , Volume =

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , Date-Added =. Mastering the game of Go with deep neural networks and tree search , Volume =. nature , Number =

work page
[23]

Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks , Year =. arXiv , Author =:1409.3215 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Neural Discrete Representation Learning

Neural Discrete Representation Learning , Year =. arXiv , Author =:1711.00937 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Jukebox: A Generative Model for Music

Jukebox: A Generative Model for Music , Year =. arXiv , Author =:2005.00341 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2005
[30]

arXiv , Author =:1906.02634 , Primaryclass =

Scaling Autoregressive Video Models , Year =. arXiv , Author =:1906.02634 , Primaryclass =

work page arXiv 1906
[31]

Pixel Recurrent Neural Networks

Pixel Recurrent Neural Networks , Url =. 2016 , Bdsk-Url-1 =. arXiv , Author =:1601.06759 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Multimodal transformer for unaligned multimodal language sequences , Volume =

Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan , Booktitle =. Multimodal transformer for unaligned multimodal language sequences , Volume =

work page
[33]

arXiv , Author =:1910.06611 , Primaryclass =

Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , Year =. arXiv , Author =:1910.06611 , Primaryclass =

work page arXiv 1910
[34]

YFCC100M: The New Data in Multimedia Research

The New Data and New Challenges in Multimedia Research , Url =. 2015 , Bdsk-Url-1 =. arXiv , Author =:1503.01817 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

arXiv , Author =:2006.10621 , Primaryclass =

On the Predictability of Pruning Across Scales , Year =. arXiv , Author =:2006.10621 , Primaryclass =

work page arXiv 2006
[37]

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

A Downsampled Variant of ImageNet as an Alternative to the. 2017 , Bdsk-Url-1 =. arXiv , Author =:1707.08819 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Generating Wikipedia by Summarizing Long Sequences

Liu, Peter J. and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam , Biburl =. Generating Wikipedia by Summarizing Long Sequences , Url =. 1801.10198 , Eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Analysing Mathematical Reasoning Abilities of Neural Models

Analysing Mathematical Reasoning Abilities of Neural Models , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.01557 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[40]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Generating Diverse High-Fidelity Images with. 2019 , Bdsk-Url-1 =. arXiv , Author =:1906.00446 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

arXiv , Author =:2002.11794 , Primaryclass =

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , Year =. arXiv , Author =:2002.11794 , Primaryclass =

work page arXiv 2002
[44]

arXiv preprint arXiv:2004.13637 , year=

Recipes for building an open-domain chatbot , Year =. arXiv , Author =:2004.13637 , Primaryclass =

work page arXiv 2004
[46]

Liu , Eprint =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , Eprint =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =

work page
[47]

Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =

Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =. A Constructive Prediction of the Generalization Error Across Scales , Year =

work page
[48]

Analysis of a random forests model , Volume =

Biau, G. Analysis of a random forests model , Volume =. Journal of Machine Learning Research , Number =

work page
[49]

All of nonparametric statistics , Year =

Wasserman, Larry , Publisher =. All of nonparametric statistics , Year =

work page
[51]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , Year =. arXiv , Author =:1909.11942 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 1909
[52]

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow: Deep Learning for Supercomputers , Year =. arXiv , Author =:1811.02084 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =

Hestness, Joel and Ardalani, Newsha and Diamos, Gregory , Booktitle =. Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =. 2019 , Bdsk-Url-1 =. doi:10.1145/3293883.3295710 , Isbn =

work page doi:10.1145/3293883.3295710 2019
[55]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.07062 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

Common Crawl , Url =

The Common Crawl Foundation , Date-Added =. Common Crawl , Url =

work page
[58]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa:. 2019 , Bdsk-Url-1 =. arXiv , Author =:1907.11692 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[59]

On the origin of long-range correlations in texts , Volume =

Altmann, Eduardo G and Cristadoro, Giampaolo and Degli Esposti, Mirko , Journal =. On the origin of long-range correlations in texts , Volume =

work page
[60]

Entropy and long-range correlations in literary English , Volume =

Ebeling, Werner and P. Entropy and long-range correlations in literary English , Volume =. EPL (Europhysics Letters) , Number =

work page
[61]

Criticality in formal languages and statistical physics , Year =

Lin, Henry W and Tegmark, Max , Journal =. Criticality in formal languages and statistical physics , Year =

work page
[62]

Universal Transformers

Universal Transformers , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1807.03819 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1804.04235 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , Url =

Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja , Date-Added =. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , Url =. 2015 IEEE International Conference on Computer Vision (ICCV) , Month =. 2015 , Bdsk-Url-1 =. doi:10....

work page doi:10.1109/iccv.2015.11 2015
[65]

arXiv , Author =:2003.02218 , Primaryclass =

The large learning rate phase of deep learning: the catapult mechanism , Year =. arXiv , Author =:2003.02218 , Primaryclass =

work page arXiv 2003
[66]

Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , Eprint =

Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , Eprint =. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent , Year =

work page
[67]

Scaling to Very Very Large Corpora for Natural Language Disambiguation , Url =

Banko, Michele and Brill, Eric , Booktitle =. Scaling to Very Very Large Corpora for Natural Language Disambiguation , Url =. 2001 , Bdsk-Url-1 =. doi:10.3115/1073012.1073017 , Month = jul, Pages =

work page doi:10.3115/1073012.1073017 2001
[68]

A Bit of Progress in Language Modeling , Url =

Joshua Goodman , Bibsource =. A Bit of Progress in Language Modeling , Url =. CoRR , Timestamp =. 2001 , Bdsk-Url-1 =

work page 2001
[69]

Neural tangent kernel: Convergence and generalization in neural networks , Year =

Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural tangent kernel: Convergence and generalization in neural networks , Year =. Advances in neural information processing systems , Pages =

work page
[70]

Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , Url =

Ali Rahimi and Recht, Benjamin , Booktitle =. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , Url =. 2009 , Bdsk-Url-1 =

work page 2009
[71]

Introduction to the theory of complex systems , Year =

Thurner, Stefan and Hanel, Rudolf and Klimek, Peter , Publisher =. Introduction to the theory of complex systems , Year =

work page
[72]

arXiv , Title =

Mario Geiger and Arthur Jacot and Stefano Spigler and Franck Gabriel and Levent Sagun and St. arXiv , Title =. 1901.01608 , Eprinttype =

work page arXiv 1901
[73]

arXiv , Title =

Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal , Date-Added =. arXiv , Title =. 1812.11118 , Eprinttype =

work page arXiv
[74]

High-dimensional dynamics of generalization error in neural networks

Madhu S. Advani and Andrew M. Saxe , Date-Added =. arXiv , Title =. 1710.03667 , Eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[76]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , Year =. arXiv , Author =:1901.10159 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 1901
[77]

Wide Residual Networks , Url =

Zagoruyko, Sergey and Komodakis, Nikos , Date-Added =. Wide Residual Networks , Url =. Procedings of the British Machine Vision Conference 2016 , Publisher =. 2016 , Bdsk-Url-1 =. doi:10.5244/c.30.87 , Isbn =

work page doi:10.5244/c.30.87 2016
[78]

arXiv , Author =:1906.02909 , Primaryclass =

AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , Year =. arXiv , Author =:1906.02909 , Primaryclass =

work page arXiv 1906
[79]

Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =

Wang, Yu-Xiong and Ramanan, Deva and Hebert, Martial , Date-Added =. Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Month =. 2017 , Bdsk-Url-1 =. doi:10.1109/cvpr.2017.323 , Isbn =

work page doi:10.1109/cvpr.2017.323 2017
[80]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1905.11946 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[81]

Scaling to very very large corpora for natural language disambiguation , Year =

Banko, Michele and Brill, Eric , Booktitle =. Scaling to very very large corpora for natural language disambiguation , Year =

work page
[82]

, Booktitle =

Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , Booktitle =. ImageNet Classification with Deep Convolutional Neural Networks , Url =. 2012 , Bdsk-Url-1 =

work page 2012
[83]

openai.com , Title =

Gray, Scott and Radford, Alec and Kingma, Diederik P , Date-Added =. openai.com , Title =

work page
[84]

Decoupled Weight Decay Regularization

Fixing Weight Decay Regularization in Adam , Url =. 2017 , Bdsk-Url-1 =. arXiv , Author =:1711.05101 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[85]

Generating Long Sequences with Sparse Transformers

Generating Long Sequences with Sparse Transformers , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.10509 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[86]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.06965 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[87]

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =

Stanislaw Jastrzebski and Zachary Kenton and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey , Date-Added =. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =. arXiv , Keywords =. 1807.05031 , Eprinttype =

work page arXiv
[88]

arXiv , Author =:1908.08351 , Primaryclass =

Compositionality decomposed: how do neural networks generalise? , Year =. arXiv , Author =:1908.08351 , Primaryclass =

work page arXiv 1908
[90]

Generative Pretraining From Pixels , Year =

Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya , Booktitle =. Generative Pretraining From Pixels , Year =

work page
[91]

One Epoch Is All You Need

Aran Komatsuzaki , Date-Added =. arXiv:1906.06669 , Title =

work page internal anchor Pith review Pith/arXiv arXiv 1906
[92]

An Empirical Model of Large-Batch Training

Sam McCandlish and Jared Kaplan and Dario Amodei and OpenAI Dota Team , Date-Added =. arXiv:1812.06162 , Title =

work page internal anchor Pith review Pith/arXiv arXiv
[93]

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , Date-Added =. XLNet: Generalized Autoregressive Pretraining for Language Understanding , Year =. arXiv:1906.08237 , Keywords =

work page internal anchor Pith review arXiv 1906
[95]

Residual Networks Behave Like Ensembles of Relatively Shallow Networks , Year =

Andreas Veit and Michael Wilber and Serge Belongie , Eprint =. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , Year =

work page
[96]

Language Models are Unsupervised Multitask Learners , Year =

Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , Date-Modified =. Language Models are Unsupervised Multitask Learners , Year =. openai.com , Keywords =

work page
[97]

Improving language understanding by generative pre-training , Year =

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , Date-Modified =. Improving language understanding by generative pre-training , Year =. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf , Keywords =

work page
[98]

Attention is All you Need , Url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , Booktitle =. Attention is All you Need , Url =. 2017 , Bdsk-Url-1 =

work page 2017
[99]

2018 , Bdsk-Url-1 =

Dario Amodei AND Danny Hernandez , Date-Added =. 2018 , Bdsk-Url-1 =

work page 2018
[100]

Selecting Sample Sizes , Url =

work page
[101]

Sample Size Determination , Url =

work page
[102]

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Noah Golmant and Nikita Vemuri and Zhewei Yao and Vladimir Feinberg and Amir Gholami and Kai Rothauge and Michael W. Mahoney and Joseph Gonzalez , Date-Added =. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , Year =. 1811.12941 , Eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[103]

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Siyuan Ma and Raef Bassily and Mikhail Belkin , Date-Added =. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , Year =. 1712.06559 , Eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[104]

TrueSkill : A Bayesian Skill Rating System , Url =

Ralf Herbrich and Minka, Tom and Graepel, Thore , Booktitle =. TrueSkill : A Bayesian Skill Rating System , Url =. 2007 , Bdsk-Url-1 =

work page 2007
[105]

Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , Year =

Andrew Ilyas and Logan Engstrom and Shibani Santurkar and Dimitris Tsipras and Firdaus Janoos and Larry Rudolph and Aleksander Madry , Date-Added =. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , Year =. 1811.02553 , Eprinttype =

work page arXiv

Showing first 80 references.