Scaling Laws for Transfer
Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3
The pith
Pre-training multiplies the effective size of fine-tuning datasets according to a power law in model size and data volume.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When models are pre-trained on a large language dataset and then fine-tuned, the loss continues to drop with more parameters even after from-scratch training has saturated; inverting the from-scratch loss-versus-data curve shows that the amount of effective data transferred obeys a power law in parameter count and fine-tuning dataset size, so that pre-training multiplies the fine-tuning dataset size.
What carries the argument
Effective data transferred, obtained by inverting the observed fine-tuned loss against the loss curve measured in from-scratch training to find how much additional data would have produced the same loss.
If this is right
- Transfer performance can be predicted in advance from parameter count, fine-tuning size, and the measured exponents.
- The slope of the power law in model size quantifies how generally useful the pre-trained representations are.
- The slope in fine-tuning data size quantifies how close the pre-training and target distributions are.
- Overall scaling of transfer follows the same predictable pattern as scaling of performance from scratch.
Where Pith is reading between the lines
- Training budgets could be allocated by first estimating the multiplication factor from the power law and then deciding how much additional fine-tuning data is still worth collecting.
- The same inversion technique might reveal whether pre-training on one modality transfers to another by comparing effective data across domains.
- If the power-law exponents turn out stable across many tasks, they could serve as a cheap diagnostic for how well a new pre-trained model will generalize before any fine-tuning is run.
Load-bearing premise
The loss curve measured during ordinary from-scratch training can be inverted to give the exact amount of data that would produce the same loss after fine-tuning, with no extra effects from optimization or distribution mismatch.
What would settle it
Measure the actual loss after fine-tuning a model on a new small dataset and check whether the loss matches the value predicted by plugging the model size and dataset size into the reported power-law formula for effective transferred data.
read the original abstract
We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero. We calculate the effective data "transferred" from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size. We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies empirical scaling laws for transfer learning in an unsupervised fine-tuning setting for transformers. It shows that pre-trained models continue improving with fine-tuning data in regimes where from-scratch models plateau due to data limits. Effective data transferred from pre-training is computed by inverting the from-scratch loss-versus-data curve to find the D_eff that would produce the observed fine-tuned loss for the same model size. This D_eff is reported to follow a power-law dependence on parameter count and fine-tuning dataset size in the low-data regime, with the interpretation that pre-training multiplies the fine-tuning data and that the exponents measure generality and distribution proximity.
Significance. If the central results hold after addressing the inversion assumptions, the work supplies a data-centric, quantitative description of transfer that extends existing scaling-law analyses and could guide decisions on pre-training compute allocation versus fine-tuning data. The focus on low-data regime and the explicit power-law form for effective transferred data are useful contributions, though they rest on the validity of treating from-scratch curves as an invertible baseline.
major comments (2)
- [Effective data calculation (described in abstract and methods)] The effective-data inversion (L_scratch(N, D_eff) = L_finetune(N, D_ft)) is the load-bearing step for all subsequent power-law claims. The manuscript provides no diagnostics that the power-law regime, exponents, or location remain unchanged when training begins from a pre-trained checkpoint rather than random initialization; differing optimization trajectories or effective capacity could systematically bias D_eff. This assumption is not tested and directly affects the claim that pre-training multiplies fine-tuning data.
- [Results on effective transferred data] The power-law fit to effective data in the low-data regime is presented without error bars on the fitted exponents, without the exact functional form or regression procedure used for the from-scratch baseline, and without comparisons to alternative forms (e.g., log or saturating functions). These omissions make it impossible to assess how well the power law actually describes the data or how sensitive the reported exponents are to fitting choices.
minor comments (2)
- [Notation and definitions] Notation for D_eff and the power-law exponents should be introduced with explicit equations early in the text to improve readability when the same symbols appear in later figures and interpretations.
- [Figures] Several loss-curve figures would be clearer if they overlaid the from-scratch and fine-tuned curves on identical axes with explicit indication of the inversion points used to obtain D_eff.
Simulated Author's Rebuttal
We thank the referee for their detailed and insightful comments on our work. We believe the suggested revisions will strengthen the presentation of our results on scaling laws for transfer learning. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Effective data calculation (described in abstract and methods)] The effective-data inversion (L_scratch(N, D_eff) = L_finetune(N, D_ft)) is the load-bearing step for all subsequent power-law claims. The manuscript provides no diagnostics that the power-law regime, exponents, or location remain unchanged when training begins from a pre-trained checkpoint rather than random initialization; differing optimization trajectories or effective capacity could systematically bias D_eff. This assumption is not tested and directly affects the claim that pre-training multiplies fine-tuning data.
Authors: We agree that validating the inversion assumption is important for the robustness of our claims. While the core methodology relies on matching observed losses to the from-scratch scaling curve, we did not explicitly test whether the from-scratch power-law exponents or regimes shift when initializing from a pre-trained model. In the revised version, we will add a discussion of this potential limitation and, where computationally feasible, include diagnostic experiments comparing loss curves starting from pre-trained weights versus random initialization in the low-data regime to assess any systematic bias in D_eff. revision: yes
-
Referee: [Results on effective transferred data] The power-law fit to effective data in the low-data regime is presented without error bars on the fitted exponents, without the exact functional form or regression procedure used for the from-scratch baseline, and without comparisons to alternative forms (e.g., log or saturating functions). These omissions make it impossible to assess how well the power law actually describes the data or how sensitive the reported exponents are to fitting choices.
Authors: We appreciate this point and acknowledge that additional details on the fitting procedure would enhance the clarity and reproducibility of our results. In the revision, we will specify the exact functional form used for the from-scratch baseline (power-law in N and D), detail the regression procedure (e.g., linear regression on log-transformed variables), include error bars or confidence intervals on the fitted exponents derived from bootstrap resampling or similar methods, and provide comparisons to alternative functional forms such as logarithmic or saturating models to justify the power-law choice in the low-data regime. revision: yes
Circularity Check
No significant circularity; empirical definition of effective data followed by power-law fit is standard scaling analysis
full rationale
The paper defines effective transferred data via inversion of the from-scratch loss curve to match observed fine-tuned loss, then empirically observes that this quantity follows a power-law in N and D_ft within the low-data regime. This is a measurement-plus-fitting procedure for reporting scaling relations, not a first-principles derivation whose claimed result reduces to its inputs by construction. The inversion step rests on an assumption about curve applicability (a correctness concern), but does not create a self-definitional loop or rename a fitted quantity as an independent prediction. No equations or steps in the abstract or described chain exhibit the specific reductions required for circularity flags (e.g., no power-law exponents derived tautologically from the inversion itself). The work remains self-contained as observational scaling laws.
Axiom & Free-Parameter Ledger
free parameters (1)
- power-law exponents for effective transferred data
axioms (1)
- domain assumption Loss scales as a power law with dataset size in the from-scratch regime
Forward citations
Cited by 20 Pith papers
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
On the Invariance and Generality of Neural Scaling Laws
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
-
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
-
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
-
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks
Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.
-
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.
-
Knowledge Transfer Scaling Laws for 3D Medical Imaging
Transfer-aware data allocation derived from observed power-law scaling laws for asymmetric knowledge transfer in 3D medical imaging outperforms standard proportional sampling by up to 58% and generalizes to new budgets.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
-
Efficient Training of Language Models to Fill in the Middle
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Scaling Laws and Interpretability of Learning from Repeated Data
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
-
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
-
Small Language Models are the Future of Agentic AI
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Reference graph
Works this paper leans on
-
[1]
Rethinking imagenet pre-training , Year =
He, Kaiming and Girshick, Ross and Doll. Rethinking imagenet pre-training , Year =. Proceedings of the IEEE/CVF International Conference on Computer Vision , Date-Added =
-
[3]
A survey on deep transfer learning , Year =
Tan, Chuanqi and Sun, Fuchun and Kong, Tao and Zhang, Wenchang and Yang, Chao and Liu, Chunfang , Booktitle =. A survey on deep transfer learning , Year =
-
[4]
lilianweng.github.io/lil-log , Title =
Weng, Lilian , Date-Added =. lilianweng.github.io/lil-log , Title =. 2018 , Bdsk-Url-1 =
work page 2018
-
[6]
arXiv preprint arXiv:1907.07174 , Title =
Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , Date-Added =. arXiv preprint arXiv:1907.07174 , Title =
-
[7]
Learning Transferable Visual Models From Natural Language Supervision , Volume =
Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others , Date-Added =. Learning Transferable Visual Models From Natural Language Supervision , Volume =. Image , Pages =
-
[12]
A Neural Probabilistic Language Model , Volume =
Yoshua Bengio and R. A Neural Probabilistic Language Model , Volume =. JOURNAL OF MACHINE LEARNING RESEARCH , Pages =
-
[13]
Recurrent neural network based language model , Volume =
Mikolov, Tomas and Karafi. Recurrent neural network based language model , Volume =. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 , Month =
work page 2010
-
[17]
Silver, David and Huang, Aja and Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and K...
-
[19]
Learning internal representations by error propagation , Year =
Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J , Date-Added =. Learning internal representations by error propagation , Year =
-
[20]
Long Short-Term Memory , Volume =
Sepp Hochreiter and J. Long Short-Term Memory , Volume =. Neural Computation , Number =
-
[21]
Mastering the game of Go with deep neural networks and tree search , Volume =
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , Date-Added =. Mastering the game of Go with deep neural networks and tree search , Volume =. nature , Number =
-
[23]
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks , Year =. arXiv , Author =:1409.3215 , Primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Neural Discrete Representation Learning
Neural Discrete Representation Learning , Year =. arXiv , Author =:1711.00937 , Primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Jukebox: A Generative Model for Music
Jukebox: A Generative Model for Music , Year =. arXiv , Author =:2005.00341 , Primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[30]
arXiv , Author =:1906.02634 , Primaryclass =
Scaling Autoregressive Video Models , Year =. arXiv , Author =:1906.02634 , Primaryclass =
-
[31]
Pixel Recurrent Neural Networks
Pixel Recurrent Neural Networks , Url =. 2016 , Bdsk-Url-1 =. arXiv , Author =:1601.06759 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[32]
Multimodal transformer for unaligned multimodal language sequences , Volume =
Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan , Booktitle =. Multimodal transformer for unaligned multimodal language sequences , Volume =
-
[33]
arXiv , Author =:1910.06611 , Primaryclass =
Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , Year =. arXiv , Author =:1910.06611 , Primaryclass =
-
[34]
YFCC100M: The New Data in Multimedia Research
The New Data and New Challenges in Multimedia Research , Url =. 2015 , Bdsk-Url-1 =. arXiv , Author =:1503.01817 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[35]
arXiv , Author =:2006.10621 , Primaryclass =
On the Predictability of Pruning Across Scales , Year =. arXiv , Author =:2006.10621 , Primaryclass =
-
[37]
A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets
A Downsampled Variant of ImageNet as an Alternative to the. 2017 , Bdsk-Url-1 =. arXiv , Author =:1707.08819 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Generating Wikipedia by Summarizing Long Sequences
Liu, Peter J. and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam , Biburl =. Generating Wikipedia by Summarizing Long Sequences , Url =. 1801.10198 , Eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Analysing Mathematical Reasoning Abilities of Neural Models
Analysing Mathematical Reasoning Abilities of Neural Models , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.01557 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[40]
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with. 2019 , Bdsk-Url-1 =. arXiv , Author =:1906.00446 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[43]
arXiv , Author =:2002.11794 , Primaryclass =
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , Year =. arXiv , Author =:2002.11794 , Primaryclass =
-
[44]
arXiv preprint arXiv:2004.13637 , year=
Recipes for building an open-domain chatbot , Year =. arXiv , Author =:2004.13637 , Primaryclass =
-
[46]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , Eprint =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =
-
[47]
Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =
Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =. A Constructive Prediction of the Generalization Error Across Scales , Year =
-
[48]
Analysis of a random forests model , Volume =
Biau, G. Analysis of a random forests model , Volume =. Journal of Machine Learning Research , Number =
-
[49]
All of nonparametric statistics , Year =
Wasserman, Larry , Publisher =. All of nonparametric statistics , Year =
-
[51]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , Year =. arXiv , Author =:1909.11942 , Primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[52]
Mesh-TensorFlow: Deep Learning for Supercomputers
Mesh-TensorFlow: Deep Learning for Supercomputers , Year =. arXiv , Author =:1811.02084 , Primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =
Hestness, Joel and Ardalani, Newsha and Diamos, Gregory , Booktitle =. Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =. 2019 , Bdsk-Url-1 =. doi:10.1145/3293883.3295710 , Isbn =
-
[55]
The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size
The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.07062 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [56]
-
[58]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa:. 2019 , Bdsk-Url-1 =. arXiv , Author =:1907.11692 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[59]
On the origin of long-range correlations in texts , Volume =
Altmann, Eduardo G and Cristadoro, Giampaolo and Degli Esposti, Mirko , Journal =. On the origin of long-range correlations in texts , Volume =
-
[60]
Entropy and long-range correlations in literary English , Volume =
Ebeling, Werner and P. Entropy and long-range correlations in literary English , Volume =. EPL (Europhysics Letters) , Number =
-
[61]
Criticality in formal languages and statistical physics , Year =
Lin, Henry W and Tegmark, Max , Journal =. Criticality in formal languages and statistical physics , Year =
-
[62]
Universal Transformers , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1807.03819 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[63]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1804.04235 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[64]
Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja , Date-Added =. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , Url =. 2015 IEEE International Conference on Computer Vision (ICCV) , Month =. 2015 , Bdsk-Url-1 =. doi:10....
-
[65]
arXiv , Author =:2003.02218 , Primaryclass =
The large learning rate phase of deep learning: the catapult mechanism , Year =. arXiv , Author =:2003.02218 , Primaryclass =
-
[66]
Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , Eprint =. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent , Year =
-
[67]
Scaling to Very Very Large Corpora for Natural Language Disambiguation , Url =
Banko, Michele and Brill, Eric , Booktitle =. Scaling to Very Very Large Corpora for Natural Language Disambiguation , Url =. 2001 , Bdsk-Url-1 =. doi:10.3115/1073012.1073017 , Month = jul, Pages =
-
[68]
A Bit of Progress in Language Modeling , Url =
Joshua Goodman , Bibsource =. A Bit of Progress in Language Modeling , Url =. CoRR , Timestamp =. 2001 , Bdsk-Url-1 =
work page 2001
-
[69]
Neural tangent kernel: Convergence and generalization in neural networks , Year =
Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural tangent kernel: Convergence and generalization in neural networks , Year =. Advances in neural information processing systems , Pages =
-
[70]
Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , Url =
Ali Rahimi and Recht, Benjamin , Booktitle =. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , Url =. 2009 , Bdsk-Url-1 =
work page 2009
-
[71]
Introduction to the theory of complex systems , Year =
Thurner, Stefan and Hanel, Rudolf and Klimek, Peter , Publisher =. Introduction to the theory of complex systems , Year =
-
[72]
Mario Geiger and Arthur Jacot and Stefano Spigler and Franck Gabriel and Levent Sagun and St. arXiv , Title =. 1901.01608 , Eprinttype =
-
[73]
Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal , Date-Added =. arXiv , Title =. 1812.11118 , Eprinttype =
-
[74]
High-dimensional dynamics of generalization error in neural networks
Madhu S. Advani and Andrew M. Saxe , Date-Added =. arXiv , Title =. 1710.03667 , Eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , Year =. arXiv , Author =:1901.10159 , Primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[77]
Wide Residual Networks , Url =
Zagoruyko, Sergey and Komodakis, Nikos , Date-Added =. Wide Residual Networks , Url =. Procedings of the British Machine Vision Conference 2016 , Publisher =. 2016 , Bdsk-Url-1 =. doi:10.5244/c.30.87 , Isbn =
-
[78]
arXiv , Author =:1906.02909 , Primaryclass =
AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , Year =. arXiv , Author =:1906.02909 , Primaryclass =
-
[79]
Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =
Wang, Yu-Xiong and Ramanan, Deva and Hebert, Martial , Date-Added =. Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Month =. 2017 , Bdsk-Url-1 =. doi:10.1109/cvpr.2017.323 , Isbn =
-
[80]
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1905.11946 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[81]
Scaling to very very large corpora for natural language disambiguation , Year =
Banko, Michele and Brill, Eric , Booktitle =. Scaling to very very large corpora for natural language disambiguation , Year =
-
[82]
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , Booktitle =. ImageNet Classification with Deep Convolutional Neural Networks , Url =. 2012 , Bdsk-Url-1 =
work page 2012
-
[83]
Gray, Scott and Radford, Alec and Kingma, Diederik P , Date-Added =. openai.com , Title =
-
[84]
Decoupled Weight Decay Regularization
Fixing Weight Decay Regularization in Adam , Url =. 2017 , Bdsk-Url-1 =. arXiv , Author =:1711.05101 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[85]
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.10509 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[86]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.06965 , Journal =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[87]
On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =
Stanislaw Jastrzebski and Zachary Kenton and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey , Date-Added =. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =. arXiv , Keywords =. 1807.05031 , Eprinttype =
-
[88]
arXiv , Author =:1908.08351 , Primaryclass =
Compositionality decomposed: how do neural networks generalise? , Year =. arXiv , Author =:1908.08351 , Primaryclass =
-
[90]
Generative Pretraining From Pixels , Year =
Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya , Booktitle =. Generative Pretraining From Pixels , Year =
-
[91]
Aran Komatsuzaki , Date-Added =. arXiv:1906.06669 , Title =
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[92]
An Empirical Model of Large-Batch Training
Sam McCandlish and Jared Kaplan and Dario Amodei and OpenAI Dota Team , Date-Added =. arXiv:1812.06162 , Title =
work page internal anchor Pith review Pith/arXiv arXiv
-
[93]
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , Date-Added =. XLNet: Generalized Autoregressive Pretraining for Language Understanding , Year =. arXiv:1906.08237 , Keywords =
work page internal anchor Pith review arXiv 1906
-
[95]
Residual Networks Behave Like Ensembles of Relatively Shallow Networks , Year =
Andreas Veit and Michael Wilber and Serge Belongie , Eprint =. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , Year =
-
[96]
Language Models are Unsupervised Multitask Learners , Year =
Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , Date-Modified =. Language Models are Unsupervised Multitask Learners , Year =. openai.com , Keywords =
-
[97]
Improving language understanding by generative pre-training , Year =
Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , Date-Modified =. Improving language understanding by generative pre-training , Year =. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf , Keywords =
-
[98]
Attention is All you Need , Url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , Booktitle =. Attention is All you Need , Url =. 2017 , Bdsk-Url-1 =
work page 2017
-
[99]
Dario Amodei AND Danny Hernandez , Date-Added =. 2018 , Bdsk-Url-1 =
work page 2018
-
[100]
Selecting Sample Sizes , Url =
-
[101]
Sample Size Determination , Url =
-
[102]
On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent
Noah Golmant and Nikita Vemuri and Zhewei Yao and Vladimir Feinberg and Amir Gholami and Kai Rothauge and Michael W. Mahoney and Joseph Gonzalez , Date-Added =. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , Year =. 1811.12941 , Eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv
-
[103]
Siyuan Ma and Raef Bassily and Mikhail Belkin , Date-Added =. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , Year =. 1712.06559 , Eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv
-
[104]
TrueSkill : A Bayesian Skill Rating System , Url =
Ralf Herbrich and Minka, Tom and Graepel, Thore , Booktitle =. TrueSkill : A Bayesian Skill Rating System , Url =. 2007 , Bdsk-Url-1 =
work page 2007
-
[105]
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , Year =
Andrew Ilyas and Logan Engstrom and Shibani Santurkar and Dimitris Tsipras and Firdaus Janoos and Larry Rudolph and Aleksander Madry , Date-Added =. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , Year =. 1811.02553 , Eprinttype =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.