pith. machine review for the scientific record. sign in

arxiv: 2102.01293 · v1 · pith:EHVUIWH5new · submitted 2021-02-02 · 💻 cs.LG

Scaling Laws for Transfer

Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords scaling lawstransfer learningfine-tuninglanguage modelspower lawseffective dataneural network scaling
0
0 comments X

The pith

Pre-training multiplies the effective size of fine-tuning datasets according to a power law in model size and data volume.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how much pre-training on a large language corpus helps when later fine-tuning on much smaller datasets. It converts the observed fine-tuned loss back into an equivalent amount of from-scratch training data that would have been required to reach the same loss, then shows that this effective transferred quantity follows a clean power law in both the number of model parameters and the fine-tuning dataset size. A reader should care because the result supplies a concrete, testable rule for predicting transfer gains without running every possible experiment. The authors interpret the power-law exponents as direct measures of how general a model is and how close the pre-training and fine-tuning distributions are. They conclude that pre-training simply multiplies the fine-tuning dataset size by a predictable factor.

Core claim

When models are pre-trained on a large language dataset and then fine-tuned, the loss continues to drop with more parameters even after from-scratch training has saturated; inverting the from-scratch loss-versus-data curve shows that the amount of effective data transferred obeys a power law in parameter count and fine-tuning dataset size, so that pre-training multiplies the fine-tuning dataset size.

What carries the argument

Effective data transferred, obtained by inverting the observed fine-tuned loss against the loss curve measured in from-scratch training to find how much additional data would have produced the same loss.

If this is right

  • Transfer performance can be predicted in advance from parameter count, fine-tuning size, and the measured exponents.
  • The slope of the power law in model size quantifies how generally useful the pre-trained representations are.
  • The slope in fine-tuning data size quantifies how close the pre-training and target distributions are.
  • Overall scaling of transfer follows the same predictable pattern as scaling of performance from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training budgets could be allocated by first estimating the multiplication factor from the power law and then deciding how much additional fine-tuning data is still worth collecting.
  • The same inversion technique might reveal whether pre-training on one modality transfers to another by comparing effective data across domains.
  • If the power-law exponents turn out stable across many tasks, they could serve as a cheap diagnostic for how well a new pre-trained model will generalize before any fine-tuning is run.

Load-bearing premise

The loss curve measured during ordinary from-scratch training can be inverted to give the exact amount of data that would produce the same loss after fine-tuning, with no extra effects from optimization or distribution mismatch.

What would settle it

Measure the actual loss after fine-tuning a model on a new small dataset and check whether the loss matches the value predicted by plugging the model size and dataset size into the reported power-law formula for effective transferred data.

read the original abstract

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero. We calculate the effective data "transferred" from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size. We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies empirical scaling laws for transfer learning in an unsupervised fine-tuning setting for transformers. It shows that pre-trained models continue improving with fine-tuning data in regimes where from-scratch models plateau due to data limits. Effective data transferred from pre-training is computed by inverting the from-scratch loss-versus-data curve to find the D_eff that would produce the observed fine-tuned loss for the same model size. This D_eff is reported to follow a power-law dependence on parameter count and fine-tuning dataset size in the low-data regime, with the interpretation that pre-training multiplies the fine-tuning data and that the exponents measure generality and distribution proximity.

Significance. If the central results hold after addressing the inversion assumptions, the work supplies a data-centric, quantitative description of transfer that extends existing scaling-law analyses and could guide decisions on pre-training compute allocation versus fine-tuning data. The focus on low-data regime and the explicit power-law form for effective transferred data are useful contributions, though they rest on the validity of treating from-scratch curves as an invertible baseline.

major comments (2)
  1. [Effective data calculation (described in abstract and methods)] The effective-data inversion (L_scratch(N, D_eff) = L_finetune(N, D_ft)) is the load-bearing step for all subsequent power-law claims. The manuscript provides no diagnostics that the power-law regime, exponents, or location remain unchanged when training begins from a pre-trained checkpoint rather than random initialization; differing optimization trajectories or effective capacity could systematically bias D_eff. This assumption is not tested and directly affects the claim that pre-training multiplies fine-tuning data.
  2. [Results on effective transferred data] The power-law fit to effective data in the low-data regime is presented without error bars on the fitted exponents, without the exact functional form or regression procedure used for the from-scratch baseline, and without comparisons to alternative forms (e.g., log or saturating functions). These omissions make it impossible to assess how well the power law actually describes the data or how sensitive the reported exponents are to fitting choices.
minor comments (2)
  1. [Notation and definitions] Notation for D_eff and the power-law exponents should be introduced with explicit equations early in the text to improve readability when the same symbols appear in later figures and interpretations.
  2. [Figures] Several loss-curve figures would be clearer if they overlaid the from-scratch and fine-tuned curves on identical axes with explicit indication of the inversion points used to obtain D_eff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our work. We believe the suggested revisions will strengthen the presentation of our results on scaling laws for transfer learning. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Effective data calculation (described in abstract and methods)] The effective-data inversion (L_scratch(N, D_eff) = L_finetune(N, D_ft)) is the load-bearing step for all subsequent power-law claims. The manuscript provides no diagnostics that the power-law regime, exponents, or location remain unchanged when training begins from a pre-trained checkpoint rather than random initialization; differing optimization trajectories or effective capacity could systematically bias D_eff. This assumption is not tested and directly affects the claim that pre-training multiplies fine-tuning data.

    Authors: We agree that validating the inversion assumption is important for the robustness of our claims. While the core methodology relies on matching observed losses to the from-scratch scaling curve, we did not explicitly test whether the from-scratch power-law exponents or regimes shift when initializing from a pre-trained model. In the revised version, we will add a discussion of this potential limitation and, where computationally feasible, include diagnostic experiments comparing loss curves starting from pre-trained weights versus random initialization in the low-data regime to assess any systematic bias in D_eff. revision: yes

  2. Referee: [Results on effective transferred data] The power-law fit to effective data in the low-data regime is presented without error bars on the fitted exponents, without the exact functional form or regression procedure used for the from-scratch baseline, and without comparisons to alternative forms (e.g., log or saturating functions). These omissions make it impossible to assess how well the power law actually describes the data or how sensitive the reported exponents are to fitting choices.

    Authors: We appreciate this point and acknowledge that additional details on the fitting procedure would enhance the clarity and reproducibility of our results. In the revision, we will specify the exact functional form used for the from-scratch baseline (power-law in N and D), detail the regression procedure (e.g., linear regression on log-transformed variables), include error bars or confidence intervals on the fitted exponents derived from bootstrap resampling or similar methods, and provide comparisons to alternative functional forms such as logarithmic or saturating models to justify the power-law choice in the low-data regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical definition of effective data followed by power-law fit is standard scaling analysis

full rationale

The paper defines effective transferred data via inversion of the from-scratch loss curve to match observed fine-tuned loss, then empirically observes that this quantity follows a power-law in N and D_ft within the low-data regime. This is a measurement-plus-fitting procedure for reporting scaling relations, not a first-principles derivation whose claimed result reduces to its inputs by construction. The inversion step rests on an assumption about curve applicability (a correctness concern), but does not create a self-definitional loop or rename a fitted quantity as an independent prediction. No equations or steps in the abstract or described chain exhibit the specific reductions required for circularity flags (e.g., no power-law exponents derived tautologically from the inversion itself). The work remains self-contained as observational scaling laws.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of inverting loss curves to obtain effective data and on the assumption that power-law forms observed in from-scratch training continue to apply when matching fine-tuned performance.

free parameters (1)
  • power-law exponents for effective transferred data
    Exponents relating effective data to parameter count and to fine-tuning dataset size are determined by fitting observed values.
axioms (1)
  • domain assumption Loss scales as a power law with dataset size in the from-scratch regime
    Used to back-calculate how much data would have been needed to reach the fine-tuned loss.

pith-pipeline@v0.9.0 · 5721 in / 1337 out tokens · 42357 ms · 2026-05-18T00:52:50.896713+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  2. On the Invariance and Generality of Neural Scaling Laws

    cs.LG 2026-05 unverdicted novelty 7.0

    Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.

  3. Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

    cs.AI 2026-04 unverdicted novelty 7.0

    A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.

  4. Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

    cs.LG 2026-05 conditional novelty 6.0

    A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

  5. A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.

  6. Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.

  7. Knowledge Transfer Scaling Laws for 3D Medical Imaging

    cs.CV 2026-05 conditional novelty 6.0

    Transfer-aware data allocation derived from observed power-law scaling laws for asymmetric knowledge transfer in 3D medical imaging outperforms standard proportional sampling by up to 58% and generalizes to new budgets.

  8. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 6.0

    Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...

  9. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  10. Lessons from the Trenches on Reproducible Evaluation of Language Models

    cs.CL 2024-05 accept novelty 6.0

    The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

  11. Scaling Data-Constrained Language Models

    cs.CL 2023-05 conditional novelty 6.0

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  12. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  13. SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    cs.LG 2023-03 unverdicted novelty 6.0

    SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.

  14. Efficient Training of Language Models to Fill in the Middle

    cs.CL 2022-07 unverdicted novelty 6.0

    Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.

  15. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  16. Scaling Laws and Interpretability of Learning from Repeated Data

    cs.LG 2022-05 accept novelty 6.0

    Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

  17. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    cs.CL 2022-04 unverdicted novelty 6.0

    RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...

  18. Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

    cs.LG 2026-05 unverdicted novelty 5.0

    A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.

  19. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 5.0

    Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...

  20. Small Language Models are the Future of Agentic AI

    cs.AI 2025-06 unverdicted novelty 5.0

    Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · cited by 19 Pith papers · 84 internal anchors

  1. [1]

    Rethinking imagenet pre-training , Year =

    He, Kaiming and Girshick, Ross and Doll. Rethinking imagenet pre-training , Year =. Proceedings of the IEEE/CVF International Conference on Computer Vision , Date-Added =

  2. [3]

    A survey on deep transfer learning , Year =

    Tan, Chuanqi and Sun, Fuchun and Kong, Tao and Zhang, Wenchang and Yang, Chao and Liu, Chunfang , Booktitle =. A survey on deep transfer learning , Year =

  3. [4]

    lilianweng.github.io/lil-log , Title =

    Weng, Lilian , Date-Added =. lilianweng.github.io/lil-log , Title =. 2018 , Bdsk-Url-1 =

  4. [6]

    arXiv preprint arXiv:1907.07174 , Title =

    Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , Date-Added =. arXiv preprint arXiv:1907.07174 , Title =

  5. [7]

    Learning Transferable Visual Models From Natural Language Supervision , Volume =

    Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others , Date-Added =. Learning Transferable Visual Models From Natural Language Supervision , Volume =. Image , Pages =

  6. [12]

    A Neural Probabilistic Language Model , Volume =

    Yoshua Bengio and R. A Neural Probabilistic Language Model , Volume =. JOURNAL OF MACHINE LEARNING RESEARCH , Pages =

  7. [13]

    Recurrent neural network based language model , Volume =

    Mikolov, Tomas and Karafi. Recurrent neural network based language model , Volume =. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 , Month =

  8. [17]

    Silver, David and Huang, Aja and Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and K...

  9. [19]

    Learning internal representations by error propagation , Year =

    Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J , Date-Added =. Learning internal representations by error propagation , Year =

  10. [20]

    Long Short-Term Memory , Volume =

    Sepp Hochreiter and J. Long Short-Term Memory , Volume =. Neural Computation , Number =

  11. [21]

    Mastering the game of Go with deep neural networks and tree search , Volume =

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , Date-Added =. Mastering the game of Go with deep neural networks and tree search , Volume =. nature , Number =

  12. [23]

    Sequence to Sequence Learning with Neural Networks

    Sequence to Sequence Learning with Neural Networks , Year =. arXiv , Author =:1409.3215 , Primaryclass =

  13. [28]

    Neural Discrete Representation Learning

    Neural Discrete Representation Learning , Year =. arXiv , Author =:1711.00937 , Primaryclass =

  14. [29]

    Jukebox: A Generative Model for Music

    Jukebox: A Generative Model for Music , Year =. arXiv , Author =:2005.00341 , Primaryclass =

  15. [30]

    arXiv , Author =:1906.02634 , Primaryclass =

    Scaling Autoregressive Video Models , Year =. arXiv , Author =:1906.02634 , Primaryclass =

  16. [31]

    Pixel Recurrent Neural Networks

    Pixel Recurrent Neural Networks , Url =. 2016 , Bdsk-Url-1 =. arXiv , Author =:1601.06759 , Journal =

  17. [32]

    Multimodal transformer for unaligned multimodal language sequences , Volume =

    Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan , Booktitle =. Multimodal transformer for unaligned multimodal language sequences , Volume =

  18. [33]

    arXiv , Author =:1910.06611 , Primaryclass =

    Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , Year =. arXiv , Author =:1910.06611 , Primaryclass =

  19. [34]

    YFCC100M: The New Data in Multimedia Research

    The New Data and New Challenges in Multimedia Research , Url =. 2015 , Bdsk-Url-1 =. arXiv , Author =:1503.01817 , Journal =

  20. [35]

    arXiv , Author =:2006.10621 , Primaryclass =

    On the Predictability of Pruning Across Scales , Year =. arXiv , Author =:2006.10621 , Primaryclass =

  21. [37]

    A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

    A Downsampled Variant of ImageNet as an Alternative to the. 2017 , Bdsk-Url-1 =. arXiv , Author =:1707.08819 , Journal =

  22. [38]

    Generating Wikipedia by Summarizing Long Sequences

    Liu, Peter J. and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam , Biburl =. Generating Wikipedia by Summarizing Long Sequences , Url =. 1801.10198 , Eprinttype =

  23. [39]

    Analysing Mathematical Reasoning Abilities of Neural Models

    Analysing Mathematical Reasoning Abilities of Neural Models , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.01557 , Journal =

  24. [40]

    Generating Diverse High-Fidelity Images with VQ-VAE-2

    Generating Diverse High-Fidelity Images with. 2019 , Bdsk-Url-1 =. arXiv , Author =:1906.00446 , Journal =

  25. [43]

    arXiv , Author =:2002.11794 , Primaryclass =

    Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , Year =. arXiv , Author =:2002.11794 , Primaryclass =

  26. [44]

    arXiv preprint arXiv:2004.13637 , year=

    Recipes for building an open-domain chatbot , Year =. arXiv , Author =:2004.13637 , Primaryclass =

  27. [46]

    Liu , Eprint =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , Eprint =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =

  28. [47]

    Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =

    Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =. A Constructive Prediction of the Generalization Error Across Scales , Year =

  29. [48]

    Analysis of a random forests model , Volume =

    Biau, G. Analysis of a random forests model , Volume =. Journal of Machine Learning Research , Number =

  30. [49]

    All of nonparametric statistics , Year =

    Wasserman, Larry , Publisher =. All of nonparametric statistics , Year =

  31. [51]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , Year =. arXiv , Author =:1909.11942 , Primaryclass =

  32. [52]

    Mesh-TensorFlow: Deep Learning for Supercomputers

    Mesh-TensorFlow: Deep Learning for Supercomputers , Year =. arXiv , Author =:1811.02084 , Primaryclass =

  33. [53]

    Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =

    Hestness, Joel and Ardalani, Newsha and Diamos, Gregory , Booktitle =. Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =. 2019 , Bdsk-Url-1 =. doi:10.1145/3293883.3295710 , Isbn =

  34. [55]

    The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

    The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.07062 , Journal =

  35. [56]

    Common Crawl , Url =

    The Common Crawl Foundation , Date-Added =. Common Crawl , Url =

  36. [58]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    RoBERTa:. 2019 , Bdsk-Url-1 =. arXiv , Author =:1907.11692 , Journal =

  37. [59]

    On the origin of long-range correlations in texts , Volume =

    Altmann, Eduardo G and Cristadoro, Giampaolo and Degli Esposti, Mirko , Journal =. On the origin of long-range correlations in texts , Volume =

  38. [60]

    Entropy and long-range correlations in literary English , Volume =

    Ebeling, Werner and P. Entropy and long-range correlations in literary English , Volume =. EPL (Europhysics Letters) , Number =

  39. [61]

    Criticality in formal languages and statistical physics , Year =

    Lin, Henry W and Tegmark, Max , Journal =. Criticality in formal languages and statistical physics , Year =

  40. [62]

    Universal Transformers

    Universal Transformers , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1807.03819 , Journal =

  41. [63]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1804.04235 , Journal =

  42. [64]

    Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , Url =

    Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja , Date-Added =. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , Url =. 2015 IEEE International Conference on Computer Vision (ICCV) , Month =. 2015 , Bdsk-Url-1 =. doi:10....

  43. [65]

    arXiv , Author =:2003.02218 , Primaryclass =

    The large learning rate phase of deep learning: the catapult mechanism , Year =. arXiv , Author =:2003.02218 , Primaryclass =

  44. [66]

    Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , Eprint =

    Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , Eprint =. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent , Year =

  45. [67]

    Scaling to Very Very Large Corpora for Natural Language Disambiguation , Url =

    Banko, Michele and Brill, Eric , Booktitle =. Scaling to Very Very Large Corpora for Natural Language Disambiguation , Url =. 2001 , Bdsk-Url-1 =. doi:10.3115/1073012.1073017 , Month = jul, Pages =

  46. [68]

    A Bit of Progress in Language Modeling , Url =

    Joshua Goodman , Bibsource =. A Bit of Progress in Language Modeling , Url =. CoRR , Timestamp =. 2001 , Bdsk-Url-1 =

  47. [69]

    Neural tangent kernel: Convergence and generalization in neural networks , Year =

    Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural tangent kernel: Convergence and generalization in neural networks , Year =. Advances in neural information processing systems , Pages =

  48. [70]

    Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , Url =

    Ali Rahimi and Recht, Benjamin , Booktitle =. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , Url =. 2009 , Bdsk-Url-1 =

  49. [71]

    Introduction to the theory of complex systems , Year =

    Thurner, Stefan and Hanel, Rudolf and Klimek, Peter , Publisher =. Introduction to the theory of complex systems , Year =

  50. [72]

    arXiv , Title =

    Mario Geiger and Arthur Jacot and Stefano Spigler and Franck Gabriel and Levent Sagun and St. arXiv , Title =. 1901.01608 , Eprinttype =

  51. [73]

    arXiv , Title =

    Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal , Date-Added =. arXiv , Title =. 1812.11118 , Eprinttype =

  52. [74]

    High-dimensional dynamics of generalization error in neural networks

    Madhu S. Advani and Andrew M. Saxe , Date-Added =. arXiv , Title =. 1710.03667 , Eprinttype =

  53. [76]

    An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

    An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , Year =. arXiv , Author =:1901.10159 , Primaryclass =

  54. [77]

    Wide Residual Networks , Url =

    Zagoruyko, Sergey and Komodakis, Nikos , Date-Added =. Wide Residual Networks , Url =. Procedings of the British Machine Vision Conference 2016 , Publisher =. 2016 , Bdsk-Url-1 =. doi:10.5244/c.30.87 , Isbn =

  55. [78]

    arXiv , Author =:1906.02909 , Primaryclass =

    AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , Year =. arXiv , Author =:1906.02909 , Primaryclass =

  56. [79]

    Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =

    Wang, Yu-Xiong and Ramanan, Deva and Hebert, Martial , Date-Added =. Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Month =. 2017 , Bdsk-Url-1 =. doi:10.1109/cvpr.2017.323 , Isbn =

  57. [80]

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1905.11946 , Journal =

  58. [81]

    Scaling to very very large corpora for natural language disambiguation , Year =

    Banko, Michele and Brill, Eric , Booktitle =. Scaling to very very large corpora for natural language disambiguation , Year =

  59. [82]

    , Booktitle =

    Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , Booktitle =. ImageNet Classification with Deep Convolutional Neural Networks , Url =. 2012 , Bdsk-Url-1 =

  60. [83]

    openai.com , Title =

    Gray, Scott and Radford, Alec and Kingma, Diederik P , Date-Added =. openai.com , Title =

  61. [84]

    Decoupled Weight Decay Regularization

    Fixing Weight Decay Regularization in Adam , Url =. 2017 , Bdsk-Url-1 =. arXiv , Author =:1711.05101 , Journal =

  62. [85]

    Generating Long Sequences with Sparse Transformers

    Generating Long Sequences with Sparse Transformers , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.10509 , Journal =

  63. [86]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.06965 , Journal =

  64. [87]

    On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =

    Stanislaw Jastrzebski and Zachary Kenton and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey , Date-Added =. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =. arXiv , Keywords =. 1807.05031 , Eprinttype =

  65. [88]

    arXiv , Author =:1908.08351 , Primaryclass =

    Compositionality decomposed: how do neural networks generalise? , Year =. arXiv , Author =:1908.08351 , Primaryclass =

  66. [90]

    Generative Pretraining From Pixels , Year =

    Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya , Booktitle =. Generative Pretraining From Pixels , Year =

  67. [91]

    One Epoch Is All You Need

    Aran Komatsuzaki , Date-Added =. arXiv:1906.06669 , Title =

  68. [92]

    An Empirical Model of Large-Batch Training

    Sam McCandlish and Jared Kaplan and Dario Amodei and OpenAI Dota Team , Date-Added =. arXiv:1812.06162 , Title =

  69. [93]

    XLNet: Generalized Autoregressive Pretraining for Language Understanding

    Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , Date-Added =. XLNet: Generalized Autoregressive Pretraining for Language Understanding , Year =. arXiv:1906.08237 , Keywords =

  70. [95]

    Residual Networks Behave Like Ensembles of Relatively Shallow Networks , Year =

    Andreas Veit and Michael Wilber and Serge Belongie , Eprint =. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , Year =

  71. [96]

    Language Models are Unsupervised Multitask Learners , Year =

    Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , Date-Modified =. Language Models are Unsupervised Multitask Learners , Year =. openai.com , Keywords =

  72. [97]

    Improving language understanding by generative pre-training , Year =

    Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , Date-Modified =. Improving language understanding by generative pre-training , Year =. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf , Keywords =

  73. [98]

    Attention is All you Need , Url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , Booktitle =. Attention is All you Need , Url =. 2017 , Bdsk-Url-1 =

  74. [99]

    2018 , Bdsk-Url-1 =

    Dario Amodei AND Danny Hernandez , Date-Added =. 2018 , Bdsk-Url-1 =

  75. [100]

    Selecting Sample Sizes , Url =

  76. [101]

    Sample Size Determination , Url =

  77. [102]

    On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

    Noah Golmant and Nikita Vemuri and Zhewei Yao and Vladimir Feinberg and Amir Gholami and Kai Rothauge and Michael W. Mahoney and Joseph Gonzalez , Date-Added =. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , Year =. 1811.12941 , Eprinttype =

  78. [103]

    The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

    Siyuan Ma and Raef Bassily and Mikhail Belkin , Date-Added =. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , Year =. 1712.06559 , Eprinttype =

  79. [104]

    TrueSkill : A Bayesian Skill Rating System , Url =

    Ralf Herbrich and Minka, Tom and Graepel, Thore , Booktitle =. TrueSkill : A Bayesian Skill Rating System , Url =. 2007 , Bdsk-Url-1 =

  80. [105]

    Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , Year =

    Andrew Ilyas and Logan Engstrom and Shibani Santurkar and Dimitris Tsipras and Firdaus Janoos and Larry Rudolph and Aleksander Madry , Date-Added =. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , Year =. 1811.02553 , Eprinttype =

Showing first 80 references.