pith. machine review for the scientific record. sign in

arxiv: 2001.08361 · v1 · submitted 2020-01-23 · 💻 cs.LG · stat.ML

Recognition: unknown

Scaling Laws for Neural Language Models

Authors on Pith no claims yet
classification 💻 cs.LG stat.ML
keywords modelsizetrainingmodelsamountcomputedatasetdependence
0
0 comments X
read the original abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    cs.CL 2022-01 accept novelty 9.0

    Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

  2. Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation

    cs.LG 2026-05 unverdicted novelty 8.0

    Fixed tokens-per-parameter ratios in scaling law experiments induce ill-conditioned least-squares fits due to Jacobian geometry, making scale coefficients unidentifiable and extrapolations unreliable; diverse TPP cove...

  3. Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

    quant-ph 2026-05 unverdicted novelty 8.0

    Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

  4. Nearly Optimal Attention Coresets

    cs.DS 2026-05 unverdicted novelty 8.0

    ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

  5. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  6. The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

    cs.AI 2026-04 unverdicted novelty 8.0

    Masking-based explanations are governed by the information capacity of the query channel, with reliable recovery achievable below capacity via sparse maximum-likelihood decoding but impossible above it.

  7. The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

    cs.LG 2026-04 unverdicted novelty 8.0

    Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...

  8. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  9. Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    cs.LG 2024-07 conditional novelty 8.0

    TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

  10. KAN: Kolmogorov-Arnold Networks

    cs.LG 2024-04 conditional novelty 8.0

    KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

  11. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  12. Discovering Language Model Behaviors with Model-Written Evaluations

    cs.CL 2022-12 unverdicted novelty 8.0

    Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

  13. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  14. How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

    cs.LG 2026-05 unverdicted novelty 7.0

    The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...

  15. Do Language Models Align with Brains? Prediction Scores Are Not Enough

    q-bio.NC 2026-05 unverdicted novelty 7.0

    Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.

  16. Scaling Laws for Mixture Pretraining Under Data Constraints

    cs.LG 2026-05 conditional novelty 7.0

    Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.

  17. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  18. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  19. Uniform Scaling Limits in AdamW-Trained Transformers

    stat.ML 2026-05 unverdicted novelty 7.0

    AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...

  20. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  21. Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

    stat.ML 2026-05 unverdicted novelty 7.0

    In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).

  22. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...

  23. Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.

  24. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

    cs.AI 2026-05 unverdicted novelty 7.0

    Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.

  25. How Much is Brain Data Worth for Machine Learning?

    cs.AI 2026-05 conditional novelty 7.0

    Brain data is worth a variable number of task samples depending on task-brain alignment, noise levels, and latent dimension, with conditions under which it also improves robustness to test distribution shift.

  26. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  27. Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

    cond-mat.dis-nn 2026-05 unverdicted novelty 7.0

    A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.

  28. Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

    cs.LG 2026-05 unverdicted novelty 7.0

    Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

  29. On the Invariance and Generality of Neural Scaling Laws

    cs.LG 2026-05 unverdicted novelty 7.0

    Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.

  30. Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.

  31. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  32. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  33. Logic-Regularized Verifier Elicits Reasoning from LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

  34. Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.

  35. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

    cs.AI 2026-05 unverdicted novelty 7.0

    Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.

  36. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

    cs.AI 2026-05 unverdicted novelty 7.0

    Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...

  37. The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence

    cs.LG 2026-05 unverdicted novelty 7.0

    Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...

  38. Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

    cs.LG 2026-05 unverdicted novelty 7.0

    Budgeted LoRA treats LLM distillation as structured compute allocation under a single global budget, producing student models with tunable inference speedups of 1.74x to 4.05x while controlling perplexity and task accuracy.

  39. A foundation model of vision, audition, and language for in-silico neuroscience

    q-bio.NC 2026-05 unverdicted novelty 7.0

    TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

  40. RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...

  41. Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

    cs.DC 2026-05 unverdicted novelty 7.0

    Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.

  42. InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

    cs.LG 2026-05 unverdicted novelty 7.0

    InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.

  43. CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation

    q-bio.GN 2026-04 unverdicted novelty 7.0

    CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...

  44. Low Rank Adaptation for Adversarial Perturbation

    cs.LG 2026-04 unverdicted novelty 7.0

    Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

  45. The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate

    cs.MA 2026-04 unverdicted novelty 7.0

    Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...

  46. An Empirical Study of Speculative Decoding on Software Engineering Tasks

    cs.SE 2026-04 unverdicted novelty 7.0

    Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.

  47. Optimizing ground state preparation protocols with autoresearch

    quant-ph 2026-04 unverdicted novelty 7.0

    AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.

  48. Optimizing ground state preparation protocols with autoresearch

    quant-ph 2026-04 unverdicted novelty 7.0

    AI coding agents mutate baseline protocols for VQE, DMRG, and AFQMC into versions with improved energy proxies on spin models and molecules while respecting computational budgets.

  49. Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

    cs.AI 2026-04 unverdicted novelty 7.0

    Unstructured pruning augments test-time scaling reasoning performance in LLMs and can outperform the unpruned model on benchmarks, contrary to expectations from structured pruning studies.

  50. Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

    cs.LG 2026-04 unverdicted novelty 7.0

    Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.

  51. Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

    cs.CR 2026-04 unverdicted novelty 7.0

    Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.

  52. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 unverdicted novelty 7.0

    Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

  53. How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

  54. Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

    cs.LG 2026-04 unverdicted novelty 7.0

    Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

  55. LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

    cs.IR 2026-04 unverdicted novelty 7.0

    LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.

  56. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

    cs.CL 2026-04 unverdicted novelty 7.0

    Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

  57. Rethinking Dataset Distillation: Hard Truths about Soft Labels

    cs.LG 2026-04 conditional novelty 7.0

    Soft labels hide the value of high-quality data subsets in dataset distillation, and a new compute-aware method outperforms existing approaches in hard-label settings on ImageNet-1K.

  58. Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion

    cs.CL 2026-04 unverdicted novelty 7.0

    TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.

  59. Causal inference for social network formation

    econ.EM 2026-04 conditional novelty 7.0

    Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.

  60. Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

    cs.AI 2026-04 unverdicted novelty 7.0

    A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 354 Pith papers · 3 internal anchors

  1. [1]

    2017 , month = oct, journal =

    25 [AS17] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv, 2017, 1710.03667. 11, 18, 22 [BB01] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics, page...

  2. [2]

    Reconciling modern machine-learning practice and the classical bias– variance trade-off,

    18 [BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv, 2018, 1812.11118. 18 [Bia12] GÊrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063–1095,

  3. [3]

    Generating Long Sequences with Sparse Transformers

    18 [CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509. 19 [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understandi...

  4. [4]

    Gradient Descent Happens in a Tiny Subspace

    25 [Fou] The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org. 7 [GARD18] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 [GJS+19] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthie...

  5. [5]

    18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma

    URL http://arxiv.org/abs/cs.CL/0108005. 18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com,

  6. [6]

    Sadayappan

    ACM. doi:10.1145/3293883.3295710. 18 28 [HCC+18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V . Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965. 19 [HNA+17] Joel Hestness, Sharan Narang, Newsha ...

  7. [7]

    Adam: A Method for Stochastic Optimization

    18 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 7 [Kom19] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International C...

  8. [8]

    URL http://dl.acm.org/citation.cfm?id=2999134.2999257

    Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257. 19 [LCG+19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, 1909.11942. 9 [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi...

  9. [9]

    Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

    25 [LXS+19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720. 18 [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch tra...

  10. [10]

    arXiv preprint arXiv:1909.12673 , year=

    2, 6 [RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673. 18 [RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 ...

  11. [11]

    Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and Christo- pher R´e

    2, 5, 6, 7, 8 [SCP+18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers, 2018, 1811.02084. 19 [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine t...

  12. [12]

    18 [TL19] Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905. 11946. 18 [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. I...

  13. [13]

    2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie

    URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . 2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 [Was06] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media,

  14. [14]

    18 [WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019, 1905.00537. 2 [WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in- creasing model capacity....

  15. [15]

    & Hebert, M

    doi:10.1109/cvpr.2017.323. 19 [WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 2019, 1906.02909. 19 [YDY+19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V . Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, arXiv:1906.08237. ...

  16. [16]

    Wide Residual Networks, in: Proceedings of the British Machine Vision Conference (BMVC), pp

    doi:10.5244/c.30.87. 18 [ZKZ+15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor- ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), Dec

  17. [17]

    Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler

    doi:10.1109/iccv.2015.11. 7 [ZLN+19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. CoRR, abs/1907.04164, 2019, 1907.04164. URL http://arxiv.org/abs/1907.04164. 12, 18 30