pith. machine review for the scientific record. sign in

arxiv: 2010.14701 · v2 · submitted 2020-10-28 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Scaling Laws for Autoregressive Generative Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:44 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords scaling lawsautoregressive transformerscross-entropy lossgenerative modelingpower-law scalingmultimodal modelsimage modelingmathematical reasoning
0
0 comments X

The pith

Autoregressive Transformers improve via power-law-plus-constant scaling laws for cross-entropy loss across image, video, text-image, and math domains, with optimal size depending on compute through nearly universal exponents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how cross-entropy loss decreases as model size and total compute increase in four separate autoregressive settings. In each case the loss follows a simple power-law-plus-constant form that holds smoothly over the measured range. The authors also find that the model size giving the best loss for a fixed compute budget itself follows a power law whose exponent is almost the same in every domain. This pattern lets them interpret the loss as the sum of the true data entropy and the KL divergence between data and model, and to predict how large a model must be to reach any target level of reducible error.

Core claim

In generative image modeling, video modeling, multimodal image-text modeling, and mathematical problem solving, the cross-entropy loss of autoregressive Transformers decreases as a power law plus constant when plotted against model size or against compute; the optimal model size for a given compute budget likewise obeys a power law whose exponent is nearly the same across all four domains. The functional form allows an information-theoretic decomposition into irreducible entropy of the data and reducible KL divergence, and supplies concrete forecasts for model size needed to reach any chosen KL target.

What carries the argument

The power-law-plus-constant scaling relation between cross-entropy loss and (model size, compute budget) pair, together with the derived power-law relation for optimal model size versus compute.

If this is right

  • Billion-parameter models already achieve near-zero KL divergence on 8x8 downsampled YFCC100M images.
  • Model size required to reach any target reducible loss in nats per image can be read off directly from the fitted curves for other resolutions.
  • Mutual information between image and caption in multimodal models scales predictably with compute.
  • Extrapolation performance on mathematical problems outside the training distribution follows its own smooth scaling law.
  • After fine-tuning, classification loss and error rate on ImageNet continue to improve smoothly even after the generative loss has largely saturated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training runs could be budgeted by first choosing the desired KL target and then solving for the minimal compute that supplies the corresponding optimal model size.
  • The near-universality of the optimal-size exponent suggests that similar scaling relations may govern other autoregressive tasks not tested here, such as audio or protein sequences.
  • If the power-law form persists, the marginal reduction in loss per additional FLOP remains positive and predictable at arbitrarily large scales, implying that performance gains from scale alone do not saturate within foreseeable compute limits.
  • The decomposition into entropy plus KL supplies a quantitative way to decide when a domain is 'solved' by a generative model versus when further scale is still required.

Load-bearing premise

The same power-law-plus-constant shape that fits the measured range of model sizes and compute budgets continues to describe performance at scales well beyond those actually tested.

What would settle it

A single run at ten times the largest compute budget used in the paper in which the measured loss deviates by more than the reported uncertainty from the extrapolation of the fitted power-law-plus-constant curve.

read the original abstract

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that autoregressive Transformers exhibit consistent power-law scaling (plus constant) in cross-entropy loss as functions of model size N and compute budget C across four domains: image generation, video, multimodal image-text, and mathematical problem solving. From fits to L(N,C) the authors derive that the optimal model size N_opt(C) itself follows a power law in C whose exponents are nearly universal across domains. They interpret the loss information-theoretically as data entropy plus KL divergence, use the fits to forecast model sizes needed for target KL values, and report additional scaling relations for mutual information, out-of-distribution extrapolation in math, and finetuning to ImageNet classification.

Significance. If the reported empirical relations hold, the work supplies quantitative, cross-domain guidance for compute allocation and suggests that sufficiently large autoregressive models can approach the entropy of the underlying data distribution. The near-universality of the scaling exponents and the downstream-task results strengthen the practical case that scaling laws govern neural-network performance beyond the training objective.

major comments (1)
  1. [§3 and abstract] §3 (scaling-law fits) and abstract: the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ and the derived N_opt(C) power-law are obtained exclusively from data within the measured envelope (largest models ~10^9 parameters). The manuscript extrapolates this form to forecast KL divergences and optimal allocations at larger scales without additional held-out points or a derivation that would justify continued validity; this extrapolation is load-bearing for the universality claim and the forecasting statements.
minor comments (2)
  1. [Figures 1-4] Figure captions and axis labels should explicitly state the range of N and C used for each fit so readers can immediately see the extrapolation distance.
  2. [§2] The information-theoretic interpretation (S(True) + D_KL) is introduced in the abstract and §2 but the precise mapping from the fitted L_∞ to the entropy term is not restated in the main results section; a short clarifying sentence would help.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and for highlighting the extrapolation issue. We address this concern directly below and will make targeted revisions to clarify the empirical scope of our claims.

read point-by-point responses
  1. Referee: [§3 and abstract] §3 (scaling-law fits) and abstract: the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ and the derived N_opt(C) power-law are obtained exclusively from data within the measured envelope (largest models ~10^9 parameters). The manuscript extrapolates this form to forecast KL divergences and optimal allocations at larger scales without additional held-out points or a derivation that would justify continued validity; this extrapolation is load-bearing for the universality claim and the forecasting statements.

    Authors: We agree that the functional form and the derived N_opt(C) relation are fitted exclusively to data within the observed envelope (models up to ~10^9 parameters). The form L(N,C) = a N^{-α} + b C^{-β} + L_∞ was chosen because it yields excellent fits across all four domains with low residual error; the power-law terms capture the observed improvement with scale while L_∞ corresponds to the irreducible entropy of the data distribution under the information-theoretic interpretation given in the paper. Although we lack a rigorous derivation that guarantees the same exponents at all scales, the functional form is consistent with prior theoretical and empirical scaling-law literature. We validated robustness by refitting on data subsets and confirming that the exponents remain stable. The forecasts for target KL values are presented as extrapolations of the observed trends rather than guaranteed predictions. We will revise the abstract and §3 to state the empirical range explicitly, add a short limitations paragraph on extrapolation, and qualify the universality claim as holding within the measured regime and across the tested domains. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical fits to measured data

full rationale

The paper reports empirical scaling laws obtained by fitting the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ directly to cross-entropy loss measurements across model sizes and compute budgets in four domains. The optimal model size N_opt(C) is then obtained by minimizing the fitted loss at fixed C. These steps consist of standard curve-fitting to held-out validation data within the experimentally accessed range; no equation reduces to itself by definition, no parameter is renamed as a prediction, and no load-bearing premise rests on a self-citation chain. The derivation is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper assumes without derivation that loss follows a power-law-plus-constant form; all reported exponents and the constant floor are fitted to the observed data points.

free parameters (2)
  • power-law exponents (alpha, beta)
    Fitted separately for each domain to match observed loss versus model size and compute.
  • constant floor L_infty
    Fitted per domain as the asymptotic loss value.
axioms (1)
  • domain assumption Cross-entropy loss obeys a power-law-plus-constant functional form in model size and compute
    Invoked to perform the fits shown in the scaling plots.

pith-pipeline@v0.9.0 · 5682 in / 1290 out tokens · 31996 ms · 2026-05-13T07:44:52.955019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KAN: Kolmogorov-Arnold Networks

    cs.LG 2024-04 conditional novelty 8.0

    KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

  2. Discovering Language Model Behaviors with Model-Written Evaluations

    cs.CL 2022-12 unverdicted novelty 8.0

    Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

  3. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  4. ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    ScaleMoGen introduces a scale-wise autoregressive framework that quantizes motions into hierarchical discrete tokens and predicts next-scale maps to achieve SOTA FID 0.030 on HumanML3D and text-guided editing.

  5. On the Invariance and Generality of Neural Scaling Laws

    cs.LG 2026-05 unverdicted novelty 7.0

    Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.

  6. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  7. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  8. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  9. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  10. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  11. RWKV: Reinventing RNNs for the Transformer Era

    cs.CL 2023-05 unverdicted novelty 7.0

    RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

  12. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  13. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  14. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    cs.CV 2021-11 unverdicted novelty 7.0

    LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.

  15. Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

    cs.LG 2026-05 conditional novelty 6.0

    A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

  16. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  17. The Power of Power Law: Asymmetry Enables Compositional Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...

  18. Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

    cs.LG 2026-04 unverdicted novelty 6.0

    MOSAIC is a scaling-aware data selection framework that outperforms baselines in training end-to-end autonomous driving planners, achieving comparable or better EPDMS scores with up to 80% less data.

  19. TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.

  20. Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity

    cs.LG 2026-03 unverdicted novelty 6.0

    Diffusion models on manifold-supported data admit score decompositions whose statistical rates are controlled by intrinsic dimension and curvature.

  21. Search-o1: Agentic Search-Enhanced Large Reasoning Models

    cs.AI 2025-01 unverdicted novelty 6.0

    Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...

  22. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    cs.CL 2023-08 unverdicted novelty 6.0

    Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.

  23. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    cs.CV 2022-11 unverdicted novelty 6.0

    An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.

  24. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  25. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  26. TIDE: Every Layer Knows the Token Beneath the Context

    cs.CL 2026-05 unverdicted novelty 5.0

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

  27. Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures

    cs.CV 2026-04 unverdicted novelty 5.0

    A modified SISA architecture with replay and gating achieves effective class removal from trained CNNs on image datasets while preserving accuracy and cutting retraining costs.

  28. Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches

    math.NA 2026-04 unverdicted novelty 5.0

    The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...

  29. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  30. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  31. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

  32. Superposition Yields Robust Neural Scaling

    cs.LG 2025-05

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 31 Pith papers · 5 internal anchors

  1. [1]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  2. [2]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509 http://arxiv.org/abs/1904.10509 . ://arxiv.org/abs/1904.10509

  3. [3]

    A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

    Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the CIFAR datasets. CoRR , abs/1707.08819, 2017, 1707.08819 http://arxiv.org/abs/1707.08819 . ://arxiv.org/abs/1707.08819

  4. [4]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of Machine Learning and Systems 2020 , pages 10466--10478. 2020

  5. [5]

    Jukebox: A Generative Model for Music

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music, 2020, 2005.00341 http://arxiv.org/abs/2005.00341

  6. [6]

    Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351

    Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351

  7. [7]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017, 1712.00409 http://arxiv.org/abs/1712.00409

  8. [8]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571--8580, 2018

  9. [9]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020, 2001.08361 http://arxiv.org/abs/2001.08361

  10. [10]

    One epoch is all you need, 2019, arXiv:1906.06669 http://arxiv.org/abs/arXiv:1906.06669

    Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669 http://arxiv.org/abs/arXiv:1906.06669

  11. [11]

    arXiv:2003.02218 , year=

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism, 2020, 2003.02218 http://arxiv.org/abs/2003.02218

  12. [12]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR , abs/1711.05101, 2017, 1711.05101 http://arxiv.org/abs/1711.05101 . ://arxiv.org/abs/1711.05101

  13. [13]

    Generating Wikipedia by summarizing long sequences

    Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv :1801.10198 [cs] , 2018, 1801.10198 http://arxiv.org/abs/1801.10198 . ://arxiv.org/abs/1801.10198

  14. [14]

    Bioinformatics, 36(4):1234–1240

    Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of transformers, 2020, 2002.11794 http://arxiv.org/abs/2002.11794

  15. [15]

    Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

    Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720 http://arxiv.org/abs/arXiv:1902.06720

  16. [16]

    The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. CoRR , abs/1712.06559, 2017, 1712.06559 http://arxiv.org/abs/1712.06559 . ://arxiv.org/abs/1712.06559

  17. [17]

    An Empirical Model of Large-Batch Training

    Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018, arXiv:1812.06162 http://arxiv.org/abs/arXiv:1812.06162

  18. [18]

    Smith, Y-Lan Boureau, and Jason Weston

    Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and Jason Weston. Recipes for building an open-domain chatbot, 2020, 2004.13637 http://arxiv.org/abs/2004.13637

  19. [19]

    Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit

    Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit. On the predictability of pruning across scales, 2020, 2006.10621 http://arxiv.org/abs/2006.10621

  20. [20]

    arXiv preprint arXiv:1909.12673 , year=

    Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673 http://arxiv.org/abs/1909.12673

  21. [21]

    Analysing Mathematical Reasoning Abilities of Neural Models

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. CoRR , abs/1904.01557, 2019, 1904.01557 http://arxiv.org/abs/1904.01557 . ://arxiv.org/abs/1904.01557

  22. [22]

    A neural scaling law from the dimension of the data manifold, 2020, 2004.10802 http://arxiv.org/abs/2004.10802

    Utkarsh Sharma and Jared Kaplan. A neural scaling law from the dimension of the data manifold, 2020, 2004.10802 http://arxiv.org/abs/2004.10802

  23. [23]

    Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

    Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, J \"u rgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

  24. [24]

    Multimodal transformer for unaligned multimodal language sequences

    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting , volume 2019, page 6558. NIH Public Access, 2019

  25. [25]

    Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li - Jia Li

    Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li - Jia Li. The new data and new challenges in multimedia research. CoRR , abs/1503.01817, 2015, 1503.01817 http://arxiv.org/abs/1503.01817 . ://arxiv.org/abs/1503.01817

  26. [26]

    Pixel Recurrent Neural Networks

    A \" a ron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR , abs/1601.06759, 2016, 1601.06759 http://arxiv.org/abs/1601.06759 . ://arxiv.org/abs/1601.06759

  27. [27]

    Neural Discrete Representation Learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018, 1711.00937 http://arxiv.org/abs/1711.00937

  28. [28]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998--6008. Curran Associates, Inc., 2017...

  29. [29]

    Scaling autoregressive video models

    Dirk Weissenborn, Oscar T \"a ckstr \"o m, and Jakob Uszkoreit. Scaling autoregressive video models, 2019, 1906.02634 http://arxiv.org/abs/1906.02634