arxiv: 2010.14701 · v2 · submitted 2020-10-28 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan , Jared Kaplan , Mor Katz , Mark Chen , Christopher Hesse , Jacob Jackson , Heewoo Jun , Tom B. Brown

show 11 more authors

Prafulla Dhariwal Scott Gray Chris Hallacy Benjamin Mann Alec Radford Aditya Ramesh Nick Ryder Daniel M. Ziegler John Schulman Dario Amodei Sam McCandlish

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:44 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords scaling lawsautoregressive transformerscross-entropy lossgenerative modelingpower-law scalingmultimodal modelsimage modelingmathematical reasoning

0 comments

The pith

Autoregressive Transformers improve via power-law-plus-constant scaling laws for cross-entropy loss across image, video, text-image, and math domains, with optimal size depending on compute through nearly universal exponents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how cross-entropy loss decreases as model size and total compute increase in four separate autoregressive settings. In each case the loss follows a simple power-law-plus-constant form that holds smoothly over the measured range. The authors also find that the model size giving the best loss for a fixed compute budget itself follows a power law whose exponent is almost the same in every domain. This pattern lets them interpret the loss as the sum of the true data entropy and the KL divergence between data and model, and to predict how large a model must be to reach any target level of reducible error.

Core claim

In generative image modeling, video modeling, multimodal image-text modeling, and mathematical problem solving, the cross-entropy loss of autoregressive Transformers decreases as a power law plus constant when plotted against model size or against compute; the optimal model size for a given compute budget likewise obeys a power law whose exponent is nearly the same across all four domains. The functional form allows an information-theoretic decomposition into irreducible entropy of the data and reducible KL divergence, and supplies concrete forecasts for model size needed to reach any chosen KL target.

What carries the argument

The power-law-plus-constant scaling relation between cross-entropy loss and (model size, compute budget) pair, together with the derived power-law relation for optimal model size versus compute.

If this is right

Billion-parameter models already achieve near-zero KL divergence on 8x8 downsampled YFCC100M images.
Model size required to reach any target reducible loss in nats per image can be read off directly from the fitted curves for other resolutions.
Mutual information between image and caption in multimodal models scales predictably with compute.
Extrapolation performance on mathematical problems outside the training distribution follows its own smooth scaling law.
After fine-tuning, classification loss and error rate on ImageNet continue to improve smoothly even after the generative loss has largely saturated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training runs could be budgeted by first choosing the desired KL target and then solving for the minimal compute that supplies the corresponding optimal model size.
The near-universality of the optimal-size exponent suggests that similar scaling relations may govern other autoregressive tasks not tested here, such as audio or protein sequences.
If the power-law form persists, the marginal reduction in loss per additional FLOP remains positive and predictable at arbitrarily large scales, implying that performance gains from scale alone do not saturate within foreseeable compute limits.
The decomposition into entropy plus KL supplies a quantitative way to decide when a domain is 'solved' by a generative model versus when further scale is still required.

Load-bearing premise

The same power-law-plus-constant shape that fits the measured range of model sizes and compute budgets continues to describe performance at scales well beyond those actually tested.

What would settle it

A single run at ten times the largest compute budget used in the paper in which the measured loss deviates by more than the reported uncertainty from the extrapolation of the fitted power-law-plus-constant curve.

read the original abstract

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling laws hold empirically within the tested range across domains but rest on a fitted functional form whose extrapolation is unverified.

read the letter

The main thing to know is that autoregressive transformers follow power-law scaling in cross-entropy loss across image, video, multimodal, and math domains, with the optimal model size for a given compute budget also following a power law whose exponents look similar everywhere. The paper runs the same style of scaling experiments in each setting and shows the curves fit the data well inside the measured window. The decomposition of loss into data entropy plus KL divergence gives a clean way to interpret how close the models get to the true distribution, and the mutual-information scaling between images and captions plus the out-of-distribution math results are genuinely new observations. They also show classification performance keeps improving smoothly even after generative loss levels off. All of this is useful for anyone planning large training runs because it supplies concrete forecasts for model size needed to reach a target reducible loss. The soft spot is that every fit comes from models up to roughly a billion parameters. The functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L∞ is chosen because it matches the points they have, not derived from theory, so there is no direct evidence it continues at 10^12 parameters or beyond. If the curve bends or the optimal allocation shifts at larger scales, both the universality claim and the forecasts weaken. The paper stays within its measured range and does not overclaim, but the practical value depends on that assumption holding. This is the kind of work that belongs in a reading group to discuss the domain comparisons and the information-theoretic framing. It deserves a serious referee because the empirical patterns are extensive and the implications for compute planning are direct, even if later papers will need to test the extrapolation.

Referee Report

1 major / 2 minor

Summary. The paper claims that autoregressive Transformers exhibit consistent power-law scaling (plus constant) in cross-entropy loss as functions of model size N and compute budget C across four domains: image generation, video, multimodal image-text, and mathematical problem solving. From fits to L(N,C) the authors derive that the optimal model size N_opt(C) itself follows a power law in C whose exponents are nearly universal across domains. They interpret the loss information-theoretically as data entropy plus KL divergence, use the fits to forecast model sizes needed for target KL values, and report additional scaling relations for mutual information, out-of-distribution extrapolation in math, and finetuning to ImageNet classification.

Significance. If the reported empirical relations hold, the work supplies quantitative, cross-domain guidance for compute allocation and suggests that sufficiently large autoregressive models can approach the entropy of the underlying data distribution. The near-universality of the scaling exponents and the downstream-task results strengthen the practical case that scaling laws govern neural-network performance beyond the training objective.

major comments (1)

[§3 and abstract] §3 (scaling-law fits) and abstract: the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ and the derived N_opt(C) power-law are obtained exclusively from data within the measured envelope (largest models ~10^9 parameters). The manuscript extrapolates this form to forecast KL divergences and optimal allocations at larger scales without additional held-out points or a derivation that would justify continued validity; this extrapolation is load-bearing for the universality claim and the forecasting statements.

minor comments (2)

[Figures 1-4] Figure captions and axis labels should explicitly state the range of N and C used for each fit so readers can immediately see the extrapolation distance.
[§2] The information-theoretic interpretation (S(True) + D_KL) is introduced in the abstract and §2 but the precise mapping from the fitted L_∞ to the entropy term is not restated in the main results section; a short clarifying sentence would help.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and for highlighting the extrapolation issue. We address this concern directly below and will make targeted revisions to clarify the empirical scope of our claims.

read point-by-point responses

Referee: [§3 and abstract] §3 (scaling-law fits) and abstract: the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ and the derived N_opt(C) power-law are obtained exclusively from data within the measured envelope (largest models ~10^9 parameters). The manuscript extrapolates this form to forecast KL divergences and optimal allocations at larger scales without additional held-out points or a derivation that would justify continued validity; this extrapolation is load-bearing for the universality claim and the forecasting statements.

Authors: We agree that the functional form and the derived N_opt(C) relation are fitted exclusively to data within the observed envelope (models up to ~10^9 parameters). The form L(N,C) = a N^{-α} + b C^{-β} + L_∞ was chosen because it yields excellent fits across all four domains with low residual error; the power-law terms capture the observed improvement with scale while L_∞ corresponds to the irreducible entropy of the data distribution under the information-theoretic interpretation given in the paper. Although we lack a rigorous derivation that guarantees the same exponents at all scales, the functional form is consistent with prior theoretical and empirical scaling-law literature. We validated robustness by refitting on data subsets and confirming that the exponents remain stable. The forecasts for target KL values are presented as extrapolations of the observed trends rather than guaranteed predictions. We will revise the abstract and §3 to state the empirical range explicitly, add a short limitations paragraph on extrapolation, and qualify the universality claim as holding within the measured regime and across the tested domains. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical fits to measured data

full rationale

The paper reports empirical scaling laws obtained by fitting the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ directly to cross-entropy loss measurements across model sizes and compute budgets in four domains. The optimal model size N_opt(C) is then obtained by minimizing the fitted loss at fixed C. These steps consist of standard curve-fitting to held-out validation data within the experimentally accessed range; no equation reduces to itself by definition, no parameter is renamed as a prediction, and no load-bearing premise rests on a self-citation chain. The derivation is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper assumes without derivation that loss follows a power-law-plus-constant form; all reported exponents and the constant floor are fitted to the observed data points.

free parameters (2)

power-law exponents (alpha, beta)
Fitted separately for each domain to match observed loss versus model size and compute.
constant floor L_infty
Fitted per domain as the asymptotic loss value.

axioms (1)

domain assumption Cross-entropy loss obeys a power-law-plus-constant functional form in model size and compute
Invoked to perform the fits shown in the scaling plots.

pith-pipeline@v0.9.0 · 5682 in / 1290 out tokens · 31996 ms · 2026-05-13T07:44:52.955019+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KAN: Kolmogorov-Arnold Networks
cs.LG 2024-04 conditional novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Discovering Language Model Behaviors with Model-Written Evaluations
cs.CL 2022-12 unverdicted novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation
cs.CV 2026-05 unverdicted novelty 7.0

ScaleMoGen introduces a scale-wise autoregressive framework that quantizes motions into hierarchical discrete tokens and predicts next-scale maps to achieve SOTA FID 0.030 on HumanML3D and text-guided editing.
On the Invariance and Generality of Neural Scaling Laws
cs.LG 2026-05 unverdicted novelty 7.0

Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
RWKV: Reinventing RNNs for the Transformer Era
cs.CL 2023-05 unverdicted novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
cs.CV 2021-11 unverdicted novelty 7.0

LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
cs.LG 2026-05 conditional novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
The Power of Power Law: Asymmetry Enables Compositional Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
cs.LG 2026-04 unverdicted novelty 6.0

MOSAIC is a scaling-aware data selection framework that outperforms baselines in training end-to-end autonomous driving planners, achieving comparable or better EPDMS scores with up to 80% less data.
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
cs.LG 2026-03 unverdicted novelty 6.0

Diffusion models on manifold-supported data admit score decompositions whose statistical rates are controlled by intrinsic dimension and curvature.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
cs.CV 2022-11 unverdicted novelty 6.0

An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures
cs.CV 2026-04 unverdicted novelty 5.0

A modified SISA architecture with replay and gating achieves effective class removal from trained CNNs on image datasets while preserving accuracy and cutting retraining costs.
Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches
math.NA 2026-04 unverdicted novelty 5.0

The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Superposition Yields Robust Neural Scaling
cs.LG 2025-05

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 31 Pith papers · 5 internal anchors

[1]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509 http://arxiv.org/abs/1904.10509 . ://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 1904
[3]

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the CIFAR datasets. CoRR , abs/1707.08819, 2017, 1707.08819 http://arxiv.org/abs/1707.08819 . ://arxiv.org/abs/1707.08819

work page Pith review arXiv 2017
[4]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of Machine Learning and Systems 2020 , pages 10466--10478. 2020

work page 2020
[5]

Jukebox: A Generative Model for Music

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music, 2020, 2005.00341 http://arxiv.org/abs/2005.00341

work page Pith review arXiv 2020
[6]

Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351

work page arXiv 2019
[7]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017, 1712.00409 http://arxiv.org/abs/1712.00409

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571--8580, 2018

work page 2018
[9]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020, 2001.08361 http://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

One epoch is all you need, 2019, arXiv:1906.06669 http://arxiv.org/abs/arXiv:1906.06669

Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669 http://arxiv.org/abs/arXiv:1906.06669

work page arXiv 2019
[11]

arXiv:2003.02218 , year=

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism, 2020, 2003.02218 http://arxiv.org/abs/2003.02218

work page arXiv 2020
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR , abs/1711.05101, 2017, 1711.05101 http://arxiv.org/abs/1711.05101 . ://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Generating Wikipedia by summarizing long sequences

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv :1801.10198 [cs] , 2018, 1801.10198 http://arxiv.org/abs/1801.10198 . ://arxiv.org/abs/1801.10198

work page arXiv 2018
[14]

Bioinformatics, 36(4):1234–1240

Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of transformers, 2020, 2002.11794 http://arxiv.org/abs/2002.11794

work page arXiv 2020
[15]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720 http://arxiv.org/abs/arXiv:1902.06720

work page arXiv 2019
[16]

The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. CoRR , abs/1712.06559, 2017, 1712.06559 http://arxiv.org/abs/1712.06559 . ://arxiv.org/abs/1712.06559

work page arXiv 2017
[17]

An Empirical Model of Large-Batch Training

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018, arXiv:1812.06162 http://arxiv.org/abs/arXiv:1812.06162

work page Pith review arXiv 2018
[18]

Smith, Y-Lan Boureau, and Jason Weston

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and Jason Weston. Recipes for building an open-domain chatbot, 2020, 2004.13637 http://arxiv.org/abs/2004.13637

work page arXiv 2020
[19]

Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit

Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit. On the predictability of pruning across scales, 2020, 2006.10621 http://arxiv.org/abs/2006.10621

work page arXiv 2020
[20]

arXiv preprint arXiv:1909.12673 , year=

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673 http://arxiv.org/abs/1909.12673

work page arXiv 2019
[21]

Analysing Mathematical Reasoning Abilities of Neural Models

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. CoRR , abs/1904.01557, 2019, 1904.01557 http://arxiv.org/abs/1904.01557 . ://arxiv.org/abs/1904.01557

work page Pith review arXiv 1904
[22]

A neural scaling law from the dimension of the data manifold, 2020, 2004.10802 http://arxiv.org/abs/2004.10802

Utkarsh Sharma and Jared Kaplan. A neural scaling law from the dimension of the data manifold, 2020, 2004.10802 http://arxiv.org/abs/2004.10802

work page arXiv 2020
[23]

Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, J \"u rgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

work page arXiv 2019
[24]

Multimodal transformer for unaligned multimodal language sequences

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting , volume 2019, page 6558. NIH Public Access, 2019

work page 2019
[25]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li - Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li - Jia Li. The new data and new challenges in multimedia research. CoRR , abs/1503.01817, 2015, 1503.01817 http://arxiv.org/abs/1503.01817 . ://arxiv.org/abs/1503.01817

work page arXiv 2015
[26]

Pixel Recurrent Neural Networks

A \" a ron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR , abs/1601.06759, 2016, 1601.06759 http://arxiv.org/abs/1601.06759 . ://arxiv.org/abs/1601.06759

work page Pith review arXiv 2016
[27]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018, 1711.00937 http://arxiv.org/abs/1711.00937

work page Pith review arXiv 2018
[28]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998--6008. Curran Associates, Inc., 2017...

work page 2017
[29]

Scaling autoregressive video models

Dirk Weissenborn, Oscar T \"a ckstr \"o m, and Jakob Uszkoreit. Scaling autoregressive video models, 2019, 1906.02634 http://arxiv.org/abs/1906.02634

work page arXiv 2019