Recognition: 2 theorem links
· Lean TheoremScaling Laws for Autoregressive Generative Modeling
Pith reviewed 2026-05-13 07:44 UTC · model grok-4.3
The pith
Autoregressive Transformers improve via power-law-plus-constant scaling laws for cross-entropy loss across image, video, text-image, and math domains, with optimal size depending on compute through nearly universal exponents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In generative image modeling, video modeling, multimodal image-text modeling, and mathematical problem solving, the cross-entropy loss of autoregressive Transformers decreases as a power law plus constant when plotted against model size or against compute; the optimal model size for a given compute budget likewise obeys a power law whose exponent is nearly the same across all four domains. The functional form allows an information-theoretic decomposition into irreducible entropy of the data and reducible KL divergence, and supplies concrete forecasts for model size needed to reach any chosen KL target.
What carries the argument
The power-law-plus-constant scaling relation between cross-entropy loss and (model size, compute budget) pair, together with the derived power-law relation for optimal model size versus compute.
If this is right
- Billion-parameter models already achieve near-zero KL divergence on 8x8 downsampled YFCC100M images.
- Model size required to reach any target reducible loss in nats per image can be read off directly from the fitted curves for other resolutions.
- Mutual information between image and caption in multimodal models scales predictably with compute.
- Extrapolation performance on mathematical problems outside the training distribution follows its own smooth scaling law.
- After fine-tuning, classification loss and error rate on ImageNet continue to improve smoothly even after the generative loss has largely saturated.
Where Pith is reading between the lines
- Training runs could be budgeted by first choosing the desired KL target and then solving for the minimal compute that supplies the corresponding optimal model size.
- The near-universality of the optimal-size exponent suggests that similar scaling relations may govern other autoregressive tasks not tested here, such as audio or protein sequences.
- If the power-law form persists, the marginal reduction in loss per additional FLOP remains positive and predictable at arbitrarily large scales, implying that performance gains from scale alone do not saturate within foreseeable compute limits.
- The decomposition into entropy plus KL supplies a quantitative way to decide when a domain is 'solved' by a generative model versus when further scale is still required.
Load-bearing premise
The same power-law-plus-constant shape that fits the measured range of model sizes and compute budgets continues to describe performance at scales well beyond those actually tested.
What would settle it
A single run at ten times the largest compute budget used in the paper in which the measured loss deviates by more than the reported uncertainty from the extrapolation of the fitted power-law-plus-constant curve.
read the original abstract
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that autoregressive Transformers exhibit consistent power-law scaling (plus constant) in cross-entropy loss as functions of model size N and compute budget C across four domains: image generation, video, multimodal image-text, and mathematical problem solving. From fits to L(N,C) the authors derive that the optimal model size N_opt(C) itself follows a power law in C whose exponents are nearly universal across domains. They interpret the loss information-theoretically as data entropy plus KL divergence, use the fits to forecast model sizes needed for target KL values, and report additional scaling relations for mutual information, out-of-distribution extrapolation in math, and finetuning to ImageNet classification.
Significance. If the reported empirical relations hold, the work supplies quantitative, cross-domain guidance for compute allocation and suggests that sufficiently large autoregressive models can approach the entropy of the underlying data distribution. The near-universality of the scaling exponents and the downstream-task results strengthen the practical case that scaling laws govern neural-network performance beyond the training objective.
major comments (1)
- [§3 and abstract] §3 (scaling-law fits) and abstract: the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ and the derived N_opt(C) power-law are obtained exclusively from data within the measured envelope (largest models ~10^9 parameters). The manuscript extrapolates this form to forecast KL divergences and optimal allocations at larger scales without additional held-out points or a derivation that would justify continued validity; this extrapolation is load-bearing for the universality claim and the forecasting statements.
minor comments (2)
- [Figures 1-4] Figure captions and axis labels should explicitly state the range of N and C used for each fit so readers can immediately see the extrapolation distance.
- [§2] The information-theoretic interpretation (S(True) + D_KL) is introduced in the abstract and §2 but the precise mapping from the fitted L_∞ to the entropy term is not restated in the main results section; a short clarifying sentence would help.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and for highlighting the extrapolation issue. We address this concern directly below and will make targeted revisions to clarify the empirical scope of our claims.
read point-by-point responses
-
Referee: [§3 and abstract] §3 (scaling-law fits) and abstract: the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ and the derived N_opt(C) power-law are obtained exclusively from data within the measured envelope (largest models ~10^9 parameters). The manuscript extrapolates this form to forecast KL divergences and optimal allocations at larger scales without additional held-out points or a derivation that would justify continued validity; this extrapolation is load-bearing for the universality claim and the forecasting statements.
Authors: We agree that the functional form and the derived N_opt(C) relation are fitted exclusively to data within the observed envelope (models up to ~10^9 parameters). The form L(N,C) = a N^{-α} + b C^{-β} + L_∞ was chosen because it yields excellent fits across all four domains with low residual error; the power-law terms capture the observed improvement with scale while L_∞ corresponds to the irreducible entropy of the data distribution under the information-theoretic interpretation given in the paper. Although we lack a rigorous derivation that guarantees the same exponents at all scales, the functional form is consistent with prior theoretical and empirical scaling-law literature. We validated robustness by refitting on data subsets and confirming that the exponents remain stable. The forecasts for target KL values are presented as extrapolations of the observed trends rather than guaranteed predictions. We will revise the abstract and §3 to state the empirical range explicitly, add a short limitations paragraph on extrapolation, and qualify the universality claim as holding within the measured regime and across the tested domains. revision: partial
Circularity Check
No significant circularity; results are direct empirical fits to measured data
full rationale
The paper reports empirical scaling laws obtained by fitting the functional form L(N,C) ≈ a N^{-α} + b C^{-β} + L_∞ directly to cross-entropy loss measurements across model sizes and compute budgets in four domains. The optimal model size N_opt(C) is then obtained by minimizing the fitted loss at fixed C. These steps consist of standard curve-fitting to held-out validation data within the experimentally accessed range; no equation reduces to itself by definition, no parameter is renamed as a prediction, and no load-bearing premise rests on a self-citation chain. The derivation is therefore self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- power-law exponents (alpha, beta)
- constant floor L_infty
axioms (1)
- domain assumption Cross-entropy loss obeys a power-law-plus-constant functional form in model size and compute
Forward citations
Cited by 32 Pith papers
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Discovering Language Model Behaviors with Model-Written Evaluations
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation
ScaleMoGen introduces a scale-wise autoregressive framework that quantizes motions into hierarchical discrete tokens and predicts next-scale maps to achieve SOTA FID 0.030 on HumanML3D and text-guided editing.
-
On the Invariance and Generality of Neural Scaling Laws
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
Scalable Diffusion Models with Transformers
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.
-
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...
-
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
MOSAIC is a scaling-aware data selection framework that outperforms baselines in training end-to-end autonomous driving planners, achieving comparable or better EPDMS scores with up to 80% less data.
-
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
-
Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
Diffusion models on manifold-supported data admit score decompositions whose statistical rates are controlled by intrinsic dimension and curvature.
-
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
-
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
-
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures
A modified SISA architecture with replay and gating achieves effective class removal from trained CNNs on image datasets while preserving accuracy and cutting retraining costs.
-
Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches
The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
- Superposition Yields Robust Neural Scaling
Reference graph
Works this paper leans on
-
[1]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[2]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509 http://arxiv.org/abs/1904.10509 . ://arxiv.org/abs/1904.10509
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[3]
A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the CIFAR datasets. CoRR , abs/1707.08819, 2017, 1707.08819 http://arxiv.org/abs/1707.08819 . ://arxiv.org/abs/1707.08819
work page Pith review arXiv 2017
-
[4]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of Machine Learning and Systems 2020 , pages 10466--10478. 2020
work page 2020
-
[5]
Jukebox: A Generative Model for Music
Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music, 2020, 2005.00341 http://arxiv.org/abs/2005.00341
work page Pith review arXiv 2020
-
[6]
Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351
-
[7]
Deep Learning Scaling is Predictable, Empirically
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017, 1712.00409 http://arxiv.org/abs/1712.00409
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
Neural tangent kernel: Convergence and generalization in neural networks
Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571--8580, 2018
work page 2018
-
[9]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020, 2001.08361 http://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[10]
One epoch is all you need, 2019, arXiv:1906.06669 http://arxiv.org/abs/arXiv:1906.06669
Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669 http://arxiv.org/abs/arXiv:1906.06669
-
[11]
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism, 2020, 2003.02218 http://arxiv.org/abs/2003.02218
-
[12]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR , abs/1711.05101, 2017, 1711.05101 http://arxiv.org/abs/1711.05101 . ://arxiv.org/abs/1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Generating Wikipedia by summarizing long sequences
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv :1801.10198 [cs] , 2018, 1801.10198 http://arxiv.org/abs/1801.10198 . ://arxiv.org/abs/1801.10198
-
[14]
Bioinformatics, 36(4):1234–1240
Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of transformers, 2020, 2002.11794 http://arxiv.org/abs/2002.11794
-
[15]
Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington
Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720 http://arxiv.org/abs/arXiv:1902.06720
-
[16]
Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. CoRR , abs/1712.06559, 2017, 1712.06559 http://arxiv.org/abs/1712.06559 . ://arxiv.org/abs/1712.06559
-
[17]
An Empirical Model of Large-Batch Training
Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018, arXiv:1812.06162 http://arxiv.org/abs/arXiv:1812.06162
work page Pith review arXiv 2018
-
[18]
Smith, Y-Lan Boureau, and Jason Weston
Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and Jason Weston. Recipes for building an open-domain chatbot, 2020, 2004.13637 http://arxiv.org/abs/2004.13637
-
[19]
Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit
Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit. On the predictability of pruning across scales, 2020, 2006.10621 http://arxiv.org/abs/2006.10621
-
[20]
arXiv preprint arXiv:1909.12673 , year=
Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673 http://arxiv.org/abs/1909.12673
-
[21]
Analysing Mathematical Reasoning Abilities of Neural Models
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. CoRR , abs/1904.01557, 2019, 1904.01557 http://arxiv.org/abs/1904.01557 . ://arxiv.org/abs/1904.01557
work page Pith review arXiv 1904
-
[22]
Utkarsh Sharma and Jared Kaplan. A neural scaling law from the dimension of the data manifold, 2020, 2004.10802 http://arxiv.org/abs/2004.10802
-
[23]
Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, J \"u rgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611
-
[24]
Multimodal transformer for unaligned multimodal language sequences
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting , volume 2019, page 6558. NIH Public Access, 2019
work page 2019
-
[25]
Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li - Jia Li
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li - Jia Li. The new data and new challenges in multimedia research. CoRR , abs/1503.01817, 2015, 1503.01817 http://arxiv.org/abs/1503.01817 . ://arxiv.org/abs/1503.01817
-
[26]
Pixel Recurrent Neural Networks
A \" a ron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR , abs/1601.06759, 2016, 1601.06759 http://arxiv.org/abs/1601.06759 . ://arxiv.org/abs/1601.06759
work page Pith review arXiv 2016
-
[27]
Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018, 1711.00937 http://arxiv.org/abs/1711.00937
work page Pith review arXiv 2018
-
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998--6008. Curran Associates, Inc., 2017...
work page 2017
-
[29]
Scaling autoregressive video models
Dirk Weissenborn, Oscar T \"a ckstr \"o m, and Jakob Uszkoreit. Scaling autoregressive video models, 2019, 1906.02634 http://arxiv.org/abs/1906.02634
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.