Deep Learning Scaling is Predictable, Empirically
Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3
The pith
Deep learning generalization error decreases as a power law of training set size across multiple domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Empirical tests show that generalization error scales as a power law with training set size in each of the four domains examined. The exponent that sets the rate of improvement stays the same when architectures or other model improvements are introduced; those changes only shift the absolute error level. Model size needed for best performance scales sublinearly with data size. The measurements cover a wide range of data volumes and produce consistent scaling behavior within the tested regimes.
What carries the argument
Power-law scaling of generalization error with training set size
If this is right
- Accuracy targets can be set by extrapolating the measured power law rather than by exhaustive trial runs.
- Decisions on whether to collect more data can be guided by the expected error reduction per additional example.
- Model architecture work can be assessed by how far it shifts the error curve rather than by any change in the scaling rate.
- Hardware and systems planning can use the sublinear model-size relation to estimate compute needs as datasets grow.
- Continued scaling of data and compute is expected to deliver steady, predictable gains within the domains studied.
Where Pith is reading between the lines
- If the scaling persists, theoretical explanations should target why the observed exponents take their particular values rather than only proving existence of some scaling.
- The invariance of the exponent under model changes suggests that data volume may dominate long-term progress more than incremental architectural tweaks.
- Sublinear growth of model size with data implies that the computational cost per example falls as datasets enlarge, improving efficiency at scale.
- A break in the power law at extreme sizes would signal a new regime, such as exhaustion of useful information in the data source.
Load-bearing premise
The power-law relationships seen in the tested range of data and model sizes will continue without breaks when both are made much larger.
What would settle it
A new experiment at ten times the largest data volume tested here that shows error deviating from the fitted power-law curve by more than the observed variation.
read the original abstract
Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a large-scale empirical study of scaling in deep learning across four domains (machine translation, language modeling, image processing, and speech recognition). It claims that generalization error follows a power-law dependence on training set size, that the power-law exponent is invariant to model architecture improvements (which only shift the prefactor), and that optimal model size grows sublinearly with data size. These relationships are positioned as making DL scaling predictable, with implications for research, practice, and systems design.
Significance. If the reported power-law relationships hold, the work provides a valuable empirical foundation for quantifying the benefits of scaling data and compute in deep learning. The cross-domain consistency and the observation that architecture changes primarily affect the constant term rather than the exponent are particularly useful for guiding practical decisions on data collection and model sizing. The study also highlights open theoretical questions about the origin of the exponents.
major comments (3)
- [§3 (Experimental Methodology)] §3 (Experimental Methodology): The description of how training subsets of varying sizes were constructed lacks detail on sampling method (e.g., random vs. contiguous) and any controls to ensure distributional equivalence across scales; without this, it is difficult to rule out selection effects that could artifactually produce or alter the observed power-law exponents.
- [Results sections (e.g., §4.1–4.4)] Results sections (e.g., §4.1–4.4): No error bars, confidence intervals, or goodness-of-fit statistics (such as R² or residual analysis) are reported for the fitted power-law exponents, and there is no sensitivity analysis to the choice of fitting range; this weakens the ability to assess the robustness of the central scaling claims.
- [§5 (Discussion)] §5 (Discussion): The claim that scaling is 'predictable' rests on the power-law form and exponents persisting beyond the tested regimes, yet the manuscript contains no analysis or discussion of possible breaks, saturation, or changes in effective exponent at substantially larger data volumes or model capacities.
minor comments (3)
- [Abstract and §1] The abstract and introduction would benefit from explicitly stating the numerical values of the observed exponents and the precise functional form used for the power-law fits.
- [Figures] Figures showing learning curves should overlay the fitted power-law curves and report the fitted parameters for direct visual assessment of fit quality.
- [§2 (Related Work)] A brief comparison to prior empirical scaling observations (e.g., in speech or vision) would help situate the novelty of the cross-domain results.
Simulated Author's Rebuttal
We are grateful for the referee's positive assessment and constructive suggestions for improving the manuscript. We address each of the major comments below.
read point-by-point responses
-
Referee: §3 (Experimental Methodology): The description of how training subsets of varying sizes were constructed lacks detail on sampling method (e.g., random vs. contiguous) and any controls to ensure distributional equivalence across scales; without this, it is difficult to rule out selection effects that could artifactually produce or alter the observed power-law exponents.
Authors: We agree that more detail on subset construction is needed for reproducibility. The training subsets were constructed via random sampling (without replacement) from the full training set to maintain distributional properties. We will revise §3 to include a clear description of this sampling method and any verification steps for distributional equivalence. revision: yes
-
Referee: Results sections (e.g., §4.1–4.4): No error bars, confidence intervals, or goodness-of-fit statistics (such as R² or residual analysis) are reported for the fitted power-law exponents, and there is no sensitivity analysis to the choice of fitting range; this weakens the ability to assess the robustness of the central scaling claims.
Authors: We acknowledge the value of these statistical measures. Although the fits were consistent across domains and visually robust, we will add error bars (from repeated trials where feasible), report R² and other fit statistics, and perform sensitivity analysis on the fitting range in the revised results sections. revision: yes
-
Referee: §5 (Discussion): The claim that scaling is 'predictable' rests on the power-law form and exponents persisting beyond the tested regimes, yet the manuscript contains no analysis or discussion of possible breaks, saturation, or changes in effective exponent at substantially larger data volumes or model capacities.
Authors: The claims are grounded in the empirical observations within the tested regimes. We will expand the Discussion section to address potential limitations at larger scales, including possible saturation or exponent changes, based on trends at the upper limits of our experiments and related literature. However, empirical analysis at substantially larger scales is beyond the scope of this work due to resource constraints. revision: partial
Circularity Check
No circularity: empirical scaling laws are direct observations, not reductions to fitted inputs
full rationale
The manuscript reports measured power-law relationships between generalization error and factors such as training-set size, model size, and compute across four domains. These relationships are obtained by fitting functional forms to experimental data points collected within the tested regimes; the paper does not derive the power-law exponents from prior equations, self-citations, or uniqueness theorems that would make the reported scaling equivalent to its own inputs by construction. Model-size sublinearity is likewise an observed trend, not a prediction forced by the fitting procedure itself. Because the central claims rest on reproducible empirical measurements rather than any self-referential derivation chain, the analysis is self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- power-law exponent
axioms (1)
- domain assumption Power-law functional form adequately captures the scaling relationship over the measured range
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the 'steepness' of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
-
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
-
Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
Olivia harmonizes time series datasets via normalized power spectral density using a Harmonizer module and resonator-based HarmonicAttention, achieving state-of-the-art zero-shot, few-shot, and full-shot forecasting o...
-
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
-
Decision Boundary-aware Generation for Long-tailed Learning
DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.
-
Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection
A cross-population framework for EEG Parkinson's detection using exhaustive 75 directional evaluations and nested validation shows asymmetric transfer and accuracy up to 94.1% when training diversity increases, suppor...
-
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Scaling Laws for Autoregressive Generative Modeling
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling
SemiPrune uses a small labeled subset and semi-supervised pseudo-labeling to enable supervised dataset pruning methods, achieving state-of-the-art results on domain-specific, image-corrupted, and long-tailed datasets.
-
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.
-
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
-
AIPO: Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks
Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.
-
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...
-
Large language model-enabled automated data extraction for concrete materials informatics
LLM pipeline extracts nearly 9,000 high-quality blended-cement concrete records from over 27,000 publications with F1 scores up to 0.97 and enables ML analyses showing benefits of large diverse datasets.
-
Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification
TRIAGE adaptively scales test-time compute via tiered zero-shot stages for respiratory audio classification, reaching mean AUROC 0.744 across nine tasks while outperforming prior zero-shot methods.
-
Unsupervised domain adaptation for radioisotope identification in gamma spectroscopy
Unsupervised domain adaptation via feature alignment raises radioisotope identification accuracy on real LaBr3 gamma spectra from 0.754 to 0.904 for models trained only on synthetic data.
-
Model Merging Scaling Laws in Large Language Models
Empirical scaling laws for LLM merging show a size-dependent floor and 1/k-like tail in cross-entropy loss that holds across architectures and merging methods.
-
Surprisingly High Redundancy in Electronic Structure Data Across Materials Explained by Low Intrinsic Dimensionality
Electronic structure datasets across materials show high redundancy from low intrinsic dimensionality, allowing pruning to 1/100th size with preserved chemical accuracy.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Superposition Yields Robust Neural Scaling
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
-
Learning to Reason under Off-Policy Guidance
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-poli...
-
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
-
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Scaling Laws and Interpretability of Learning from Repeated Data
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
Asymmetric Scaling Laws from Sparse Features
A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.
-
Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons
Recurrent networks built from tunable expressive neurons reveal scaling laws with an optimal parameter split that shifts toward higher per-neuron complexity at larger scales.
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
Physical Foundation Models: Fixed hardware implementations of large-scale neural networks
Physical Foundation Models are fixed physical hardware realizations of foundation-scale neural networks that compute via inherent material dynamics, potentially delivering orders-of-magnitude gains in energy efficienc...
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
-
Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches
The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...
-
Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning
CoCoGen+ models each federated learning round as a weighted potential game with strategic synthetic data generation and payoff redistribution incentives, showing improved efficiency over baselines under non-IID data a...
-
Towards Scaling Law Analysis For Spatiotemporal Weather Data
Scaling laws for weather models exhibit strong cross-channel and cross-horizon heterogeneity, where globally pooled metrics appear favorable while many individual channels degrade at longer leads.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
Less (Data) Is More: Why Small Data Holds the Key to the Future of Artificial Intelligence
Position paper arguing that AI's future lies in small-data, privacy-oriented human-machine collaboration rather than big-data scaling.
-
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.
Reference graph
Works this paper leans on
-
[1]
D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Bengio. End-to-end Attention-based Large V ocabu- lary Speech Recognition. arXiv preprint arXiv:1508.04395v2,
-
[2]
E. Battenberg, J. Chen, R. Child, A. Coates, Y . Gaur, Y . Li, H. Liu, S. Satheesh, D. Seetapun, A. Sri- ram, and Z. Zhu. Exploring Neural Transducers for End-to-end Speech Recognition. arXiv preprint arXiv:1707.07413,
-
[3]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arXiv preprint arXiv:1312.3005,
-
[4]
Deep Speech: Scaling up end-to-end speech recognition
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep Speech: Scaling Up End-to-End Speech Recognition.arXiv preprint arXiv:1412.5567,
-
[5]
Exploring the Limits of Language Modeling
R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410v2,
-
[6]
Generalization in deep learning
K. Kawaguchi, L. P. Kaelbling, and Y . Bengio. Generalization in Deep Learning. arXiv preprint arXiv:1710.05468v1, October
-
[7]
ImageNet Large Scale Visual Recognition Challenge
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern- stein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575, January
-
[8]
Morgan Kaufmann Publishers Inc. R. Sennrich, B. Haddow, and A. Birch. Neural Machine Translation of Rare Words with Subword Units.arXiv preprint arXiv:1508.07909, 2016a. R. Sennrich, B. Haddow, and A. Birch. Edinburgh Neural Machine Translation Systems for WMT
work page internal anchor Pith review arXiv
- [9]
-
[10]
S. L. Smith and Q. V . Le. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451v2, October
-
[11]
Understanding deep learning requires rethinking generalization
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding Deep Learning Requires Rethinking Generalization. arXiv preprint arXiv:1611.03530v2,
work page internal anchor Pith review arXiv
-
[12]
The output space isO =C. Similar to word language models, we use normalized cross-entropy loss:− 1 N ∑ ilnpwi, wherepwi is the model’s predicted probability of seeing theith token. N is either the number of sequences in a batch for training optimization orN is the number of predicted characters in the validation set. A.3 I MAGE CLASSIFICATION ImageNet ima...
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.