Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 3roles
background 1polarities
background 1representative citing papers
Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
The Transformer is interpreted as discretization of a structured integro-differential equation in continuous domains for tokens and features, unifying attention, feedforward, and normalization via operator and variational views.
citing papers explorer
-
On the Invariance and Generality of Neural Scaling Laws
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
-
Exact Sequence Interpolation with Transformers
Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
-
A Mathematical Explanation of Transformers
The Transformer is interpreted as discretization of a structured integro-differential equation in continuous domains for tokens and features, unifying attention, feedforward, and normalization via operator and variational views.