Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
The Bayesian Geometry of Transformer Attention
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.