Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2025 3verdicts
UNVERDICTED 3representative citing papers
Gradient analysis shows cross-entropy induces an EM-like loop in attention that sculpts Bayesian manifolds supporting in-context probabilistic inference.
Large language models preserve a geometric substrate in value representations that correlates with uncertainty and matches patterns from small models performing exact Bayesian inference.
citing papers explorer
-
The Bayesian Geometry of Transformer Attention
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
-
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Gradient analysis shows cross-entropy induces an EM-like loop in attention that sculpts Bayesian manifolds supporting in-context probabilistic inference.
-
Geometric Scaling of Bayesian Inference in LLMs
Large language models preserve a geometric substrate in value representations that correlates with uncertainty and matches patterns from small models performing exact Bayesian inference.