For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
Advances in Neural Information Processing Systems , volume=
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Common ID estimators fail to track the true intrinsic dimension of neural representations and are instead driven by other factors.
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
citing papers explorer
-
Scaling Limits of Long-Context Transformers
For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
-
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
-
Rethinking Intrinsic Dimension Estimation in Neural Representations
Common ID estimators fail to track the true intrinsic dimension of neural representations and are instead driven by other factors.
-
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.