Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
Ryumei Nakada and Masaaki Imaizumi
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
method 1polarities
use method 1representative citing papers
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
A direct plug-in kernel estimator for Schrödinger bridge time-series drifts achieves uniform non-asymptotic bounds, pointwise CLT under undersmoothing, and minimax-rate optimal adaptive selection.
VeloTree infers differentiation trees from RNA velocity fields by defining cell dissimilarity as the squared varifold distance between integral curves of the velocity field.
Random slicing for subsampling combined with Nadaraya-Watson smoothing enables faster and improved persistence-based topological optimization of point clouds in 2D and 3D.
A Bayesian optimal experimental design framework with Gaussian approximation of expected information gain and surrogate Fisher information enables optimized uniaxial tests that significantly improve identifiability of history-dependent constitutive parameters over random designs.
Gradient-boosted models with SHAP analysis find word familiarity as the dominant predictor of English vocabulary difficulty across Spanish, German, and Chinese L1 learners, with orthographic transfer adding value only for the first two groups.
citing papers explorer
-
Sparse Attention as Compact Kernel Regression
Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
-
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
-
Direct Estimation of Schr\"odinger Bridge Time-Series Drifts: Finite-Sample, Asymptotic, and Adaptive Guarantees
A direct plug-in kernel estimator for Schrödinger bridge time-series drifts achieves uniform non-asymptotic bounds, pointwise CLT under undersmoothing, and minimax-rate optimal adaptive selection.
-
VeloTree: Inferring single-cell trajectories from RNA velocity fields with varifold distances
VeloTree infers differentiation trees from RNA velocity fields by defining cell dissimilarity as the squared varifold distance between integral curves of the velocity field.
-
Towards Scalable Persistence-Based Topological Optimization
Random slicing for subsampling combined with Nadaraya-Watson smoothing enables faster and improved persistence-based topological optimization of point clouds in 2D and 3D.
-
Optimal Experimental Design for Reliable Learning of History-Dependent Constitutive Laws
A Bayesian optimal experimental design framework with Gaussian approximation of expected information gain and surrogate Fisher information enables optimized uniaxial tests that significantly improve identifiability of history-dependent constitutive parameters over random designs.
-
What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty
Gradient-boosted models with SHAP analysis find word familiarity as the dominant predictor of English vocabulary difficulty across Spanish, German, and Chinese L1 learners, with orthographic transfer adding value only for the first two groups.