Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
citing papers explorer
-
Sparse Attention as Compact Kernel Regression
Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
Gated Delta Networks: Improving Mamba2 with Delta Rule
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.