Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
Understanding and improving layer normalization.Advances in neural information processing systems, 32
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
GLSTaGAT is a spatial-temporal graph attention network using data-driven fusion graphs, global-local blocks, node normalization, and a transformer encoder to outperform baselines on real-world network traffic datasets.
citing papers explorer
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
-
Global-local Spatial-temporal Aware Graph Attention Network for Network Traffic Forecasting
GLSTaGAT is a spatial-temporal graph attention network using data-driven fusion graphs, global-local blocks, node normalization, and a transformer encoder to outperform baselines on real-world network traffic datasets.