Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.

Transformer as an Euler Discretization of Score-based Variational Flow

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

The Transformer is recovered exactly as the forward Euler step of spherical SVFlow, with multi-head attention and MoE/FFN as approximations to its vector field.

Vision Transformers Need Better Token Interaction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

Replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 improves semantic segmentation mIoU on three benchmarks while keeping ImageNet linear-probing accuracy unchanged.

citing papers explorer

Showing 3 of 3 citing papers.

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers cs.LG · 2026-05-07 · unverdicted · none · ref 48
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
Transformer as an Euler Discretization of Score-based Variational Flow cs.LG · 2026-04-26 · unverdicted · none · ref 22
The Transformer is recovered exactly as the forward Euler step of spherical SVFlow, with multi-head attention and MoE/FFN as approximations to its vector field.
Vision Transformers Need Better Token Interaction cs.CV · 2026-05-22 · unverdicted · none · ref 22
Replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 improves semantic segmentation mIoU on three benchmarks while keeping ImageNet linear-probing accuracy unchanged.

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

fields

years

verdicts

representative citing papers

citing papers explorer