Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
The Transformer is recovered exactly as the forward Euler step of spherical SVFlow, with multi-head attention and MoE/FFN as approximations to its vector field.
Replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 improves semantic segmentation mIoU on three benchmarks while keeping ImageNet linear-probing accuracy unchanged.
citing papers explorer
-
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
-
Transformer as an Euler Discretization of Score-based Variational Flow
The Transformer is recovered exactly as the forward Euler step of spherical SVFlow, with multi-head attention and MoE/FFN as approximations to its vector field.
-
Vision Transformers Need Better Token Interaction
Replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 improves semantic segmentation mIoU on three benchmarks while keeping ImageNet linear-probing accuracy unchanged.