Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.
hub Canonical reference
ngpt: Normalized transformer with rep- resentation learning on the hypersphere
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
Polaris separates semantic meaning from hierarchical structure in embeddings via angular geometry and radius on a hypersphere, yielding up to 19-point gains in taxonomy expansion retrieval over baselines.
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Normalized Matching Transformer enforces unit-norm embeddings at every Transformer layer and trains with InfoNCE plus hyperspherical uniformity loss, reaching new state-of-the-art accuracy on PascalVOC and SPair-71k while converging faster than prior matching networks.
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.