hub Canonical reference

ngpt: Normalized transformer with rep- resentation learning on the hypersphere

nGPT: Normalized Transformer with Representation Learning on the Hypersphere , author= · 2024 · arXiv 2410.01131

Canonical reference. 100% of citing Pith papers cite this work as background.

12 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

cs.LG · 2026-06-13 · unverdicted · novelty 7.0

Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.

Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.

Demystifying Manifold Constraints in LLM Pre-training

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.

Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

Polaris separates semantic meaning from hierarchical structure in embeddings via angular geometry and radius on a hypersphere, yielding up to 19-point gains in taxonomy expansion retrieval over baselines.

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

cs.LG · 2026-04-06 · unverdicted · novelty 6.0 · 2 refs

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.

Superposition Yields Robust Neural Scaling

cs.LG · 2025-05-15 · conditional · novelty 6.0

Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.

Normalized Matching Transformer

cs.CV · 2025-03-22 · unverdicted · novelty 6.0

Normalized Matching Transformer enforces unit-norm embeddings at every Transformer layer and trains with InfoNCE plus hyperspherical uniformity loss, reaching new state-of-the-art accuracy on PascalVOC and SPair-71k while converging faster than prior matching networks.

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

cs.LG · 2025-09-04

citing papers explorer

Showing 1 of 1 citing paper after filters.

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation cs.LG · 2025-09-04 · unreviewed · ref 56

ngpt: Normalized transformer with rep- resentation learning on the hypersphere

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer