The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
arXiv preprint arXiv:2503.05613 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
citing papers explorer
-
Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes
The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
- Relational Linear Properties in Language Models: An Empirical Investigation