The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
arXiv preprint arXiv:2503.05613 , year=
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8representative citing papers
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
Sparse autoencoders provide a basis for sensible concept hierarchies on visual data but are undermined by hard and soft feature absorption.
The L2 norm of LLM hidden states signals reasoning intensity, with a theoretical bound on SAE feature activations, enabling three new test-time scaling techniques that boost performance.
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
Case study applies SAE probing with enstrophy triage to a continuum-dynamics foundation model and reports intermittent feature consistency that does not align with standard physics while linking some output discrepancies to specific feature changes.
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
Introduces KL-divergence probing to test relational linearity and reports its variation across models, layers, and paraphrased queries on four datasets.
citing papers explorer
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
Do Sparse Autoencoders Learn Meaningful Concept Hierarchies?
Sparse autoencoders provide a basis for sensible concept hierarchies on visual data but are undermined by hard and soft feature absorption.
-
The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models
The L2 norm of LLM hidden states signals reasoning intensity, with a theoretical bound on SAE feature activations, enabling three new test-time scaling techniques that boost performance.
-
Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics
Case study applies SAE probing with enstrophy triage to a continuum-dynamics foundation model and reports intermittent feature consistency that does not align with standard physics while linking some output discrepancies to specific feature changes.
-
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
-
Relational Linear Properties in Language Models: An Empirical Investigation
Introduces KL-divergence probing to test relational linearity and reports its variation across models, layers, and paraphrased queries on four datasets.