WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Locating and Editing Factual Associations in
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6representative citing papers
A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to new facts and matching multi-hop constructions.
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.
Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
citing papers explorer
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Geometric Factual Recall in Transformers
A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to new facts and matching multi-hop constructions.
-
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
-
Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition
Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.
-
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
-
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.