WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Axbench: Steering llms? even simple baselines outperform sparse autoencoders
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10verdicts
UNVERDICTED 10roles
background 1polarities
background 1representative citing papers
Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Decoder-based VLMs over-align visual embeddings to text manifold causing linguistic bias in top PCs of a universal text subspace; projecting out this subspace reduces hallucinations on POPE/CHAIR/AMBER and improves CLAIR.
Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for steering-vector control.
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
citing papers explorer
-
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.