Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt.arXiv preprint arXiv:2402.12201, 2024b

Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, Xipeng Qiu · 2024 · arXiv 2402.12201

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 32
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt.arXiv preprint arXiv:2402.12201, 2024b

fields

years

verdicts

representative citing papers

citing papers explorer