Attribution patching outperforms automated circuit discovery

Aaquib Syed, Can Rager, Arthur Conmy · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

citing papers explorer

Showing 2 of 2 citing papers.

Crafting Reversible SFT Behaviors in Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 15
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training cs.CL · 2026-05-07 · unverdicted · none · ref 16
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

Attribution patching outperforms automated circuit discovery

fields

years

verdicts

representative citing papers

citing papers explorer