Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
Causal analysis of syntactic agreement mechanisms in neural language models
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Re-derivation of activation patching NIE reveals it captures interaction effects in addition to direct causal effects, demonstrated via GPT-2 IOI circuit where INT explains component ranking issues and faithfulness instability.
Introduces the binning semiring and causal graphical models to show that correlational evaluation of learnability in formal language tasks leads to incorrect conclusions from confounders.
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
An exposure-based split on BLiMP data reveals delayed generalization in five grammatical phenomena during LLM pre-training, with post-generalization shifts in concept vector predictiveness and attention patterns.
Language models employ a highly localized shared mechanism for filler-gap dependencies but no unified mechanism for NPI licensing, and activation patching generalizes better than supervised alignment search.
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
citing papers explorer
No citing papers match the current filters.