Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
Direct and Indirect Effects
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
The direct effect of one eventon another can be defined and measured byholding constant all intermediate variables between the two.Indirect effects present conceptual andpractical difficulties (in nonlinear models), because they cannot be isolated by holding certain variablesconstant. This paper shows a way of defining any path-specific effectthat does not invoke blocking the remainingpaths.This permits the assessment of a more naturaltype of direct and indirect effects, one thatis applicable in both linear and nonlinear models. The paper establishesconditions under which such assessments can be estimated consistentlyfrom experimental and nonexperimental data,and thus extends path-analytic techniques tononlinear and nonparametric models.
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
citing papers explorer
-
Localizing Model Behavior with Path Patching
Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
-
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.