WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior
Pith reviewed 2026-05-15 09:09 UTC · model grok-4.3
The pith
WASD identifies minimal neuron-activation predicates that suffice to produce and steer specific LLM outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WASD represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations, thereby explaining model behavior through sufficient neural conditions and enabling control of outputs such as cross-lingual generation.
What carries the argument
Neuron-activation predicates, which encode whether specific neurons are active above threshold, together with the iterative minimal-set search that isolates sufficient conditions for token generation.
If this is right
- Explanations on SST-2 and CounterFact tasks are more stable, accurate, and concise than conventional attribution graphs.
- The identified predicates enable direct control of model behavior, including cross-lingual output generation.
- Sufficient conditions can be located without additional training or loss of semantic coherence.
Where Pith is reading between the lines
- The same predicate-search procedure could be applied to other model sizes or architectures to test whether minimal sufficient sets remain small.
- If sufficient predicates generalize across tasks, they might support targeted internal edits for safety or domain adaptation.
- The approach implies that full attribution graphs may contain many redundant neurons that are not required for the observed behavior.
Load-bearing premise
Neuron-activation predicates can be iteratively searched to identify a minimal set that is sufficient to guarantee the observed output under input perturbations.
What would settle it
A test in which no small collection of neuron predicates reproduces the original output across multiple input perturbations, or in which activating the discovered predicates fails to steer generation on new inputs.
read the original abstract
Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes WASD (unWeaving Actionable Sufficient Directives), a framework that explains LLM behavior by representing candidate conditions as neuron-activation predicates and iteratively searching for a minimal set that guarantees the observed token output under input perturbations. Experiments on SST-2 and CounterFact with Gemma-2-2B claim more stable, accurate, and concise explanations than attribution graphs, with an additional case study showing effective control of cross-lingual output generation.
Significance. If the sufficiency claims hold under broader validation, the work would advance mechanistic interpretability by shifting from correlational attributions to causally sufficient neural conditions, enabling low-cost, natural-language-controllable steering of LLM outputs. The concrete experiments on a 2B model and the cross-lingual control demonstration add practical value, though the absence of exhaustive perturbation testing limits immediate impact.
major comments (3)
- [Abstract and §3] Abstract and §3 (Method): The iterative search is presented as identifying a minimal set of predicates that 'guarantees the current output under input perturbations,' yet no description is given of how the perturbation distribution is generated or whether sufficiency is verified outside the search loop itself; this directly undermines the stability/accuracy superiority claims versus attribution graphs.
- [§4.2] §4.2 (Experiments on SST-2 and CounterFact): The reported gains in conciseness and accuracy are shown only for perturbations encountered during search; without an out-of-distribution test set or exhaustive enumeration of input variations, it remains possible that the predicates are merely correlated rather than causally sufficient, weakening the central explanatory claim.
- [Case study section] Case study section: The cross-lingual control results demonstrate practical effectiveness on the tested examples, but the manuscript provides no quantitative failure rate or robustness metric under input perturbations outside the search distribution, leaving the sufficiency guarantee unverified for the control application.
minor comments (2)
- [§2.1] §2.1: The formal definition of neuron-activation predicates would benefit from an explicit mathematical notation (e.g., predicate P_i(a) = (activation_i > threshold)) before the search algorithm is introduced.
- [Table 1] Table 1: The comparison table with attribution graphs lacks error bars or statistical significance tests for the stability and accuracy metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the description of perturbation generation and out-of-distribution validation can be strengthened. We address each major comment below and will revise the manuscript accordingly to improve clarity and provide additional empirical support for the sufficiency claims.
read point-by-point responses
-
Referee: [Abstract and §3] The iterative search is presented as identifying a minimal set of predicates that 'guarantees the current output under input perturbations,' yet no description is given of how the perturbation distribution is generated or whether sufficiency is verified outside the search loop itself; this directly undermines the stability/accuracy superiority claims versus attribution graphs.
Authors: We agree that the perturbation generation procedure and out-of-loop verification require more explicit description. In §3.2 the manuscript states that perturbations are generated via bounded random token substitutions, but the exact sampling (uniform over single-token replacements drawn from a fixed 5k-token subset of the vocabulary) and the separate verification step on a held-out perturbation batch were not stated with sufficient precision. In the revision we will expand §3.2 with a formal definition of the perturbation distribution and add an explicit post-search verification procedure that evaluates the discovered predicates on 200 fresh perturbations never seen during search. This will directly support the stability and accuracy comparisons. revision: yes
-
Referee: [§4.2] The reported gains in conciseness and accuracy are shown only for perturbations encountered during search; without an out-of-distribution test set or exhaustive enumeration of input variations, it remains possible that the predicates are merely correlated rather than causally sufficient, weakening the central explanatory claim.
Authors: The referee is correct that the current §4.2 results are computed on perturbations drawn from the same distribution used inside the search loop. To address this limitation we will add a new paragraph and table in §4.2 that reports accuracy, stability, and conciseness on an explicitly constructed out-of-distribution test set (edit distance 3–4 and substitution types excluded from search). We will also report the fraction of cases in which the predicate set continues to guarantee the target token under these OOD perturbations. These additions will provide stronger evidence that the identified conditions are causally sufficient rather than merely correlational. revision: yes
-
Referee: [Case study section] The cross-lingual control results demonstrate practical effectiveness on the tested examples, but the manuscript provides no quantitative failure rate or robustness metric under input perturbations outside the search distribution, leaving the sufficiency guarantee unverified for the control application.
Authors: We acknowledge that the case study currently reports only success on the examples used during predicate discovery. In the revised manuscript we will augment the case-study section with a quantitative robustness evaluation: for each of the 50 cross-lingual control instances we will measure the failure rate (i.e., loss of the target language or token) under 100 OOD perturbations per instance that lie outside the search distribution. These failure rates, together with average control success under perturbation, will be reported in a new table, thereby verifying the sufficiency guarantee for the control application. revision: yes
Circularity Check
No circularity: independent iterative search over predicates
full rationale
The paper defines WASD as an explicit iterative search algorithm that enumerates neuron-activation predicates and selects a minimal set guaranteeing output under the perturbations encountered during search. This procedure is self-contained and does not reduce any claimed prediction or sufficiency result to a fitted parameter, self-citation, or definitional tautology. Experimental comparisons to attribution graphs and the cross-lingual control case study are presented as external validation rather than derivations that collapse back to the search inputs by construction. No equations or load-bearing steps in the provided description exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Precision of a rule r ... is the probability of output invariance under the intervention do(r) across the local neighborhood
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.