WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu; Haoran Lin; Junhao Liu; Xin Zhang; Zhenyu Yan

arxiv: 2603.18474 · v2 · submitted 2026-03-19 · 💻 cs.CL · cs.AI

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu , Junhao Liu , Zhenyu Yan , Haoran Lin , Xin Zhang This is my paper

Pith reviewed 2026-05-15 09:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords sufficient conditionsneuron activation predicatesLLM interpretabilitybehavioral controlattribution methodstoken generationcross-lingual control

0 comments

The pith

WASD identifies minimal neuron-activation predicates that suffice to produce and steer specific LLM outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WASD as a method to explain LLM token generation by treating conditions as neuron-activation predicates and searching iteratively for the smallest set that still produces the observed output when inputs are perturbed. This yields explanations that experiments on SST-2 and CounterFact with Gemma-2-2B show are more stable, accurate, and concise than those from standard attribution graphs. The same sufficient predicates can then be used to control behavior, as demonstrated in a cross-lingual generation case study. The core idea is that locating these minimal sufficient neural conditions bridges explanation and direct intervention without retraining.

Core claim

WASD represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations, thereby explaining model behavior through sufficient neural conditions and enabling control of outputs such as cross-lingual generation.

What carries the argument

Neuron-activation predicates, which encode whether specific neurons are active above threshold, together with the iterative minimal-set search that isolates sufficient conditions for token generation.

If this is right

Explanations on SST-2 and CounterFact tasks are more stable, accurate, and concise than conventional attribution graphs.
The identified predicates enable direct control of model behavior, including cross-lingual output generation.
Sufficient conditions can be located without additional training or loss of semantic coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predicate-search procedure could be applied to other model sizes or architectures to test whether minimal sufficient sets remain small.
If sufficient predicates generalize across tasks, they might support targeted internal edits for safety or domain adaptation.
The approach implies that full attribution graphs may contain many redundant neurons that are not required for the observed behavior.

Load-bearing premise

Neuron-activation predicates can be iteratively searched to identify a minimal set that is sufficient to guarantee the observed output under input perturbations.

What would settle it

A test in which no small collection of neuron predicates reproduces the original output across multiple input perturbations, or in which activating the discovered predicates fails to steer generation on new inputs.

read the original abstract

Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes WASD (unWeaving Actionable Sufficient Directives), a framework that explains LLM behavior by representing candidate conditions as neuron-activation predicates and iteratively searching for a minimal set that guarantees the observed token output under input perturbations. Experiments on SST-2 and CounterFact with Gemma-2-2B claim more stable, accurate, and concise explanations than attribution graphs, with an additional case study showing effective control of cross-lingual output generation.

Significance. If the sufficiency claims hold under broader validation, the work would advance mechanistic interpretability by shifting from correlational attributions to causally sufficient neural conditions, enabling low-cost, natural-language-controllable steering of LLM outputs. The concrete experiments on a 2B model and the cross-lingual control demonstration add practical value, though the absence of exhaustive perturbation testing limits immediate impact.

major comments (3)

[Abstract and §3] Abstract and §3 (Method): The iterative search is presented as identifying a minimal set of predicates that 'guarantees the current output under input perturbations,' yet no description is given of how the perturbation distribution is generated or whether sufficiency is verified outside the search loop itself; this directly undermines the stability/accuracy superiority claims versus attribution graphs.
[§4.2] §4.2 (Experiments on SST-2 and CounterFact): The reported gains in conciseness and accuracy are shown only for perturbations encountered during search; without an out-of-distribution test set or exhaustive enumeration of input variations, it remains possible that the predicates are merely correlated rather than causally sufficient, weakening the central explanatory claim.
[Case study section] Case study section: The cross-lingual control results demonstrate practical effectiveness on the tested examples, but the manuscript provides no quantitative failure rate or robustness metric under input perturbations outside the search distribution, leaving the sufficiency guarantee unverified for the control application.

minor comments (2)

[§2.1] §2.1: The formal definition of neuron-activation predicates would benefit from an explicit mathematical notation (e.g., predicate P_i(a) = (activation_i > threshold)) before the search algorithm is introduced.
[Table 1] Table 1: The comparison table with attribution graphs lacks error bars or statistical significance tests for the stability and accuracy metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the description of perturbation generation and out-of-distribution validation can be strengthened. We address each major comment below and will revise the manuscript accordingly to improve clarity and provide additional empirical support for the sufficiency claims.

read point-by-point responses

Referee: [Abstract and §3] The iterative search is presented as identifying a minimal set of predicates that 'guarantees the current output under input perturbations,' yet no description is given of how the perturbation distribution is generated or whether sufficiency is verified outside the search loop itself; this directly undermines the stability/accuracy superiority claims versus attribution graphs.

Authors: We agree that the perturbation generation procedure and out-of-loop verification require more explicit description. In §3.2 the manuscript states that perturbations are generated via bounded random token substitutions, but the exact sampling (uniform over single-token replacements drawn from a fixed 5k-token subset of the vocabulary) and the separate verification step on a held-out perturbation batch were not stated with sufficient precision. In the revision we will expand §3.2 with a formal definition of the perturbation distribution and add an explicit post-search verification procedure that evaluates the discovered predicates on 200 fresh perturbations never seen during search. This will directly support the stability and accuracy comparisons. revision: yes
Referee: [§4.2] The reported gains in conciseness and accuracy are shown only for perturbations encountered during search; without an out-of-distribution test set or exhaustive enumeration of input variations, it remains possible that the predicates are merely correlated rather than causally sufficient, weakening the central explanatory claim.

Authors: The referee is correct that the current §4.2 results are computed on perturbations drawn from the same distribution used inside the search loop. To address this limitation we will add a new paragraph and table in §4.2 that reports accuracy, stability, and conciseness on an explicitly constructed out-of-distribution test set (edit distance 3–4 and substitution types excluded from search). We will also report the fraction of cases in which the predicate set continues to guarantee the target token under these OOD perturbations. These additions will provide stronger evidence that the identified conditions are causally sufficient rather than merely correlational. revision: yes
Referee: [Case study section] The cross-lingual control results demonstrate practical effectiveness on the tested examples, but the manuscript provides no quantitative failure rate or robustness metric under input perturbations outside the search distribution, leaving the sufficiency guarantee unverified for the control application.

Authors: We acknowledge that the case study currently reports only success on the examples used during predicate discovery. In the revised manuscript we will augment the case-study section with a quantitative robustness evaluation: for each of the 50 cross-lingual control instances we will measure the failure rate (i.e., loss of the target language or token) under 100 OOD perturbations per instance that lie outside the search distribution. These failure rates, together with average control success under perturbation, will be reported in a new table, thereby verifying the sufficiency guarantee for the control application. revision: yes

Circularity Check

0 steps flagged

No circularity: independent iterative search over predicates

full rationale

The paper defines WASD as an explicit iterative search algorithm that enumerates neuron-activation predicates and selects a minimal set guaranteeing output under the perturbations encountered during search. This procedure is self-contained and does not reduce any claimed prediction or sufficiency result to a fitted parameter, self-citation, or definitional tautology. Experimental comparisons to attribution graphs and the cross-lingual control case study are presented as external validation rather than derivations that collapse back to the search inputs by construction. No equations or load-bearing steps in the provided description exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the method is presented as a search procedure without explicit additional assumptions beyond the existence of sufficient neuron sets.

pith-pipeline@v0.9.0 · 5447 in / 1069 out tokens · 37235 ms · 2026-05-15T09:09:59.165134+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Precision of a rule r ... is the probability of output invariance under the intervention do(r) across the local neighborhood

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.