Neural Networks as Explicit Word-Based Rules
Pith reviewed 2026-05-25 00:01 UTC · model grok-4.3
The pith
A convolutional network for sentiment classification can be rewritten as explicit word-based rules that recover its original performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The filters of a convolutional network for sentiment classification can be interpreted as word-based rules by maximizing their responses over input words. The resulting rules recover the performance of the original model on the classification task.
What carries the argument
Maximizing filter responses on word embeddings to extract sets of words that define each filter's activation rule.
Load-bearing premise
That the words maximizing each filter's response form a complete and accurate representation of what the filter computes in the full model.
What would settle it
Testing the extracted word rules on the original test set and finding that their accuracy is substantially lower than the neural network's accuracy.
read the original abstract
Filters of convolutional networks used in computer vision are often visualized as image patches that maximize the response of the filter. We use the same approach to interpret weight matrices in simple architectures for natural language processing tasks. We interpret a convolutional network for sentiment classification as word-based rules. Using the rule, we recover the performance of the original model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript adapts the max-response visualization technique from computer vision to interpret the weight matrices of a convolutional neural network for sentiment classification. Individual words that maximize filter responses are presented as explicit word-based rules; the authors claim these rules recover the performance of the original CNN on the test set.
Significance. If the extracted rules provably match the CNN's decisions (including max-pooling and the final classifier), the work would supply a concrete, falsifiable method for turning a neural NLP model into an interpretable rule set. The performance-recovery check is a strength that directly tests the fidelity of the interpretation.
major comments (2)
- [Abstract] Abstract: the central claim that 'using the rule, we recover the performance of the original model' is load-bearing, yet the abstract (and the description provided) supplies no account of how single-filter argmax words are aggregated into rules that replicate the CNN's max-pooling across filters and final linear layer.
- [Method] The transfer of the CV visualization technique assumes that words maximizing individual filter responses automatically compose into complete, composable rules; the manuscript does not demonstrate that this holds once n-gram interactions and the downstream classifier are taken into account.
minor comments (1)
- Clarify the exact procedure (thresholding, combination function, handling of negative contributions) used to turn per-filter word lists into executable rules.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight important points about the clarity of our claims and the description of how the extracted rules are formed and applied. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'using the rule, we recover the performance of the original model' is load-bearing, yet the abstract (and the description provided) supplies no account of how single-filter argmax words are aggregated into rules that replicate the CNN's max-pooling across filters and final linear layer.
Authors: We agree that the abstract is concise and does not include an explicit account of the aggregation step. The manuscript forms rules by extracting the single word that maximizes the response of each filter and then applies these words by computing their activations under the same convolutional and pooling operations as the original network before feeding into the final linear layer. We will revise the abstract to add a brief clause describing this aggregation process so that the central claim is better supported within the word limit. revision: yes
-
Referee: [Method] The transfer of the CV visualization technique assumes that words maximizing individual filter responses automatically compose into complete, composable rules; the manuscript does not demonstrate that this holds once n-gram interactions and the downstream classifier are taken into account.
Authors: The manuscript demonstrates composition empirically: the extracted per-filter words are substituted back into the original CNN architecture (including max-pooling over the filter responses and the downstream linear classifier), and test-set performance is recovered. This empirical match serves as evidence that the rules capture the net effect of the model's operations. We acknowledge that an additional explicit walk-through of how n-gram filter responses and the classifier weights interact with the selected words would improve transparency. We will add a short clarifying paragraph in the method section. revision: yes
Circularity Check
No significant circularity; empirical interpretation validated externally
full rationale
The paper applies max-response visualization (transferred from CV) to extract word rules from a CNN's filters for sentiment classification, then reports that the resulting rules recover the CNN's test performance. This is an empirical demonstration of interpretability rather than a mathematical derivation or prediction step. No equations reduce a claimed result to its own fitted inputs by construction, no self-citation chain bears the central claim, and the performance match is measured against held-out data rather than being tautological. The approach is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We interpret a convolutional network for sentiment classification as word-based rules. Using the rule, we recover the performance of the original model.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Filters of convolutional networks... visualized as image patches that maximize the response... retrieve the words whose embeddings have the highest dot-product
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.