Neural Networks as Explicit Word-Based Rules

Jind\v{r}ich Libovick\'y

arxiv: 1907.04613 · v1 · pith:2E5NHKO3new · submitted 2019-07-10 · 💻 cs.CL · cs.LG

Neural Networks as Explicit Word-Based Rules

Jind\v{r}ich Libovick\'y This is my paper

Pith reviewed 2026-05-25 00:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords convolutional networksinterpretabilitysentiment classificationword-based rulesfilter visualizationnatural language processingmodel explanation

0 comments

The pith

A convolutional network for sentiment classification can be rewritten as explicit word-based rules that recover its original performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies a visualization technique from computer vision to interpret the weight matrices of a convolutional neural network used for sentiment classification. By maximizing the response of each filter, the authors identify sets of words that activate the filters, turning them into explicit word-based rules. These rules are then used in place of the network to classify text. The extracted rules achieve the same level of performance as the original model. This demonstrates that the network's behavior can be fully captured by simple combinations of word indicators.

Core claim

The filters of a convolutional network for sentiment classification can be interpreted as word-based rules by maximizing their responses over input words. The resulting rules recover the performance of the original model on the classification task.

What carries the argument

Maximizing filter responses on word embeddings to extract sets of words that define each filter's activation rule.

Load-bearing premise

That the words maximizing each filter's response form a complete and accurate representation of what the filter computes in the full model.

What would settle it

Testing the extracted word rules on the original test set and finding that their accuracy is substantially lower than the neural network's accuracy.

read the original abstract

Filters of convolutional networks used in computer vision are often visualized as image patches that maximize the response of the filter. We use the same approach to interpret weight matrices in simple architectures for natural language processing tasks. We interpret a convolutional network for sentiment classification as word-based rules. Using the rule, we recover the performance of the original model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extracts word-based rules from a simple CNN sentiment classifier by maximizing filter responses over words, and shows those rules recover the original accuracy.

read the letter

This paper extracts word-based rules from a simple CNN sentiment classifier by maximizing filter responses over words, and shows those rules recover the original accuracy. It does a clean job of borrowing the visualization technique from computer vision and applying it to NLP weight matrices. The fact that they can recover performance suggests the rules capture the essential behavior rather than just providing loose explanations. That's a solid check for this kind of work. The potential issue is whether the rules are truly explicit and composable without additional assumptions. In a CNN, the decision involves max-pooling across multiple filters and a final classification layer, so turning individual filter activations into rules requires a clear way to combine them. The abstract claims recovery, but without seeing the exact method for building and applying the rules, it's hard to tell if there's any circularity or if it works only because of specific choices in thresholding or aggregation. The scope is narrow too – simple architectures and one task – so it doesn't address more complex models. This paper is for people focused on interpretability of shallow neural nets in NLP. A reader looking for practical ways to extract rules from CNNs might find it useful, especially if they value the performance recovery metric. It deserves a serious referee because the idea is straightforward and the claim is testable, even if the contribution is incremental.

Referee Report

2 major / 1 minor

Summary. The manuscript adapts the max-response visualization technique from computer vision to interpret the weight matrices of a convolutional neural network for sentiment classification. Individual words that maximize filter responses are presented as explicit word-based rules; the authors claim these rules recover the performance of the original CNN on the test set.

Significance. If the extracted rules provably match the CNN's decisions (including max-pooling and the final classifier), the work would supply a concrete, falsifiable method for turning a neural NLP model into an interpretable rule set. The performance-recovery check is a strength that directly tests the fidelity of the interpretation.

major comments (2)

[Abstract] Abstract: the central claim that 'using the rule, we recover the performance of the original model' is load-bearing, yet the abstract (and the description provided) supplies no account of how single-filter argmax words are aggregated into rules that replicate the CNN's max-pooling across filters and final linear layer.
[Method] The transfer of the CV visualization technique assumes that words maximizing individual filter responses automatically compose into complete, composable rules; the manuscript does not demonstrate that this holds once n-gram interactions and the downstream classifier are taken into account.

minor comments (1)

Clarify the exact procedure (thresholding, combination function, handling of negative contributions) used to turn per-filter word lists into executable rules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important points about the clarity of our claims and the description of how the extracted rules are formed and applied. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'using the rule, we recover the performance of the original model' is load-bearing, yet the abstract (and the description provided) supplies no account of how single-filter argmax words are aggregated into rules that replicate the CNN's max-pooling across filters and final linear layer.

Authors: We agree that the abstract is concise and does not include an explicit account of the aggregation step. The manuscript forms rules by extracting the single word that maximizes the response of each filter and then applies these words by computing their activations under the same convolutional and pooling operations as the original network before feeding into the final linear layer. We will revise the abstract to add a brief clause describing this aggregation process so that the central claim is better supported within the word limit. revision: yes
Referee: [Method] The transfer of the CV visualization technique assumes that words maximizing individual filter responses automatically compose into complete, composable rules; the manuscript does not demonstrate that this holds once n-gram interactions and the downstream classifier are taken into account.

Authors: The manuscript demonstrates composition empirically: the extracted per-filter words are substituted back into the original CNN architecture (including max-pooling over the filter responses and the downstream linear classifier), and test-set performance is recovered. This empirical match serves as evidence that the rules capture the net effect of the model's operations. We acknowledge that an additional explicit walk-through of how n-gram filter responses and the classifier weights interact with the selected words would improve transparency. We will add a short clarifying paragraph in the method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical interpretation validated externally

full rationale

The paper applies max-response visualization (transferred from CV) to extract word rules from a CNN's filters for sentiment classification, then reports that the resulting rules recover the CNN's test performance. This is an empirical demonstration of interpretability rather than a mathematical derivation or prediction step. No equations reduce a claimed result to its own fitted inputs by construction, no self-citation chain bears the central claim, and the performance match is measured against held-out data rather than being tautological. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information on free parameters, axioms, or invented entities available from the abstract alone.

pith-pipeline@v0.9.0 · 5564 in / 877 out tokens · 21912 ms · 2026-05-25T00:01:36.803192+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We interpret a convolutional network for sentiment classification as word-based rules. Using the rule, we recover the performance of the original model.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Filters of convolutional networks... visualized as image patches that maximize the response... retrieve the words whose embeddings have the highest dot-product

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.