From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm
Pith reviewed 2026-05-17 23:18 UTC · model grok-4.3
The pith
WOODELF computes Background SHAP for tree ensembles in linear time by building one pseudo-Boolean formula per sample that encodes the trees, features, and full background dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WOODELF constructs, for each consumer, a pseudo-Boolean formula that captures their feature values, the structure of the decision tree ensemble, and the entire background dataset, then leverages this representation to compute Background SHAP in linear time while also supporting Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values.
What carries the argument
The pseudo-Boolean formula that encodes a sample's features together with the full decision-tree ensemble and background dataset into a single logical expression for direct evaluation of game-theoretic values.
If this is right
- Background SHAP becomes practical on datasets with millions of background samples where previous exact methods timed out.
- A single code path can now produce both Background and Path-Dependent SHAP as well as interaction and Banzhaf variants.
- Standard Python array libraries suffice for both CPU and GPU execution, removing the need for custom low-level kernels.
- Reported wall-clock times drop from tens of minutes to under 20 seconds on GPU for three-million-row problems with five-million background samples.
Where Pith is reading between the lines
- The Boolean reduction may let similar linear-time techniques be applied to explanation methods for models other than trees if those models admit compact logical encodings.
- Practitioners could shift from approximate to exact Background SHAP more often once the computational barrier is lowered.
- The formula representation might support additional sensitivity queries beyond SHAP at marginal extra cost.
Load-bearing premise
The pseudo-Boolean formula can be built and evaluated without exponential blow-up for realistic tree depths and background sizes while exactly preserving the original game-theoretic SHAP definition.
What would settle it
Running WOODELF and an exhaustive coalition-enumeration implementation on the same small tree ensemble and background set and checking whether the numeric SHAP vectors match to machine precision.
Figures
read the original abstract
SHapley Additive exPlanations (SHAP) is a key tool for interpreting decision tree ensembles by assigning contribution values to features. It is widely used in finance, advertising, medicine, and other domains. Two main approaches to SHAP calculation exist: Path-Dependent SHAP, which leverages the tree structure for efficiency, and Background SHAP, which uses a background dataset to estimate feature distributions. We introduce WOODELF, a SHAP algorithm that integrates decision trees, game theory, and Boolean logic into a unified framework. For each consumer, WOODELF constructs a pseudo-Boolean formula that captures their feature values, the structure of the decision tree ensemble, and the entire background dataset. It then leverages this representation to compute Background SHAP in linear time. WOODELF can also compute Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values. WOODELF is designed to run efficiently on CPU and GPU hardware alike. Available via the WOODELF Python package, it is implemented using NumPy, SciPy, and CuPy without relying on custom C++ or CUDA code. This design enables fast performance and seamless integration into existing frameworks, supporting large-scale computation of SHAP and other game-theoretic values in practice. For example, on a dataset with 3,000,000 rows, 5,000,000 background samples, and 127 features, WOODELF computed all Background Shapley values in 162 seconds on CPU and 16 seconds on GPU - compared to 44 minutes required by the best method on any hardware platform, representing 16x and 165x speedups, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces WOODELF, a unified SHAP algorithm for decision tree ensembles that constructs a per-instance pseudo-Boolean formula incorporating the instance feature values, the full tree ensemble structure, and the entire background dataset. This representation is used to compute Background SHAP values in linear time; the method also supports Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values. The implementation uses only NumPy, SciPy, and CuPy (no custom C++ or CUDA) for CPU/GPU execution and reports concrete speedups, e.g., 162 s (CPU) and 16 s (GPU) versus 44 min for the best prior method on a dataset with 3 000 000 rows, 5 000 000 background samples, and 127 features.
Significance. If the pseudo-Boolean construction exactly reproduces the game-theoretic value function without hidden exponential steps or distributional approximations, the work would constitute a substantial practical advance in scalable model interpretation for tree ensembles. The reported speedups on very large data, the avoidance of low-level custom kernels, and the unification of multiple explanation types (Shapley, Banzhaf, interactions) are clear strengths that would be of immediate interest to practitioners in finance, medicine, and advertising.
major comments (1)
- [Abstract] Abstract: the central linear-time claim for Background SHAP rests on the assertion that a single pseudo-Boolean formula per instance 'captures ... the entire background dataset' and yields exact SHAP values. No construction rule, equivalence proof, or complexity argument is supplied showing that the Boolean encoding preserves the precise empirical conditional expectation v(S) = (1/|D|) sum_{x in D} f(x_S, x'_S^c) without introducing independence assumptions or collapsing the background distribution. This equivalence is load-bearing for both correctness and the O(n) runtime guarantee.
minor comments (2)
- The abstract uses the term 'consumer' without definition; the manuscript should clarify whether the algorithm is restricted to a particular application domain or applies to arbitrary decision-tree ensembles.
- The speed-up numbers are given for a single large instance; a table or figure reporting scaling with background size, number of trees, and feature count would strengthen the empirical section.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We address the major comment regarding the abstract and the supporting details for the Background SHAP claims below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central linear-time claim for Background SHAP rests on the assertion that a single pseudo-Boolean formula per instance 'captures ... the entire background dataset' and yields exact SHAP values. No construction rule, equivalence proof, or complexity argument is supplied showing that the Boolean encoding preserves the precise empirical conditional expectation v(S) = (1/|D|) sum_{x in D} f(x_S, x'_S^c) without introducing independence assumptions or collapsing the background distribution. This equivalence is load-bearing for both correctness and the O(n) runtime guarantee.
Authors: We agree that the abstract, owing to length constraints, does not supply the explicit construction rule, equivalence proof, or complexity argument. The current manuscript describes the high-level approach but lacks a formal statement of how the single pseudo-Boolean formula is built to encode trees, instance values, and the full background dataset, as well as a proof that it exactly realizes v(S) without independence assumptions or distributional collapse. We will add these elements in the revision: a detailed construction procedure in Section 3, a theorem with proof in a new appendix establishing exact equivalence to the empirical conditional expectation, and a complexity section showing that formula evaluation yields linear time in the number of features. These changes will directly support the correctness and runtime claims. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents WOODELF as a direct algorithmic construction: for each instance it builds a pseudo-Boolean formula encoding the decision-tree ensemble plus the full background dataset, then evaluates Background SHAP from that representation in linear time. No equations, self-citations, or fitted parameters are shown that reduce the claimed linear-time result to a tautology or to the input data by construction. The central claim is an independent encoding step whose correctness is an external verification question, not a definitional loop. This matches the default expectation of a non-circular algorithmic paper.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Shapley values are well-defined for the cooperative game induced by a tree ensemble and background distribution
invented entities (1)
-
pseudo-Boolean formula per consumer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 1–8
A Machine Learning Approach for Credit Card Fraud Detection in Massive Datasets Using SMOTE and Random Sampling. In2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 1–8. Banzhaf, J. F. 1965. Weighted V oting Doesn’t Work: A Mathematical Analysis.Rutgers Law Review, 19: 317–343. Bracci, A. 2021. background data size limited to 100, xg...
-
[2]
Consistent Individualized Feature Attribution for Tree Ensembles
The Shapley Value of Tuples in Query Answering. Logical Methods in Computer Science, V olume 17, Issue 3. Lundberg, S.; and contributors, S. 2024. SHAP GPUTree Explainer Example. https://shap.readthedocs.io/en/latest/ example notebooks/api examples/explainers/GPUTree. html. Lundberg, S. M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J. M.; Nair, B.; Katz,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, 1–6
A detailed analysis of the KDD CUP 99 data set. In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, 1–6. Xiao, Z. 2024. IEEE-CIS Fraud Detection Based on XGB. In Li, X.; Yuan, C.; and Kent, J., eds.,Proceedings of the 7th International Conference on Economic Management and Green Development, 1785–1796. Singapore: Sp...
-
[4]
+ 2·(1−p(f 2 <3)). Baseline SHAPcan not estimate the full probability as it only has a single baseline value forf 2.Path-Dependent SHAPestimates this probability using the cover property, which is based on the training data that reached the split node in the training process. Since only rows withX[f 1]< 4reached this node, the estimate is conditioned onX[...
work page 1953
-
[5]
The original weight of the clause was multiplied by−1
-
[6]
Using de-Morgan’s laws: |S+(ck)|=|S −(¬ck)|,|S −(ck)|=|S +(¬ck)| When applying 2 and 3 toβ simplified i andϕ simplified i , both changes cancel out and we are left with the same equa- tions. A direct corollary of Theorem 4 is the robustness of βsimplified i andϕ simplified i also w.r.t WCNF formulas. Corollary 5.For any playeri, bothβ simplified i andϕ si...
work page 1983
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.