pith. sign in

arxiv: 2606.19850 · v1 · pith:PAWUKM6Bnew · submitted 2026-06-18 · 💻 cs.LG · cs.AI

Neural Additive and Basis Models with Feature Selection and Interactions

Pith reviewed 2026-06-26 18:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords neural additive modelneural basis modelfeature selectiongeneralized additive modelfeature interactioninterpretabilityhigh-dimensional datacomputational efficiency
0
0 comments X

The pith

Adding a learnable feature selection layer to neural additive and basis models reduces computation while enabling interactions on high-dimensional data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural additive models and neural basis models provide interpretable predictions through additive structure but become computationally intractable when using two-input networks for feature interactions or when applied to high-dimensional inputs. The paper adds a feature selection layer whose weights are learned during training to automatically choose which single features and feature pairs to retain. This keeps the models interpretable, cuts resource use and model size relative to the unmodified versions, and allows the two-input case to run on large datasets. The resulting models match or exceed the accuracy of existing state-of-the-art generalized additive models.

Core claim

Incorporating a feature selection layer into NAM and NBM, with selection weights updated during training, resolves the computational bottlenecks of the original models, reduces costs and sizes, permits two-input networks for feature interactions even on high-dimensional data, and produces performance that is better than or comparable to state-of-the-art GAMs.

What carries the argument

Feature selection layer whose weights are updated during training to retain relevant features and pairs.

If this is right

  • Computational resources and model sizes decrease compared with vanilla NAM and NBM.
  • Two-input networks can be used to capture feature interactions on high-dimensional inputs.
  • Accuracy is better than or comparable to current state-of-the-art GAMs.
  • Interpretability is retained through the additive GAM structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection layer could be inserted into other additive or basis models to improve scalability.
  • Automatic pair selection removes the need for exhaustive enumeration of all possible interactions.
  • The approach may generalize to other neural architectures that currently face similar dimensionality barriers.

Load-bearing premise

The learned selection weights reliably pick the important features and interactions without bias or the need for per-dataset retuning.

What would settle it

On a high-dimensional dataset whose ground-truth relevant features and interactions are known, the models select the wrong subset and lose accuracy relative to the unselected versions.

Figures

Figures reproduced from arXiv: 2606.19850 by Kota Yamanishi, Shinichi Shirakawa, Takuya Matsuda, Yasutoshi Kishimoto.

Figure 1
Figure 1. Figure 1: Throughput (the number of inputs processed per second) of NAM, NAM-FS, NA2M, and NA2M-FS 101 102 103 104 Number of features (D) 0 50 100 150 200 250 300 Throughput ( x/sec) ×103 NBM NBM-FS(50) NBM-FS(500) NB2M NB2M-FS(50,50) NB2M-FS(500,500) [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model architecture of NAM-FS for single output. Shared one-input B-output DNNs Shared two-input B-output DNNs Bias: Feature selection weights from entmax: Feature specific weights [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model architecture of NBM-FS for single output [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes adding a differentiable feature selection layer (with weights updated during training) to Neural Additive Models (NAM) and Neural Basis Models (NBM). This is claimed to resolve computational intractability when using two-input networks for feature interactions or when scaling to high-dimensional data, while preserving interpretability, reducing model size and training cost relative to vanilla NAM/NBM, and achieving performance that is better or comparable to state-of-the-art GAMs.

Significance. If the selection mechanism can be shown to prune irrelevant features and interactions reliably without dataset-specific tuning or selection bias, the work would meaningfully extend the practical reach of interpretable additive models to higher-dimensional regimes where two-input networks were previously infeasible.

major comments (3)
  1. [Abstract / Method] The abstract and method description provide no equations or pseudocode for the selection layer, the precise form of the selection weights, the loss term (or regularization schedule) that drives sparsity, or the update rule. This information is load-bearing for the central claim that the mechanism reliably identifies relevant features/interactions without introducing bias or requiring per-dataset hyperparameter search.
  2. [Experiments] No experimental protocol, datasets, baselines, error bars, or ablation on the selection weights is described. Without these, it is impossible to assess whether the reported efficiency gains and performance parity hold or whether they depend on favorable hyperparameter choices that undermine the "simple and general" claim.
  3. [Method] The skeptic concern is not addressed: if the selection weights are learned via a plain differentiable gate without explicit sparsity or stability penalties, they can latch onto spurious correlations. The manuscript must demonstrate (via controlled experiments or theoretical argument) that this does not occur on the high-dimensional regimes it targets.
minor comments (2)
  1. [Method] Notation for the selection weights and their integration into the additive structure should be introduced with an equation rather than prose only.
  2. [Abstract] The abstract states "better or comparable performance with state-of-the-art GAMs" without naming the specific GAM baselines or reporting quantitative differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key areas where additional detail and evidence are needed to support the central claims. We address each point below and commit to revisions that incorporate the requested clarifications and demonstrations.

read point-by-point responses
  1. Referee: [Abstract / Method] The abstract and method description provide no equations or pseudocode for the selection layer, the precise form of the selection weights, the loss term (or regularization schedule) that drives sparsity, or the update rule. This information is load-bearing for the central claim that the mechanism reliably identifies relevant features/interactions without introducing bias or requiring per-dataset hyperparameter search.

    Authors: We agree that the current description is insufficiently precise. In the revised manuscript we will insert the explicit equations for the feature selection layer (including the trainable weights, their transformation, and integration into the NAM/NBM forward pass), the sparsity-inducing regularization term and its annealing schedule, and the gradient-based update rule. These additions will make the mechanism fully reproducible and allow direct evaluation of whether it avoids bias or per-dataset tuning. revision: yes

  2. Referee: [Experiments] No experimental protocol, datasets, baselines, error bars, or ablation on the selection weights is described. Without these, it is impossible to assess whether the reported efficiency gains and performance parity hold or whether they depend on favorable hyperparameter choices that undermine the "simple and general" claim.

    Authors: The manuscript reports efficiency and performance results, yet we acknowledge that the experimental protocol, dataset specifications, baseline implementations, error bars, and ablations on the selection weights are not presented at the required level of detail. The revision will expand the experimental section with complete protocols, the full list of datasets and baselines, statistical error bars across multiple runs, and dedicated ablations that vary the selection regularization strength to demonstrate robustness. revision: yes

  3. Referee: [Method] The skeptic concern is not addressed: if the selection weights are learned via a plain differentiable gate without explicit sparsity or stability penalties, they can latch onto spurious correlations. The manuscript must demonstrate (via controlled experiments or theoretical argument) that this does not occur on the high-dimensional regimes it targets.

    Authors: We recognize the validity of this concern. Our approach includes an explicit sparsity regularization term on the selection weights, but the manuscript does not yet contain controlled experiments isolating spurious-feature behavior. The revision will add synthetic high-dimensional experiments with known ground-truth relevant and irrelevant features, together with quantitative metrics of selection accuracy, to show that the learned weights reliably recover the relevant set without latching onto noise. revision: yes

Circularity Check

0 steps flagged

No circularity detected in model proposal or claims

full rationale

The paper proposes an architectural extension (feature selection layer with trainable weights) to NAM/NBM and reports empirical efficiency and accuracy gains. No derivation chain, uniqueness theorem, fitted parameter renamed as prediction, or self-citation load-bearing step is present. Claims rest on experimental comparisons rather than any quantity that reduces to its own inputs by construction. This is the normal case of an applied modeling paper whose central results are externally falsifiable via replication on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the selection layer itself; selection weights appear to be ordinary learned parameters rather than hand-tuned constants.

pith-pipeline@v0.9.1-grok · 5754 in / 1046 out tokens · 22119 ms · 2026-06-26T18:03:19.341918+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    In: Advances in Neural Information Processing Systems

    Agarwal, R., Melnick, L., Frosst, N., Zhang, X., Lengerich, B., Caruana, R., Hinton, G.E.: Neural additive models: Interpretable machine learning with neural nets. In: Advances in Neural Information Processing Systems. vol. 34 (2021)

  2. [2]

    In: 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2013)

    Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: A public domain dataset for human activity recognition using smartphones. In: 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2013)

  3. [3]

    AAAI Conference on Artificial Intelligence35(8), 6679–6687 (2021)

    Arik, S.Ö., Pfister, T.: TabNet: Attentive interpretable tabular learning. AAAI Conference on Artificial Intelligence35(8), 6679–6687 (2021). https://doi.org/10.1609/aaai.v35i8.16826

  4. [4]

    In: International Conference on Learning Representations (ICLR) (2022)

    Chang, C., Caruana, R., Goldenberg, A.: NODE-GAM: Neural generalized additive model for interpretable deep learning. In: International Conference on Learning Representations (ICLR) (2022)

  5. [5]

    Proceedings of the 22nd

    Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785

  6. [6]

    Epsilon: Large scale learning challenge.https://k4all.org/project/ large-scale-learning-challenge/(2008)

  7. [7]

    In: Advances in Neural Information Process- ing Systems

    Fanty, M., Cole, R.: Spoken letter recognition. In: Advances in Neural Information Process- ing Systems. vol. 3 (1990)

  8. [8]

    In: Advances in Neural Information Processing Systems

    Gorishniy, Y ., Rubachev, I., Khrulkov, V ., Babenko, A.: Revisiting deep learning models for tabular data. In: Advances in Neural Information Processing Systems. vol. 34 (2021)

  9. [9]

    Guillermo: ChaLearn AutoML challenge.http://automl.chalearn.org/data/ (2000)

  10. [10]

    In: Advances in Neural Information Processing Systems

    Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selec- tion challenge. In: Advances in Neural Information Processing Systems. vol. 17 (2004)

  11. [11]

    In: Inter- national Conference on Learning Representations (ICLR) (2017) Neural Additive and Basis Models with Feature Selection and Interactions 13

    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. In: Inter- national Conference on Learning Representations (ICLR) (2017) Neural Additive and Basis Models with Feature Selection and Interactions 13

  12. [12]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Confer- ence on Learning Representations (ICLR) (2015). https://doi.org/10.48550/arXiv.1412.6980

  13. [13]

    In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    Lou, Y ., Caruana, R., Gehrke, J.: Intelligible models for classification and regression. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 150–158 (2012). https://doi.org/10.1145/2339530.2339556

  14. [14]

    In: 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    Lou, Y ., Caruana, R., Gehrke, J., Hooker, G.: Accurate intelligible models with pairwise interactions. In: 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 623–631 (2013). https://doi.org/10.1145/2487575.2487579

  15. [15]

    In: International Conference on Learning Representations (ICLR) (2017)

    Maddison, C.J., Mnih, A., Teh, Y .W.: The concrete distribution: A continuous relaxation of discrete random variables. In: International Conference on Learning Representations (ICLR) (2017)

  16. [16]

    https://doi.org/10.48550/arXiv.1909.09223

    Nori, H., Jenkins, S., Koch, P., Caruana, R.: InterpretML: A unified framework for machine learning interpretability (2019). https://doi.org/10.48550/arXiv.1909.09223

  17. [17]

    In: Advances in Neural Information Process- ing Systems

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Te- jani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Inform...

  18. [18]

    Journal of Machine Learning Research12(85), 2825–2830 (2011)

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research12(85), 2825–2830 (2011)

  19. [19]

    In: 57th An- nual Meeting of the Association for Computational Linguistics (ACL)

    Peters, B., Niculae, V ., Martins, A.F.T.: Sparse sequence-to-sequence models. In: 57th An- nual Meeting of the Association for Computational Linguistics (ACL). pp. 1504–1519 (2019). https://doi.org/10.18653/v1/P19-1146

  20. [20]

    In: International Conference on Learning Representations (ICLR) (2020)

    Popov, S., Morozov, S., Babenko, A.: Neural oblivious decision ensembles for deep learning on tabular data. In: International Conference on Learning Representations (ICLR) (2020)

  21. [21]

    In: Advances in Neural Information Processing Systems

    Radenovic, F., Dubey, A., Mahajan, D.: Neural basis models for interpretability. In: Advances in Neural Information Processing Systems. vol. 35 (2022)

  22. [22]

    Why Should I Trust You?

    Ribeiro, M.T., Singh, S., Guestrin, C.: “Why Should I Trust You?”: Explaining the predic- tions of any classifier. In: 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. pp. 1135–1144 (2016). https://doi.org/10.1145/2939672.2939778

  23. [23]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , volume =

    Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence1(5), 206–215 (2019). https://doi.org/10.1038/s42256-019-0048-x

  24. [24]

    In: 2019 IEEE/CVF International Conference on Com- puter Vision (ICCV)

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: Visual explanations from deep networks via gradient-based localiza- tion. In: International Conference on Computer Vision (ICCV). pp. 618–626 (2017). https://doi.org/10.1109/ICCV .2017.74

  25. [25]

    Journal of Machine Learning Research 15(56), 1929–1958 (2014)

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A sim- ple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(56), 1929–1958 (2014)

  26. [26]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Xiao, H., Rasul, K., V ollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms (2017). https://doi.org/10.48550/arXiv.1708.07747 14 Y . Kishimoto et al. Table 4.Comparison of the number of parameters (#param) and the total multiply–accumulate operations (MACs) in forward calculation of each model with a single output...

  27. [27]

    This dataset has the highest number of features in this work

    The task is based on a handwritten digit recognition problem to separate the highly confusable digits ‘4’ and ‘9’. This dataset has the highest number of features in this work. 3 https://www.openml.org/ 16 Y . Kishimoto et al. C Hyperparameter Setting and Tuning We tune the hyperparameters of each model using a grid search. We randomly select 10% from the...