pith. sign in

arxiv: 2606.04042 · v2 · pith:OP532E5Unew · submitted 2026-06-02 · ❄️ cond-mat.mtrl-sci

Uncertainty-Aware Symbolic Regression through Bayesian Support Selection

Pith reviewed 2026-06-28 09:38 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci
keywords symbolic regressionBayesian inferenceSISSOuncertainty quantificationdescriptor selectionHeusler alloysmaterials informaticsmodel averaging
0
0 comments X

The pith

A Bayesian reformulation of the sparsifying operator in SISSO produces posterior probabilities over competing descriptors and predictive credible intervals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the deterministic SISSO symbolic regression method by keeping the sure independence screening stage fixed while turning the sparsifying operator into a Bayesian inference problem over the screened supports. This yields probabilities for different descriptor sets, feature inclusion rates, averaged predictions, and uncertainty intervals around them. In the limit of maximum a posteriori, it recovers the original deterministic result. Applied to magnetic moment data for Heusler alloys, it shows modest gains in cross-validation accuracy and proper coverage of prediction intervals, while highlighting multiple physically related descriptor families that a single deterministic run would miss. The approach is positioned as a diagnostic tool for assessing descriptor stability in small-data materials problems.

Core claim

The deterministic-SIS/Bayesian-SO framework yields posterior probabilities for competing descriptor supports, feature-inclusion probabilities, Bayesian-model-averaged predictions, and predictive credible intervals, while recovering the deterministic SO descriptor of standard SISSO in the maximum-a-posteriori limit. Applied to an X2YZ Heusler-alloy magnetic-moment dataset, the approach gives modest improvements in five-fold cross-validation RMSE and near-nominal empirical coverage of the 95% predictive intervals. The posterior exposes competing, physically related symbolic descriptor families.

What carries the argument

The deterministic-SIS/Bayesian-SO framework, which performs Bayesian inference over the support space identified by sure independence screening.

If this is right

  • The method recovers standard SISSO descriptors exactly in the MAP limit.
  • It provides Bayesian model-averaged predictions that can improve accuracy over single models.
  • Credible intervals achieve near-nominal coverage in cross-validation on the alloy dataset.
  • Multiple descriptor families become visible that share physical relations but would be hidden in deterministic output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Uncertainty quantification could help prioritize which descriptors to validate experimentally in materials discovery.
  • The framework might extend to other symbolic regression methods by replacing their selection stages with Bayesian analogs.
  • Competing descriptors suggest that feature spaces in materials problems often contain redundant physical information.

Load-bearing premise

The sure independence screening stage can remain fully deterministic without compromising the validity or tractability of Bayesian inference over the resulting support space.

What would settle it

A test where the empirical coverage of the 95% predictive intervals on held-out data falls substantially below 95%, or where the posterior mass on the standard SISSO descriptor is not the highest.

Figures

Figures reproduced from arXiv: 2606.04042 by Satadeep Bhattacharjee.

Figure 1
Figure 1. Figure 1: FIG. 1. Workflow of the deterministic-SIS/Bayesian-SO [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Summary of the X [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Average validation RMSE for the deterministic SIS+SO baseline of standard SISSO and Bayesian posterior-mean [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Two-dimensional uniform-prior PIP stability with re [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Diagnostics for the two-dimensional Slater–Pauling prior. Top left: fold-resolved validation RMSE. Top right: average [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Posterior feature-inclusion probabilities averaged over the five cross-validation folds. Each bar gives the average [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

The Sure Independence Screening and Sparsifying Operator (SISSO) framework is a powerful symbolic-regression method for extracting compact and interpretable descriptors from large nonlinear feature spaces. However, standard SISSO is deterministic: it returns a single descriptor and point prediction, without quantifying uncertainty in descriptor selection, regression coefficients, or predictions. Here we introduce a probabilistic extension in which the sure independence screening (SIS) stage is kept deterministic to preserve scalability, while the sparsifying operator (SO) stage is reformulated as Bayesian inference over the SIS-screened support space. The resulting deterministic-SIS/Bayesian-SO framework yields posterior probabilities for competing descriptor supports, feature-inclusion probabilities, Bayesian-model-averaged predictions, and predictive credible intervals, while recovering the deterministic SO descriptor of standard SISSO in the maximum-a-posteriori limit. Applied to an $X_2YZ$ Heusler-alloy magnetic-moment dataset, the approach gives modest improvements in five-fold cross-validation RMSE and near-nominal empirical coverage of the 95$\%$ predictive intervals. More importantly, the posterior exposes competing, physically related symbolic descriptor families that would appear artificially unique in a deterministic analysis. These results suggest that deterministic-SIS/Bayesian-SO can be used as an uncertainty-aware diagnostic extension of SISSO: a tool for assessing descriptor confidence, stability, and non-uniqueness in small-data materials regression problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Bayesian extension to the SISSO symbolic regression framework for materials informatics. The sure-independence-screening (SIS) stage remains fully deterministic to retain scalability, while the sparsifying-operator (SO) stage is recast as Bayesian inference over the reduced support space. This yields posterior probabilities over competing supports, feature-inclusion probabilities, Bayesian-model-averaged predictions, and predictive credible intervals; the deterministic SISSO solution is recovered in the maximum-a-posteriori limit. On an X2YZ Heusler-alloy magnetic-moment dataset the method reports modest five-fold cross-validation RMSE gains and near-nominal empirical coverage of the 95% predictive intervals, while surfacing physically related but competing descriptor families.

Significance. If the central construction is valid, the work supplies a practical uncertainty-aware diagnostic layer for an established symbolic-regression tool widely used in small-data materials problems. The explicit recovery of the deterministic solution in the MAP limit and the demonstration that posterior mass can be distributed across physically related descriptors are concrete strengths that would be useful to practitioners.

major comments (2)
  1. [§3] §3 (Bayesian-SO construction): the claim that posterior probabilities, inclusion probabilities, and credible intervals remain valid when the support space is restricted to the deterministic SIS output requires an explicit argument that no descriptor outside the screened set can carry non-negligible posterior mass under the joint model. The correlation-based screening criterion does not automatically guarantee this in high-dimensional nonlinear feature spaces; without such an argument or a diagnostic test the reported posteriors are conditioned on an incomplete model space.
  2. [§4.2, Table 2] §4.2 and Table 2 (Heusler results): the five-fold CV-RMSE improvement is stated as modest, yet no quantitative comparison is given against a baseline that performs Bayesian inference over a larger (non-SIS-screened) support or against standard SISSO with bootstrap uncertainty. Without this, it is unclear whether the reported gains and coverage are attributable to the Bayesian-SO layer or to other modeling choices.
minor comments (2)
  1. [Abstract] The abstract states 'near-nominal empirical coverage' but supplies neither the exact empirical coverage value nor the figure or table that reports it; adding the numerical result would improve clarity.
  2. [§3] Notation for the prior on regression coefficients and the sampling method (MCMC, variational, etc.) used to obtain the posteriors should be stated explicitly in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, proposing targeted revisions to clarify scope and limitations while preserving the manuscript's focus on a scalable uncertainty-aware extension of SISSO.

read point-by-point responses
  1. Referee: [§3] §3 (Bayesian-SO construction): the claim that posterior probabilities, inclusion probabilities, and credible intervals remain valid when the support space is restricted to the deterministic SIS output requires an explicit argument that no descriptor outside the screened set can carry non-negligible posterior mass under the joint model. The correlation-based screening criterion does not automatically guarantee this in high-dimensional nonlinear feature spaces; without such an argument or a diagnostic test the reported posteriors are conditioned on an incomplete model space.

    Authors: We agree that all reported posterior quantities are conditional on the SIS-screened support by construction. The deterministic SIS stage is retained explicitly for scalability, as a joint Bayesian treatment over the full nonlinear feature space is computationally prohibitive. While SIS has theoretical retention guarantees in linear high-dimensional settings, these are heuristic in the nonlinear case used here. In revision we will add an explicit paragraph in §3 stating that posteriors are conditioned on the screened space (with zero mass assigned outside it by design) and will include a short diagnostic note suggesting post-hoc correlation checks of excluded features with the target as a practical safeguard. revision: partial

  2. Referee: [§4.2, Table 2] §4.2 and Table 2 (Heusler results): the five-fold CV-RMSE improvement is stated as modest, yet no quantitative comparison is given against a baseline that performs Bayesian inference over a larger (non-SIS-screened) support or against standard SISSO with bootstrap uncertainty. Without this, it is unclear whether the reported gains and coverage are attributable to the Bayesian-SO layer or to other modeling choices.

    Authors: The central contribution is an uncertainty diagnostic that recovers the deterministic SISSO solution in the MAP limit rather than a claim of superior point accuracy. The modest CV-RMSE gain and near-nominal coverage arise from Bayesian model averaging over supports within the screened space. Full Bayesian inference without screening remains intractable for the feature dimensions considered, which is the motivation for the two-stage design. Bootstrap resampling of deterministic SISSO would quantify coefficient variability for a fixed support but would not capture posterior mass over alternative supports. In revision we will expand §4.2 with a paragraph explaining these distinctions and will add a direct numerical comparison of 5-fold CV-RMSE and interval coverage between the Bayesian-SO results and bootstrap-resampled standard SISSO on the same Heusler dataset. revision: partial

Circularity Check

0 steps flagged

No circularity: Bayesian reformulation is an independent probabilistic layer on deterministic SIS

full rationale

The paper keeps the sure-independence-screening stage fully deterministic (to preserve scalability) and reformulates only the sparsifying operator as standard Bayesian inference over the screened support. Posterior probabilities, feature-inclusion probabilities, model-averaged predictions and credible intervals follow directly from Bayesian model averaging with conventional priors; they are not obtained by fitting the same quantities used for the reported five-fold CV RMSE or coverage checks. The MAP recovery of the deterministic SISSO solution is an expected limiting case of the Bayesian construction rather than a reduction of new results to the input data. No self-citation is load-bearing for the core derivation, and the evaluation metrics are computed on held-out folds, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, ad-hoc axioms, or invented entities; the method rests on standard Bayesian inference applied to a discrete support space after deterministic screening.

axioms (1)
  • domain assumption Bayesian inference can be performed tractably over the finite set of supports returned by deterministic SIS
    The framework presupposes that the screened support space is small enough for exact or efficient posterior computation.

pith-pipeline@v0.9.1-grok · 5773 in / 1356 out tokens · 58769 ms · 2026-06-28T09:38:55.904550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references

  1. [1]

    Ouyang, S

    R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, and L. M. Ghiringhelli, Physical Review Materials2, 083802 (2018)

  2. [2]

    Mullick, A

    A. Mullick, A. Ghosh, G. S. Chaitanya, S. Ghui, T. Nayak, S.-C. Lee, S. Bhattacharjee, and P. Goyal, Computational Materials Science233, 112659 (2024)

  3. [3]

    Wang and J

    S. Wang and J. Jiang, ACS Catalysis13, 7428 (2023)

  4. [4]

    Bhattacharjee and S.-C

    S. Bhattacharjee and S.-C. Lee, Journal of Magnetism and Magnetic Materials563, 169818 (2022)

  5. [5]

    S. Ram, G. H. Choi, A. S. Lee, S.-C. Lee, and S. Bhat- tacharjee, ACS Applied Materials & Interfaces15, 43702 (2023)

  6. [6]

    G. Cao, R. Ouyang, L. M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, and Z. Zhang, Physical Review Materials4, 034204 (2020)

  7. [7]

    S. R. Xie, G. R. Stewart, J. J. Hamlin, P. J. Hirschfeld, and R. G. Hennig, Physical Review B100, 174513 (2019)

  8. [8]

    A. S. Nair, L. Foppa, and M. Scheffler, npj Computa- tional Materials11, 150 (2025)

  9. [9]

    Z.-X. Shen, C. Su, and L. He, npj Computational Mate- rials8, 132 (2022)

  10. [10]

    Tantardini, H

    C. Tantardini, H. A. Zakaryan, Z.-K. Han, T. Altalhi, S. V. Levchenko, A. G. Kvashnin, and B. I. Yakobson, Journal of Computational Science82, 102402 (2024)

  11. [11]

    A. S. Nair, L. Foppa, and M. Scheffler, Faraday Discus- sions (2026)

  12. [12]

    Tavares, K

    S. Tavares, K. Yang, and M. A. Meyers, Progress in Ma- terials Science132, 101017 (2023)

  13. [13]

    Gelman, J

    A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian data analysis(Chapman and Hall/CRC, 1995)

  14. [14]

    J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, Statistical science14, 382 (1999)

  15. [15]

    E. I. George and R. E. McCulloch, Journal of the Amer- ican Statistical Association88, 881 (1993)

  16. [16]

    E. I. George and R. E. McCulloch, Statistica sinica , 339 (1997)

  17. [17]

    Galanakis, P

    I. Galanakis, P. Dederichs, and N. Papanikolaou, Physi- cal Review B66, 174429 (2002)

  18. [18]

    Galanakis, P

    I. Galanakis, P. Mavropoulos, and P. H. Dederichs, Jour- nal of Physics D: Applied Physics39, 765 (2006)

  19. [19]

    The University of Alabama, Heusler alloy database, Online database (2026), available at http://heusleralloys.mint.ua.edu/; accessed April 29, 2026

  20. [20]

    Galanakis, E

    I. Galanakis, E. S ¸a¸ sıo˘ glu, S. Bl¨ ugel, and K.¨Ozdo˘ gan, Phys. Rev. B90, 064408 (2014)

  21. [21]

    Stollhoff, A

    G. Stollhoff, A. M. Ole´ s, and V. Heine, Physical Review B41, 7028 (1990)

  22. [22]

    Zellner, Journal of Econometrics75, 51 (1996)

    A. Zellner, Journal of Econometrics75, 51 (1996)