Expert-Guided Class-Conditional Goodness-of-Fit Scores for Interpretable Classification with Informative Missingness: An Application to Seismic Monitoring

David M. Steinberg; Shahar Cohen; Yael Radzyner; Yochai Ben Horin

arxiv: 2604.14809 · v1 · submitted 2026-04-16 · 📊 stat.ML · cs.LG· stat.AP

Expert-Guided Class-Conditional Goodness-of-Fit Scores for Interpretable Classification with Informative Missingness: An Application to Seismic Monitoring

Shahar Cohen , David M. Steinberg , Yael Radzyner , Yochai Ben Horin This is my paper

Pith reviewed 2026-05-10 10:20 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.AP

keywords interpretable classificationinformative missingnessgoodness-of-fit scoresexpert-guided modelsseismic monitoringsmall-sample performancetransparent decision rulesnuclear-test-ban treaty

0 comments

The pith

A framework encodes expert knowledge into class-conditional goodness-of-fit features that yield interpretable classifications even with pervasive informative missingness and can outperform standard machine-learning methods when training is

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a classification method for settings that combine heavy missing data, partial expert knowledge, and a requirement for transparent decisions. It builds a class-conditional model from expert input and derives a small set of goodness-of-fit scores that measure how observed and missing values align with each class model. These scores are joined with a few simple auxiliary summaries and passed to a basic discriminative classifier. The resulting rule is easy to inspect and is shown, via simulation that isolates the framework's contribution, to match or exceed strong black-box classifiers when labeled training examples are scarce. The approach is demonstrated on seismic signals used to verify compliance with the nuclear-test-ban treaty, where it is positioned as a transparent screening tool that lowers the review burden on human analysts.

Core claim

The central claim is that prior expert knowledge can be encoded in a class-conditional model whose goodness-of-fit scores, together with transparent auxiliary summaries, produce an accurate yet fully inspectable classifier that handles informative missingness; simulations isolating the framework show this expert-guided method can outperform standard machine-learning classifiers, especially when training samples are small.

What carries the argument

Expert-guided class-conditional goodness-of-fit scores that quantify agreement of both observed and missing data components with the expert model for each class.

If this is right

Decision rules can be inspected and justified component by component because each feature has a direct interpretation as agreement with an expert model.
Analyst workload in seismic monitoring is reduced by using the scores as an initial transparent filter before human review.
Performance holds or improves relative to black-box methods when labeled training data are limited because expert knowledge supplies the missing structure.
The framework isolates contributions from observed data, missing data patterns, and auxiliary summaries, allowing targeted diagnosis of classification errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feature-construction step could be tested in medical triage tasks where laboratory results are frequently absent and domain models of normal versus diseased states already exist.
Because the scores remain meaningful even with high missingness, the method may offer a route to more robust classifiers in any domain where missingness itself carries class information.
Extending the auxiliary summaries to include domain-specific constraints (for example, physical bounds on seismic amplitudes) would be a direct next step that keeps the decision rule transparent.

Load-bearing premise

The expert-specified class-conditional model is accurate and complete enough that its goodness-of-fit scores capture the relevant class differences even when many values are missing.

What would settle it

A direct accuracy comparison between the proposed method and standard classifiers (random forests, neural nets) on a large labeled seismic dataset, repeated across training-set sizes from tens to thousands of examples.

Figures

Figures reproduced from arXiv: 2604.14809 by David M. Steinberg, Shahar Cohen, Yael Radzyner, Yochai Ben Horin.

**Figure 1.** Figure 1: Overview of the proposed pipeline. Starting from the observed data and the expert-guided class-conditional models, we estimate instance-level parameters (or latent states) { ˆθ (y) i }y∈M. The fitted quantities, together with the original data and the models, are then used in two expert-guided feature-engineering branches: (1) construction of model-fit score features (including interpretable decomposed com… view at source ↗

**Figure 2.** Figure 2: ROC curves for all models on the test set. The decomposed feature representation yields a clear improvement over the baseline logistic models, and the best overall performance is obtained by the random forest augmented with the proposed features [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: shows calibration plots for the expert-guided probit models at these stations, separately for valid and invalid SEL1 events, and [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Histograms of pooled SEL1 residuals together with the fitted expert-guided residual models. The top row shows log(a/T) residuals for valid events (a) and invalid events (b); the bottom row shows arrival-time residuals for valid events (c) and invalid events (d). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Predictive performance across training sample sizes. The top row shows AUROC and the bottom row shows TNR at TPR= 0.95. In each row, the left panel corresponds to the low-λ scenario and the right panel to the high-λ scenario. For each method, points denote the Monte Carlo mean across replicates; Monte Carlo standard errors were below 0.01. Overall, [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Paired performance gains of score-based representations over baselines in the simulation study. Each panel reports a paired difference between two methods evaluated on the same replicate and scenario: (LR-decomp) − (LR-obs) (left column) and (RF-raw + features) − (RF-raw) (right column). Rows correspond to the evaluation metric (top: AUROC; bottom: TNR at TPR= 0.95). Points show the mean paired difference… view at source ↗

**Figure 7.** Figure 7: Predictive performance under correct specification and expert-model misspecification, for training size n = 10,000. Points show Monte Carlo mean test-set performance, line segments connect the same method under correct specification and misspecification. The left panel reports AUROC, and the right panel reports TNR at TPR= 0.95. B.4 Feature-level diagnostics and model interpretability Diagnostic behavior o… view at source ↗

**Figure 8.** Figure 8: Class-conditional distributions of the extracted score features in representative simulation settings. Here u denotes the corresponding feature summary constructed from the extracted ℓ-based scores. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗

read the original abstract

We study a classification problem with three key challenges: pervasive informative missingness, the integration of partial prior expert knowledge into the learning process, and the need for interpretable decision rules. We propose a framework that encodes prior knowledge through an expert-guided class-conditional model for one or more classes, and use this model to construct a small set of interpretable goodness-of-fit features. The features quantify how well the observed data agree with the expert model, isolating the contributions of different aspects of the data, including both observed and missing components. These features are combined with a few transparent auxiliary summaries in a simple discriminative classifier, resulting in a decision rule that is easy to inspect and justify. We develop and apply the framework in the context of seismic monitoring used to assess compliance with the Comprehensive Nuclear-Test-Ban Treaty. We show that the method has strong potential as a transparent screening tool, reducing workload for expert analysts. A simulation designed to isolate the contribution of the proposed framework shows that this interpretable expert-guided method can even outperform strong standard machine-learning classifiers, particularly when training samples are small.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable way to fold partial expert class-conditional knowledge into transparent GOF features that separate observed and missing contributions, but the simulation evidence for beating standard classifiers is too thin to judge yet.

read the letter

The main contribution is a framework that takes an expert-specified class-conditional model for one or more classes and turns it into a handful of goodness-of-fit scores. These scores measure agreement on the observed data and on the missingness pattern separately, then feed into a simple discriminative stage with a few auxiliary summaries. The result is a decision rule that stays easy to inspect while using the expert input directly rather than trying to learn everything from scarce labeled examples.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a framework for interpretable classification with pervasive informative missingness. Prior expert knowledge is encoded via a class-conditional model for one or more classes; this model is used to construct a small set of goodness-of-fit features that quantify agreement between observed (and missing) data and the expert specification. These features are combined with transparent auxiliary summaries inside a simple discriminative classifier. The approach is developed and applied to seismic monitoring for CTBT compliance verification; a simulation is reported to show that the resulting interpretable method can outperform standard machine-learning classifiers, especially at small training-sample sizes.

Significance. If the simulation result is shown to hold when the expert model is only approximately correct, the framework would supply a practical, inspectable screening tool for domains that combine expert knowledge, missingness, and the need for justifiable decisions. The emphasis on isolating contributions of observed versus missing components and on low-data regimes is a genuine strength.

major comments (2)

[Simulation section] Simulation section: the description of the data-generating process used to isolate the framework's contribution is insufficient. It is not stated whether the synthetic observations (including the missingness pattern) are drawn exactly from the expert-specified class-conditional densities. If they are, the GOF features are informative by construction while black-box classifiers must discover the same structure from few samples; this setup does not test the misspecification case flagged by the weakest assumption and required for the seismic-monitoring claim.
[Abstract] Abstract and results presentation: the central performance claim is asserted without any numerical values, confidence intervals, sample sizes, or description of how the expert model was elicited or validated. This leaves the outperformance statement unsupported by visible evidence and prevents assessment of whether the advantage survives realistic departures from the expert model.

minor comments (1)

[Method section] Notation for the goodness-of-fit features and the auxiliary summaries could be introduced more explicitly with a single table or equation block to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to improve clarity and evidentiary support.

read point-by-point responses

Referee: Simulation section: the description of the data-generating process used to isolate the framework's contribution is insufficient. It is not stated whether the synthetic observations (including the missingness pattern) are drawn exactly from the expert-specified class-conditional densities. If they are, the GOF features are informative by construction while black-box classifiers must discover the same structure from few samples; this setup does not test the misspecification case flagged by the weakest assumption and required for the seismic-monitoring claim.

Authors: We agree that the simulation section requires a more explicit description of the data-generating process. In the revised manuscript, we will state that the synthetic observations and missingness patterns are generated directly from the expert-specified class-conditional densities. This controlled setup isolates the benefit of embedding expert knowledge as goodness-of-fit features, showing their value for interpretable classification when training data are scarce. We acknowledge that the simulation assumes an exactly correct expert model and therefore does not evaluate performance under misspecification. We will add a discussion paragraph addressing the implications of approximate expert models for the seismic application, where the class-conditional specification derives from physical domain knowledge, and will note that sensitivity checks could be explored in future work. revision: partial
Referee: Abstract and results presentation: the central performance claim is asserted without any numerical values, confidence intervals, sample sizes, or description of how the expert model was elicited or validated. This leaves the outperformance statement unsupported by visible evidence and prevents assessment of whether the advantage survives realistic departures from the expert model.

Authors: We accept the need for greater specificity. We will revise the abstract to report key quantitative results from the simulation (e.g., classification accuracy or AUC values with confidence intervals at the small sample sizes examined) and will briefly indicate that the expert model was constructed from established seismic monitoring principles. These additions will make the outperformance claim directly verifiable from the abstract. We will also expand the results section to include the requested details on expert-model elicitation and validation, while adding a short note on the dependence of performance on the fidelity of the expert specification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework uses external expert prior and simulation is presented as isolating contribution

full rationale

The paper encodes prior expert knowledge via a class-conditional model to build GOF features that quantify agreement with that model (including missingness), then feeds the features plus auxiliary summaries into a simple classifier. This structure does not reduce the output to the inputs by construction because the expert model is supplied externally rather than fitted to the classification data. The simulation claim is framed as isolating the framework's contribution and demonstrating outperformance (especially at small n), with no quoted equations or self-citation chains showing that the synthetic data generation forces the GOF features to be informative tautologically. The method is therefore self-contained against its stated external benchmark and prior-knowledge assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence and accuracy of an expert-provided class-conditional model and on the assumption that goodness-of-fit scores derived from it remain informative under missingness.

axioms (1)

domain assumption An expert can specify a class-conditional model that captures the relevant statistical structure for at least one class
The framework encodes prior knowledge through this model to construct the goodness-of-fit features.

invented entities (1)

Expert-guided class-conditional goodness-of-fit features no independent evidence
purpose: Quantify agreement between observed (and missing) data and the expert model for use in a transparent classifier
These features are newly constructed from the expert model and are central to the interpretable decision rule.

pith-pipeline@v0.9.0 · 5516 in / 1342 out tokens · 29584 ms · 2026-05-10T10:20:11.048387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

If ¯ℓ(1) i,det≤−1.81, predictY= 0, except in the narrow region Mˆ≤7.674,−2.997< ¯ℓ(1) i,det≤−1.81,¯ℓ(1) i,obs >−1.772, where the tree predictsY= 1

work page
[2]

If ¯ℓ(1) i,det >−1.81and¯ℓ(1) i,obs≤−1.871, predictY= 0

work page
[3]

If ¯ℓ(1) i,det >−1.81,¯ℓ(1) i,obs >−1.871, and¯Ri≤0.134, predictY= 1

work page
[4]

L-BFGS-B

If ¯ℓ(1) i,det >−1.81, ¯ℓ(1) i,obs >−1.871, and ¯Ri > 0.134, then use ¯ℓ(1) i,nondet: predict Y= 0if ¯ℓ(1) i,nondet≤−0.133, andY= 1otherwise. In the representative simulation setting considered here, the learned tree is shallow and its top-level splits are dominated by the detection-fit score¯ℓ(1) i,det and the observed- value fit score¯ℓ(1) i,obs. In par...

work page

[1] [1]

If ¯ℓ(1) i,det≤−1.81, predictY= 0, except in the narrow region Mˆ≤7.674,−2.997< ¯ℓ(1) i,det≤−1.81,¯ℓ(1) i,obs >−1.772, where the tree predictsY= 1

work page

[2] [2]

If ¯ℓ(1) i,det >−1.81and¯ℓ(1) i,obs≤−1.871, predictY= 0

work page

[3] [3]

If ¯ℓ(1) i,det >−1.81,¯ℓ(1) i,obs >−1.871, and¯Ri≤0.134, predictY= 1

work page

[4] [4]

L-BFGS-B

If ¯ℓ(1) i,det >−1.81, ¯ℓ(1) i,obs >−1.871, and ¯Ri > 0.134, then use ¯ℓ(1) i,nondet: predict Y= 0if ¯ℓ(1) i,nondet≤−0.133, andY= 1otherwise. In the representative simulation setting considered here, the learned tree is shallow and its top-level splits are dominated by the detection-fit score¯ℓ(1) i,det and the observed- value fit score¯ℓ(1) i,obs. In par...

work page