pith. sign in

arxiv: 2506.06455 · v1 · submitted 2025-06-06 · 💻 cs.LG · cs.AI· stat.ML

WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets

Pith reviewed 2026-05-19 10:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords WISCAinterpretabilityconsensustabular datamachine learningexplanationssynthetic datasetsmodel-agnostic
0
0 comments X

The pith

WISCA consensus aligns with the most reliable individual interpretability method on synthetic tabular data

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WISCA as a new consensus method that combines outputs from multiple model-agnostic interpretability techniques applied to machine learning models on tabular data. It weights attributions by class probabilities and normalizes them to produce a single explanation when individual methods disagree. Experiments trained six models on six synthetic datasets with known ground truths, then compared WISCA against other consensus approaches. The new method aligned more closely with the single best individual technique than alternatives did. A sympathetic reader would care because conflicting explanations undermine trust in models used for science or high-stakes decisions.

Core claim

WISCA integrates class probability and normalized attributions to generate consensus explanations from various model-agnostic interpretability techniques. When applied to six ML models trained on six synthetic datasets with known ground truths, WISCA consistently aligned with the most reliable individual method, demonstrating the value of robust consensus strategies in improving explanation reliability.

What carries the argument

WISCA (Weighted Scaled Consensus Attributions), which weights and scales attributions using class probabilities to harmonize conflicting outputs from multiple interpretability algorithms.

Load-bearing premise

That alignment with the single most reliable individual method on synthetic data with known ground truth means the consensus explanations are higher quality in general.

What would settle it

Apply WISCA to a new synthetic dataset with known ground truth and observe whether it still aligns with the top individual method, or test it on real tabular data and measure agreement with downstream task performance or expert judgment.

Figures

Figures reproduced from arXiv: 2506.06455 by Antonio Jes\'us Banegas-Luna, Carlos Mart\'inez-Cort\'es, Horacio P\'erez-S\'anchez.

Figure 1
Figure 1. Figure 1: Families of functions that can implement the classification correction factor. [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Heatmaps summarizing model performance metrics across datasets. [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Score of the consensus functions measured in terms of the hit rate metric. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average distance between expected and non-expected features accross the datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Score of the interpretability algorithms measured in terms of the hit rate metric. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spearman correlation between WISCA and the interpretability algorithms. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Jensen-Shannon divergence between WISCA and the interpretability algorithms. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hit rate score of LR and WISCA. their importance is closer to the rest of the features. The Schiller test is a gyne￾cological study that applies iodine to the cervix to detect cellular alterations. It is performed during colposcopy. The Schiller test helps to define the boundaries between the epithelium and the lesion. Therefore, it is logical that the result of this test is considered an expected marker f… view at source ↗
Figure 9
Figure 9. Figure 9: Consensus explanations returned by WISCA on the cervical cancer risk dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Consensus explanations returned by WISCA on the wine dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Consensus explanations returned by WISCA on the bike rental dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

While predictive accuracy is often prioritized in machine learning (ML) models, interpretability remains essential in scientific and high-stakes domains. However, diverse interpretability algorithms frequently yield conflicting explanations, highlighting the need for consensus to harmonize results. In this study, six ML models were trained on six synthetic datasets with known ground truths, utilizing various model-agnostic interpretability techniques. Consensus explanations were generated using established methods and a novel approach: WISCA (Weighted Scaled Consensus Attributions), which integrates class probability and normalized attributions. WISCA consistently aligned with the most reliable individual method, underscoring the value of robust consensus strategies in improving explanation reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces WISCA (Weighted Scaled Consensus Attributions), a novel consensus method that combines class probabilities with normalized attributions from multiple model-agnostic interpretability techniques. The authors train six ML models on six synthetic tabular datasets with known ground-truth feature attributions, generate explanations via established methods and WISCA, and report that WISCA consistently aligns with whichever individual method recovers the ground truth most reliably.

Significance. A well-supported consensus procedure for tabular interpretability could help reconcile conflicting explanations in scientific and high-stakes settings. The use of synthetic data with explicit ground truth provides a controlled test bed, which is a methodological strength. However, the absence of quantitative metrics, statistical tests, or results on real data limits the immediate significance of the reported findings.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that WISCA 'consistently aligned with the most reliable individual method' is stated without any quantitative metrics, error bars, statistical tests, or description of how the 'most reliable' method was identified on the six synthetic datasets. This makes it impossible to assess the magnitude or reliability of the reported alignment.
  2. [Evaluation] Evaluation section: the argument that alignment with the single best individual explainer on these particular synthetics constitutes improved explanation quality rests on an untested premise; no stability, fidelity, or human-grounded metrics on non-synthetic tabular data are provided to test whether the consensus adds robustness beyond simply reproducing the best individual method.
minor comments (2)
  1. [Method] Clarify the precise formula for weighting and scaling in WISCA (e.g., how class probability is combined with normalized attributions) and whether any hyperparameters are involved.
  2. [Results] Add a table or figure summarizing the quantitative agreement scores (e.g., rank correlation or attribution error) between WISCA and each individual method across the six datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where we agree and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that WISCA 'consistently aligned with the most reliable individual method' is stated without any quantitative metrics, error bars, statistical tests, or description of how the 'most reliable' method was identified on the six synthetic datasets. This makes it impossible to assess the magnitude or reliability of the reported alignment.

    Authors: We agree that the abstract would benefit from greater specificity. The manuscript identifies the most reliable method by direct comparison of each explainer's feature attributions against the known ground-truth importances in the synthetic datasets. In the revision we will update the abstract to report key quantitative results (e.g., mean alignment percentage or rank correlation with the best individual method across the six datasets) and note that reliability was assessed via fidelity to ground truth. revision: yes

  2. Referee: [Evaluation] Evaluation section: the argument that alignment with the single best individual explainer on these particular synthetics constitutes improved explanation quality rests on an untested premise; no stability, fidelity, or human-grounded metrics on non-synthetic tabular data are provided to test whether the consensus adds robustness beyond simply reproducing the best individual method.

    Authors: The synthetic setting with explicit ground truth is used precisely to enable objective identification of the best individual method; WISCA is shown to match that method's output without access to the ground truth. This constitutes a controlled test of whether consensus can reliably recover high-fidelity explanations. We acknowledge that additional evidence on real data would be valuable and will expand the evaluation section to include stability metrics (consistency of attributions across repeated model trainings) while adding an explicit limitations paragraph on the absence of real-world tabular results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical alignment on synthetic ground-truth data

full rationale

The paper evaluates the proposed WISCA consensus method by training models on six synthetic datasets with known ground-truth feature attributions, applying multiple model-agnostic explainers, and measuring alignment of the consensus output with the single best individual explainer. This constitutes a direct empirical comparison against an external benchmark rather than any derivation, fitting step, or self-referential definition. No equations, self-citations, uniqueness theorems, or ansatzes are invoked that reduce the central claim to the paper's own inputs by construction. The evaluation is therefore self-contained and falsifiable via the provided synthetic ground truths.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that synthetic datasets with known ground truths are sufficient proxies for evaluating explanation quality and that alignment with one reliable method indicates overall improvement.

axioms (1)
  • domain assumption Synthetic datasets with known ground truths accurately reflect the behavior of interpretability methods on real tabular data.
    Invoked when results on synthetic data are used to claim improved reliability of explanations.
invented entities (1)
  • WISCA (Weighted Scaled Consensus Attributions) no independent evidence
    purpose: To integrate class probability and normalized attributions into a consensus explanation.
    New method introduced in the abstract; no independent evidence outside the paper's own experiments is provided.

pith-pipeline@v0.9.0 · 5657 in / 1248 out tokens · 23589 ms · 2026-05-19T10:17:42.269480+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Aeberhard, S., Forina, M. (1992). Wine. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5PC7J. Altmann, A., Toloşi, L., Sander, O., Lengauer, T. (2010). Permutation importance: a cor- rected feature importance measure.Bioinformatics, 26(10), 1340–1347.https://doi.org/ 10.1093/bioinformatics/btq134. Ayad, H.G., Kamel, M.S. (2007). Cumulativ...

  2. [2]

    Banegas-Luna, A.J., Pérez-Sánchez, H. (2022). SIBILA: High-performance computing and inter- pretable machine learning join efforts toward personalised medicine in a novel decision-making tool. arXiv. Bennetot, A., Donadello, I., El Qadi El Haouari, A., Dragoni, M., Frossard, T., Wagner, B., Sar- ranti, A., Tulli, S., Trocan, M., Chatila, R., Holzinger, A....

  3. [3]

    Linardatos, P., Papastefanopoulos, V., Kotsiantis, S. (2020). Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1),

  4. [4]

    Lundberg, S.M., Lee, S.I. (2017). A unified approach to interpreting model predictions.Adv. Neur. In.,

  5. [5]

    Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2nd ed.). https://christophm.github.io/interpretable-ml-book. Mothilal, R.K., Sharma, A., Tan, C. (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 Conference on Fairness, Accountability...

  6. [6]

    Rainio, O., Teuho, J., Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Sci. Rep., 14(1),

  7. [7]

    Rajkomar, A., Dean, J., Kohane, I. (2019). Machine learning in medicine.New Engl. J. Med., 380(14), 1347–1358. Ribeiro, M.T., Singh, S., Guestrin, C. (2016). Why Should I Trust You?: Explaining the Predic- tions of Any Classifier. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp. 1135...

  8. [8]

    Rosenfeld, A. (2021). Better metrics for evaluating explainable artificial intelligence. In:Proceed- ings of the 20th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, pp. 45–50. Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., Zhong, C. (2022)....

  9. [9]

    Steyaert, S., Pizurica, M., Nagaraj, D., Khandelwal, P., Hernandez-Boussard, T., Gentles, A.J., Gevaert, O. (2023). Multimodal data fusion for cancer biomarker discovery with deep learning. Nat. Mach. Intell., 5(4), 351–362. Stirnberg, R., Cermak, J., Kotthaus, S., Haeffelin, M., Andersen, H., Fuchs, J., Kim, M., Petit, J.E., Favez, O. (2021). Meteorology...

  10. [10]

    Phys., 21(5), 3919–3948

    revealed with explainable machine learning.Atmos Chem. Phys., 21(5), 3919–3948. Strumbelj, E., Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions.Knowl. Inform. Syst., 41.3, 647–665. Sundararajan, M., Taly, A., Yan, Q. (2017). Axiomatic attribution for deep networks. In: Precup, D., Teh, Y.W. (Eds.),Pr...

  11. [11]

    Waljee, A.K., Higgins, P.D.R. (2010). Machine learning in medicine: a primer for physicians. Am. J. Gastroenterol., 105(6), 1224–1226. Yang, C.C. (2022). Explainable artificial intelligence for predictive modeling in healthcare.J. Healthc. Inform. Res., 6(2), 228–239. Zamani, M.G., Nikoo, M.R., Niknazar, F., Al-Rawas, G., Al-Wardy, M., Gandomi, A.H. (2023...

  12. [12]

    Zhou, J., Gandomi, A.H., Chen, F., Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics.Electronics, 10(5),

  13. [13]

    Banegas-Luna is an Associate Professor in Computer Science at Universi- dad Católica de Murcia (UCAM), Spain

    A.J. Banegas-Luna is an Associate Professor in Computer Science at Universi- dad Católica de Murcia (UCAM), Spain. He earned his Ph.D. in Computer Sci- ence from UCAM in 2019, specializing in the application of high-performance computing (HPC) to biological and chemical contexts. His research focuses on computer-aided drug discovery and the use of artific...