WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets

Antonio Jes\'us Banegas-Luna; Carlos Mart\'inez-Cort\'es; Horacio P\'erez-S\'anchez

arxiv: 2506.06455 · v1 · submitted 2025-06-06 · 💻 cs.LG · cs.AI· stat.ML

WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets

Antonio Jes\'us Banegas-Luna , Horacio P\'erez-S\'anchez , Carlos Mart\'inez-Cort\'es This is my paper

Pith reviewed 2026-05-19 10:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords WISCAinterpretabilityconsensustabular datamachine learningexplanationssynthetic datasetsmodel-agnostic

0 comments

The pith

WISCA consensus aligns with the most reliable individual interpretability method on synthetic tabular data

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WISCA as a new consensus method that combines outputs from multiple model-agnostic interpretability techniques applied to machine learning models on tabular data. It weights attributions by class probabilities and normalizes them to produce a single explanation when individual methods disagree. Experiments trained six models on six synthetic datasets with known ground truths, then compared WISCA against other consensus approaches. The new method aligned more closely with the single best individual technique than alternatives did. A sympathetic reader would care because conflicting explanations undermine trust in models used for science or high-stakes decisions.

Core claim

WISCA integrates class probability and normalized attributions to generate consensus explanations from various model-agnostic interpretability techniques. When applied to six ML models trained on six synthetic datasets with known ground truths, WISCA consistently aligned with the most reliable individual method, demonstrating the value of robust consensus strategies in improving explanation reliability.

What carries the argument

WISCA (Weighted Scaled Consensus Attributions), which weights and scales attributions using class probabilities to harmonize conflicting outputs from multiple interpretability algorithms.

Load-bearing premise

That alignment with the single most reliable individual method on synthetic data with known ground truth means the consensus explanations are higher quality in general.

What would settle it

Apply WISCA to a new synthetic dataset with known ground truth and observe whether it still aligns with the top individual method, or test it on real tabular data and measure agreement with downstream task performance or expert judgment.

Figures

Figures reproduced from arXiv: 2506.06455 by Antonio Jes\'us Banegas-Luna, Carlos Mart\'inez-Cort\'es, Horacio P\'erez-S\'anchez.

**Figure 2.** Figure 2: Heatmaps summarizing model performance metrics across datasets. [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Score of the consensus functions measured in terms of the hit rate metric. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Average distance between expected and non-expected features accross the datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Score of the interpretability algorithms measured in terms of the hit rate metric. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Spearman correlation between WISCA and the interpretability algorithms. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Jensen-Shannon divergence between WISCA and the interpretability algorithms. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Hit rate score of LR and WISCA. their importance is closer to the rest of the features. The Schiller test is a gynecological study that applies iodine to the cervix to detect cellular alterations. It is performed during colposcopy. The Schiller test helps to define the boundaries between the epithelium and the lesion. Therefore, it is logical that the result of this test is considered an expected marker f… view at source ↗

**Figure 9.** Figure 9: Consensus explanations returned by WISCA on the cervical cancer risk dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Consensus explanations returned by WISCA on the wine dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Consensus explanations returned by WISCA on the bike rental dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

While predictive accuracy is often prioritized in machine learning (ML) models, interpretability remains essential in scientific and high-stakes domains. However, diverse interpretability algorithms frequently yield conflicting explanations, highlighting the need for consensus to harmonize results. In this study, six ML models were trained on six synthetic datasets with known ground truths, utilizing various model-agnostic interpretability techniques. Consensus explanations were generated using established methods and a novel approach: WISCA (Weighted Scaled Consensus Attributions), which integrates class probability and normalized attributions. WISCA consistently aligned with the most reliable individual method, underscoring the value of robust consensus strategies in improving explanation reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WISCA is a straightforward weighting tweak for consensus on tabular explanations that matches the best single method on synthetics but offers no evidence it improves reliability outside those controlled cases.

read the letter

The main point is that WISCA weights and scales attributions using class probabilities to produce a consensus that, on their six synthetic datasets with known ground truth, ends up close to whichever individual method recovered the features best. They trained six models, ran standard model-agnostic explainers, and compared several consensus baselines against this new one. The result is presented as support for using robust consensus to reduce disagreement in explanations. That is the actual contribution here: a simple, practical recipe for tabular settings where methods conflict. It is easy to implement and directly targets the harmonization problem the abstract describes. For practitioners who already run multiple explainers on tables, this gives one more option to try without having to choose a single method in advance. The paper does a clean job of setting up the synthetic testbed and showing the alignment result. The soft spots are clear and proportionate. Everything rests on synthetic data with known ground truth; there are no results on real tabular datasets, no stability checks across seeds or perturbations, and no human or downstream-task evaluations. The claim that this improves explanation quality therefore depends on the untested step that matching the best synthetic performer is the right target when ground truth is unavailable. No quantitative metrics, confidence intervals, or statistical comparisons appear in the summary, which makes it hard to judge the size of any gain. The method itself is not a deep derivation, just a scaling and weighting step on top of existing attributions. Citations follow the usual interpretability references without obvious gaps or over-claiming. This paper is for applied researchers or engineers working on tabular models who need a quick way to combine conflicting explanations. A reader looking for a new baseline or practical trick could extract value from the method description and the synthetic comparison. It deserves peer review because the core idea is well-scoped and testable; referees could reasonably ask for real-data experiments and clearer metrics without the work being incoherent on its own terms.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces WISCA (Weighted Scaled Consensus Attributions), a novel consensus method that combines class probabilities with normalized attributions from multiple model-agnostic interpretability techniques. The authors train six ML models on six synthetic tabular datasets with known ground-truth feature attributions, generate explanations via established methods and WISCA, and report that WISCA consistently aligns with whichever individual method recovers the ground truth most reliably.

Significance. A well-supported consensus procedure for tabular interpretability could help reconcile conflicting explanations in scientific and high-stakes settings. The use of synthetic data with explicit ground truth provides a controlled test bed, which is a methodological strength. However, the absence of quantitative metrics, statistical tests, or results on real data limits the immediate significance of the reported findings.

major comments (2)

[Abstract] Abstract: the central empirical claim that WISCA 'consistently aligned with the most reliable individual method' is stated without any quantitative metrics, error bars, statistical tests, or description of how the 'most reliable' method was identified on the six synthetic datasets. This makes it impossible to assess the magnitude or reliability of the reported alignment.
[Evaluation] Evaluation section: the argument that alignment with the single best individual explainer on these particular synthetics constitutes improved explanation quality rests on an untested premise; no stability, fidelity, or human-grounded metrics on non-synthetic tabular data are provided to test whether the consensus adds robustness beyond simply reproducing the best individual method.

minor comments (2)

[Method] Clarify the precise formula for weighting and scaling in WISCA (e.g., how class probability is combined with normalized attributions) and whether any hyperparameters are involved.
[Results] Add a table or figure summarizing the quantitative agreement scores (e.g., rank correlation or attribution error) between WISCA and each individual method across the six datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where we agree and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that WISCA 'consistently aligned with the most reliable individual method' is stated without any quantitative metrics, error bars, statistical tests, or description of how the 'most reliable' method was identified on the six synthetic datasets. This makes it impossible to assess the magnitude or reliability of the reported alignment.

Authors: We agree that the abstract would benefit from greater specificity. The manuscript identifies the most reliable method by direct comparison of each explainer's feature attributions against the known ground-truth importances in the synthetic datasets. In the revision we will update the abstract to report key quantitative results (e.g., mean alignment percentage or rank correlation with the best individual method across the six datasets) and note that reliability was assessed via fidelity to ground truth. revision: yes
Referee: [Evaluation] Evaluation section: the argument that alignment with the single best individual explainer on these particular synthetics constitutes improved explanation quality rests on an untested premise; no stability, fidelity, or human-grounded metrics on non-synthetic tabular data are provided to test whether the consensus adds robustness beyond simply reproducing the best individual method.

Authors: The synthetic setting with explicit ground truth is used precisely to enable objective identification of the best individual method; WISCA is shown to match that method's output without access to the ground truth. This constitutes a controlled test of whether consensus can reliably recover high-fidelity explanations. We acknowledge that additional evidence on real data would be valuable and will expand the evaluation section to include stability metrics (consistency of attributions across repeated model trainings) while adding an explicit limitations paragraph on the absence of real-world tabular results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical alignment on synthetic ground-truth data

full rationale

The paper evaluates the proposed WISCA consensus method by training models on six synthetic datasets with known ground-truth feature attributions, applying multiple model-agnostic explainers, and measuring alignment of the consensus output with the single best individual explainer. This constitutes a direct empirical comparison against an external benchmark rather than any derivation, fitting step, or self-referential definition. No equations, self-citations, uniqueness theorems, or ansatzes are invoked that reduce the central claim to the paper's own inputs by construction. The evaluation is therefore self-contained and falsifiable via the provided synthetic ground truths.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that synthetic datasets with known ground truths are sufficient proxies for evaluating explanation quality and that alignment with one reliable method indicates overall improvement.

axioms (1)

domain assumption Synthetic datasets with known ground truths accurately reflect the behavior of interpretability methods on real tabular data.
Invoked when results on synthetic data are used to claim improved reliability of explanations.

invented entities (1)

WISCA (Weighted Scaled Consensus Attributions) no independent evidence
purpose: To integrate class probability and normalized attributions into a consensus explanation.
New method introduced in the abstract; no independent evidence outside the paper's own experiments is provided.

pith-pipeline@v0.9.0 · 5657 in / 1248 out tokens · 23589 ms · 2026-05-19T10:17:42.269480+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WISCA Formulation. WISCA ... ϕ(f) = ∑ ϕ′(f) / (N * π(s,m)) ... parabolic correction factor π(p) = 1−4p(1−p)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.equivNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hit rate metric P = ∑ h(xi)/i / ∑ 1/i
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

six synthetic datasets ... known ground truths ... Expected Explanation column

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Aeberhard, S., Forina, M. (1992). Wine. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5PC7J. Altmann, A., Toloşi, L., Sander, O., Lengauer, T. (2010). Permutation importance: a cor- rected feature importance measure.Bioinformatics, 26(10), 1340–1347.https://doi.org/ 10.1093/bioinformatics/btq134. Ayad, H.G., Kamel, M.S. (2007). Cumulativ...

work page doi:10.24432/c5pc7j 1992
[2]

Banegas-Luna, A.J., Pérez-Sánchez, H. (2022). SIBILA: High-performance computing and inter- pretable machine learning join efforts toward personalised medicine in a novel decision-making tool. arXiv. Bennetot, A., Donadello, I., El Qadi El Haouari, A., Dragoni, M., Frossard, T., Wagner, B., Sar- ranti, A., Tulli, S., Trocan, M., Chatila, R., Holzinger, A....

work page doi:10.1145/3670685 2022
[3]

Linardatos, P., Papastefanopoulos, V., Kotsiantis, S. (2020). Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1),

work page 2020
[4]

Lundberg, S.M., Lee, S.I. (2017). A unified approach to interpreting model predictions.Adv. Neur. In.,

work page 2017
[5]

Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2nd ed.). https://christophm.github.io/interpretable-ml-book. Mothilal, R.K., Sharma, A., Tan, C. (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 Conference on Fairness, Accountability...

work page doi:10.1080/17579961.2024.2313795 2022
[6]

Rainio, O., Teuho, J., Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Sci. Rep., 14(1),

work page 2024
[7]

Rajkomar, A., Dean, J., Kohane, I. (2019). Machine learning in medicine.New Engl. J. Med., 380(14), 1347–1358. Ribeiro, M.T., Singh, S., Guestrin, C. (2016). Why Should I Trust You?: Explaining the Predic- tions of Any Classifier. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp. 1135...

work page 2019
[8]

Rosenfeld, A. (2021). Better metrics for evaluating explainable artificial intelligence. In:Proceed- ings of the 20th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, pp. 45–50. Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., Zhong, C. (2022)....

work page doi:10.1016/j.knosys.2023.110273 2021
[9]

Steyaert, S., Pizurica, M., Nagaraj, D., Khandelwal, P., Hernandez-Boussard, T., Gentles, A.J., Gevaert, O. (2023). Multimodal data fusion for cancer biomarker discovery with deep learning. Nat. Mach. Intell., 5(4), 351–362. Stirnberg, R., Cermak, J., Kotthaus, S., Haeffelin, M., Andersen, H., Fuchs, J., Kim, M., Petit, J.E., Favez, O. (2021). Meteorology...

work page 2023
[10]

Phys., 21(5), 3919–3948

revealed with explainable machine learning.Atmos Chem. Phys., 21(5), 3919–3948. Strumbelj, E., Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions.Knowl. Inform. Syst., 41.3, 647–665. Sundararajan, M., Taly, A., Yan, Q. (2017). Axiomatic attribution for deep networks. In: Precup, D., Teh, Y.W. (Eds.),Pr...

work page 2014
[11]

Waljee, A.K., Higgins, P.D.R. (2010). Machine learning in medicine: a primer for physicians. Am. J. Gastroenterol., 105(6), 1224–1226. Yang, C.C. (2022). Explainable artificial intelligence for predictive modeling in healthcare.J. Healthc. Inform. Res., 6(2), 228–239. Zamani, M.G., Nikoo, M.R., Niknazar, F., Al-Rawas, G., Al-Wardy, M., Gandomi, A.H. (2023...

work page 2010
[12]

Zhou, J., Gandomi, A.H., Chen, F., Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics.Electronics, 10(5),

work page 2021
[13]

Banegas-Luna is an Associate Professor in Computer Science at Universi- dad Católica de Murcia (UCAM), Spain

A.J. Banegas-Luna is an Associate Professor in Computer Science at Universi- dad Católica de Murcia (UCAM), Spain. He earned his Ph.D. in Computer Sci- ence from UCAM in 2019, specializing in the application of high-performance computing (HPC) to biological and chemical contexts. His research focuses on computer-aided drug discovery and the use of artific...

work page 2019

[1] [1]

Aeberhard, S., Forina, M. (1992). Wine. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5PC7J. Altmann, A., Toloşi, L., Sander, O., Lengauer, T. (2010). Permutation importance: a cor- rected feature importance measure.Bioinformatics, 26(10), 1340–1347.https://doi.org/ 10.1093/bioinformatics/btq134. Ayad, H.G., Kamel, M.S. (2007). Cumulativ...

work page doi:10.24432/c5pc7j 1992

[2] [2]

Banegas-Luna, A.J., Pérez-Sánchez, H. (2022). SIBILA: High-performance computing and inter- pretable machine learning join efforts toward personalised medicine in a novel decision-making tool. arXiv. Bennetot, A., Donadello, I., El Qadi El Haouari, A., Dragoni, M., Frossard, T., Wagner, B., Sar- ranti, A., Tulli, S., Trocan, M., Chatila, R., Holzinger, A....

work page doi:10.1145/3670685 2022

[3] [3]

Linardatos, P., Papastefanopoulos, V., Kotsiantis, S. (2020). Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1),

work page 2020

[4] [4]

Lundberg, S.M., Lee, S.I. (2017). A unified approach to interpreting model predictions.Adv. Neur. In.,

work page 2017

[5] [5]

Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2nd ed.). https://christophm.github.io/interpretable-ml-book. Mothilal, R.K., Sharma, A., Tan, C. (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 Conference on Fairness, Accountability...

work page doi:10.1080/17579961.2024.2313795 2022

[6] [6]

Rainio, O., Teuho, J., Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Sci. Rep., 14(1),

work page 2024

[7] [7]

Rajkomar, A., Dean, J., Kohane, I. (2019). Machine learning in medicine.New Engl. J. Med., 380(14), 1347–1358. Ribeiro, M.T., Singh, S., Guestrin, C. (2016). Why Should I Trust You?: Explaining the Predic- tions of Any Classifier. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp. 1135...

work page 2019

[8] [8]

Rosenfeld, A. (2021). Better metrics for evaluating explainable artificial intelligence. In:Proceed- ings of the 20th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, pp. 45–50. Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., Zhong, C. (2022)....

work page doi:10.1016/j.knosys.2023.110273 2021

[9] [9]

Steyaert, S., Pizurica, M., Nagaraj, D., Khandelwal, P., Hernandez-Boussard, T., Gentles, A.J., Gevaert, O. (2023). Multimodal data fusion for cancer biomarker discovery with deep learning. Nat. Mach. Intell., 5(4), 351–362. Stirnberg, R., Cermak, J., Kotthaus, S., Haeffelin, M., Andersen, H., Fuchs, J., Kim, M., Petit, J.E., Favez, O. (2021). Meteorology...

work page 2023

[10] [10]

Phys., 21(5), 3919–3948

revealed with explainable machine learning.Atmos Chem. Phys., 21(5), 3919–3948. Strumbelj, E., Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions.Knowl. Inform. Syst., 41.3, 647–665. Sundararajan, M., Taly, A., Yan, Q. (2017). Axiomatic attribution for deep networks. In: Precup, D., Teh, Y.W. (Eds.),Pr...

work page 2014

[11] [11]

Waljee, A.K., Higgins, P.D.R. (2010). Machine learning in medicine: a primer for physicians. Am. J. Gastroenterol., 105(6), 1224–1226. Yang, C.C. (2022). Explainable artificial intelligence for predictive modeling in healthcare.J. Healthc. Inform. Res., 6(2), 228–239. Zamani, M.G., Nikoo, M.R., Niknazar, F., Al-Rawas, G., Al-Wardy, M., Gandomi, A.H. (2023...

work page 2010

[12] [12]

Zhou, J., Gandomi, A.H., Chen, F., Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics.Electronics, 10(5),

work page 2021

[13] [13]

Banegas-Luna is an Associate Professor in Computer Science at Universi- dad Católica de Murcia (UCAM), Spain

A.J. Banegas-Luna is an Associate Professor in Computer Science at Universi- dad Católica de Murcia (UCAM), Spain. He earned his Ph.D. in Computer Sci- ence from UCAM in 2019, specializing in the application of high-performance computing (HPC) to biological and chemical contexts. His research focuses on computer-aided drug discovery and the use of artific...

work page 2019