arxiv: 2604.17494 · v1 · submitted 2026-04-19 · 💻 cs.LG · cs.AI

Recognition: unknown

A Probabilistic Consensus-Driven Approach for Robust Counterfactual Explanations

Marcin Kostrzewa , Maciej Zi\k{e}ba , Jerzy Stefanowski

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords counterfactual explanationsrobustnessnormalizing flowsensemble methodsmodel stabilityinterpretable machine learningblack-box models

0 comments

The pith

A conditional normalizing flow trained on ensemble consensus data produces counterfactual explanations whose robustness is set by a single minimum-agreement threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to generate counterfactual explanations that stay valid when the underlying classifier is later replaced or retrained. It trains a generative model on the data distribution conditioned on the fraction of an ensemble that assigns a given point to each class. At generation time one scalar sets the lowest agreement level the explanation must satisfy. If the approach works, users obtain stable, plausible counterfactuals without retraining the generator or hand-tuning robustness penalties for each new model.

Core claim

We propose a novel approach that jointly models the data distribution and the space of plausible model decisions to ensure robustness to model changes. Using a probabilistic consensus over a model ensemble, we train a conditional normalizing flow that captures the data density under varying levels of classifier agreement. At inference time, a single interpretable parameter controls the robustness level; it specifies the minimum fraction of models that should agree on the target class without retraining the generative model. Our method effectively pushes CFEs toward regions that are both plausible and stable across model changes.

What carries the argument

A conditional normalizing flow whose density is conditioned on the agreement level of an ensemble of classifiers; the agreement level acts as a controllable robustness knob at sampling time.

If this is right

Counterfactuals remain valid under small model perturbations without requiring model-specific robustness losses.
A single scalar controls the trade-off between plausibility and robustness at inference time.
The same trained flow can serve multiple downstream models without retraining.
Empirical robustness improves while other standard metrics such as proximity and sparsity remain competitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning trick could be applied to other generative explanation techniques such as prototype-based or decision-tree explanations.
If the ensemble is built from models trained on successive data snapshots, the method might also protect against distribution shift over time.
Testing on streaming or online-learning settings would reveal whether the fixed flow continues to produce stable counterfactuals as the underlying models evolve.

Load-bearing premise

An ensemble of models spans the space of decisions a future model might make, and the normalizing flow faithfully reproduces the joint density of data and agreement levels.

What would settle it

Generate counterfactuals at a high agreement threshold, then replace the original ensemble with an independent set of models trained on the same task; if the new counterfactuals lose validity at the same rate as those from non-robust baselines, the claimed stability gain is absent.

Figures

Figures reproduced from arXiv: 2604.17494 by Jerzy Stefanowski, Maciej Zi\k{e}ba, Marcin Kostrzewa.

**Figure 1.** Figure 1: Illustration of the consensus-driven mechanism on a synthetic dataset. The background color for Class 1 encodes the values of ensemble consensus p(y = 1|x). As the required agreement level γ increases, the generated counterfactual moves away from the decision boundary into regions of higher classifier consensus, trading proximity for robustness. However, in practice, CFEs face an additional challenge that … view at source ↗

**Figure 2.** Figure 2: Robustness–proximity tradeoff controlled by γ shown for the MLP classifier. Each point corresponds to a different value of γ ranging from 0.55 to 0.95. The left and right panel show the Rob.(ret) and Rob.(bs), respectively. Error bars indicate standard deviations across 5 folds. at most 2−3 percentage points across all α values, whereas changing γ from 0.7 to 0.9 shifts robustness by 3−9 points. Plausibili… view at source ↗

read the original abstract

Counterfactual explanations (CFEs) are essential for interpreting black-box models, yet they often become invalid when models are slightly changed. Existing methods for generating robust CFEs are often limited to specific types of models, require costly tuning, or inflexible robustness controls. We propose a novel approach that jointly models the data distribution and the space of plausible model decisions to ensure robustness to model changes. Using a probabilistic consensus over a model ensemble, we train a conditional normalizing flow that captures the data density under varying levels of classifier agreement. At inference time, a single interpretable parameter controls the robustness level; it specifies the minimum fraction of models that should agree on the target class without retraining the generative model. Our method effectively pushes CFEs toward regions that are both plausible and stable across model changes. Experimental results demonstrate that our approach achieves superior empirical robustness while also maintaining good performance across other evaluation measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a method for generating robust counterfactual explanations (CFEs) by training a conditional normalizing flow on the joint distribution of data and classifier agreement levels from an ensemble. At inference, a single scalar parameter sets the minimum agreement fraction required for the target class, directing CFEs toward data regions that are both plausible and stable under model variations. The authors claim this yields superior empirical robustness to model changes while preserving performance on standard CFE metrics, without requiring retraining or model-specific adaptations.

Significance. If the central claims hold, the approach would provide a flexible, interpretable mechanism for controlling robustness in CFEs via a single parameter, addressing limitations of prior methods that are model-specific or lack tunable robustness. The probabilistic consensus framing and use of conditional flows to model decision-space variation represent a coherent technical contribution, with potential value in settings where models evolve over time. However, the significance is tempered by the unverified assumption that ensemble agreement serves as a reliable proxy for generalization to unseen model distributions.

major comments (3)

[Abstract, §4] The headline claim of robustness to model changes (Abstract and §4) rests on ensemble agreement as a proxy, yet no ablation, theoretical argument, or cross-distribution experiment demonstrates that agreement within the finite training ensemble predicts validity for models drawn from different architectures, training seeds, or data splits. This is load-bearing for the central contribution.
[§3] §3 (method): the conditional normalizing flow is trained on density conditioned on agreement fraction, but no analysis or diagnostic is provided to confirm that the flow recovers the true conditional density without artifacts correlated with the agreement variable, which could inflate apparent robustness.
[§4] §4 (experiments): the abstract asserts superior empirical robustness, but the provided text lacks details on ensemble construction, baseline implementations, statistical significance tests, or sensitivity to ensemble size; without these, the superiority claim cannot be assessed and may reflect post-hoc choices.

minor comments (2)

[§3] Notation for the agreement fraction parameter could be clarified with an explicit equation linking it to the flow conditioning variable.
[Figures] Figure captions should explicitly state the ensemble size and agreement threshold values used in each panel for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional experiments, diagnostics, and details as suggested. These changes strengthen the presentation of our central claims without altering the core method.

read point-by-point responses

Referee: [Abstract, §4] The headline claim of robustness to model changes (Abstract and §4) rests on ensemble agreement as a proxy, yet no ablation, theoretical argument, or cross-distribution experiment demonstrates that agreement within the finite training ensemble predicts validity for models drawn from different architectures, training seeds, or data splits. This is load-bearing for the central contribution.

Authors: We agree that validating ensemble agreement as a proxy for robustness to unseen models is critical. In the revised manuscript, we added an ablation study in §4 evaluating CFEs on held-out models with different architectures and training seeds. We also included a theoretical argument in §3 based on ensemble consistency and the law of large numbers, plus cross-distribution experiments on varied data splits to support generalization of the proxy. revision: yes
Referee: [§3] §3 (method): the conditional normalizing flow is trained on density conditioned on agreement fraction, but no analysis or diagnostic is provided to confirm that the flow recovers the true conditional density without artifacts correlated with the agreement variable, which could inflate apparent robustness.

Authors: We appreciate this observation. The revised §3 now includes diagnostic analyses: sample agreement level histograms compared to conditioning values, quantitative density recovery metrics across agreement fractions, and correlation checks between generated artifacts and the agreement variable. We also added a limitations discussion on normalizing flow approximations. revision: yes
Referee: [§4] §4 (experiments): the abstract asserts superior empirical robustness, but the provided text lacks details on ensemble construction, baseline implementations, statistical significance tests, or sensitivity to ensemble size; without these, the superiority claim cannot be assessed and may reflect post-hoc choices.

Authors: We acknowledge the original submission omitted key experimental details. The revised §4 now provides: full ensemble construction specifications (architectures, training procedures, sizes), baseline implementation and hyperparameter details, statistical significance tests with p-values, and sensitivity analysis over ensemble sizes (5–20 models). These ensure the robustness claims are fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method defines robustness via external ensemble parameter

full rationale

The paper trains a conditional normalizing flow on data density conditioned on varying levels of ensemble classifier agreement and uses a single scalar at inference to set the minimum agreement fraction for robustness. This is a direct generative modeling construction with an externally interpretable control parameter; no claimed prediction or result reduces by construction to a fitted input, self-citation chain, or renamed ansatz. The central robustness claim rests on the modeling choice and empirical results rather than tautological equivalence to inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that model ensembles capture plausible variations and that normalizing flows can model the relevant conditional densities; no free parameters are explicitly fitted in the abstract description beyond the inference-time threshold.

free parameters (1)

minimum agreement fraction
User-specified threshold at inference that controls robustness level; may require domain-specific tuning not detailed in abstract.

axioms (2)

domain assumption An ensemble of models represents the space of plausible model decisions under slight changes.
Invoked to justify the probabilistic consensus for robustness.
domain assumption Conditional normalizing flows can accurately capture data density conditioned on classifier agreement levels.
Core modeling assumption for the generative component.

pith-pipeline@v0.9.0 · 5452 in / 1377 out tokens · 33176 ms · 2026-05-10T06:51:10.104302+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Artelt, A., Hammer, B.: Convex density constraints for computing plausible counterfactual explanations. In: Artificial Neural Networks and Machine Learn- ing – ICANN 2020: 29th International Conference on Artificial Neural Net- works, Bratislava, Slovakia, September 15–18, 2020, Proceedings, Part I. p. 353–365. Springer-Verlag, Berlin, Heidelberg (2020). ...

work page doi:10.1007/978- 2020
[2]

In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari,C.,Niu,G.,Sabato,S.(eds.)Proceedingsofthe39thInternationalCon- ference on Machine Learning

Dutta, S., Long, J., Mishra, S., Tilli, C., Magazzeni, D.: Robust counterfactual explanations for tree-based ensembles. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari,C.,Niu,G.,Sabato,S.(eds.)Proceedingsofthe39thInternationalCon- ference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 5742–5756. PMLR (17–23 Jul 2022),htt...

2022
[3]

IEEE Access pp

Ferrario, A., Loi, M.: The robustness of counterfactual explanations over time. IEEE Access pp. 1–1 (08 2022). https://doi.org/10.1109/ACCESS.2022.3196917

work page doi:10.1109/access.2022.3196917 2022
[4]

CoRR abs/2505.23700(2025)

Furman, O., Movsum-zada, U., Marszalek, P., Zieba, M., Smieja, M.: Di- coflex: Model-agnostic diverse counterfactuals with flexible control. CoRR abs/2505.23700(2025). https://doi.org/10.48550/ARXIV.2505.23700,https:// doi.org/10.48550/arXiv.2505.23700

work page doi:10.48550/arxiv.2505.23700 2025
[5]

Data Min

Guidotti, R.: Counterfactual explanations and how to find them: literature re- view and benchmarking. Data Min. Knowl. Discov.38(5), 2770–2824 (Apr 16 M. Kostrzewa et al. 2022). https://doi.org/10.1007/s10618-022-00831-6,https://doi.org/10.1007/ s10618-022-00831-6

work page doi:10.1007/s10618-022-00831-6 2022
[6]

In: Yanıkoğlu, B., Buntine, W

Jiang, J., Lan, J., Leofante, F., Rago, A., Toni, F.: Provably robust and plau- sible counterfactual explanations for neural networks via robust optimisation. In: Yanıkoğlu, B., Buntine, W. (eds.) Proceedings of the 15th Asian Conference on Ma- chine Learning. Proceedings of Machine Learning Research, vol. 222, pp. 582–597. PMLR(11–14Nov2024),https://proc...
[7]

Artificial Intelligence336, 104218 (2024)

Jiang, J., Leofante, F., Rago, A., Toni, F.: Interval abstractions for robust counterfactual explanations. Artificial Intelligence336, 104218 (2024). https://doi.org/https://doi.org/10.1016/j.artint.2024.104218,https: //www.sciencedirect.com/science/article/pii/S0004370224001541

work page doi:10.1016/j.artint.2024.104218 2024
[8]

In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Jiang, J., Leofante, F., Rago, A., Toni, F.: Robust counterfactual expla- nations in machine learning: a survey. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. IJCAI ’24 (2024). https://doi.org/10.24963/ijcai.2024/894,https://doi.org/10.24963/ijcai. 2024/894

work page doi:10.24963/ijcai.2024/894 2024
[9]

In: ECAI 2024, pp

Marzari, L., Leofante, F., Cicalese, F., Farinelli, A.: Rigorous probabilistic guar- antees for robust counterfactual explanations. In: ECAI 2024, pp. 1059–1066. IOS Press (2024)

2024
[10]

Frontiers in Artificial IntelligenceVolume 5(2022)

Mertes, S., Huber, T., Weitz, K., Heimerl, A., André, E.: Ganterfac- tual—counterfactual explanations for medical non-experts using generative adversarial learning. Frontiers in Artificial IntelligenceVolume 5(2022). https://doi.org/10.3389/frai.2022.825565

work page doi:10.3389/frai.2022.825565 2022
[11]

In: Proceedings of the Thirty-Eighth Conference on Uncertainty in Arti- ficial Intelligence

Nguyen, T.D.H., Bui, N., Nguyen, D., Yue, M.C., Nguyen, V.A.: Robust Bayesian recourse. In: Proceedings of the Thirty-Eighth Conference on Uncertainty in Arti- ficial Intelligence. Proceedings of Machine Learning Research, PMLR (2022)

2022
[12]

OptunaHub: A Platform for Black-Box Optimization

Ozaki, Y., Watanabe, S., Yanase, T.: OptunaHub: A platform for black-box opti- mization. arXiv preprint arXiv:2510.02798 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

In: Proceedings of the 31st International Conference on Neural Infor- mation Processing Systems

Papamakarios, G., Pavlakou, T., Murray, I.: Masked autoregressive flow for density estimation. In: Proceedings of the 31st International Conference on Neural Infor- mation Processing Systems. p. 2335–2344. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)

2017
[14]

In: Pro- ceedings of the 32nd International Conference on Machine Learning - Volume 37

Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Pro- ceedings of the 32nd International Conference on Machine Learning - Volume 37. p. 1530–1538. ICML’15, JMLR.org (2015)

2015
[15]

In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing V.1

Stępka, I., Stefanowski, J., Lango, M.: Counterfactual explanations with prob- abilistic guarantees on their robustness to model change. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing V.1. p. 1277–1288. KDD ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3690624.370930...

work page doi:10.1145/3690624.3709300 2025
[16]

In: Proceedings of the 35th International Conference on Neural Informa- tion Processing Systems

Upadhyay, S., Joshi, S., Lakkaraju, H.: Towards robust and reliable algorithmic recourse. In: Proceedings of the 35th International Conference on Neural Informa- tion Processing Systems. NIPS ’21, Curran Associates Inc., Red Hook, NY, USA (2021)

2021
[17]

2024 , issue_date =

Verma, S., Boonsanong, V., Hoang, M., Hines, K., Dickerson, J., Shah, C.: Coun- terfactual explanations and algorithmic recourses for machine learning: A re- view. ACM Comput. Surv.56(12) (Oct 2024). https://doi.org/10.1145/3677119, https://doi.org/10.1145/3677119 A Probabilistic Approach for Robust Counterfactual Explanations 17

work page doi:10.1145/3677119 2024
[18]

Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR,

Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harvard journal of law & technology31, 841–887 (04 2018). https://doi.org/10.2139/ssrn.3063289

work page doi:10.2139/ssrn.3063289 2018
[19]

IOS Press (Oct 2024)

Wielopolski, P., Furman, O., Stefanowski, J., Zięba, M.: Probabilistically Plausi- ble Counterfactual Explanations with Normalizing Flows. IOS Press (Oct 2024). https://doi.org/10.3233/faia240584,http://dx.doi.org/10.3233/FAIA240584

work page doi:10.3233/faia240584 2024