pith. sign in

arxiv: 2606.06334 · v1 · pith:KEPQJHCHnew · submitted 2026-06-04 · 💻 cs.LG

Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic Data

Pith reviewed 2026-06-28 02:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords privacycounterfactual explanationsmembership inference attackssynthetic datamachine learning
0
0 comments X

The pith

Counterfactual explanations can reveal whether a record was in a model's training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that membership inference attacks created for synthetic data can be transferred directly to counterfactual explanations. These attacks identify training-set membership using only the counterfactuals themselves, without any queries to the original model. The work matters because counterfactuals are commonly released in high-stakes domains to explain decisions, and the same privacy risk that applies to synthetic data now applies to them. A sympathetic reader would conclude that releasing counterfactuals carries an under-appreciated privacy cost.

Core claim

Membership inference attacks designed for synthetic data succeed when applied to counterfactuals generated by a model, and the attacks remain effective when the attacker has access only to the set of counterfactuals and no access to the model.

What carries the argument

Direct transfer of membership inference attacks from synthetic data, treating counterfactuals as realistic substitutes for training records.

If this is right

  • Releasing counterfactuals can produce a privacy breach comparable to releasing synthetic data.
  • The attacks apply across multiple counterfactual generation methods.
  • Model developers must consider privacy when sharing counterfactuals with users or the public.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other forms of model explanations may carry similar membership leakage risks.
  • Organizations releasing explanations might need new privacy safeguards or access controls.

Load-bearing premise

Counterfactuals function as realistic substitutes for real training data in the same way synthetic data does.

What would settle it

An experiment in which the membership inference attack's success rate on counterfactuals equals its success rate on randomly chosen non-training records.

Figures

Figures reproduced from arXiv: 2606.06334 by Hadrien Lautraite, Heber H. Arcolezi, Maryam Babaei, Sebastien Gambs, Ulrich Aivodji, Yingke Wang.

Figure 1
Figure 1. Figure 1: Attack pipeline. The attacker has no access to or information about the model or counterfactual generator. They only [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attack performance comparison. Blue lines show ROC curves for [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Acs_income – distribution_comparison between real dataset and counterfactual sets used to perform no-box ensemble [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compas – distribution_comparison between real dataset and counterfactual sets used to perform no-box ensemble [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: heloc – distribution_comparison between real dataset and counterfactual sets used to perform no-box ensemble [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: adult – distribution_comparison between real dataset and counterfactual sets used to perform no-box ensemble [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Counterfactuals are typically used in high-stakes decision areas to explain a machine learning model by showing how changes to the user profiles result in the desired outcome. However, explaining the model's decisions through counterfactuals can also be exploited by an adversary to conduct privacy attacks against the model or its training data. Drawing on the analogy that counterfactuals provide realistic substitutes for real training data, similar to synthetic data, we demonstrate in this paper how it is possible to successfully perform privacy attacks on counterfactuals by drawing on the attacks developed against synthetic data. More precisely, we investigate the effectiveness of the membership inference attacks designed for synthetic data on various types of counterfactuals. Additionally, while existing membership inference attacks against counterfactuals usually require to be able to query the model, we show how it is possible to perform successful membership inference attacks using only a set of counterfactuals, with no access to the model from which they are generated. Our results demonstrate that model developers should be more cautious when releasing counterfactuals to various users, as it can lead to a privacy breach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that counterfactual explanations can leak membership information about training data in a manner analogous to synthetic datasets. It asserts that membership inference attacks originally developed for synthetic data can be directly applied to sets of counterfactuals, achieving successful attacks even without query access to the underlying model. The work investigates this across various counterfactual types and concludes that model developers should exercise caution when releasing counterfactuals due to privacy risks.

Significance. If the empirical transfer of synthetic-data MIA techniques to counterfactuals is valid, the result would extend documented privacy vulnerabilities from data synthesis to the explainable-AI setting, providing a concrete reason to treat released counterfactual sets as potentially sensitive. The analogy itself is a useful framing, but the manuscript supplies no quantitative evidence, baselines, or controls in the visible text, preventing assessment of whether the claimed success is robust or merely an artifact of untested assumptions about sampling regimes.

major comments (2)
  1. [Abstract] Abstract and introduction: the central premise that counterfactuals 'provide realistic substitutes for real training data, similar to synthetic data' is asserted without any ablation, density comparison, or feature-distribution analysis showing that MIA signals (density ratios, reconstruction error, shadow-model outputs) remain discriminative when the data are sparse, validity-constrained, and instance-specific rather than drawn from a model's marginal.
  2. [Abstract] The manuscript states that attacks succeed 'with no access to the model' yet supplies no metrics, attack success rates, baseline comparisons, dataset descriptions, or statistical controls in the provided text, rendering the empirical claim impossible to evaluate for soundness.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., attack accuracy or AUC) and the datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment point by point below, clarifying the empirical content of the full manuscript while agreeing to strengthen certain aspects in revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central premise that counterfactuals 'provide realistic substitutes for real training data, similar to synthetic data' is asserted without any ablation, density comparison, or feature-distribution analysis showing that MIA signals (density ratios, reconstruction error, shadow-model outputs) remain discriminative when the data are sparse, validity-constrained, and instance-specific rather than drawn from a model's marginal.

    Authors: We agree that the analogy benefits from explicit supporting analysis. The full manuscript reports empirical transfer of MIA techniques with attack success rates on counterfactual sets, but does not include dedicated density or feature-distribution ablations. We will add these comparisons (e.g., KL divergence between counterfactual and training distributions, plus ablation on sparsity/validity constraints) to the revised version. revision: yes

  2. Referee: [Abstract] The manuscript states that attacks succeed 'with no access to the model' yet supplies no metrics, attack success rates, baseline comparisons, dataset descriptions, or statistical controls in the provided text, rendering the empirical claim impossible to evaluate for soundness.

    Authors: The full manuscript contains these elements in the experimental evaluation (attack success rates, baseline comparisons against random and shadow-model attacks, dataset details, and statistical significance). The referee's observation correctly notes their absence from the abstract alone. We will incorporate a concise summary of key quantitative results into the abstract during revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical attack transfer with no derivations or self-referential reductions.

full rationale

The paper is an empirical demonstration that applies pre-existing membership inference attacks (designed for synthetic data) directly to released counterfactual sets. No equations, fitted parameters, or derivations appear in the provided text. The central claim rests on experimental transfer of attack success rates rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. The analogy between counterfactuals and synthetic data is stated as a premise for the experiments but is not used to derive results by construction; results are measured outcomes. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5738 in / 904 out tokens · 29704 ms · 2026-06-28T02:00:52.029690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs. 2020. Model extraction from counterfactual explanations.arXiv preprint arXiv:2009.01884(2020)

  2. [2]

    Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias: Risk Assessments in Criminal Sentencing. https: //github.com/propublica/compas-analysis. Accessed 2025-05-15

  3. [3]

    Arthur Asuncion and David Newman. 2007. UCI machine learning repository

  4. [4]

    André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster. 2024. Compre- hensive exploration of synthetic data generation: A survey.arXiv preprint arXiv:2401.02524(2024)

  5. [5]

    Brett K Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P Bhavnani, James Brian Byrd, and Casey S Greene. 2019. Privacy-preserving generative deep neural networks support clinical data sharing.Circulation: Cardiovascular Quality and Outcomes12, 7 (2019), e005122

  6. [6]

    Dieter Brughmans, Pieter Leyman, and David Martens. 2023. Nice: an algorithm for nearest instance counterfactual explanations.Data Mining and Knowledge Discovery(2023), 1–39

  7. [7]

    Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. Gan-leaks: A taxonomy of membership inference attacks against generative models. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security. 343–362

  8. [8]

    Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security. 15–26

  9. [9]

    Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science9, 3–4 (2014), 211–407

  10. [10]

    Sofie Goethals, Kenneth Sörensen, and David Martens. 2023. The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks.ACM Trans. Intell. Syst. Technol.14, 5, Article 83 (aug 2023), 24 pages. doi:10.1145/3608482

  11. [11]

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

  12. [12]

    Generative adversarial nets.Advances in neural information processing systems27 (2014)

  13. [13]

    Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2017. Logan: Membership inference attacks against generative models.arXiv preprint arXiv:1705.07663(2017)

  14. [14]

    Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte carlo and reconstruction membership inference attacks against generative models.Proceedings on Privacy Enhancing Technologies(2019)

  15. [15]

    Florimond Houssiau, James Jordon, Samuel N Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch. 2022. TAPAS: a toolbox for adversarial privacy auditing of synthetic data.arXiv preprint arXiv:2211.06550(2022)

  16. [16]

    Aoting Hu, Renjie Xie, Zhigang Lu, Aiqun Hu, and Minhui Xue. 2021. Tablegan-mca: Evaluating membership collisions of gan-synthesized tabular data releasing. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2096–2112

  17. [17]

    Catherine Huang, Chelse Swoopes, Christina Xiao, Jiaqi Ma, and Himabindu Lakkaraju. 2023. Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage.arXiv preprint arXiv:2308.04341(2023)

  18. [18]

    Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence. 43–58

  19. [19]

    Amir-Hossein Karimi, Gilles Barthe, Bernhard Schölkopf, and Isabel Valera. 2022. A survey of algorithmic recourse: contrastive explanations and consequential recommendations.Comput. Surveys55, 5 (2022), 1–29

  20. [20]

    Aditya Kuppa and Nhien-An Le-Khac. 2021. Adversarial XAI methods in cybersecurity.IEEE transactions on information forensics and security16 (2021), 4924–4938

  21. [21]

    Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, and Marcin Detyniecki. 2019. The dangers of post-hoc interpretability: unjustified counterfactual explanations. InProceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China)(IJCAI’19). 2801–2807

  22. [22]

    Ryan McKenna, Gerome Miklau, and Daniel Sheldon. 2021. Winning the nist contest: A scalable and general approach to differentially private synthetic data.arXiv preprint arXiv:2108.04978(2021)

  23. [23]

    Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. AIM: an adaptive and iterative mechanism for differentially private synthetic data.Proc. VLDB Endow.15, 11 (July 2022), 2599–2612. doi:10.14778/3551793.3551817

  24. [24]

    Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explaining machine learning classifiers through diverse counterfactual explanations. InProceedings of the 2020 conference on fairness, accountability, and transparency. 607–617

  25. [25]

    2022.Privacy-preserving counterfactual explanations to help humans contest AI-based decisions

    DJ Nelson. 2022.Privacy-preserving counterfactual explanations to help humans contest AI-based decisions. Master’s thesis. University of Twente

  26. [26]

    2021.Synthetic data for deep learning

    Sergey I Nikolenko et al. 2021.Synthetic data for deep learning. Vol. 174. Springer

  27. [27]

    OpenML. 2018. FICO-HELOC-cleaned Dataset. https://openml.org/d/45554. Accessed 2025-05-15

  28. [28]

    Art B. Owen. 2013.Monte Carlo theory, methods and examples. https://artowen.su.domains/mc/. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Babaei et al

  29. [29]

    Martin Pawelczyk, Himabindu Lakkaraju, and Seth Neel. 2023. On the privacy risks of algorithmic recourse. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 9680–9696

  30. [30]

    Sikha Pentyala, Shubham Sharma, Sanjay Kariyappa, Freddy Lecue, and Daniele Magazzeni. 2023. Privacy-Preserving Algorithmic Recourse.arXiv preprint arXiv:2311.14137(2023)

  31. [31]

    Reza Shokri, Martin Strobel, and Yair Zick. 2021. On the privacy risks of model explanations. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 231–241

  32. [32]

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy (SP). IEEE, 3–18

  33. [33]

    Marlon Tobaben, Hibiki Ito, Joonas Jälkö, Yuan He, and Antti Honkela. 2025. Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning. InAdvances in Neural Information Processing Systems, Vol. 38. 67486–67537

  34. [34]

    Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. 2023. Membership Inference Attacks against Synthetic Data through Overfitting Detection. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 3493–3514

  35. [35]

    Giuseppe Vietri, Cedric Archambeau, Sergul Aydore, William Brown, Michael Kearns, Aaron Roth, Ankit Siva, Shuai Tang, and Steven Z Wu. 2022. Private synthetic data for multitask learning and marginal queries.Advances in Neural Information Processing Systems35 (2022), 18282–18295

  36. [36]

    Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harv. JL & Tech.31 (2017), 841

  37. [37]

    Yongjie Wang, Hangwei Qian, and Chunyan Miao. 2022. DualCF: Efficient Model Extraction Attack from Counterfactual Explanations. In2022 ACM Conference on Fairness, Accountability, and Transparency. 1318–1329

  38. [38]

    Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2024. Data plagiarism index: Characterizing the privacy risk of data-copying in tabular generative models.arXiv preprint arXiv:2406.13012(2024)

  39. [39]

    Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2025. Privacy Auditing Synthetic Data Release through Local Likelihood Attacks. arXiv preprint arXiv:2508.21146(2025)

  40. [40]

    Joshua Ward, Yuxuan Yang, Chi-Hua Wang, and Guang Cheng. 2025. Ensembling Membership Inference Attacks Against Tabular Generative Models. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security (AISec ’25). Association for Computing Machinery, New York, NY, USA, 182–193. doi:10.1145/3733799.3762977 A Distribution comparison between ...