arxiv: 2603.19185 · v2 · submitted 2026-03-19 · 💻 cs.LG

Recognition: no theorem link

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad , Xi He , Mahshid Alinoori , John Jewell , Sana Ayromlou , Wei Pang , Veronica Chatrath , Gauri Sharma

show 1 more author

Deval Pandya

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords membership inference attacksdiffusion modelssynthetic tabular dataprivacy evaluationchallengetabular data generation

0 comments

The pith

The MIDST challenge shows that membership inference attacks can quantify privacy leakage in synthetic tabular data from diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper describes the MIDST challenge at SaTML 2025, which tests whether synthetic tabular data generated by diffusion models truly resists membership inference attacks. Synthetic data is promoted as a privacy solution because it aims to match statistical properties without exposing original records, yet its resilience for complex tabular formats with mixed types or relational constraints has not been systematically measured. The challenge invited new black-box and white-box attacks tailored to these diffusion models and used their success rates to evaluate privacy gain. A sympathetic reader would care because effective attacks would indicate that synthetic tabular data may still leak membership information from the training set.

Core claim

MIDST is a challenge that explores diffusion models for generating synthetic tabular data of mixed types and multi-relational structures with interconnected constraints, and it prompted the creation of specialized black-box and white-box membership inference attacks to evaluate how resistant the resulting data is to privacy threats.

What carries the argument

Membership inference attacks applied directly to diffusion models trained on tabular data, used as the metric to measure whether synthetic outputs leak information about the original training records.

If this is right

Novel black-box membership inference attacks were developed specifically for diffusion models on tabular data.
Novel white-box membership inference attacks were developed specifically for diffusion models on tabular data.
The evaluation covers both single tables with mixed data types and multi-relational tables.
Quantitative scores of privacy efficacy are produced by measuring attack success rates across the submitted methods.
The challenge provides a benchmark for comparing privacy protection levels of different diffusion model approaches to synthetic tabular data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attack-based evaluation approach could be adapted to other generative models such as GANs or VAEs for tabular data.
High-performing attacks in the challenge could guide the addition of privacy constraints directly into diffusion model training for tabular data.
Standardized challenges like MIDST may become a routine step before releasing synthetic tabular datasets for public use.

Load-bearing premise

That success or failure of membership inference attacks on the synthetic outputs can accurately reflect the real privacy leakage from the original dataset used to train the diffusion models.

What would settle it

A finding that all submitted membership inference attacks achieve accuracy no better than random guessing on the challenge's held-out test sets would show that the evaluation method does not detect meaningful privacy leakage.

read the original abstract

Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a challenge report on membership inference for diffusion-generated synthetic tabular data, but the manuscript itself contains no attack details, results, or comparisons.

read the letter

Hi, the main thing to know is that this paper is essentially a description of the MIDST challenge setup rather than a report of new findings or methods. It frames the problem of evaluating privacy leakage in synthetic tabular data from diffusion models and points to the need for tailored membership inference attacks on both single tables with mixed types and multi-relational tables. The GitHub link is included so others can access the materials, which is a straightforward way to make the resources available. That part is useful for anyone working on synthetic data in domains like healthcare or finance where privacy claims need checking. The challenge format itself can encourage development of black-box and white-box attacks that account for the specific structure of tabular data and diffusion outputs. On the downside, the central claim that the challenge produced novel MIAs and enabled a comprehensive privacy evaluation rests on nothing shown in the paper. No attack algorithms are described, no success rates or metrics appear, and there are no comparisons to prior MIA work on tabular or generative data. The text stays at the level of motivation and setup without delivering verifiable evidence or analysis of what was learned. This makes it hard to assess whether the privacy efficacy conclusions hold up. The paper is aimed at researchers interested in privacy evaluations of generative models or those who might run or participate in similar challenges. It could be a useful pointer to resources and open questions, but it does not stand as a self-contained technical contribution. I would send it to peer review for a workshop or challenge track, since the topic is timely and the framing is clear, even though the current version would benefit from including actual results to strengthen it.

Referee Report

1 major / 1 minor

Summary. The paper presents the MIDST challenge at SaTML 2025, which evaluates the privacy resilience of synthetic tabular data generated by diffusion models against membership inference attacks (MIAs). It covers target models for single mixed-type tables and multi-relational tables, and claims that the challenge inspired novel black-box and white-box MIAs tailored to these models, enabling a comprehensive privacy evaluation. The work references a GitHub repository for resources but provides no attack algorithms, results, or quantitative metrics in the manuscript itself.

Significance. If the claimed novel MIAs and their evaluations were rigorously documented and shown to outperform prior methods with reproducible metrics, the challenge could meaningfully advance privacy assessment for diffusion-based tabular synthesis, an underexplored area. However, the absence of any technical details or results in the manuscript substantially reduces its standalone contribution to the literature.

major comments (1)

[Abstract] Abstract: The central claim that MIDST 'inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome' is unsupported by any attack descriptions, novelty arguments relative to existing tabular or diffusion MIAs, success rates, or ablation studies. The manuscript limits itself to challenge setup and a GitHub link, leaving the primary asserted contribution dependent on external, unexamined submissions.

minor comments (1)

The paper would benefit from a brief summary table of challenge submissions (e.g., attack types, AUC scores, or top-performing methods) even if full details are in the repository, to make the privacy evaluation claims more self-contained.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need to better substantiate the manuscript's claims. The paper describes the MIDST challenge at SaTML 2025, which was designed to evaluate privacy resilience of diffusion-generated synthetic tabular data. We address the specific concern below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that MIDST 'inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome' is unsupported by any attack descriptions, novelty arguments relative to existing tabular or diffusion MIAs, success rates, or ablation studies. The manuscript limits itself to challenge setup and a GitHub link, leaving the primary asserted contribution dependent on external, unexamined submissions.

Authors: We acknowledge that the current manuscript focuses primarily on the challenge design, target models (single mixed-type tables and multi-relational tables), and the overall evaluation framework, with technical details of participant submissions referenced via the GitHub repository. The claim that the challenge inspired novel MIAs is based on the fact that multiple teams developed and submitted tailored black-box and white-box attacks specifically for diffusion-based tabular generators, which were not previously explored in this setting. To address the referee's valid point and make the contribution more self-contained, we will revise the manuscript to include a concise summary section describing the key innovations in the top-performing attacks (e.g., adaptations for mixed data types and relational constraints), high-level performance metrics from the challenge leaderboard, and brief novelty arguments relative to prior tabular MIAs. Full algorithms and ablations will remain in the repository and associated participant reports, as is standard for challenge papers. revision: yes

Circularity Check

0 steps flagged

No circularity in challenge description paper

full rationale

This manuscript describes the MIDST challenge setup for evaluating privacy of diffusion-based synthetic tabular data against membership inference attacks. It contains no equations, derivations, fitted parameters, predictions, or first-principles results. The statement that MIDST 'inspired the development of novel black-box and white-box MIAs' refers to external participant submissions (via GitHub link) rather than any internal reduction to the paper's own inputs. No self-citation load-bearing steps, ansatzes, uniqueness theorems, or renamings of known results appear. The derivation chain is empty; the paper is self-contained as a challenge report with no circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The provided abstract contains no mathematical derivations, free parameters, axioms, or invented entities. It is a description of a challenge event focused on privacy evaluation.

pith-pipeline@v0.9.0 · 5519 in / 1042 out tokens · 39491 ms · 2026-05-15T08:03:56.069730+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

FERMI improves membership inference on tabular diffusion models by mapping relational auxiliary information into attack features, raising TPR at 0.1 FPR by up to 53% white-box and 22% black-box over single-table baselines.
On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics
cs.LG 2026-05 unverdicted novelty 6.0

Tabular diffusion models leak membership information via attacks even with partial attacker knowledge, and common heuristic privacy metrics like distance-to-closest-record are unreliable.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers

[1]

S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso. Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 1–8, 2020

work page 2020
[2]

Berka et al

P. Berka et al. Guide to the financial data set.PKDD2000 discovery challenge, 2000

work page 2000
[3]

Carlini, J

N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tram `er, B. Balle, D. Ippolito, and E. Wallace. Ex- tracting training data from diffusion models. InUSENIX Security 23, pages 5253–5270, 2023

work page 2023
[4]

J. Duan, F. Kong, S. Wang, X. Shi, and K. Xu. Are diffusion models vulnerable to membership inference attacks? InICML, 2023

work page 2023
[5]

Fonseca and F

J. Fonseca and F. Bacao. Tabular and latent space synthetic data generation: a literature review.Journal of Big Data, 10(1):115, 2023

work page 2023
[6]

German and D

E. German and D. Samira. Mia-ept: Membership inference attack via error prediction for tabular data. https://github.com/eyalgerman/MIA-EPT, 2025. GitHub repository

work page 2025
[7]

Giomi, F

M. Giomi, F. Boenisch, C. Wehmeyer, and B. Tasn ´adi. A unified framework for quantifying privacy risk in synthetic data.arXiv preprint arXiv:2211.10459, 2022

work page arXiv 2022
[8]

Gonzales, G

A. Gonzales, G. Guruswamy, and S. R. Smith. Synthetic data in health care: A narrative review.PLOS Digital Health, 2(1):1–16, 01 2023

work page 2023
[9]

Hernandez, G

M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin. Synthetic data generation for tabular health records: A systematic review.Neurocomputing, 493:28– 45, 2022

work page 2022
[10]

J. Kim, C. Lee, and N. Park. Stasy: Score-based tabular data synthesis.arXiv preprint arXiv:2210.04018, 2022

work page arXiv 2022
[11]

Kotelnikov, D

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko. Tabddpm: Modelling tabular data with diffusion models. InICML, pages 17564–17579, 2023

work page 2023
[12]

Lautraite, L

H. Lautraite, L. Herbault, , Y . Qi, J.-F. Rajotte, and S. Gambs. Ensemble mia: The 2nd place solu- tion to the midst black-box mia on the single-table competition. https://github.com/CRCHUM-CITADEL/ ensemble-mia, 2025. GitHub repository, accessed: 2025- 12-10

work page 2025
[13]

C. Lee, J. Kim, and N. Park. Codi: Co-evolving con- trastive diffusion models for mixed-type tabular synthe- sis. InICML, pages 18940–18956, 2023

work page 2023
[14]

T. Liu, J. Fan, N. Tang, G. Li, and X. Du. Controllable tabular data synthesis using diffusion models.Proc. ACM Manag. Data, 2(1), 2024

work page 2024
[15]

Meeus, L

M. Meeus, L. Wutschitz, S. Zanella-B ´eguelin, S. Tople, and R. Shokri. The canary’s echo: Auditing privacy risks of llm-generated synthetic text. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 43557–43580. PMLR, 2025

work page 2025
[16]

MICO: Membership inference competition

Microsoft. MICO: Membership inference competition. https://github.com/microsoft/MICO, 2023. GitHub repos- itory

work page 2023
[17]

W. Pang, M. Shafieinejad, L. Liu, and X. He. Clavaddpm: Multi-relational data synthesis with cluster-guided diffu- sion models.arXiv preprint, arXiv:2405.17724, 2024

work page arXiv 2024
[18]

Y . Pang. Solution for MIDST. https://github.com/ py85252876/MIDST, 2025. GitHub repository

work page 2025
[19]

Y . Pang, T. Wang, X. Kang, M. Huai, and Y . Zhang. White-box membership inference attacks against diffu- sion models.Proceedings on Privacy Enhancing Tech- nologies, 2025(2):398–415, 2025

work page 2025
[20]

V . K. Potluru, D. Borrajo, A. Coletta, N. Dalmasso, Y . El- Laham, E. Fons, M. Ghassemi, S. Gopalakrishnan, V . Go- sai, E. Krea ˇci´c, G. Mani, S. Obitayo, D. Paramanand, N. Raman, M. Solonin, S. Sood, S. Vyetrenko, H. Zhu, M. Veloso, and T. Balch. Synthetic data applications in finance.arXiv preprint arXiv:2401.00081, 2024

work page arXiv 2024
[21]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[22]

Stadler, B

T. Stadler, B. Oprisanu, and C. Troncoso. Synthetic data – anonymisation groundhog day. InUSENIX Security 22, pages 1451–1468, 2022

work page 2022
[23]

van Breugel, N

B. van Breugel, N. Seedat, F. Imrie, and M. van der Schaar. Can you rely on your model evaluation? im- proving model evaluation with synthetic test data. In Advances in Neural Information Processing Systems, 2023

work page 2023
[24]

van Breugel, H

B. van Breugel, H. Sun, Z. Qian, and M. van der Schaar. Membership inference attacks against synthetic data through overfitting detection. In F. J. R. Ruiz, J. G. Dy, and J. van de Meent, editors,International Confer- ence on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 3493–3514, 2023

work page 2023
[25]

van Breugel, H

B. van Breugel, H. Sun, Z. Qian, and M. van der Schaar. Membership inference attacks against synthetic data through overfitting detection. In F. J. R. Ruiz, J. G. Dy, and J. van de Meent, editors,International Confer- ence on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain, volume 206 ofProceedings of Machine...

work page 2023
[26]

van Breugel and M

B. van Breugel and M. van der Schaar. Beyond privacy: Navigating the opportunities and challenges of synthetic data.arXiv preprint arXiv:2304.03722, 2023

work page arXiv 2023
[27]

Diffusion models for tabular and time series bootcamp

Vector Institute. Diffusion models for tabular and time series bootcamp. https://github.com/VectorInstitute/ diffusion-models, 2024. GitHub repository

work page 2024
[28]

X. Wu, Y . Pang, T. Liu, and S. Wu. Winning the midst challenge: New membership inference attacks on diffu- sion models for tabular data synthesis.arXiv preprint, 2025

work page 2025
[29]

Zhang, J

H. Zhang, J. Zhang, B. Srinivasan, Z. Shen, X. Qin, C. Faloutsos, H. Rangwala, and G. Karypis. Mixed-type tabular data synthesis with score-based diffusion in latent space.arXiv preprint arXiv:2310.09656, 2023

work page arXiv 2023
[30]

Zheng and N

S. Zheng and N. Charoenphakdee. Diffusion models for missing value imputation in tabular data.arXiv preprint arXiv:2210.17128, 2022

work page arXiv 2022