Quality Degradation Attack in Synthetic Data

Dong Liu; Mohammad Khalil; Pedro P. Vergara Barrios; Qinyi Liu; Sam Urmian

arxiv: 2601.02947 · v1 · pith:FTGLJWEUnew · submitted 2026-01-06 · 💻 cs.CR

Quality Degradation Attack in Synthetic Data

Qinyi Liu , Dong Liu , Sam Urmian , Mohammad Khalil , Pedro P. Vergara Barrios This is my paper

Pith reviewed 2026-05-21 21:06 UTC · model grok-4.3

classification 💻 cs.CR

keywords synthetic data generationquality degradation attackadversarial attacksprivacy-preserving data sharingdata integritySDG pipelinesstatistical divergence

0 comments

The pith

Adversaries with access to real data can substantially degrade synthetic data quality using small targeted changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates quality degradation attacks on synthetic data generation, where adversaries who have access to the real dataset or control over the generation process perform manipulations to reduce the utility of the synthetic data. It formalizes a threat model and demonstrates through experiments that actions like label flipping and feature-importance-based interventions, even when small, lead to lower downstream predictive performance and higher statistical divergence from the original data. A reader would care because synthetic data is increasingly used for privacy-preserving sharing, yet these attacks expose that current pipelines lack protections against integrity threats. The work calls for adding robustness mechanisms to ensure reliable synthetic data.

Core claim

The central claim is that even small perturbations to real data, such as label flipping and feature-importance-based interventions, can substantially reduce the downstream predictive performance of models trained on the resulting synthetic data and increase statistical divergence, thereby exposing vulnerabilities in synthetic data generation pipelines.

What carries the argument

Targeted manipulations of real data, including label flipping and feature-importance-based interventions, that exploit access to the real dataset to degrade the quality of generated synthetic data.

If this is right

SDG systems require integrity verification alongside privacy protections.
Small changes by data owners or providers can render synthetic data unreliable for downstream tasks.
Statistical divergence and predictive performance are sensitive to these manipulations.
Trustworthiness of synthetic data sharing depends on robustness to such attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that similar attacks could affect other privacy-preserving techniques like differential privacy mechanisms.
Future work might explore automated detection of degraded synthetic data.
Regulatory frameworks for data sharing may need to mandate integrity checks for synthetic datasets.

Load-bearing premise

The adversaries have access to the real dataset or control over the generation process allowing them to perform targeted manipulations.

What would settle it

A demonstration that label flipping and feature-importance interventions do not lead to substantial reductions in predictive performance or increases in statistical divergence would falsify the claim.

read the original abstract

Synthetic Data Generation (SDG) can be used to facilitate privacy-preserving data sharing. However, most existing research focuses on privacy attacks where the adversary is the recipient of the released synthetic data and attempts to infer sensitive information from it. This study investigates quality degradation attacks initiated by adversaries who possess access to the real dataset or control over the generation process, such as the data owner, the synthetic data provider, or potential intruders. We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data (e.g., label flipping and feature-importance-based interventions) on the quality of generated synthetic data. The results show that even small perturbations can substantially reduce downstream predictive performance and increase statistical divergence, exposing vulnerabilities within SDG pipelines. This study highlights the need to integrate integrity verification and robustness mechanisms, alongside privacy protection, to ensure the reliability and trustworthiness of synthetic data sharing frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a plausible insider threat to synthetic data quality but the experiments do not yet isolate whether the damage is amplified by the generation step or just inherited from poisoned real data.

read the letter

The main point is that data owners or providers could deliberately degrade synthetic data utility with small changes, and the authors are right to treat this as a distinct threat model from the usual recipient-side privacy attacks. They formalize the scenario where the adversary controls the real dataset or the generator itself, then test concrete manipulations like label flipping and feature-importance interventions. That shift in focus is useful for anyone thinking about regulated data sharing in healthcare or finance, where synthetic data is sold as a privacy fix but integrity also matters. The empirical angle is straightforward: they show downstream models lose predictive power and statistical divergence grows after these tweaks. Credit for moving the conversation beyond inference attacks and for grounding the claim in a threat model that matches real deployment roles. The soft spot is the missing comparison the stress-test note highlights. Label flips or feature changes on the real data will hurt any learner; without showing performance on the perturbed real data itself, or on synthetic data generated from clean data with matched noise levels, it is difficult to claim the effect is particular to SDG pipelines rather than ordinary data poisoning. The abstract gives no numbers, baselines, or error bars, so the size of the claimed degradation remains hard to judge. If the full paper includes those controls and reports clear effect sizes, the central argument strengthens; otherwise the results could be explained without invoking anything special about synthetic generation. This is for people building or auditing synthetic data systems who need to consider integrity alongside privacy. A serious referee should see it because the threat model is new enough and practically relevant, even though the current evidence needs tighter controls to carry the full claim.

Referee Report

1 major / 1 minor

Summary. The paper formalizes a threat model for quality degradation attacks on synthetic data generation (SDG) pipelines, where adversaries with access to the real dataset (data owners, providers, or intruders) apply targeted perturbations such as label flipping and feature-importance-based interventions. Empirical results indicate that even small perturbations substantially reduce downstream predictive performance and increase statistical divergence in the generated synthetic data, exposing vulnerabilities in SDG systems and calling for integrated integrity verification alongside privacy protections.

Significance. If the central empirical findings hold after addressing baseline controls, the work is significant for shifting SDG security research from privacy attacks to integrity threats. It could inform the design of robust synthetic data frameworks used in privacy-preserving sharing, particularly in domains where data utility must be preserved.

major comments (1)

The experimental results (as summarized in the abstract and implied in the evaluation sections) report substantial drops in predictive performance and increased divergence for synthetic data from perturbed real data, but do not include baselines such as (a) performance of models trained directly on the perturbed real data or (b) synthetic data generated from clean real data with matched perturbation magnitude. Without these controls, the claimed SDG-specific amplification cannot be isolated from direct inheritance of source degradation, undermining the central claim that the effect exposes unique vulnerabilities in SDG pipelines.

minor comments (1)

Abstract lacks any quantitative metrics, error bars, dataset details, or specific effect sizes for the reported performance drops and divergence increases, making it difficult to gauge practical impact.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our work. We address the major comment below and agree that the suggested baseline controls will improve the isolation of SDG-specific effects.

read point-by-point responses

Referee: The experimental results (as summarized in the abstract and implied in the evaluation sections) report substantial drops in predictive performance and increased divergence for synthetic data from perturbed real data, but do not include baselines such as (a) performance of models trained directly on the perturbed real data or (b) synthetic data generated from clean real data with matched perturbation magnitude. Without these controls, the claimed SDG-specific amplification cannot be isolated from direct inheritance of source degradation, undermining the central claim that the effect exposes unique vulnerabilities in SDG pipelines.

Authors: We agree that these controls are necessary to rigorously demonstrate amplification through the SDG process rather than simple inheritance of degradation. In the revised manuscript we will add (a) downstream model performance when trained directly on the perturbed real data and (b) synthetic data generated from clean real data followed by application of perturbations with matched magnitude (e.g., same label-flip rate or feature-importance shift) to the resulting synthetic records. These additions will allow direct comparison of degradation magnitude before versus after the generation step and will be reported in the updated evaluation sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper formalizes a threat model and reports experimental results from applying targeted perturbations (label flipping, feature interventions) to real data and measuring downstream effects on synthetic data quality metrics. No equations, derivations, or self-citations are presented that reduce any claimed result to a fitted parameter, self-definition, or prior author work by construction. Central claims rest on direct empirical comparisons rather than any load-bearing theoretical chain, satisfying the criteria for a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard domain assumptions about synthetic data utility and adversary capabilities but introduces no free parameters, new axioms beyond conventional ML threat modeling, or invented entities.

axioms (1)

domain assumption Adversaries with access to real data or generation control can perform targeted manipulations such as label flipping and feature-importance interventions.
This premise underpins the threat model and the choice of evaluated attacks.

pith-pipeline@v0.9.0 · 5684 in / 1190 out tokens · 57293 ms · 2026-05-21T21:06:26.003706+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data (e.g., label flipping and feature-importance-based interventions) on the quality of generated synthetic data.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The results show that even small perturbations can substantially reduce downstream predictive performance and increase statistical divergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

A reversible database watermark scheme for textual and numerical datasets,

Dankar, F.K., Ibrahim, M.K., Ismail, L.: A Multi-Dimensional Evalua- tion of Synthetic Data Generators. IEEE Access 10, 11147 –11158 (2022). https://doi.org/10.1109/access.2022.3144765

work page doi:10.1109/access.2022.3144765 2022
[2]

Journal of Machine Learning Research 22(57), 1–64 (2021)

Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning Research 22(57), 1–64 (2021)

work page 2021
[3]

Hernadez, G

Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rank in, D.: Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Pri - vacy Dimensions. Methods of Information in Medicine 62(Suppl 1), e19–e38 (2023). https://doi.org/10.1055/s-0042-1760247

work page doi:10.1055/s-0042-1760247 2023
[4]

Proceedings on Privacy Enhancing Technologies 2023(2), 312–328 (2023)

Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A Unified Framework for Quantifying Privacy Risk in Synthetic Data. Proceedings on Privacy Enhancing Technologies 2023(2), 312–328 (2023). https://doi.org/10.56553/popets-2023-0055

work page doi:10.56553/popets-2023-0055 2023
[5]

arXiv preprint (2023)

Boudewijn, A., Ferraris, A.F., Panfilo, D., Cocca, V., Zinutti, S., Schep- per, D., Chauvenet, C.R.: Privacy Measurement in Tabular Synthetic Data: State of the Art and Future Research Directions. arXiv preprint (2023). https://arxiv.org/abs/2311.17453 12 Authors Suppressed Due to Excessive Length

work page arXiv 2023
[6]

arXiv preprint arXiv:2401.06883 (2024)

Liu, Q., Khalil, M., Shakya, R., Jovanovic, J.: Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. arXiv preprint arXiv:2401.06883 (2024)

work page arXiv 2024
[7]

arXiv preprint arXiv:1907.00503 (2019)

Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling Tabular Data using Conditional GAN. arXiv preprint arXiv:1907.00503 (2019)

work page arXiv 1907
[8]

Truly Anonymous Synthetic Data

Ganev, G., Cristofaro, E.D.: On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against “Truly Anonymous Synthetic Data”. arXiv prepri nt (2024). https://arxiv.org/abs/2312.05114v1

work page arXiv 2024
[9]

arXiv preprint arXiv:2504.00758 (2025)

Andrey, P., Le Bars, B., Tommasi, M.: TAMIS: Tailored Membership In- ference Attacks on Synthetic Data. arXiv preprint arXiv:2504.00758 (2025). https://doi.org/10.48550/arXiv.2504.00758

work page doi:10.48550/arxiv.2504.00758 2025
[10]

arXiv preprint arXiv:2404.00696 (2024)

Alshantti, A., Rasheed, A., Westad, F.: Privacy Re-identification At- tacks on Tabular GANs. arXiv preprint arXiv:2404.00696 (2024). https://doi.org/10.48550/arXiv.2404.00696

work page doi:10.48550/arxiv.2404.00696 2024
[11]

arXiv preprint arXiv:2401.02524 (2024)

Bauer, A., Trapp, S., Stenger, M., Leppich, R., Kounev, S., Leznik, M., Chard, K., Foster, I.: Comprehensive Exploration of Synthetic Data Generation: A Survey. arXiv preprint arXiv:2401.02524 (2024). https://doi.org/10.48550/arXiv.2401.02524

work page doi:10.48550/arxiv.2401.02524 2024
[12]

arXiv preprint arXiv:2503.22759 (2025)

Zhao, P., Zhu, W., Jiao, P., Gao, D., Wu, O.: Data Poisoning in Deep Learning: A Survey. arXiv preprint arXiv:2503.22759 (2025)

work page arXiv 2025
[13]

arXiv preprint arXiv:2311.11796 (2023)

Wang, G., Zhou, C., Wang, Y., Chen, B., Guo, H., Yan, Q.: Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems. arXiv preprint arXiv:2311.11796 (2023)

work page arXiv 2023
[14]

arXiv preprint arXiv:2503.09669 (2025)

Jang, S., Choi, J.S., Jo, J., Lee, K., Hwang, S.J.: Silent Branding Attack: Trigger - free Data Poisoning Attack on Text -to-Image Diffusion Models. arXiv preprint arXiv:2503.09669 (2025). https://doi.org/10.48550/arXiv.2503.09669

work page doi:10.48550/arxiv.2503.09669 2025
[15]

arXiv preprint arXiv:2509.23041 (2025)

Liang, Z., Ye, Q., Liu, X., Wang, Y., Xu, J., Hu, H.: Virus Infection Attack on LLMs: Your Poisoning Can Spread “VIA” Synthetic Data. arXiv preprint arXiv:2509.23041 (2025). https://doi.org/10.48550/arXiv.2509.23041

work page doi:10.48550/arxiv.2509.23041 2025
[16]

In: USENIX Security Symposium (2024)

Sundaram Muthu, M., Annamalai, S., Gadotti, A., Rocher, L., Sundaram, M., Annamalai, M.: A Linear Reconstruction Approach for Attribute Infer- ence Attacks against Synthetic Data. In: USENIX Security Symposium (2024). https://www.usenix.org/system/files/usenixsecurity24-annamalai-linear.pdf

work page 2024
[17]

In: Proceedings of the 31st USENIX Security Symposium (USENIX Secu - rity 22), pp

Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic Data – Anonymisation Ground- hog Day. In: Proceedings of the 31st USENIX Security Symposium (USENIX Secu - rity 22), pp. 1451–1468 (2022)

work page 2022
[18]

https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/tvaesynthesizer

SDV Developers: TVAESynthesizer — SDV Documentation (2025). https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/tvaesynthesizer

work page 2025
[19]

-C., van der Schaar, M.: SynthCity: facilitating innovative use cases of synthetic data in different data modalities

Qian, Z., Cebere, B. -C., van der Schaar, M.: SynthCity: facilitating innovative use cases of synthetic data in different data modalities. arXiv preprint arXiv:2301.07573 (2023)

work page arXiv 2023

[1] [1]

A reversible database watermark scheme for textual and numerical datasets,

Dankar, F.K., Ibrahim, M.K., Ismail, L.: A Multi-Dimensional Evalua- tion of Synthetic Data Generators. IEEE Access 10, 11147 –11158 (2022). https://doi.org/10.1109/access.2022.3144765

work page doi:10.1109/access.2022.3144765 2022

[2] [2]

Journal of Machine Learning Research 22(57), 1–64 (2021)

Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning Research 22(57), 1–64 (2021)

work page 2021

[3] [3]

Hernadez, G

Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rank in, D.: Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Pri - vacy Dimensions. Methods of Information in Medicine 62(Suppl 1), e19–e38 (2023). https://doi.org/10.1055/s-0042-1760247

work page doi:10.1055/s-0042-1760247 2023

[4] [4]

Proceedings on Privacy Enhancing Technologies 2023(2), 312–328 (2023)

Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A Unified Framework for Quantifying Privacy Risk in Synthetic Data. Proceedings on Privacy Enhancing Technologies 2023(2), 312–328 (2023). https://doi.org/10.56553/popets-2023-0055

work page doi:10.56553/popets-2023-0055 2023

[5] [5]

arXiv preprint (2023)

Boudewijn, A., Ferraris, A.F., Panfilo, D., Cocca, V., Zinutti, S., Schep- per, D., Chauvenet, C.R.: Privacy Measurement in Tabular Synthetic Data: State of the Art and Future Research Directions. arXiv preprint (2023). https://arxiv.org/abs/2311.17453 12 Authors Suppressed Due to Excessive Length

work page arXiv 2023

[6] [6]

arXiv preprint arXiv:2401.06883 (2024)

Liu, Q., Khalil, M., Shakya, R., Jovanovic, J.: Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. arXiv preprint arXiv:2401.06883 (2024)

work page arXiv 2024

[7] [7]

arXiv preprint arXiv:1907.00503 (2019)

Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling Tabular Data using Conditional GAN. arXiv preprint arXiv:1907.00503 (2019)

work page arXiv 1907

[8] [8]

Truly Anonymous Synthetic Data

Ganev, G., Cristofaro, E.D.: On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against “Truly Anonymous Synthetic Data”. arXiv prepri nt (2024). https://arxiv.org/abs/2312.05114v1

work page arXiv 2024

[9] [9]

arXiv preprint arXiv:2504.00758 (2025)

Andrey, P., Le Bars, B., Tommasi, M.: TAMIS: Tailored Membership In- ference Attacks on Synthetic Data. arXiv preprint arXiv:2504.00758 (2025). https://doi.org/10.48550/arXiv.2504.00758

work page doi:10.48550/arxiv.2504.00758 2025

[10] [10]

arXiv preprint arXiv:2404.00696 (2024)

Alshantti, A., Rasheed, A., Westad, F.: Privacy Re-identification At- tacks on Tabular GANs. arXiv preprint arXiv:2404.00696 (2024). https://doi.org/10.48550/arXiv.2404.00696

work page doi:10.48550/arxiv.2404.00696 2024

[11] [11]

arXiv preprint arXiv:2401.02524 (2024)

Bauer, A., Trapp, S., Stenger, M., Leppich, R., Kounev, S., Leznik, M., Chard, K., Foster, I.: Comprehensive Exploration of Synthetic Data Generation: A Survey. arXiv preprint arXiv:2401.02524 (2024). https://doi.org/10.48550/arXiv.2401.02524

work page doi:10.48550/arxiv.2401.02524 2024

[12] [12]

arXiv preprint arXiv:2503.22759 (2025)

Zhao, P., Zhu, W., Jiao, P., Gao, D., Wu, O.: Data Poisoning in Deep Learning: A Survey. arXiv preprint arXiv:2503.22759 (2025)

work page arXiv 2025

[13] [13]

arXiv preprint arXiv:2311.11796 (2023)

Wang, G., Zhou, C., Wang, Y., Chen, B., Guo, H., Yan, Q.: Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems. arXiv preprint arXiv:2311.11796 (2023)

work page arXiv 2023

[14] [14]

arXiv preprint arXiv:2503.09669 (2025)

Jang, S., Choi, J.S., Jo, J., Lee, K., Hwang, S.J.: Silent Branding Attack: Trigger - free Data Poisoning Attack on Text -to-Image Diffusion Models. arXiv preprint arXiv:2503.09669 (2025). https://doi.org/10.48550/arXiv.2503.09669

work page doi:10.48550/arxiv.2503.09669 2025

[15] [15]

arXiv preprint arXiv:2509.23041 (2025)

Liang, Z., Ye, Q., Liu, X., Wang, Y., Xu, J., Hu, H.: Virus Infection Attack on LLMs: Your Poisoning Can Spread “VIA” Synthetic Data. arXiv preprint arXiv:2509.23041 (2025). https://doi.org/10.48550/arXiv.2509.23041

work page doi:10.48550/arxiv.2509.23041 2025

[16] [16]

In: USENIX Security Symposium (2024)

Sundaram Muthu, M., Annamalai, S., Gadotti, A., Rocher, L., Sundaram, M., Annamalai, M.: A Linear Reconstruction Approach for Attribute Infer- ence Attacks against Synthetic Data. In: USENIX Security Symposium (2024). https://www.usenix.org/system/files/usenixsecurity24-annamalai-linear.pdf

work page 2024

[17] [17]

In: Proceedings of the 31st USENIX Security Symposium (USENIX Secu - rity 22), pp

Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic Data – Anonymisation Ground- hog Day. In: Proceedings of the 31st USENIX Security Symposium (USENIX Secu - rity 22), pp. 1451–1468 (2022)

work page 2022

[18] [18]

https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/tvaesynthesizer

SDV Developers: TVAESynthesizer — SDV Documentation (2025). https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/tvaesynthesizer

work page 2025

[19] [19]

-C., van der Schaar, M.: SynthCity: facilitating innovative use cases of synthetic data in different data modalities

Qian, Z., Cebere, B. -C., van der Schaar, M.: SynthCity: facilitating innovative use cases of synthetic data in different data modalities. arXiv preprint arXiv:2301.07573 (2023)

work page arXiv 2023