pith. sign in

arxiv: 2506.18499 · v1 · submitted 2025-06-23 · 💻 cs.LG · cs.AI· cs.DB

PuckTrick: A Library for Making Synthetic Data More Realistic

Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DB
keywords synthetic datadata contaminationmachine learning robustnesssynthetic data generationmodel performancedata imperfectionsPython libraryfinancial datasets
0
0 comments X

The pith

A library for adding controlled errors to synthetic data improves machine learning model performance on real tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PuckTrick, a Python library that injects realistic imperfections such as missing values, noise, outliers, label errors, duplicates, and class imbalance into synthetic datasets. The core idea is that purely clean synthetic data fails to prepare models for the messiness of actual data, while adding these errors in a controlled way produces better generalization. Experiments on financial datasets demonstrate higher performance for models trained on the contaminated versions, with gains especially visible in tree-based and linear methods. A reader would care because the approach offers a way to make synthetic data more useful for training without relying on restricted real-world records.

Core claim

PuckTrick supplies two modes for contaminating synthetic data: one that adds errors to clean datasets and another that further corrupts already imperfect data. When machine learning models are trained on the resulting contaminated synthetic data, they outperform models trained on error-free synthetic data when evaluated on real financial datasets. The improvement appears most clearly for tree-based and linear models such as support vector machines and extra trees.

What carries the argument

The PuckTrick library, which offers structured injection of multiple error types including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance to simulate real-world data imperfections.

If this is right

  • Machine learning models achieve higher accuracy on real tasks when trained with systematically contaminated synthetic data.
  • Tree-based and linear models such as SVMs and extra trees show the clearest benefits from the added errors.
  • Synthetic data generation can be made more effective by incorporating controlled imperfections rather than aiming for perfect cleanliness.
  • Evaluation of model resilience becomes more realistic when using contaminated synthetic datasets instead of error-free ones.
  • The library supports repeated application of contamination to already imperfect data for further realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to non-financial domains such as medical or sensor data if similar error patterns are introduced.
  • Practitioners might integrate PuckTrick into data pipelines to test model robustness before deployment.
  • Different error combinations could be tuned to optimize performance for specific model families.
  • The results suggest that real-world data noise functions as a natural regularizer that clean synthetic data lacks.

Load-bearing premise

The specific error types and contamination levels chosen accurately reproduce the statistical properties of imperfections that appear in real-world datasets.

What would settle it

Retraining the tested models on contaminated versus clean synthetic data and finding no performance gain or a loss on held-out real financial data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.18499 by Alessandra Agostini, Andrea Maurino, Blerina Spahiu.

Figure 1
Figure 1. Figure 1: The proposed pipeline To highlight the contribution of each type of error produced, the pipeline constructs a separate contaminated dataset for each introduced error type (e.g., labels misclassification, outliers, etc.). The contaminated datasets, along with the synthetic dataset, are used to train one or more machine learning models. Once training is completed, the resulting models are then employed to pr… view at source ↗
Figure 2
Figure 2. Figure 2: Correlation matrix of selected dataset (year 2014) 4.2. Dataset and experimental setup To evaluate the effectiveness of Pucktrick, we selected a diverse set of datasets related to stock market activities spanning the years 2014 to 20187 . These datasets were chosen due to their highly dynamic nature, real-world complexity, and susceptibility to various types of data errors, such as missing values, noise, o… view at source ↗
Figure 3
Figure 3. Figure 3: Missing date of the original dataset For this experiment, we considered only the top 20 features with the highest correlation with the binary target variable. The correlation matrix of the new dataset is shown in [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PuckTrick, a Python library for systematically contaminating synthetic datasets with controlled real-world-like errors including missing values, noise, outliers, label misclassifications, duplications, and class imbalance. It supports two contamination modes and reports experiments on real-world financial datasets claiming that ML models (particularly tree-based and linear models such as SVMs and Extra Trees) trained on the contaminated synthetic data outperform those trained on purely clean synthetic data.

Significance. If the performance gains are rigorously demonstrated, the library would address a practical gap in synthetic data generation by enabling controlled evaluation of model robustness to realistic imperfections, which is valuable for privacy-sensitive domains like finance where clean synthetic data may lead to overly optimistic generalization estimates.

major comments (2)
  1. [Abstract] Abstract: the central claim that models trained on contaminated synthetic data outperform those trained on error-free synthetic data is stated without any quantitative metrics, error bars, baseline comparisons, or details on how contamination parameters (rates, mechanisms) were selected or calibrated against the real financial datasets; this leaves the performance improvement only weakly supported.
  2. [Experiments] Experiments on real-world financial datasets: the reported outperformance rests on the unverified assumption that PuckTrick's specific error-injection schedule (missingness patterns, noise covariances, outlier frequencies) statistically matches imperfections in the source real data; without explicit distributional comparisons or calibration, the gains could be an artifact of the chosen contamination rather than evidence that added realism improves generalization.
minor comments (1)
  1. [Title and Abstract] Inconsistent library name spelling: 'PuckTrick' in the title versus 'Pucktrick' in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify how to better support the paper's claims. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that models trained on contaminated synthetic data outperform those trained on error-free synthetic data is stated without any quantitative metrics, error bars, baseline comparisons, or details on how contamination parameters (rates, mechanisms) were selected or calibrated against the real financial datasets; this leaves the performance improvement only weakly supported.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will add specific performance metrics from the experiments (e.g., accuracy or F1 improvements for Extra Trees and SVMs with standard deviations across runs) together with a concise statement of the contamination rates and mechanisms employed. This change directly addresses the concern while preserving the abstract's length and focus. revision: yes

  2. Referee: [Experiments] Experiments on real-world financial datasets: the reported outperformance rests on the unverified assumption that PuckTrick's specific error-injection schedule (missingness patterns, noise covariances, outlier frequencies) statistically matches imperfections in the source real data; without explicit distributional comparisons or calibration, the gains could be an artifact of the chosen contamination rather than evidence that added realism improves generalization.

    Authors: We accept that the manuscript would benefit from greater transparency on parameter selection. PuckTrick is intended to enable controlled injection of realistic imperfections rather than exact distributional replication of any single dataset. We will revise the experiments section to (i) explain the rationale for the chosen rates and mechanisms by reference to documented error characteristics in financial data, and (ii) present a sensitivity analysis across multiple contamination intensities. These additions will show that the observed gains are robust and not an artifact of one particular schedule. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external datasets

full rationale

The paper presents a library for controlled contamination of synthetic data and reports performance comparisons from direct experiments on separate real-world financial datasets. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The central finding (contaminated synthetic data yielding better model performance) is obtained via standard train/evaluate loops on held-out external data rather than by construction from the library's own inputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that deliberately introduced error types will produce training conditions representative of real data imperfections; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Real-world data contains imperfections such as missing values, noise, outliers, and misclassifications that affect ML model generalization.
    This premise justifies the need for controlled contamination and is invoked to interpret the experimental results.

pith-pipeline@v0.9.0 · 5775 in / 1296 out tokens · 40673 ms · 2026-05-19T07:28:12.084394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile

    cs.LG 2026-04 unverdicted novelty 5.0

    ESP measures model sensitivity to feature errors and shows performance drops are not always predictable from simple target correlations.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper

  1. [1]

    Paleyes, R.-G

    A. Paleyes, R.-G. Urma, N. D. Lawrence, Challenges in deploying machine learning: a survey of case studies, ACM computing surveys 55 (2022) 1–29

  2. [2]

    Figueira, B

    A. Figueira, B. Vaz, Survey on synthetic data generation, evaluation methods and gans, Mathematics 10 (2022) 2733

  3. [3]

    K. S, M. Durgadevi, Generative adversarial network (gan): a general review on different vari- ants of gan and applications, in: 2021 6th International Conference on Communication and Electronics Systems (ICCES), 2021, pp. 1–8. doi:10.1109/ICCES51350.2021.9489160

  4. [4]

    R. Wei, C. Garcia, A. El-Sayed, V. Peterson, A. Mahmood, Variations in variational autoen- coders - a comparative evaluation, IEEE Access 8 (2020) 153651–153670. doi: 10.1109/ ACCESS.2020.3018151

  5. [5]

    F. Liu, D. Panagiotakos, Real-world data: a brief review of the methods, applications, challenges and opportunities, BMC Medical Research Methodology 22 (2022) 287

  6. [6]

    Iskander, N

    S. Iskander, N. Cohen, Z. Karnin, O. Shapira, S. Tolmach, Quality matters: Evaluating synthetic data for tool-using llms, arXiv preprint arXiv:2409.16341 (2024)

  7. [7]

    J. Chen, Y. Zhang, B. Wang, W. X. Zhao, J.-R. Wen, W. Chen, Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models, arXiv preprint arXiv:2406.12397 (2024)

  8. [8]

    Bauer, S

    A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster, Compre- hensive exploration of synthetic data generation: A survey, arXiv preprint arXiv:2401.02524 (2024)

  9. [9]

    Tabular and latent space synthetic data generation: a literature review , Ty =

    J. Fonseca, F. Bacao, Tabular and latent space synthetic data generation: a literature review, Journal of Big Data 10 (2023) 115. URL: https://doi.org/10.1186/s40537-023-00792-7. doi:10.1186/s40537-023-00792-7

  10. [10]

    L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using conditional gan, in: Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019, p. 1049

  11. [11]

    Borisov, K

    V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, G. Kasneci, Language models are realistic tabular data generators, in: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, OpenReview.net, 2023. URL: https://openreview.net/forum?id=cEygmQNOeI

  12. [12]

    Patki, R

    N. Patki, R. Wedge, K. Veeramachaneni, The synthetic data vault, in: 2016 IEEE in- ternational conference on data science and advanced analytics (DSAA), IEEE, 2016, pp. 399–410

  13. [13]

    Nasios, A

    N. Nasios, A. Bors, Variational learning for gaussian mixture models, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36 (2006) 849–862. doi:10.1109/ TSMCB.2006.872273

  14. [14]

    Walia, B

    M. Walia, B. Tierney, S. Mckeever, Synthesising tabular data using wasserstein conditional gans with gradient penalty (wcgan-gp), in: Irish Conference on Artificial Intelligence and Cognitive Science, 2020. URL: https://api.semanticscholar.org/CorpusID:229345165

  15. [15]

    Milne, A

    T. Milne, A. I. Nachman, Wasserstein gans with gradient penalty compute congested transport, in: P.-L. Loh, M. Raginsky (Eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 103–129. URL: https://proceedings.mlr.press/v178/milne22a.html

  16. [16]

    Hansen, N

    L. Hansen, N. Seedat, M. van der Schaar, A. Petrovic, Reimagining synthetic tabular data generation through data-centric ai: A comprehensive benchmark, Advances in Neural Information Processing Systems 36 (2023) 33781–33823

  17. [17]

    H. H. Rashidi, S. Albahra, B. P. Rubin, B. Hu, A novel and fully automated platform for synthetic tabular data generation and validation, Scientific Reports 14 (2024) 23312

  18. [18]

    Q. Liu, M. Khalil, J. Jovanovic, R. Shakya, Scaling while privacy preserving: A comprehen- sive synthetic tabular data generation and evaluation in learning analytics, in: Proceedings of the 14th Learning Analytics and Knowledge Conference, 2024, pp. 620–631

  19. [19]

    Natarajan, I

    N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 26, Curran As- sociates, Inc., 2013. URL: https://proceedings.neurips.cc/paper_files/paper/2013/file/ 3871bd64012152bfb53fdf04b401193f-Paper.pdf

  20. [20]

    Emmanuel, T

    T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, E. Tabane, M. Velempini, A survey on missing data in machine learning, Journal of Big Data 8 (2021) 140. URL: https://doi. org/10.1186/s40537-021-00516-9. doi: 10.1186/s40537-021-00516-9

  21. [21]

    Boukerche, L

    A. Boukerche, L. Zheng, O. Alfandi, Outlier detection: Methods, models, and classifica- tion, ACM Computing Surveys (CSUR) 53 (2020) 55:1–55:37. URL: https://doi.org/10.1145/ 3381028. doi:10.1145/3381028

  22. [22]

    Frénay, A

    B. Frénay, A. Kabán, et al., A comprehensive introduction to label noise., in: ESANN, Citeseer, 2014