PuckTrick: A Library for Making Synthetic Data More Realistic

Alessandra Agostini; Andrea Maurino; Blerina Spahiu

arxiv: 2506.18499 · v1 · submitted 2025-06-23 · 💻 cs.LG · cs.AI· cs.DB

PuckTrick: A Library for Making Synthetic Data More Realistic

Alessandra Agostini , Andrea Maurino , Blerina Spahiu This is my paper

Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DB

keywords synthetic datadata contaminationmachine learning robustnesssynthetic data generationmodel performancedata imperfectionsPython libraryfinancial datasets

0 comments

The pith

A library for adding controlled errors to synthetic data improves machine learning model performance on real tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PuckTrick, a Python library that injects realistic imperfections such as missing values, noise, outliers, label errors, duplicates, and class imbalance into synthetic datasets. The core idea is that purely clean synthetic data fails to prepare models for the messiness of actual data, while adding these errors in a controlled way produces better generalization. Experiments on financial datasets demonstrate higher performance for models trained on the contaminated versions, with gains especially visible in tree-based and linear methods. A reader would care because the approach offers a way to make synthetic data more useful for training without relying on restricted real-world records.

Core claim

PuckTrick supplies two modes for contaminating synthetic data: one that adds errors to clean datasets and another that further corrupts already imperfect data. When machine learning models are trained on the resulting contaminated synthetic data, they outperform models trained on error-free synthetic data when evaluated on real financial datasets. The improvement appears most clearly for tree-based and linear models such as support vector machines and extra trees.

What carries the argument

The PuckTrick library, which offers structured injection of multiple error types including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance to simulate real-world data imperfections.

If this is right

Machine learning models achieve higher accuracy on real tasks when trained with systematically contaminated synthetic data.
Tree-based and linear models such as SVMs and extra trees show the clearest benefits from the added errors.
Synthetic data generation can be made more effective by incorporating controlled imperfections rather than aiming for perfect cleanliness.
Evaluation of model resilience becomes more realistic when using contaminated synthetic datasets instead of error-free ones.
The library supports repeated application of contamination to already imperfect data for further realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to non-financial domains such as medical or sensor data if similar error patterns are introduced.
Practitioners might integrate PuckTrick into data pipelines to test model robustness before deployment.
Different error combinations could be tuned to optimize performance for specific model families.
The results suggest that real-world data noise functions as a natural regularizer that clean synthetic data lacks.

Load-bearing premise

The specific error types and contamination levels chosen accurately reproduce the statistical properties of imperfections that appear in real-world datasets.

What would settle it

Retraining the tested models on contaminated versus clean synthetic data and finding no performance gain or a loss on held-out real financial data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.18499 by Alessandra Agostini, Andrea Maurino, Blerina Spahiu.

**Figure 1.** Figure 1: The proposed pipeline To highlight the contribution of each type of error produced, the pipeline constructs a separate contaminated dataset for each introduced error type (e.g., labels misclassification, outliers, etc.). The contaminated datasets, along with the synthetic dataset, are used to train one or more machine learning models. Once training is completed, the resulting models are then employed to pr… view at source ↗

**Figure 2.** Figure 2: Correlation matrix of selected dataset (year 2014) 4.2. Dataset and experimental setup To evaluate the effectiveness of Pucktrick, we selected a diverse set of datasets related to stock market activities spanning the years 2014 to 20187 . These datasets were chosen due to their highly dynamic nature, real-world complexity, and susceptibility to various types of data errors, such as missing values, noise, o… view at source ↗

**Figure 3.** Figure 3: Missing date of the original dataset For this experiment, we considered only the top 20 features with the highest correlation with the binary target variable. The correlation matrix of the new dataset is shown in [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PuckTrick is a practical library for adding controlled errors to synthetic data, but its performance claims rest on limited details and an unverified assumption that the injected errors match real-world patterns.

read the letter

PuckTrick is a library for adding controlled errors to synthetic data. The authors built a Python package that supports missing values, noise, outliers, label misclassifications, duplicates, and class imbalance, with two modes for applying the contamination. They test it on real financial datasets and report that models trained on the contaminated versions outperform those trained on clean synthetic data, especially tree-based and linear models like SVMs and Extra Trees.

Referee Report

2 major / 1 minor

Summary. The paper introduces PuckTrick, a Python library for systematically contaminating synthetic datasets with controlled real-world-like errors including missing values, noise, outliers, label misclassifications, duplications, and class imbalance. It supports two contamination modes and reports experiments on real-world financial datasets claiming that ML models (particularly tree-based and linear models such as SVMs and Extra Trees) trained on the contaminated synthetic data outperform those trained on purely clean synthetic data.

Significance. If the performance gains are rigorously demonstrated, the library would address a practical gap in synthetic data generation by enabling controlled evaluation of model robustness to realistic imperfections, which is valuable for privacy-sensitive domains like finance where clean synthetic data may lead to overly optimistic generalization estimates.

major comments (2)

[Abstract] Abstract: the central claim that models trained on contaminated synthetic data outperform those trained on error-free synthetic data is stated without any quantitative metrics, error bars, baseline comparisons, or details on how contamination parameters (rates, mechanisms) were selected or calibrated against the real financial datasets; this leaves the performance improvement only weakly supported.
[Experiments] Experiments on real-world financial datasets: the reported outperformance rests on the unverified assumption that PuckTrick's specific error-injection schedule (missingness patterns, noise covariances, outlier frequencies) statistically matches imperfections in the source real data; without explicit distributional comparisons or calibration, the gains could be an artifact of the chosen contamination rather than evidence that added realism improves generalization.

minor comments (1)

[Title and Abstract] Inconsistent library name spelling: 'PuckTrick' in the title versus 'Pucktrick' in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify how to better support the paper's claims. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that models trained on contaminated synthetic data outperform those trained on error-free synthetic data is stated without any quantitative metrics, error bars, baseline comparisons, or details on how contamination parameters (rates, mechanisms) were selected or calibrated against the real financial datasets; this leaves the performance improvement only weakly supported.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will add specific performance metrics from the experiments (e.g., accuracy or F1 improvements for Extra Trees and SVMs with standard deviations across runs) together with a concise statement of the contamination rates and mechanisms employed. This change directly addresses the concern while preserving the abstract's length and focus. revision: yes
Referee: [Experiments] Experiments on real-world financial datasets: the reported outperformance rests on the unverified assumption that PuckTrick's specific error-injection schedule (missingness patterns, noise covariances, outlier frequencies) statistically matches imperfections in the source real data; without explicit distributional comparisons or calibration, the gains could be an artifact of the chosen contamination rather than evidence that added realism improves generalization.

Authors: We accept that the manuscript would benefit from greater transparency on parameter selection. PuckTrick is intended to enable controlled injection of realistic imperfections rather than exact distributional replication of any single dataset. We will revise the experiments section to (i) explain the rationale for the chosen rates and mechanisms by reference to documented error characteristics in financial data, and (ii) present a sensitivity analysis across multiple contamination intensities. These additions will show that the observed gains are robust and not an artifact of one particular schedule. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external datasets

full rationale

The paper presents a library for controlled contamination of synthetic data and reports performance comparisons from direct experiments on separate real-world financial datasets. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The central finding (contaminated synthetic data yielding better model performance) is obtained via standard train/evaluate loops on held-out external data rather than by construction from the library's own inputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that deliberately introduced error types will produce training conditions representative of real data imperfections; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Real-world data contains imperfections such as missing values, noise, outliers, and misclassifications that affect ML model generalization.
This premise justifies the need for controlled contamination and is invoked to interpret the experimental results.

pith-pipeline@v0.9.0 · 5775 in / 1296 out tokens · 40673 ms · 2026-05-19T07:28:12.084394+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pucktrick library ... supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that training machine learning models on synthetic data with controlled errors ... results in better accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile
cs.LG 2026-04 unverdicted novelty 5.0

ESP measures model sensitivity to feature errors and shows performance drops are not always predictable from simple target correlations.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper

[1]

Paleyes, R.-G

A. Paleyes, R.-G. Urma, N. D. Lawrence, Challenges in deploying machine learning: a survey of case studies, ACM computing surveys 55 (2022) 1–29

work page 2022
[2]

Figueira, B

A. Figueira, B. Vaz, Survey on synthetic data generation, evaluation methods and gans, Mathematics 10 (2022) 2733

work page 2022
[3]

K. S, M. Durgadevi, Generative adversarial network (gan): a general review on different vari- ants of gan and applications, in: 2021 6th International Conference on Communication and Electronics Systems (ICCES), 2021, pp. 1–8. doi:10.1109/ICCES51350.2021.9489160

work page doi:10.1109/icces51350.2021.9489160 2021
[4]

R. Wei, C. Garcia, A. El-Sayed, V. Peterson, A. Mahmood, Variations in variational autoen- coders - a comparative evaluation, IEEE Access 8 (2020) 153651–153670. doi: 10.1109/ ACCESS.2020.3018151

work page arXiv 2020
[5]

F. Liu, D. Panagiotakos, Real-world data: a brief review of the methods, applications, challenges and opportunities, BMC Medical Research Methodology 22 (2022) 287

work page 2022
[6]

Iskander, N

S. Iskander, N. Cohen, Z. Karnin, O. Shapira, S. Tolmach, Quality matters: Evaluating synthetic data for tool-using llms, arXiv preprint arXiv:2409.16341 (2024)

work page arXiv 2024
[7]

J. Chen, Y. Zhang, B. Wang, W. X. Zhao, J.-R. Wen, W. Chen, Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models, arXiv preprint arXiv:2406.12397 (2024)

work page arXiv 2024
[8]

Bauer, S

A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster, Compre- hensive exploration of synthetic data generation: A survey, arXiv preprint arXiv:2401.02524 (2024)

work page arXiv 2024
[9]

Tabular and latent space synthetic data generation: a literature review , Ty =

J. Fonseca, F. Bacao, Tabular and latent space synthetic data generation: a literature review, Journal of Big Data 10 (2023) 115. URL: https://doi.org/10.1186/s40537-023-00792-7. doi:10.1186/s40537-023-00792-7

work page doi:10.1186/s40537-023-00792-7 2023
[10]

L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using conditional gan, in: Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019, p. 1049

work page 2019
[11]

Borisov, K

V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, G. Kasneci, Language models are realistic tabular data generators, in: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, OpenReview.net, 2023. URL: https://openreview.net/forum?id=cEygmQNOeI

work page 2023
[12]

Patki, R

N. Patki, R. Wedge, K. Veeramachaneni, The synthetic data vault, in: 2016 IEEE in- ternational conference on data science and advanced analytics (DSAA), IEEE, 2016, pp. 399–410

work page 2016
[13]

Nasios, A

N. Nasios, A. Bors, Variational learning for gaussian mixture models, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36 (2006) 849–862. doi:10.1109/ TSMCB.2006.872273

work page arXiv 2006
[14]

Walia, B

M. Walia, B. Tierney, S. Mckeever, Synthesising tabular data using wasserstein conditional gans with gradient penalty (wcgan-gp), in: Irish Conference on Artificial Intelligence and Cognitive Science, 2020. URL: https://api.semanticscholar.org/CorpusID:229345165

work page 2020
[15]

Milne, A

T. Milne, A. I. Nachman, Wasserstein gans with gradient penalty compute congested transport, in: P.-L. Loh, M. Raginsky (Eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 103–129. URL: https://proceedings.mlr.press/v178/milne22a.html

work page 2022
[16]

Hansen, N

L. Hansen, N. Seedat, M. van der Schaar, A. Petrovic, Reimagining synthetic tabular data generation through data-centric ai: A comprehensive benchmark, Advances in Neural Information Processing Systems 36 (2023) 33781–33823

work page 2023
[17]

H. H. Rashidi, S. Albahra, B. P. Rubin, B. Hu, A novel and fully automated platform for synthetic tabular data generation and validation, Scientific Reports 14 (2024) 23312

work page 2024
[18]

Q. Liu, M. Khalil, J. Jovanovic, R. Shakya, Scaling while privacy preserving: A comprehen- sive synthetic tabular data generation and evaluation in learning analytics, in: Proceedings of the 14th Learning Analytics and Knowledge Conference, 2024, pp. 620–631

work page 2024
[19]

Natarajan, I

N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 26, Curran As- sociates, Inc., 2013. URL: https://proceedings.neurips.cc/paper_files/paper/2013/file/ 3871bd64012152bfb53fdf04b401193f-Paper.pdf

work page 2013
[20]

Emmanuel, T

T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, E. Tabane, M. Velempini, A survey on missing data in machine learning, Journal of Big Data 8 (2021) 140. URL: https://doi. org/10.1186/s40537-021-00516-9. doi: 10.1186/s40537-021-00516-9

work page doi:10.1186/s40537-021-00516-9 2021
[21]

Boukerche, L

A. Boukerche, L. Zheng, O. Alfandi, Outlier detection: Methods, models, and classifica- tion, ACM Computing Surveys (CSUR) 53 (2020) 55:1–55:37. URL: https://doi.org/10.1145/ 3381028. doi:10.1145/3381028

work page doi:10.1145/3381028 2020
[22]

Frénay, A

B. Frénay, A. Kabán, et al., A comprehensive introduction to label noise., in: ESANN, Citeseer, 2014

work page 2014

[1] [1]

Paleyes, R.-G

A. Paleyes, R.-G. Urma, N. D. Lawrence, Challenges in deploying machine learning: a survey of case studies, ACM computing surveys 55 (2022) 1–29

work page 2022

[2] [2]

Figueira, B

A. Figueira, B. Vaz, Survey on synthetic data generation, evaluation methods and gans, Mathematics 10 (2022) 2733

work page 2022

[3] [3]

K. S, M. Durgadevi, Generative adversarial network (gan): a general review on different vari- ants of gan and applications, in: 2021 6th International Conference on Communication and Electronics Systems (ICCES), 2021, pp. 1–8. doi:10.1109/ICCES51350.2021.9489160

work page doi:10.1109/icces51350.2021.9489160 2021

[4] [4]

R. Wei, C. Garcia, A. El-Sayed, V. Peterson, A. Mahmood, Variations in variational autoen- coders - a comparative evaluation, IEEE Access 8 (2020) 153651–153670. doi: 10.1109/ ACCESS.2020.3018151

work page arXiv 2020

[5] [5]

F. Liu, D. Panagiotakos, Real-world data: a brief review of the methods, applications, challenges and opportunities, BMC Medical Research Methodology 22 (2022) 287

work page 2022

[6] [6]

Iskander, N

S. Iskander, N. Cohen, Z. Karnin, O. Shapira, S. Tolmach, Quality matters: Evaluating synthetic data for tool-using llms, arXiv preprint arXiv:2409.16341 (2024)

work page arXiv 2024

[7] [7]

J. Chen, Y. Zhang, B. Wang, W. X. Zhao, J.-R. Wen, W. Chen, Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models, arXiv preprint arXiv:2406.12397 (2024)

work page arXiv 2024

[8] [8]

Bauer, S

A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster, Compre- hensive exploration of synthetic data generation: A survey, arXiv preprint arXiv:2401.02524 (2024)

work page arXiv 2024

[9] [9]

Tabular and latent space synthetic data generation: a literature review , Ty =

J. Fonseca, F. Bacao, Tabular and latent space synthetic data generation: a literature review, Journal of Big Data 10 (2023) 115. URL: https://doi.org/10.1186/s40537-023-00792-7. doi:10.1186/s40537-023-00792-7

work page doi:10.1186/s40537-023-00792-7 2023

[10] [10]

L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using conditional gan, in: Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019, p. 1049

work page 2019

[11] [11]

Borisov, K

V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, G. Kasneci, Language models are realistic tabular data generators, in: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, OpenReview.net, 2023. URL: https://openreview.net/forum?id=cEygmQNOeI

work page 2023

[12] [12]

Patki, R

N. Patki, R. Wedge, K. Veeramachaneni, The synthetic data vault, in: 2016 IEEE in- ternational conference on data science and advanced analytics (DSAA), IEEE, 2016, pp. 399–410

work page 2016

[13] [13]

Nasios, A

N. Nasios, A. Bors, Variational learning for gaussian mixture models, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36 (2006) 849–862. doi:10.1109/ TSMCB.2006.872273

work page arXiv 2006

[14] [14]

Walia, B

M. Walia, B. Tierney, S. Mckeever, Synthesising tabular data using wasserstein conditional gans with gradient penalty (wcgan-gp), in: Irish Conference on Artificial Intelligence and Cognitive Science, 2020. URL: https://api.semanticscholar.org/CorpusID:229345165

work page 2020

[15] [15]

Milne, A

T. Milne, A. I. Nachman, Wasserstein gans with gradient penalty compute congested transport, in: P.-L. Loh, M. Raginsky (Eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 103–129. URL: https://proceedings.mlr.press/v178/milne22a.html

work page 2022

[16] [16]

Hansen, N

L. Hansen, N. Seedat, M. van der Schaar, A. Petrovic, Reimagining synthetic tabular data generation through data-centric ai: A comprehensive benchmark, Advances in Neural Information Processing Systems 36 (2023) 33781–33823

work page 2023

[17] [17]

H. H. Rashidi, S. Albahra, B. P. Rubin, B. Hu, A novel and fully automated platform for synthetic tabular data generation and validation, Scientific Reports 14 (2024) 23312

work page 2024

[18] [18]

Q. Liu, M. Khalil, J. Jovanovic, R. Shakya, Scaling while privacy preserving: A comprehen- sive synthetic tabular data generation and evaluation in learning analytics, in: Proceedings of the 14th Learning Analytics and Knowledge Conference, 2024, pp. 620–631

work page 2024

[19] [19]

Natarajan, I

N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 26, Curran As- sociates, Inc., 2013. URL: https://proceedings.neurips.cc/paper_files/paper/2013/file/ 3871bd64012152bfb53fdf04b401193f-Paper.pdf

work page 2013

[20] [20]

Emmanuel, T

T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, E. Tabane, M. Velempini, A survey on missing data in machine learning, Journal of Big Data 8 (2021) 140. URL: https://doi. org/10.1186/s40537-021-00516-9. doi: 10.1186/s40537-021-00516-9

work page doi:10.1186/s40537-021-00516-9 2021

[21] [21]

Boukerche, L

A. Boukerche, L. Zheng, O. Alfandi, Outlier detection: Methods, models, and classifica- tion, ACM Computing Surveys (CSUR) 53 (2020) 55:1–55:37. URL: https://doi.org/10.1145/ 3381028. doi:10.1145/3381028

work page doi:10.1145/3381028 2020

[22] [22]

Frénay, A

B. Frénay, A. Kabán, et al., A comprehensive introduction to label noise., in: ESANN, Citeseer, 2014

work page 2014