Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms
Pith reviewed 2026-05-23 06:05 UTC · model grok-4.3
The pith
DECAF achieves the best privacy-fairness balance among synthetic data generators, and fairness algorithms improve synthetic data more than real data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, it suffers in utility as reflected in predictive accuracy. Applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.
What carries the argument
The DEbiasing CAusal Fairness (DECAF) algorithm, which generates synthetic data while enforcing causal fairness constraints, and the empirical comparison of its performance against other generators and fairness methods on privacy and fairness metrics.
If this is right
- Synthetic data can enhance both privacy and fairness simultaneously in LA models.
- Pre-processing fairness on synthetic data yields superior fairness outcomes compared to real data.
- Trade-offs in predictive accuracy must be managed when prioritizing privacy and fairness.
- This combination provides a practical strategy for fairer learning analytics without direct use of sensitive real data.
Where Pith is reading between the lines
- The results may extend to other sensitive data domains like healthcare if similar metrics hold.
- Future work could explore whether different fairness definitions alter the observed advantages of DECAF.
- Integrating utility optimization into DECAF might address its accuracy limitations.
- This suggests synthetic data generation as a preprocessing step worth standardizing in fair ML pipelines.
Load-bearing premise
The selected privacy, fairness, and utility metrics along with the specific datasets used serve as reliable indicators of performance in actual learning analytics deployments.
What would settle it
Empirical results on additional datasets or with alternative metrics where another generator outperforms DECAF on the privacy-fairness balance or where fairness pre-processing does not improve more on synthetic data.
Figures
read the original abstract
The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical comparative study of synthetic data generators (including DECAF) and fairness pre-processing algorithms in learning analytics settings. It evaluates trade-offs among privacy, fairness, and utility metrics, concluding that DECAF offers the strongest privacy-fairness balance (at the expense of predictive accuracy) and that applying standard fairness pre-processors to synthetic data yields larger fairness gains than when applied to the original real data.
Significance. If the experimental outcomes are robust, the work supplies practical evidence that synthetic data plus fairness pre-processing can jointly advance privacy and fairness in LA models, a domain where both concerns are acute. The comparative design across multiple generators and algorithms is a strength, as is the explicit reporting of utility degradation for the top privacy-fairness performer. No machine-checked proofs or parameter-free derivations are present, but the falsifiable metric-based claims allow direct replication checks.
major comments (2)
- [§4, Table 2] §4 (Experimental Setup) and Table 2: the claim that 'DECAF achieves the best balance' requires an explicit scalar or Pareto criterion; the text does not state whether balance is defined by a weighted sum, dominance count, or threshold on the reported privacy and fairness scores, making it impossible to verify the ranking without re-deriving the ordering from raw numbers.
- [§5.2] §5.2 (Fairness Improvement on Synthetic Data): the statement that pre-processing 'improves fairness even more than when applied to real data' is load-bearing for the central recommendation, yet the section supplies no statistical test (e.g., paired t-test or Wilcoxon) or confidence intervals on the fairness deltas, and does not report whether the same random seeds and hyper-parameters were used for the real-data and synthetic-data fairness runs.
minor comments (2)
- [Abstract] Abstract: lists no datasets, sample sizes, or exact metric definitions; readers must reach §4 to learn these details.
- [Tables] Notation: 'privacy' and 'fairness' scores are used without a single consolidated table that also includes the utility (accuracy) column for every generator-algorithm pair.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will revise the manuscript accordingly to improve clarity and statistical rigor.
read point-by-point responses
-
Referee: [§4, Table 2] §4 (Experimental Setup) and Table 2: the claim that 'DECAF achieves the best balance' requires an explicit scalar or Pareto criterion; the text does not state whether balance is defined by a weighted sum, dominance count, or threshold on the reported privacy and fairness scores, making it impossible to verify the ranking without re-deriving the ordering from raw numbers.
Authors: We agree that an explicit definition of the balance criterion is needed for verifiability. In the original analysis, DECAF was selected because it simultaneously minimized privacy leakage (across membership inference and attribute inference attacks) while achieving the lowest fairness violations (demographic parity and equalized odds) among the generators tested, corresponding to Pareto dominance in the privacy-fairness plane. We will revise §4 and Table 2 to state explicitly that balance is defined by Pareto dominance (no other generator improves both metrics without degrading at least one), and we will add the raw metric values plus a short dominance table for direct verification. revision: yes
-
Referee: [§5.2] §5.2 (Fairness Improvement on Synthetic Data): the statement that pre-processing 'improves fairness even more than when applied to real data' is load-bearing for the central recommendation, yet the section supplies no statistical test (e.g., paired t-test or Wilcoxon) or confidence intervals on the fairness deltas, and does not report whether the same random seeds and hyper-parameters were used for the real-data and synthetic-data fairness runs.
Authors: We acknowledge the need for statistical support on this central claim. The fairness pre-processing runs on real and synthetic data used identical random seeds, hyper-parameters, and train/test splits to ensure comparability. We will add Wilcoxon signed-rank tests on the per-dataset fairness deltas (with p-values) and 95% confidence intervals on the mean improvements, plus an explicit statement confirming the shared experimental controls. These additions will appear in §5.2 and the associated tables. revision: yes
Circularity Check
No significant circularity; empirical study
full rationale
The paper is a comparative empirical study of synthetic data generators and fairness pre-processing algorithms on learning analytics datasets. All claims rest on reported experimental outcomes for privacy, fairness, and utility metrics rather than any mathematical derivations, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear in the abstract or framing; the structure is self-contained against external benchmarks with no load-bearing reductions to the paper's own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adel Abroshan et al. 2024. Improving Fairness in Machine Learning via Synthetic Data Generation. In Proceedings of the 41st International Conference on Machine Learning, Vol. 238. PMLR. https://proceedings.mlr.press/v238/abroshan24a/abroshan24a.pdf
work page 2024
-
[2]
Mahed Abroshan, Andrew Elliott, and Mohammad Mahdi Khalili. 2024. Imposing Fairness Constraints in Synthetic Data Generation. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 238). 2269–2277. https: //proceedings.mlr.press/v238/abroshan24a.html Navigating Privacy ...
work page 2024
-
[3]
Adel Abusitta, Esma Aïmeur, and Omar Abdel Wahab. 2019. Generative Adversarial Networks for Mitigating Biases in Machine Learning Systems. arXiv preprint arXiv:1905.09972 (2019). https://arxiv.org/abs/1905.09972
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Ryan S Baker and Aaron Hawn. 2021. Algorithmic bias in education. International Journal of Artificial Intelligence in Education (2021), 1–41
work page 2021
-
[5]
Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research 50, 1 (2021), 3–44
work page 2021
- [6]
-
[7]
Aqsa Bhatti and Binil Starly. 2022. Generative Design in Additive Manufacturing: A Review. Machines 4, 2 (2022), 22. https://doi.org/10.3390/ make4020022
work page 2022
-
[8]
V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci. 2023. LANGUAGE MODELS ARE REALISTIC TABULAR DATA GENERATORS. In The Eleventh International Conference on Learning Representations. https://openreview.net/pdf?id=cEygmQNOeI
work page 2023
-
[9]
Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of Machine Learning Research (PMLR) (New York, NY, USA), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, 77–91
work page 2018
-
[10]
Victoria Cheng et al. 2021. Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. In Proceedings of the 2021 ACM Conference. ACM. https://doi.org/10.1145/3442188.3445879
-
[11]
Paulo Cortez. 2008. Student Performance. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5TG7T
-
[12]
Rachel Cummings, Damien Desfontaines, David Evans, Roxana Geambasu, Yangsibo Huang, Matthew Jagielski, Peter Kairouz, Gautam Kamath, Se- woong Oh, Olga Ohrimenko, Nicolas Papernot, Ryan Rogers, Milan Shen, Shuang Song, Weijie Su, Andreas Terzis, Abhradeep Thakurta, Sergei Vassil- vitskii, Yu-Xiang Wang, Li Xiong, Sergey Yekhanin, Da Yu, Huanyu Zhang, and ...
work page 2024
-
[13]
F. K. Dankar, M. K. Ibrahim, and L. Ismail. 2022. A Multi-Dimensional Evaluation of Synthetic Data Generators. IEEE Access 10 (2022), 11147–11158. https://doi.org/10.1109/access.2022.3144765
-
[14]
Oscar Blessed Deho, Srecko Joksimovic, Jiuyong Li, Chen Zhan, Jixue Liu, and Lin Liu. 2022. Should learning analytics models include sensitive attributes? Explaining the why. IEEE Transactions on Learning Technologies 16, 4 (2022), 560–572
work page 2022
-
[15]
Oscar Blessed Deho, Chen Zhan, Jiuyong Li, Jixue Liu, Lin Liu, and Thuc Duy Le. 2022. How do the existing fairness metrics and unfairness mitigation algorithms contribute to ethical learning analytics? British Journal of Educational Technology (2022)
work page 2022
-
[16]
Shayan Doroudi. 2024. On the Paradigms of Learning Analytics: Machine Learning Meets Epistemology. Computers and Education: Artificial Intelligence 6 (2024), 100192. https://doi.org/10.1016/j.caeai.2023.100192
-
[17]
Hendrik Drachsler and Wolfgang Greller. 2016. Privacy and analytics: it’s a DELICATE issue a checklist for trusted learning analytics. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. Association for Computing Machinery (ACM), 89–98. https: //doi.org/10.1145/2883851.2883893
-
[18]
Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. Springer, 1–12
work page 2006
-
[19]
X. Fang, W. Xu, F. A. Tan, J. Zhang, Z. Hu, Y. Qi, S. Nickleach, D. Socolinsky, S. Sengamedu, and C. Faloutsos. 2024. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding – A Survey. ArXiv (2024). https://doi.org/10.48550/arxiv.2402.17944
-
[20]
Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259–268
work page 2015
-
[21]
A. Figueira and B. Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10, 15 (2022), 2733. https: //doi.org/10.3390/math10152733
- [22]
-
[23]
Josh Gardner, Christopher Brooks, and Ryan Baker. 2019. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th international conference on learning analytics & knowledge. 225–234
work page 2019
-
[24]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)
work page 2016
-
[25]
Arto Hellas, Petri Ihantola, Andrew Petersen, Vangel V Ajanovski, Mirela Gutica, Timo Hynninen, Antti Knutas, Juho Leinonen, Chris Messom, and Soohyun Nam Liao. 2018. Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education. 175–199
work page 2018
-
[26]
M. Hernadez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62, Suppl 1 (2023), e19–e38. https://doi.org/10.1055/s-0042-1760247
-
[27]
Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic Dataset Generation for Fairer Unfairness Research. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK’24). https://doi.org/10.1145/3636555.3636868
-
[28]
Jamie Jordon, Łukasz Szpruch, François Houssiau, Marco Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel Cohen, and Adrian Weller. 2022. Synthetic Data - what, why and how? (2022). https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_ Survey-24.pdf
work page 2022
-
[29]
James Jordon, Jinsung Yoon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=S1zk9iRqF7
work page 2019
-
[30]
Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and information systems 33, 1 (2012), 1–33
work page 2012
-
[31]
Mohammad Khalil, Paul Prinsloo, and Sharon Slade. 2023. Fairness, Trust, Transparency, Equity, and Responsibility in Learning Analytics. Journal of Learning Analytics 10, 1 (2023). https://doi.org/10.18608/jla.2023.7983 16 Liu et al. Manuscript submitted to ACM
- [32]
- [33]
-
[34]
René F Kizilcec and Hansol Lee. 2022. Algorithmic fairness in education. In The Ethics of Artificial Intelligence in Education. Routledge, 174–202
work page 2022
-
[35]
Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. Advances in neural information processing systems 30 (2017)
work page 2017
-
[36]
Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open university learning analytics dataset. Scientific data 4, 1 (2017), 1–8
work page 2017
-
[37]
Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics: A systematic review. British Journal of Educational Technology 54, 5 (2023). https://doi.org/10.1111/bjet.13388
-
[38]
Qinyi Liu and Mohammad Khalil. 2024. Exploring the Generation of Synthetic Educational Tabular Data using LLMs. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24), AI for Education (AI4EDU): Advancing Personalized Education with LLM and Adaptive Learning Workshop. Barcelona. https://www.researchgate.net/public...
-
[39]
Q. Liu, M. Khalil, J. Jovanovic, and R. Shakya. 2024. Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. In Proceedings of the 14th Learning Analytics and Knowledge Conference. https://doi.org/10.1145/3636555.3636921
-
[40]
WeiKang Liu, Yanchun Zhang, Hong Yang, and Qinxue Meng. 2024. A survey on differential privacy for medical data analysis. Annals of Data Science 11, 2 (2024), 733–747
work page 2024
- [41]
-
[42]
Amalia Luque, Alejandro Carrasco, Alejandro Martín, and Ana de las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91 (2019), 216–231
work page 2019
-
[43]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54, 6 (2021), 1–35
work page 2021
- [44]
-
[45]
Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, and Rafael de Sousa. 2024. Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data. PLOS ONE 19, 9 (2024). https://doi.org/10. 1371/journal.pone.0297271
work page 2024
-
[46]
David Pujol, Amir Gilad, and Ashwin Machanavajjhala. 2024. PreFair: Privately Generating Justifiably Fair Synthetic Data. In Proceedings of the VLDB Endowment (PVLDB), Vol. 16. https://www.vldb.org/pvldb/vol16/p1573-pujol.pdf
work page 2024
- [47]
-
[48]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS 𝐸𝑀𝐶2 Workshop
work page 2019
-
[49]
Filippo Sciarrone. 2018. Machine Learning and Learning Analytics: Integrating Data with Learning. IEEE (2018). https://doi.org/10.1109/EDUCON. 2018.8424780
-
[50]
Lele Sha, Mladen Rakovic, Alexander Whitelock-Wainwright, David Carroll, Victoria M Yew, Dragan Gasevic, and Guanliang Chen. 2021. Assessing algorithmic fairness in automatic classifiers of educational forum posts. In Artificial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14–18, 2021, Proceedings, Pa...
work page 2021
- [51]
-
[52]
Sahil Verma and Julia Rubin. 2018. Fairness definitions explained. In Proceedings of the international workshop on software fairness. 1–7
work page 2018
-
[53]
Hilde Weerts, Miroslav Dudík, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. 2023. Fairlearn: Assessing and Improving Fairness of AI Systems. , 8 pages. http://jmlr.org/papers/v24/23-0389.html
work page 2023
-
[54]
Linda F Wightman. 1998. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. (1998)
work page 1998
- [55]
-
[56]
Erez Yacobson, Orly Fuhrman, Arnon Hershkovitz, and Giora Alexandron. 2021. De-identification is Insufficient to Protect Student Privacy, or – What Can a Field Trip Reveal? Journal of Learning Analytics 8, 2 (2021), 83–92. https://doi.org/10.18608/jla.2021.7353
-
[57]
Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar. 2020. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics 24, 8 (2020), 2378–2388. https://doi.org/10.1109/JBHI.2020.2980262
-
[58]
Renzhe Yu, Qiujie Li, Christian Fischer, Shayan Doroudi, and Di Xu. 2020. Towards accurate and fair prediction of college success: evaluating different sources of student data. In Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020). ERIC, 292–301
work page 2020
-
[59]
Chen Zhan, Srećko Joksimović, Djazia Ladjal, Thierry Rakotoarivelo, Ruth Marshall, and Abelardo Pardo. 2024. Preserving Both Privacy and Utility in Learning Analytics. IEEE Transactions on Learning Technologies 17 (2024), 1655 – 1667. https://doi.org/10.1109/TLT.2024.3393766
-
[60]
Z. Zhao, A. Kunar, R. Birke, and L. Chen. 2021. CTAB-GAN: Effective Table Data Synthesizing. In Proceedings of Machine Learning Research, Vol. 157. 2021–2021. https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.