Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

George Siemens; Mohammad Khalil; Oscar Deho; Qinyi Liu; Sam Urmian; Srecko Joksimovic

arxiv: 2501.01785 · v1 · pith:CDANHIZHnew · submitted 2025-01-03 · 💻 cs.LG · cs.AI· cs.CY

Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

Qinyi Liu , Oscar Deho , Sam Urmian , Mohammad Khalil , Srecko Joksimovic , George Siemens This is my paper

Pith reviewed 2026-05-23 06:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY

keywords synthetic dataalgorithmic fairnessprivacylearning analyticsDECAFmachine learningfairness algorithmsdata generation

0 comments

The pith

DECAF achieves the best privacy-fairness balance among synthetic data generators, and fairness algorithms improve synthetic data more than real data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether synthetic data can simultaneously support privacy and fairness in machine learning models for learning analytics. It compares multiple synthetic data generation methods on privacy, fairness, and utility metrics. The study also checks if standard fairness pre-processing steps work better when applied to synthetic data instead of real data. The findings indicate a path to fairer and more private models by combining these techniques, which matters for educational applications where both concerns are acute.

Core claim

The DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, it suffers in utility as reflected in predictive accuracy. Applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

What carries the argument

The DEbiasing CAusal Fairness (DECAF) algorithm, which generates synthetic data while enforcing causal fairness constraints, and the empirical comparison of its performance against other generators and fairness methods on privacy and fairness metrics.

If this is right

Synthetic data can enhance both privacy and fairness simultaneously in LA models.
Pre-processing fairness on synthetic data yields superior fairness outcomes compared to real data.
Trade-offs in predictive accuracy must be managed when prioritizing privacy and fairness.
This combination provides a practical strategy for fairer learning analytics without direct use of sensitive real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results may extend to other sensitive data domains like healthcare if similar metrics hold.
Future work could explore whether different fairness definitions alter the observed advantages of DECAF.
Integrating utility optimization into DECAF might address its accuracy limitations.
This suggests synthetic data generation as a preprocessing step worth standardizing in fair ML pipelines.

Load-bearing premise

The selected privacy, fairness, and utility metrics along with the specific datasets used serve as reliable indicators of performance in actual learning analytics deployments.

What would settle it

Empirical results on additional datasets or with alternative metrics where another generator outperforms DECAF on the privacy-fairness balance or where fairness pre-processing does not improve more on synthetic data.

Figures

Figures reproduced from arXiv: 2501.01785 by George Siemens, Mohammad Khalil, Oscar Deho, Qinyi Liu, Sam Urmian, Srecko Joksimovic.

**Figure 1.** Figure 1: The overall flow of our experiments. It starts with (1) synthetic data generation and privacy evaluation, (2) training of bas eline and fair models on both real and synthetic data, and (3) evaluation of baseline and fair models for fairness and predictive accuracy. 𝑐. A TPR value of 0 indicates perfect fairness, a positive value means the unprivileged group has a higher true positive rate, and a negative v… view at source ↗

read the original abstract

The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main new observation is that fairness pre-processing yields larger gains on synthetic data than real data, with DECAF balancing privacy and fairness best though at a utility cost.

read the letter

The one or two things worth knowing are that this comparative study finds fairness pre-processing improves fairness metrics more when run on synthetic data than on the original real data, and that DECAF stands out among the tested options for the privacy-fairness trade-off even while lowering predictive accuracy. The work runs head-to-head tests of several synthetic generators paired with fairness algorithms in a learning analytics setting and reports how the combinations affect privacy, fairness, and utility. The differential effect on synthetic versus real data is the concrete empirical result they highlight. They build on existing observations about the fairness-privacy tension and test whether generating synthetic data first helps resolve it. The practical mapping of which algorithm gives the best balance is the part that could actually guide choices for LA practitioners. The soft spots are the usual ones for this kind of study: the claims rest on the chosen metrics and datasets serving as reasonable proxies, and the abstract supplies no information on statistical controls, variance across runs, or exact dataset characteristics. If the full paper shows multiple datasets and clear experimental design, that would tighten things up; otherwise the generalizability stays limited. This is targeted work for people already working on fairness and privacy in educational data. It does not introduce new methods or theory, but the reported comparison is straightforward enough to be worth checking. I would send it to referees so the experimental details can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparative study of synthetic data generators (including DECAF) and fairness pre-processing algorithms in learning analytics settings. It evaluates trade-offs among privacy, fairness, and utility metrics, concluding that DECAF offers the strongest privacy-fairness balance (at the expense of predictive accuracy) and that applying standard fairness pre-processors to synthetic data yields larger fairness gains than when applied to the original real data.

Significance. If the experimental outcomes are robust, the work supplies practical evidence that synthetic data plus fairness pre-processing can jointly advance privacy and fairness in LA models, a domain where both concerns are acute. The comparative design across multiple generators and algorithms is a strength, as is the explicit reporting of utility degradation for the top privacy-fairness performer. No machine-checked proofs or parameter-free derivations are present, but the falsifiable metric-based claims allow direct replication checks.

major comments (2)

[§4, Table 2] §4 (Experimental Setup) and Table 2: the claim that 'DECAF achieves the best balance' requires an explicit scalar or Pareto criterion; the text does not state whether balance is defined by a weighted sum, dominance count, or threshold on the reported privacy and fairness scores, making it impossible to verify the ranking without re-deriving the ordering from raw numbers.
[§5.2] §5.2 (Fairness Improvement on Synthetic Data): the statement that pre-processing 'improves fairness even more than when applied to real data' is load-bearing for the central recommendation, yet the section supplies no statistical test (e.g., paired t-test or Wilcoxon) or confidence intervals on the fairness deltas, and does not report whether the same random seeds and hyper-parameters were used for the real-data and synthetic-data fairness runs.

minor comments (2)

[Abstract] Abstract: lists no datasets, sample sizes, or exact metric definitions; readers must reach §4 to learn these details.
[Tables] Notation: 'privacy' and 'fairness' scores are used without a single consolidated table that also includes the utility (accuracy) column for every generator-algorithm pair.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will revise the manuscript accordingly to improve clarity and statistical rigor.

read point-by-point responses

Referee: [§4, Table 2] §4 (Experimental Setup) and Table 2: the claim that 'DECAF achieves the best balance' requires an explicit scalar or Pareto criterion; the text does not state whether balance is defined by a weighted sum, dominance count, or threshold on the reported privacy and fairness scores, making it impossible to verify the ranking without re-deriving the ordering from raw numbers.

Authors: We agree that an explicit definition of the balance criterion is needed for verifiability. In the original analysis, DECAF was selected because it simultaneously minimized privacy leakage (across membership inference and attribute inference attacks) while achieving the lowest fairness violations (demographic parity and equalized odds) among the generators tested, corresponding to Pareto dominance in the privacy-fairness plane. We will revise §4 and Table 2 to state explicitly that balance is defined by Pareto dominance (no other generator improves both metrics without degrading at least one), and we will add the raw metric values plus a short dominance table for direct verification. revision: yes
Referee: [§5.2] §5.2 (Fairness Improvement on Synthetic Data): the statement that pre-processing 'improves fairness even more than when applied to real data' is load-bearing for the central recommendation, yet the section supplies no statistical test (e.g., paired t-test or Wilcoxon) or confidence intervals on the fairness deltas, and does not report whether the same random seeds and hyper-parameters were used for the real-data and synthetic-data fairness runs.

Authors: We acknowledge the need for statistical support on this central claim. The fairness pre-processing runs on real and synthetic data used identical random seeds, hyper-parameters, and train/test splits to ensure comparability. We will add Wilcoxon signed-rank tests on the per-dataset fairness deltas (with p-values) and 95% confidence intervals on the mean improvements, plus an explicit statement confirming the shared experimental controls. These additions will appear in §5.2 and the associated tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study

full rationale

The paper is a comparative empirical study of synthetic data generators and fairness pre-processing algorithms on learning analytics datasets. All claims rest on reported experimental outcomes for privacy, fairness, and utility metrics rather than any mathematical derivations, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear in the abstract or framing; the structure is self-contained against external benchmarks with no load-bearing reductions to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is a standard empirical comparison of existing generators and algorithms.

pith-pipeline@v0.9.0 · 5711 in / 1026 out tokens · 48954 ms · 2026-05-23T06:05:31.916630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 1 internal anchor

[1]

Adel Abroshan et al. 2024. Improving Fairness in Machine Learning via Synthetic Data Generation. In Proceedings of the 41st International Conference on Machine Learning, Vol. 238. PMLR. https://proceedings.mlr.press/v238/abroshan24a/abroshan24a.pdf

work page 2024
[2]

Mahed Abroshan, Andrew Elliott, and Mohammad Mahdi Khalili. 2024. Imposing Fairness Constraints in Synthetic Data Generation. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 238). 2269–2277. https: //proceedings.mlr.press/v238/abroshan24a.html Navigating Privacy ...

work page 2024
[3]

Adel Abusitta, Esma Aïmeur, and Omar Abdel Wahab. 2019. Generative Adversarial Networks for Mitigating Biases in Machine Learning Systems. arXiv preprint arXiv:1905.09972 (2019). https://arxiv.org/abs/1905.09972

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

Ryan S Baker and Aaron Hawn. 2021. Algorithmic bias in education. International Journal of Artificial Intelligence in Education (2021), 1–41

work page 2021
[5]

Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research 50, 1 (2021), 3–44

work page 2021
[6]

Karan Bhanot. 2023. Synthetic Data Generation and Evaluation for Fairness. Ph.D. dissertation. Rensselaer Polytechnic Institute. https://www. proquest.com/docview/2869461606 ProQuest Document ID: 30570311

work page arXiv 2023
[7]

Aqsa Bhatti and Binil Starly. 2022. Generative Design in Additive Manufacturing: A Review. Machines 4, 2 (2022), 22. https://doi.org/10.3390/ make4020022

work page 2022
[8]

Borisov, K

V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci. 2023. LANGUAGE MODELS ARE REALISTIC TABULAR DATA GENERATORS. In The Eleventh International Conference on Learning Representations. https://openreview.net/pdf?id=cEygmQNOeI

work page 2023
[9]

Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of Machine Learning Research (PMLR) (New York, NY, USA), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, 77–91

work page 2018
[10]

Victoria Cheng et al. 2021. Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. In Proceedings of the 2021 ACM Conference. ACM. https://doi.org/10.1145/3442188.3445879

work page doi:10.1145/3442188.3445879 2021
[11]

Paulo Cortez. 2008. Student Performance. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5TG7T

work page doi:10.24432/c5tg7t 2008
[12]

Rachel Cummings, Damien Desfontaines, David Evans, Roxana Geambasu, Yangsibo Huang, Matthew Jagielski, Peter Kairouz, Gautam Kamath, Se- woong Oh, Olga Ohrimenko, Nicolas Papernot, Ryan Rogers, Milan Shen, Shuang Song, Weijie Su, Andreas Terzis, Abhradeep Thakurta, Sergei Vassil- vitskii, Yu-Xiang Wang, Li Xiong, Sergey Yekhanin, Da Yu, Huanyu Zhang, and ...

work page 2024
[13]

F. K. Dankar, M. K. Ibrahim, and L. Ismail. 2022. A Multi-Dimensional Evaluation of Synthetic Data Generators. IEEE Access 10 (2022), 11147–11158. https://doi.org/10.1109/access.2022.3144765

work page doi:10.1109/access.2022.3144765 2022
[14]

Oscar Blessed Deho, Srecko Joksimovic, Jiuyong Li, Chen Zhan, Jixue Liu, and Lin Liu. 2022. Should learning analytics models include sensitive attributes? Explaining the why. IEEE Transactions on Learning Technologies 16, 4 (2022), 560–572

work page 2022
[15]

Oscar Blessed Deho, Chen Zhan, Jiuyong Li, Jixue Liu, Lin Liu, and Thuc Duy Le. 2022. How do the existing fairness metrics and unfairness mitigation algorithms contribute to ethical learning analytics? British Journal of Educational Technology (2022)

work page 2022
[16]

Shayan Doroudi. 2024. On the Paradigms of Learning Analytics: Machine Learning Meets Epistemology. Computers and Education: Artificial Intelligence 6 (2024), 100192. https://doi.org/10.1016/j.caeai.2023.100192

work page doi:10.1016/j.caeai.2023.100192 2024
[17]

Hendrik Drachsler and Wolfgang Greller. 2016. Privacy and analytics: it’s a DELICATE issue a checklist for trusted learning analytics. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. Association for Computing Machinery (ACM), 89–98. https: //doi.org/10.1145/2883851.2883893

work page doi:10.1145/2883851.2883893 2016
[18]

Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. Springer, 1–12

work page 2006
[19]

X. Fang, W. Xu, F. A. Tan, J. Zhang, Z. Hu, Y. Qi, S. Nickleach, D. Socolinsky, S. Sengamedu, and C. Faloutsos. 2024. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding – A Survey. ArXiv (2024). https://doi.org/10.48550/arxiv.2402.17944

work page doi:10.48550/arxiv.2402.17944 2024
[20]

Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259–268

work page 2015
[21]

Figueira and B

A. Figueira and B. Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10, 15 (2022), 2733. https: //doi.org/10.3390/math10152733

work page doi:10.3390/math10152733 2022
[22]

Ferdinando Fioretto, Cuong Tran, Pascal Van Hentenryck, and Kan Zhu. 2022. Differential Privacy and Fairness in Decisions and Learning Tasks: A Survey. arXiv preprint arXiv:2202.08187 (2022). https://arxiv.org/pdf/2202.08187

work page arXiv 2022
[23]

Josh Gardner, Christopher Brooks, and Ryan Baker. 2019. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th international conference on learning analytics & knowledge. 225–234

work page 2019
[24]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)

work page 2016
[25]

Arto Hellas, Petri Ihantola, Andrew Petersen, Vangel V Ajanovski, Mirela Gutica, Timo Hynninen, Antti Knutas, Juho Leinonen, Chris Messom, and Soohyun Nam Liao. 2018. Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education. 175–199

work page 2018
[26]

Hernadez, G

M. Hernadez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62, Suppl 1 (2023), e19–e38. https://doi.org/10.1055/s-0042-1760247

work page doi:10.1055/s-0042-1760247 2023
[27]

Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic Dataset Generation for Fairer Unfairness Research. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK’24). https://doi.org/10.1145/3636555.3636868

work page doi:10.1145/3636555.3636868 2024
[28]

Jamie Jordon, Łukasz Szpruch, François Houssiau, Marco Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel Cohen, and Adrian Weller. 2022. Synthetic Data - what, why and how? (2022). https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_ Survey-24.pdf

work page 2022
[29]

James Jordon, Jinsung Yoon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=S1zk9iRqF7

work page 2019
[30]

Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and information systems 33, 1 (2012), 1–33

work page 2012
[31]

Mohammad Khalil, Paul Prinsloo, and Sharon Slade. 2023. Fairness, Trust, Transparency, Equity, and Responsibility in Learning Analytics. Journal of Learning Analytics 10, 1 (2023). https://doi.org/10.18608/jla.2023.7983 16 Liu et al. Manuscript submitted to ACM

work page doi:10.18608/jla.2023.7983 2023
[32]

Khalil, F

M. Khalil, F. Vadiee, R. Shakya, and Q. Liu. 2025. Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation. In Proceedings of the 15th Learning Analytics and Knowledge Conference (LAK’25)

work page 2025
[33]

Minjun Kim et al. 2023. Privacy Risks of Machine Learning Models with Unintended Memorization. arXiv preprint arXiv:2302.12580 (2023). https://arxiv.org/abs/2302.12580

work page arXiv 2023
[34]

René F Kizilcec and Hansol Lee. 2022. Algorithmic fairness in education. In The Ethics of Artificial Intelligence in Education. Routledge, 174–202

work page 2022
[35]

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. Advances in neural information processing systems 30 (2017)

work page 2017
[36]

Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open university learning analytics dataset. Scientific data 4, 1 (2017), 1–8

work page 2017
[37]

Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics: A systematic review. British Journal of Educational Technology 54, 5 (2023). https://doi.org/10.1111/bjet.13388

work page doi:10.1111/bjet.13388 2023
[38]

Qinyi Liu and Mohammad Khalil. 2024. Exploring the Generation of Synthetic Educational Tabular Data using LLMs. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24), AI for Education (AI4EDU): Advancing Personalized Education with LLM and Adaptive Learning Workshop. Barcelona. https://www.researchgate.net/public...

work page arXiv 2024
[39]

Q. Liu, M. Khalil, J. Jovanovic, and R. Shakya. 2024. Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. In Proceedings of the 14th Learning Analytics and Knowledge Conference. https://doi.org/10.1145/3636555.3636921

work page doi:10.1145/3636555.3636921 2024
[40]

WeiKang Liu, Yanchun Zhang, Hong Yang, and Qinxue Meng. 2024. A survey on differential privacy for medical data analysis. Annals of Data Science 11, 2 (2024), 733–747

work page 2024
[41]

Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, Tianfan Fu, and Wenqi Wei. 2023. Machine Learning for Synthetic Data Generation: A Review. arXiv preprint arXiv:2302.04062 (2023). https://arxiv.org/pdf/2302.04062

work page arXiv 2023
[42]

Amalia Luque, Alejandro Carrasco, Alejandro Martín, and Ana de las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91 (2019), 216–231

work page 2019
[43]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54, 6 (2021), 1–35

work page 2021
[44]

Emmanouil Panagiotou, Arjun Roy, and Eirini Ntoutsi. 2024. Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study. arXiv preprint arXiv:2409.05215v1 (2024). https://arxiv.org/pdf/2409.05215v1

work page arXiv 2024
[45]

Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, and Rafael de Sousa. 2024. Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data. PLOS ONE 19, 9 (2024). https://doi.org/10. 1371/journal.pone.0297271

work page 2024
[46]

David Pujol, Amir Gilad, and Ashwin Machanavajjhala. 2024. PreFair: Privately Generating Justifiably Fair Synthetic Data. In Proceedings of the VLDB Endowment (PVLDB), Vol. 16. https://www.vldb.org/pvldb/vol16/p1573-pujol.pdf

work page 2024
[47]

Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv preprint arXiv:2301.07573 (2023)

work page arXiv 2023
[48]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS 𝐸𝑀𝐶2 Workshop

work page 2019
[49]

Filippo Sciarrone. 2018. Machine Learning and Learning Analytics: Integrating Data with Learning. IEEE (2018). https://doi.org/10.1109/EDUCON. 2018.8424780

work page doi:10.1109/educon 2018
[50]

Lele Sha, Mladen Rakovic, Alexander Whitelock-Wainwright, David Carroll, Victoria M Yew, Dragan Gasevic, and Guanliang Chen. 2021. Assessing algorithmic fairness in automatic classifiers of educational forum posts. In Artificial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14–18, 2021, Proceedings, Pa...

work page 2021
[51]

Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela van der Schaar. 2021. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. arXiv preprint arXiv:2110.12884 (2021). https://arxiv.org/abs/2110.12884

work page arXiv 2021
[52]

Sahil Verma and Julia Rubin. 2018. Fairness definitions explained. In Proceedings of the international workshop on software fairness. 1–7

work page 2018
[53]

Hilde Weerts, Miroslav Dudík, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. 2023. Fairlearn: Assessing and Improving Fairness of AI Systems. , 8 pages. http://jmlr.org/papers/v24/23-0389.html

work page 2023
[54]

Linda F Wightman. 1998. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. (1998)

work page 1998
[55]

Lei Xu, Marianna Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular Data using Conditional GAN. arXiv preprint arXiv:1907.00503 (2019). https://arxiv.org/abs/1907.00503

work page arXiv 2019
[56]

Erez Yacobson, Orly Fuhrman, Arnon Hershkovitz, and Giora Alexandron. 2021. De-identification is Insufficient to Protect Student Privacy, or – What Can a Field Trip Reveal? Journal of Learning Analytics 8, 2 (2021), 83–92. https://doi.org/10.18608/jla.2021.7353

work page doi:10.18608/jla.2021.7353 2021
[57]

Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar. 2020. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics 24, 8 (2020), 2378–2388. https://doi.org/10.1109/JBHI.2020.2980262

work page doi:10.1109/jbhi.2020.2980262 2020
[58]

Renzhe Yu, Qiujie Li, Christian Fischer, Shayan Doroudi, and Di Xu. 2020. Towards accurate and fair prediction of college success: evaluating different sources of student data. In Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020). ERIC, 292–301

work page 2020
[59]

Chen Zhan, Srećko Joksimović, Djazia Ladjal, Thierry Rakotoarivelo, Ruth Marshall, and Abelardo Pardo. 2024. Preserving Both Privacy and Utility in Learning Analytics. IEEE Transactions on Learning Technologies 17 (2024), 1655 – 1667. https://doi.org/10.1109/TLT.2024.3393766

work page doi:10.1109/tlt.2024.3393766 2024
[60]

Z. Zhao, A. Kunar, R. Birke, and L. Chen. 2021. CTAB-GAN: Effective Table Data Synthesizing. In Proceedings of Machine Learning Research, Vol. 157. 2021–2021. https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf

work page 2021

[1] [1]

Adel Abroshan et al. 2024. Improving Fairness in Machine Learning via Synthetic Data Generation. In Proceedings of the 41st International Conference on Machine Learning, Vol. 238. PMLR. https://proceedings.mlr.press/v238/abroshan24a/abroshan24a.pdf

work page 2024

[2] [2]

Mahed Abroshan, Andrew Elliott, and Mohammad Mahdi Khalili. 2024. Imposing Fairness Constraints in Synthetic Data Generation. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 238). 2269–2277. https: //proceedings.mlr.press/v238/abroshan24a.html Navigating Privacy ...

work page 2024

[3] [3]

Adel Abusitta, Esma Aïmeur, and Omar Abdel Wahab. 2019. Generative Adversarial Networks for Mitigating Biases in Machine Learning Systems. arXiv preprint arXiv:1905.09972 (2019). https://arxiv.org/abs/1905.09972

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

Ryan S Baker and Aaron Hawn. 2021. Algorithmic bias in education. International Journal of Artificial Intelligence in Education (2021), 1–41

work page 2021

[5] [5]

Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research 50, 1 (2021), 3–44

work page 2021

[6] [6]

Karan Bhanot. 2023. Synthetic Data Generation and Evaluation for Fairness. Ph.D. dissertation. Rensselaer Polytechnic Institute. https://www. proquest.com/docview/2869461606 ProQuest Document ID: 30570311

work page arXiv 2023

[7] [7]

Aqsa Bhatti and Binil Starly. 2022. Generative Design in Additive Manufacturing: A Review. Machines 4, 2 (2022), 22. https://doi.org/10.3390/ make4020022

work page 2022

[8] [8]

Borisov, K

V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci. 2023. LANGUAGE MODELS ARE REALISTIC TABULAR DATA GENERATORS. In The Eleventh International Conference on Learning Representations. https://openreview.net/pdf?id=cEygmQNOeI

work page 2023

[9] [9]

Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of Machine Learning Research (PMLR) (New York, NY, USA), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, 77–91

work page 2018

[10] [10]

Victoria Cheng et al. 2021. Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. In Proceedings of the 2021 ACM Conference. ACM. https://doi.org/10.1145/3442188.3445879

work page doi:10.1145/3442188.3445879 2021

[11] [11]

Paulo Cortez. 2008. Student Performance. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5TG7T

work page doi:10.24432/c5tg7t 2008

[12] [12]

Rachel Cummings, Damien Desfontaines, David Evans, Roxana Geambasu, Yangsibo Huang, Matthew Jagielski, Peter Kairouz, Gautam Kamath, Se- woong Oh, Olga Ohrimenko, Nicolas Papernot, Ryan Rogers, Milan Shen, Shuang Song, Weijie Su, Andreas Terzis, Abhradeep Thakurta, Sergei Vassil- vitskii, Yu-Xiang Wang, Li Xiong, Sergey Yekhanin, Da Yu, Huanyu Zhang, and ...

work page 2024

[13] [13]

F. K. Dankar, M. K. Ibrahim, and L. Ismail. 2022. A Multi-Dimensional Evaluation of Synthetic Data Generators. IEEE Access 10 (2022), 11147–11158. https://doi.org/10.1109/access.2022.3144765

work page doi:10.1109/access.2022.3144765 2022

[14] [14]

Oscar Blessed Deho, Srecko Joksimovic, Jiuyong Li, Chen Zhan, Jixue Liu, and Lin Liu. 2022. Should learning analytics models include sensitive attributes? Explaining the why. IEEE Transactions on Learning Technologies 16, 4 (2022), 560–572

work page 2022

[15] [15]

Oscar Blessed Deho, Chen Zhan, Jiuyong Li, Jixue Liu, Lin Liu, and Thuc Duy Le. 2022. How do the existing fairness metrics and unfairness mitigation algorithms contribute to ethical learning analytics? British Journal of Educational Technology (2022)

work page 2022

[16] [16]

Shayan Doroudi. 2024. On the Paradigms of Learning Analytics: Machine Learning Meets Epistemology. Computers and Education: Artificial Intelligence 6 (2024), 100192. https://doi.org/10.1016/j.caeai.2023.100192

work page doi:10.1016/j.caeai.2023.100192 2024

[17] [17]

Hendrik Drachsler and Wolfgang Greller. 2016. Privacy and analytics: it’s a DELICATE issue a checklist for trusted learning analytics. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. Association for Computing Machinery (ACM), 89–98. https: //doi.org/10.1145/2883851.2883893

work page doi:10.1145/2883851.2883893 2016

[18] [18]

Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. Springer, 1–12

work page 2006

[19] [19]

X. Fang, W. Xu, F. A. Tan, J. Zhang, Z. Hu, Y. Qi, S. Nickleach, D. Socolinsky, S. Sengamedu, and C. Faloutsos. 2024. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding – A Survey. ArXiv (2024). https://doi.org/10.48550/arxiv.2402.17944

work page doi:10.48550/arxiv.2402.17944 2024

[20] [20]

Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259–268

work page 2015

[21] [21]

Figueira and B

A. Figueira and B. Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10, 15 (2022), 2733. https: //doi.org/10.3390/math10152733

work page doi:10.3390/math10152733 2022

[22] [22]

Ferdinando Fioretto, Cuong Tran, Pascal Van Hentenryck, and Kan Zhu. 2022. Differential Privacy and Fairness in Decisions and Learning Tasks: A Survey. arXiv preprint arXiv:2202.08187 (2022). https://arxiv.org/pdf/2202.08187

work page arXiv 2022

[23] [23]

Josh Gardner, Christopher Brooks, and Ryan Baker. 2019. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th international conference on learning analytics & knowledge. 225–234

work page 2019

[24] [24]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)

work page 2016

[25] [25]

Arto Hellas, Petri Ihantola, Andrew Petersen, Vangel V Ajanovski, Mirela Gutica, Timo Hynninen, Antti Knutas, Juho Leinonen, Chris Messom, and Soohyun Nam Liao. 2018. Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education. 175–199

work page 2018

[26] [26]

Hernadez, G

M. Hernadez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62, Suppl 1 (2023), e19–e38. https://doi.org/10.1055/s-0042-1760247

work page doi:10.1055/s-0042-1760247 2023

[27] [27]

Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic Dataset Generation for Fairer Unfairness Research. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK’24). https://doi.org/10.1145/3636555.3636868

work page doi:10.1145/3636555.3636868 2024

[28] [28]

Jamie Jordon, Łukasz Szpruch, François Houssiau, Marco Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel Cohen, and Adrian Weller. 2022. Synthetic Data - what, why and how? (2022). https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_ Survey-24.pdf

work page 2022

[29] [29]

James Jordon, Jinsung Yoon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=S1zk9iRqF7

work page 2019

[30] [30]

Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and information systems 33, 1 (2012), 1–33

work page 2012

[31] [31]

Mohammad Khalil, Paul Prinsloo, and Sharon Slade. 2023. Fairness, Trust, Transparency, Equity, and Responsibility in Learning Analytics. Journal of Learning Analytics 10, 1 (2023). https://doi.org/10.18608/jla.2023.7983 16 Liu et al. Manuscript submitted to ACM

work page doi:10.18608/jla.2023.7983 2023

[32] [32]

Khalil, F

M. Khalil, F. Vadiee, R. Shakya, and Q. Liu. 2025. Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation. In Proceedings of the 15th Learning Analytics and Knowledge Conference (LAK’25)

work page 2025

[33] [33]

Minjun Kim et al. 2023. Privacy Risks of Machine Learning Models with Unintended Memorization. arXiv preprint arXiv:2302.12580 (2023). https://arxiv.org/abs/2302.12580

work page arXiv 2023

[34] [34]

René F Kizilcec and Hansol Lee. 2022. Algorithmic fairness in education. In The Ethics of Artificial Intelligence in Education. Routledge, 174–202

work page 2022

[35] [35]

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. Advances in neural information processing systems 30 (2017)

work page 2017

[36] [36]

Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open university learning analytics dataset. Scientific data 4, 1 (2017), 1–8

work page 2017

[37] [37]

Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics: A systematic review. British Journal of Educational Technology 54, 5 (2023). https://doi.org/10.1111/bjet.13388

work page doi:10.1111/bjet.13388 2023

[38] [38]

Qinyi Liu and Mohammad Khalil. 2024. Exploring the Generation of Synthetic Educational Tabular Data using LLMs. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24), AI for Education (AI4EDU): Advancing Personalized Education with LLM and Adaptive Learning Workshop. Barcelona. https://www.researchgate.net/public...

work page arXiv 2024

[39] [39]

Q. Liu, M. Khalil, J. Jovanovic, and R. Shakya. 2024. Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. In Proceedings of the 14th Learning Analytics and Knowledge Conference. https://doi.org/10.1145/3636555.3636921

work page doi:10.1145/3636555.3636921 2024

[40] [40]

WeiKang Liu, Yanchun Zhang, Hong Yang, and Qinxue Meng. 2024. A survey on differential privacy for medical data analysis. Annals of Data Science 11, 2 (2024), 733–747

work page 2024

[41] [41]

Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, Tianfan Fu, and Wenqi Wei. 2023. Machine Learning for Synthetic Data Generation: A Review. arXiv preprint arXiv:2302.04062 (2023). https://arxiv.org/pdf/2302.04062

work page arXiv 2023

[42] [42]

Amalia Luque, Alejandro Carrasco, Alejandro Martín, and Ana de las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91 (2019), 216–231

work page 2019

[43] [43]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54, 6 (2021), 1–35

work page 2021

[44] [44]

Emmanouil Panagiotou, Arjun Roy, and Eirini Ntoutsi. 2024. Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study. arXiv preprint arXiv:2409.05215v1 (2024). https://arxiv.org/pdf/2409.05215v1

work page arXiv 2024

[45] [45]

Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, and Rafael de Sousa. 2024. Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data. PLOS ONE 19, 9 (2024). https://doi.org/10. 1371/journal.pone.0297271

work page 2024

[46] [46]

David Pujol, Amir Gilad, and Ashwin Machanavajjhala. 2024. PreFair: Privately Generating Justifiably Fair Synthetic Data. In Proceedings of the VLDB Endowment (PVLDB), Vol. 16. https://www.vldb.org/pvldb/vol16/p1573-pujol.pdf

work page 2024

[47] [47]

Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv preprint arXiv:2301.07573 (2023)

work page arXiv 2023

[48] [48]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS 𝐸𝑀𝐶2 Workshop

work page 2019

[49] [49]

Filippo Sciarrone. 2018. Machine Learning and Learning Analytics: Integrating Data with Learning. IEEE (2018). https://doi.org/10.1109/EDUCON. 2018.8424780

work page doi:10.1109/educon 2018

[50] [50]

Lele Sha, Mladen Rakovic, Alexander Whitelock-Wainwright, David Carroll, Victoria M Yew, Dragan Gasevic, and Guanliang Chen. 2021. Assessing algorithmic fairness in automatic classifiers of educational forum posts. In Artificial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14–18, 2021, Proceedings, Pa...

work page 2021

[51] [51]

Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela van der Schaar. 2021. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. arXiv preprint arXiv:2110.12884 (2021). https://arxiv.org/abs/2110.12884

work page arXiv 2021

[52] [52]

Sahil Verma and Julia Rubin. 2018. Fairness definitions explained. In Proceedings of the international workshop on software fairness. 1–7

work page 2018

[53] [53]

Hilde Weerts, Miroslav Dudík, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. 2023. Fairlearn: Assessing and Improving Fairness of AI Systems. , 8 pages. http://jmlr.org/papers/v24/23-0389.html

work page 2023

[54] [54]

Linda F Wightman. 1998. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. (1998)

work page 1998

[55] [55]

Lei Xu, Marianna Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular Data using Conditional GAN. arXiv preprint arXiv:1907.00503 (2019). https://arxiv.org/abs/1907.00503

work page arXiv 2019

[56] [56]

Erez Yacobson, Orly Fuhrman, Arnon Hershkovitz, and Giora Alexandron. 2021. De-identification is Insufficient to Protect Student Privacy, or – What Can a Field Trip Reveal? Journal of Learning Analytics 8, 2 (2021), 83–92. https://doi.org/10.18608/jla.2021.7353

work page doi:10.18608/jla.2021.7353 2021

[57] [57]

Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar. 2020. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics 24, 8 (2020), 2378–2388. https://doi.org/10.1109/JBHI.2020.2980262

work page doi:10.1109/jbhi.2020.2980262 2020

[58] [58]

Renzhe Yu, Qiujie Li, Christian Fischer, Shayan Doroudi, and Di Xu. 2020. Towards accurate and fair prediction of college success: evaluating different sources of student data. In Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020). ERIC, 292–301

work page 2020

[59] [59]

Chen Zhan, Srećko Joksimović, Djazia Ladjal, Thierry Rakotoarivelo, Ruth Marshall, and Abelardo Pardo. 2024. Preserving Both Privacy and Utility in Learning Analytics. IEEE Transactions on Learning Technologies 17 (2024), 1655 – 1667. https://doi.org/10.1109/TLT.2024.3393766

work page doi:10.1109/tlt.2024.3393766 2024

[60] [60]

Z. Zhao, A. Kunar, R. Birke, and L. Chen. 2021. CTAB-GAN: Effective Table Data Synthesizing. In Proceedings of Machine Learning Research, Vol. 157. 2021–2021. https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf

work page 2021