Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

Mohammad Khalil; Qinyi Liu; Ronas Shakya; Sam Urmian

arxiv: 2501.01793 · v1 · pith:HCSTDOTLnew · submitted 2025-01-03 · 💻 cs.LG · cs.AI

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

Mohammad Khalil , Sam Urmian , Ronas Shakya , Qinyi Liu This is my paper

Pith reviewed 2026-05-23 06:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords synthetic dataCTGANlarge language modelslearning analyticsstudent dataGANprivacy

0 comments

The pith

CTGAN and LLMs generate synthetic student data that resembles real data for learning analytics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether CTGAN and large language models can create synthetic tabular records about students. Privacy rules restrict access to real student data, so the authors treat synthetic generation as a direct substitute. They train CTGAN along with GPT2, DistilGPT2, and DialoGPT on student datasets, then measure how closely the outputs match the originals in statistical distributions and in how well downstream models perform on them. The results indicate the generated records preserve enough structure to support learning analytics tasks.

Core claim

CTGAN and the three LLMs produce synthetic student datasets whose statistical properties and predictive utility closely match those of real student data, allowing the synthetic records to serve learning analytics models while avoiding direct use of protected information.

What carries the argument

Generative models (CTGAN for tabular synthesis and LLMs for feature generation) trained to replicate the joint distribution of real student records.

If this is right

Synthetic records can stand in for real student data when training or evaluating learning analytics models.
CTGAN and the tested LLMs can be ranked against one another by the same utility metrics.
Learning analytics work gains an additional data-generation route that sidesteps data-protection barriers.
Methodological experiments in the field can draw on larger or more varied synthetic collections than real data alone would allow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generators could be applied to other tabular domains that face comparable privacy limits.
Synthetic data sets produced this way could serve as controlled test beds for checking whether analytics models remain stable under distribution shifts.
Integration of these generators into existing analytics pipelines would let researchers measure end-to-end effects on model outputs.

Load-bearing premise

Utility metrics on synthetic data serve as reliable stand-ins for performance when the same data is used inside actual learning analytics applications.

What would settle it

Train the same learning analytics model once on real student data and once on the synthetic data, then observe whether accuracy or other task metrics drop sharply on an independent real test set.

Figures

Figures reproduced from arXiv: 2501.01793 by Mohammad Khalil, Qinyi Liu, Ronas Shakya, Sam Urmian.

**Figure 2.** Figure 2: t-SNE plots showing the similarity between synthetic data generated by each generative model and the real data for Dataset B2. Best viewed in colour. Following the study’s pipeline in [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 6.** Figure 6: The mean value for the classification machine learning evaluation metrics of Accuracy (left), AUCROC (middle), and F1-score (right) for all the LA datasets and the generative models. Black bar stands for the standard error of mean. Best viewed in colour. and B2 involve multi-label classification, resulting in lower metric values compared to the other datasets. Overall, the Accuracy differences between the … view at source ↗

read the original abstract

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies CTGAN plus a few small GPT models to synthetic student records but reports no concrete metrics or downstream task checks.

read the letter

The main thing here is a straightforward application of CTGAN and three small GPT variants (GPT2, DistilGPT2, DialoGPT) to tabular student data. The goal is privacy-preserving data for learning analytics. Nothing new in the methods themselves; the contribution is domain-specific use plus a comparison of those generators on student records. That is useful for people who need more data in this area and cannot share real records. The paper does lay out the privacy problem clearly and shows the generators can produce tables that pass basic statistical resemblance checks. Credit for running the comparison at all and for targeting a practical constraint in education research. The soft spot is the evaluation. The abstract claims strong performance on a comprehensive set of utility metrics and says the synthetic data can serve learning analytics models, yet no numbers, no baselines, no significance tests, and no failure cases appear in the provided text. The stress-test concern lands: resemblance on generic metrics does not automatically mean models trained on the synthetic data will match real-data performance on dropout prediction or grade forecasting when tested on held-out real records. Without that end-to-end check or at least the raw metric values, the central claim stays untested. The work is for learning-analytics practitioners who want a ready recipe for synthetic student tables. It is honest in its scope and does not overclaim theory. A serious editor should send it to review rather than desk-reject; the topic is timely and the setup is reproducible enough that referees can ask for the missing numbers and the downstream validation in one round.

Referee Report

2 major / 1 minor

Summary. The manuscript explores the application of CTGAN and LLMs (GPT-2, DistilGPT-2, DialoGPT) for generating synthetic tabular student data to address privacy concerns in learning analytics. It claims that a comprehensive set of utility evaluation metrics demonstrates the strong potential of these methods to produce high-quality synthetic datasets resembling real student data, which can then be used to train learning analytics models.

Significance. If the empirical results are substantiated with concrete metrics and validated against downstream learning analytics tasks, this could offer valuable methodological contributions to the field by expanding the toolbox for synthetic data generation in education. The paper does not indicate provision of code or data for reproducibility.

major comments (2)

[Abstract] Abstract: The assertion that 'our results demonstrate the strong potential' based on 'utility evaluation metrics' is not accompanied by any specific numerical values, chosen metrics, baseline comparisons, or statistical significance tests, undermining the ability to evaluate the central claim.
[Evaluation section] Evaluation: The paper applies a set of utility metrics (statistical and predictive) but provides no evidence or discussion that these metrics correlate with or proxy actual performance on downstream learning analytics tasks (e.g., dropout prediction or grade forecasting) when models are trained on synthetic data and evaluated on real held-out data.

minor comments (1)

The abstract could benefit from briefly naming the specific utility metrics applied to improve clarity on the evaluation approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'our results demonstrate the strong potential' based on 'utility evaluation metrics' is not accompanied by any specific numerical values, chosen metrics, baseline comparisons, or statistical significance tests, undermining the ability to evaluate the central claim.

Authors: We agree that the abstract would be strengthened by including concrete numerical results. The evaluation section contains the full set of metrics (including statistical similarity scores such as correlation and distribution distances, predictive utility accuracies, and model comparisons), but these were not summarized numerically in the abstract. In the revised manuscript we will update the abstract to report key quantitative findings, chosen metrics, and baseline comparisons from the results. revision: yes
Referee: [Evaluation section] Evaluation: The paper applies a set of utility metrics (statistical and predictive) but provides no evidence or discussion that these metrics correlate with or proxy actual performance on downstream learning analytics tasks (e.g., dropout prediction or grade forecasting) when models are trained on synthetic data and evaluated on real held-out data.

Authors: Our predictive utility evaluation already trains models on synthetic data and evaluates them on real held-out data, which directly measures performance on downstream-style tasks. However, we did not explicitly discuss the correlation of these metrics with specific learning analytics applications such as dropout prediction or grade forecasting, nor did we include those exact tasks. We will add a dedicated discussion paragraph in the evaluation section explaining how the chosen predictive utility metrics serve as proxies and will consider adding one illustrative downstream task experiment if space allows. revision: partial

Circularity Check

0 steps flagged

Empirical generator comparison with no derivation chain

full rationale

The paper is a standard empirical comparison of CTGAN and three LLMs for generating synthetic student tabular data, followed by application of statistical and predictive utility metrics. No equations, fitted parameters, or derivation steps are present that reduce any reported result to its inputs by construction. The work contains no self-citations used as load-bearing uniqueness theorems, no ansatzes smuggled via citation, and no renaming of known results as new derivations. The evaluation metrics are applied after generation and do not create a self-definitional loop. This is self-contained empirical research with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or paper-specific axioms; the work implicitly relies on the standard assumption that GANs and LLMs can approximate the joint distribution of real tabular records.

axioms (1)

domain assumption Generative models trained on real student records can produce synthetic records whose statistical properties support downstream predictive models
Central premise required for the utility claim; stated in the abstract's description of validation.

pith-pipeline@v0.9.0 · 5740 in / 1273 out tokens · 67875 ms · 2026-05-23T06:01:50.150531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

[1]

Mahed Abroshan, Andrew Elliott, and Mahdi Khalili. 2024. Imposing fairness constraints in synthetic data generation. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics 238 (2024), 2269–2277. https://proceedings.mlr.press/v238/abroshan24a.html

work page 2024
[2]

Ahmed Alaa, Boris Van Breugel, Evgeny Saveliev, and Mihaela Van Der Schaar. 2022. How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022 (2022). https://proceedings.mlr.press/v162/alaa22a/alaa22a.pdf

work page 2022
[3]

Chris Alexiuk, Shashank Verma, and Vivienne Zhang. 2024. Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4 340B. https://developer.nvidia.com/blog/leverage-our-latest-open-models-for-synthetic-data-generation-with-nvidia-nemotron-4-340b/

work page 2024
[4]

Francis Anscombe. 1973. Graphs in Statistical Analysis. The American Statistician 27 (1973), 17–21. https://doi.org/10.2307/2682899

work page doi:10.2307/2682899 1973
[5]

Berg, Stefan T

Alan M. Berg, Stefan T. Mol, Gábor Kismihók, and Niall Sclater. 2016. The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics 3 (2016), 107–128. https://doi.org/10.18608/jla.2016.31.7

work page doi:10.18608/jla.2016.31.7 2016
[6]

Anabel Bethencourt-Aguilar, Dagoberto Castellanos-Nieves, Juan José Sosa-Alonso, and Manuel Area-Moreira. 2023. Use of Generative Adversarial Networks (GANs) in Educational Technology Research. Journal of New Approaches in Educational Research 12 (2023), 153–153. https://doi.org/10. 7821/naer.2023.1.1231

work page 2023
[7]

Karan Bhanot, Miao Qi, John S Erickson, Isabelle Guyon, and Kristin P Bennett. 2021. The Problem of Fairness in Synthetic Healthcare Data. Entropy 23 (2021). https://doi.org/10.3390/e23091165

work page doi:10.3390/e23091165 2021
[8]

Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. ArXiv.org (2023). https://arxiv.org/pdf/2210.06280

work page arXiv 2023
[9]

Boris van Breugel and Mihaela van der Schaar. 2022. Why Tabular Foundation Models Should Be a Research Priority. Arxiv.org (2022). https: //arxiv.org/html/2405.01147v1#S7 18 Mohammad Khalil, Farhad Vadiee, Ronas Shakya, and Qinyi Liu Manuscript submitted to ACM

work page arXiv 2022
[10]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv abs/2303.12712 (2023). https://api.semanticscholar.org/Corpu...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Paulo Cortez. 2014. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/320/student+performance

work page 2014
[12]

Tukur Dahiru. 2008. P - value, a true test of statistical significance? A cautionary note. Annals of Ibadan postgraduate medicine 6 (2008), 21–6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111019/

work page 2008
[13]

Shane Dawson, Srecko Joksimovic, Oleksandra Poquet, and George Siemens. 2019. Increasing the impact of learning analytics. LAK19: Proceedings of the 9th International Conference on Learning Analytics & Knowledge (2019), 446–455. https://doi.org/10.1145/3303772.3303784

work page doi:10.1145/3303772.3303784 2019
[14]

Erica Espinosa and Alvaro Figueira. 2023. On the Quality of Synthetic Generated Tabular Data. Mathematics 11 (2023). https://doi.org/10.3390/ math11153278

work page 2023
[15]

Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding dropouts in MOOCs. AAAI’19/IAAI’19/EAAI’19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (2019). https://do...

work page doi:10.1609/aaai.v33i01.3301517 2019
[16]

Rebecca Ferguson and Doug Clow. 2017. Where is the evidence? a call to action for learning analytics. LAK ’17: Proceedings of the Seventh International Learning Analytics & Knowledge Conference (2017), 56–65. https://doi.org/10.1145/3027385.3027396

work page doi:10.1145/3027385.3027396 2017
[17]

Alvaro Figueira and Bruno Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10 (2022), 2733. https://doi.org/10.3390/math10152733

work page doi:10.3390/math10152733 2022
[18]

Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learning Analytics. IEEE Access 10 (2022), 26230–26241. https://doi.org/10.1109/access.2022.3156073

work page doi:10.1109/access.2022.3156073 2022
[19]

Mikel Hernadez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62 (2023), e19–e38. https://doi.org/10.1055/s-0042- 1760247

work page doi:10.1055/s-0042- 2023
[20]

Geoffrey Hinton and Sam Roweis. 2002. Stochastic Neighbor Embedding. https://cs.nyu.edu/~roweis/papers/sne_final.pdf

work page 2002
[21]

Markus Hittmeir, Andreas Ekelhart, and Rudolf Mayer. 2019. On the utility of synthetic data: An empirical evaluation on machine learning tasks. ARES ’19: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019). https://doi.org/10.1145/3339252.3339281

work page doi:10.1145/3339252.3339281 2019
[22]

Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic dataset generation for fairer unfairness research. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 200–209. https://doi.org/10.1145/3636555.3636868

work page doi:10.1145/3636555.3636868 2024
[23]

James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. 2022. Synthetic Data – what, why and how? The Royal Society (2022). https://doi.org/10.48550/arxiv.2205.03257

work page doi:10.48550/arxiv.2205.03257 2022
[24]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. , 8110 –8119 pages. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_ StyleGAN_CVPR_2020_paper.html

work page 2020
[25]

Mohammad Khalil. 2018. Learning Analytics in Massive Open Online Courses. ArXiv.org (2018). https://doi.org/10.48550/arXiv.1802.09344

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.09344 2018
[26]

Harsh Kumar, Ilya Musabirov, Joseph Jay Williams, and Michael Liut. 2023. QuickTA: Exploring the Design Space of Using Large Language Models to Provide Support to Students. Learning Analytics and Knowledge Conference (LAK’23) (2023). https://tspace.library.utoronto.ca/bitstream/1807/ 127196/1/2023_Kumar_QuickTA_exploring_design_space.pdf

work page 2023
[27]

Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open University Learning Analytics dataset. Scientific Data 4 (2017), 170171. https: //doi.org/10.1038/sdata.2017.171

work page doi:10.1038/sdata.2017.171 2017
[28]

Jiayin Lin, Geng Sun, Jun Shen, Tingru Cui, Ping Yu, Dongming Xu, Li Li, and Ghassan Beydoun. 2019. Towards the readiness of learning analytics data for micro learning. Services Computing – SCC 2019: 16th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings (2019), 66–76....

work page doi:10.1007/978-3-030-23554-3_5 2019
[29]

Qinyi Liu, Oscar Deho, Farhad Vadiee, Mohammad Khalil, Srecko Joksimovic, and George Siemens. 2025. Can Synthetic Data Be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK’25). ACM. (2025). https://doi.org/10.1145/3706468.3706546

work page doi:10.1145/3706468.3706546 2025
[30]

Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics using a systematic review. British Journal of Educational Technology 54 (2023). https://doi.org/10.1111/bjet.13388

work page doi:10.1111/bjet.13388 2023
[31]

Qinyi Liu, Mohammad Khalil, Jelena Jovanovic, and Ronas Shakya. 2024. Scaling while privacy preserving: A comprehensive synthetic tabular data generation and evaluation in learning analytics. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 620–631. https://doi.org/10.1145/3636555.3636921

work page doi:10.1145/3636555.3636921 2024
[32]

Ruibo Liu, Jerry Wei, Fangyu Liu, Google Deepmind, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew Dai. 2024. Best Practices and Lessons Learned on Synthetic Data. Arxiv.org (2024). https://arxiv.org/pdf/2404.07503

work page arXiv 2024
[33]

Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and van. 2023. GOGGLE: Generative Modelling for Tabular Data by Learning Relational Structure. https://openreview.net/forum?id=fPVRcJqspu

work page 2023
[34]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE Laurens van der Maaten. Journal of Machine Learning Research 9 (2008), 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf Creating Artificial Students that Never Existed: Using LLMs and CTGAN 19 Manuscript submitted to ACM

work page 2008
[35]

Melle Mendikowski and Mattis Hartwig. 2022. Creating customers that never existed: Synthesis of e-commerce data using CTGAN. In 18th International Conference on Machine Learning and Data Mining (MLDM-22). New York, US: IBAI Publishing. 91–105

work page 2022
[36]

Menéndez, J.A

M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo. 1997. The Jensen-Shannon divergence. Journal of the Franklin Institute 334 (1997), 307–318. https://doi.org/10.1016/s0016-0032(96)00063-4

work page doi:10.1016/s0016-0032(96)00063-4 1997
[37]

Marko Miletic and Murat Sariyar. 2024. Assessing the Potentials of LLMs and GANs as StateoftheArt Tabular Synthetic Data Generation Methods. Privacy in Statistical Databases (2024), 374–389. https://doi.org/10.3390/app14145975

work page doi:10.3390/app14145975 2024
[38]

Luis Moles, Alain Andres, Goretti Echegaray, and Fernando Boto. 2024. Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics 12 (2024). https://doi.org/10.3390/math12121898

work page doi:10.3390/math12121898 2024
[39]

Abdallah Moubayed, MohammadNoor Injadat, Abdallah Shami, Ali Bou Nassif, and Hanan Lutfiyya. 2020. Student Performance and Engagement Prediction in eLearning datasets. IEEE dataport (2020). https://dx.doi.org/10.21227/4xkr-0f88

work page doi:10.21227/4xkr-0f88 2020
[40]

Panaretos and Yoav Zemel

Victor M. Panaretos and Yoav Zemel. 2019. Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and Its Application 6 (2019), 405–431. https://doi.org/10.1146/annurev-statistics-030718-104938

work page doi:10.1146/annurev-statistics-030718-104938 2019
[41]

Stanislav Pozdniakov, Jonathan Brazil, Solmaz Abdi, Aneesha Bakharia, Shazia Sadiq, Dragan Gašević, Paul Denny, and Hassan Khosravi. 2024. Large language models meet user interfaces: The case of provisioning feedback. Computers and Education: Artificial Intelligence 7 (2024), 100289. https://doi.org/10.1016/j.caeai.2024.100289

work page doi:10.1016/j.caeai.2024.100289 2024
[42]

Paul Prinsloo, Mohammad Khalil, and Sharon Slade. 2023. Learning analytics as data ecology: a tentative proposal. Journal of Computing in Higher Education 36 (2023). https://doi.org/10.1007/s12528-023-09355-4

work page doi:10.1007/s12528-023-09355-4 2023
[43]

Paul Prinsloo, Sharon Slade, and Mohammad Khalil. 2019. Student data privacy in MOOCs: A sentiment analysis. Distance Education 40, 3 (2019), 395–413

work page 2019
[44]

Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv:2301.07573 [cs] (2023). https://arxiv.org/abs/2301.07573

work page arXiv 2023
[45]

Zhaozhi Qian, Rob Davis, and Mihaela Van Der Schaar. 2024. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. https://proceedings.neurips.cc/paper_files/paper/2023/file/09723c9f291f6056fd1885081859c186-Paper-Datasets_and_Benchmarks.pdf

work page 2024
[46]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://huggingface.co/openai-community/gpt2

work page 2019
[47]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108

work page internal anchor Pith review Pith/arXiv arXiv 2020
[48]

Neil Selwyn. 2020. Re-imagining ‘Learning Analytics’ . . . a case for starting again? The Internet and Higher Education 46 (2020), 100745. https: //doi.org/10.1016/j.iheduc.2020.100745

work page doi:10.1016/j.iheduc.2020.100745 2020
[49]

Wannapon Suraworachet, Jennifer Seon, and Mutlu Cukurova. 2024. Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 473–485. https://doi.org/10.1145/3636555.3636905

work page doi:10.1145/3636555.3636905 2024
[50]

Dimitrios Tzimas and Stavros Demetriadis. 2021. Ethical issues in learning analytics: a review of the field. Educational Technology Research and Development 69 (2021). https://doi.org/10.1007/s11423-021-09977-4

work page doi:10.1007/s11423-021-09977-4 2021
[51]

Kurt VanLehn, Stellan Ohlsson, and Rod Nason. 1994. Applications of simulated students: An exploration. Journal of artificial intelligence in education 5 (1994), 135–135

work page 1994
[52]

Deborah West, Ann Luzeckyj, Bill Searle, Danny Toohey, Jessica Vanderlelie, and Kevin R Bell. 2020. Perspectives from the stakeholder: Students’ views regarding learning analytics and data collection. Australasian Journal of Educational Technology 36, 6 (2020), 72–88. https://doi.org/10.14742/ajet.5957

work page doi:10.14742/ajet.5957 2020
[53]

Lei Xu, Maria Skoularidou, Alfredo Cuesta -Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. Neural Information Processing Systems 32 (2019). https://papers.nips.cc/paper_files/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html

work page 2019
[54]

Shengzhe Xu, Virginia Tech, Cho-Ting Lee, Mandar Sharma, Raquib Yousuf, Nikhil Muralidhar, and Naren Ramakrishnan. 2024. Are LLMs Naturally Good at Synthetic Tabular Data Generation? ArXiv.org (2024). https://arxiv.org/pdf/2406.14541

work page arXiv 2024
[55]

Lixiang Yan, Linxuan Zhao, Dragan Gasevic, and Roberto Martinez-Maldonado. 2022. Scalability, sustainability, and ethicality of multimodal learning analytics. LAK22: 12th International Learning Analytics and Knowledge Conference (2022), 13–23. https://doi.org/10.1145/3506860.3506862

work page doi:10.1145/3506860.3506862 2022
[56]

Chen Zhan, Oscar Blessed Deho, Xuwei Zhang, Srecko Joksimovic, and Maarten de Laat. 2023. Synthetic data generator for student data serving learning analytics: A comparative study. Learning Letters 1 (2023), 5. https://doi.org/10.59453/KHZW9006

work page doi:10.59453/khzw9006 2023
[57]

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. ArXiv.org (2019). https://arxiv.org/abs/1911.00536

work page arXiv 2019
[58]

Zilong Zhao, Aditya Kunar, Robert Birke, Lydia Chen, and Hiek Van der Scheer. 2021. CTAB-GAN: Effective Table Data Synthesizing. Proceedings of Machine Learning Research 157 (2021). https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf

work page 2021

[1] [1]

Mahed Abroshan, Andrew Elliott, and Mahdi Khalili. 2024. Imposing fairness constraints in synthetic data generation. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics 238 (2024), 2269–2277. https://proceedings.mlr.press/v238/abroshan24a.html

work page 2024

[2] [2]

Ahmed Alaa, Boris Van Breugel, Evgeny Saveliev, and Mihaela Van Der Schaar. 2022. How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022 (2022). https://proceedings.mlr.press/v162/alaa22a/alaa22a.pdf

work page 2022

[3] [3]

Chris Alexiuk, Shashank Verma, and Vivienne Zhang. 2024. Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4 340B. https://developer.nvidia.com/blog/leverage-our-latest-open-models-for-synthetic-data-generation-with-nvidia-nemotron-4-340b/

work page 2024

[4] [4]

Francis Anscombe. 1973. Graphs in Statistical Analysis. The American Statistician 27 (1973), 17–21. https://doi.org/10.2307/2682899

work page doi:10.2307/2682899 1973

[5] [5]

Berg, Stefan T

Alan M. Berg, Stefan T. Mol, Gábor Kismihók, and Niall Sclater. 2016. The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics 3 (2016), 107–128. https://doi.org/10.18608/jla.2016.31.7

work page doi:10.18608/jla.2016.31.7 2016

[6] [6]

Anabel Bethencourt-Aguilar, Dagoberto Castellanos-Nieves, Juan José Sosa-Alonso, and Manuel Area-Moreira. 2023. Use of Generative Adversarial Networks (GANs) in Educational Technology Research. Journal of New Approaches in Educational Research 12 (2023), 153–153. https://doi.org/10. 7821/naer.2023.1.1231

work page 2023

[7] [7]

Karan Bhanot, Miao Qi, John S Erickson, Isabelle Guyon, and Kristin P Bennett. 2021. The Problem of Fairness in Synthetic Healthcare Data. Entropy 23 (2021). https://doi.org/10.3390/e23091165

work page doi:10.3390/e23091165 2021

[8] [8]

Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. ArXiv.org (2023). https://arxiv.org/pdf/2210.06280

work page arXiv 2023

[9] [9]

Boris van Breugel and Mihaela van der Schaar. 2022. Why Tabular Foundation Models Should Be a Research Priority. Arxiv.org (2022). https: //arxiv.org/html/2405.01147v1#S7 18 Mohammad Khalil, Farhad Vadiee, Ronas Shakya, and Qinyi Liu Manuscript submitted to ACM

work page arXiv 2022

[10] [10]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv abs/2303.12712 (2023). https://api.semanticscholar.org/Corpu...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Paulo Cortez. 2014. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/320/student+performance

work page 2014

[12] [12]

Tukur Dahiru. 2008. P - value, a true test of statistical significance? A cautionary note. Annals of Ibadan postgraduate medicine 6 (2008), 21–6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111019/

work page 2008

[13] [13]

Shane Dawson, Srecko Joksimovic, Oleksandra Poquet, and George Siemens. 2019. Increasing the impact of learning analytics. LAK19: Proceedings of the 9th International Conference on Learning Analytics & Knowledge (2019), 446–455. https://doi.org/10.1145/3303772.3303784

work page doi:10.1145/3303772.3303784 2019

[14] [14]

Erica Espinosa and Alvaro Figueira. 2023. On the Quality of Synthetic Generated Tabular Data. Mathematics 11 (2023). https://doi.org/10.3390/ math11153278

work page 2023

[15] [15]

Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding dropouts in MOOCs. AAAI’19/IAAI’19/EAAI’19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (2019). https://do...

work page doi:10.1609/aaai.v33i01.3301517 2019

[16] [16]

Rebecca Ferguson and Doug Clow. 2017. Where is the evidence? a call to action for learning analytics. LAK ’17: Proceedings of the Seventh International Learning Analytics & Knowledge Conference (2017), 56–65. https://doi.org/10.1145/3027385.3027396

work page doi:10.1145/3027385.3027396 2017

[17] [17]

Alvaro Figueira and Bruno Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10 (2022), 2733. https://doi.org/10.3390/math10152733

work page doi:10.3390/math10152733 2022

[18] [18]

Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learning Analytics. IEEE Access 10 (2022), 26230–26241. https://doi.org/10.1109/access.2022.3156073

work page doi:10.1109/access.2022.3156073 2022

[19] [19]

Mikel Hernadez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62 (2023), e19–e38. https://doi.org/10.1055/s-0042- 1760247

work page doi:10.1055/s-0042- 2023

[20] [20]

Geoffrey Hinton and Sam Roweis. 2002. Stochastic Neighbor Embedding. https://cs.nyu.edu/~roweis/papers/sne_final.pdf

work page 2002

[21] [21]

Markus Hittmeir, Andreas Ekelhart, and Rudolf Mayer. 2019. On the utility of synthetic data: An empirical evaluation on machine learning tasks. ARES ’19: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019). https://doi.org/10.1145/3339252.3339281

work page doi:10.1145/3339252.3339281 2019

[22] [22]

Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic dataset generation for fairer unfairness research. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 200–209. https://doi.org/10.1145/3636555.3636868

work page doi:10.1145/3636555.3636868 2024

[23] [23]

James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. 2022. Synthetic Data – what, why and how? The Royal Society (2022). https://doi.org/10.48550/arxiv.2205.03257

work page doi:10.48550/arxiv.2205.03257 2022

[24] [24]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. , 8110 –8119 pages. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_ StyleGAN_CVPR_2020_paper.html

work page 2020

[25] [25]

Mohammad Khalil. 2018. Learning Analytics in Massive Open Online Courses. ArXiv.org (2018). https://doi.org/10.48550/arXiv.1802.09344

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.09344 2018

[26] [26]

Harsh Kumar, Ilya Musabirov, Joseph Jay Williams, and Michael Liut. 2023. QuickTA: Exploring the Design Space of Using Large Language Models to Provide Support to Students. Learning Analytics and Knowledge Conference (LAK’23) (2023). https://tspace.library.utoronto.ca/bitstream/1807/ 127196/1/2023_Kumar_QuickTA_exploring_design_space.pdf

work page 2023

[27] [27]

Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open University Learning Analytics dataset. Scientific Data 4 (2017), 170171. https: //doi.org/10.1038/sdata.2017.171

work page doi:10.1038/sdata.2017.171 2017

[28] [28]

Jiayin Lin, Geng Sun, Jun Shen, Tingru Cui, Ping Yu, Dongming Xu, Li Li, and Ghassan Beydoun. 2019. Towards the readiness of learning analytics data for micro learning. Services Computing – SCC 2019: 16th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings (2019), 66–76....

work page doi:10.1007/978-3-030-23554-3_5 2019

[29] [29]

Qinyi Liu, Oscar Deho, Farhad Vadiee, Mohammad Khalil, Srecko Joksimovic, and George Siemens. 2025. Can Synthetic Data Be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK’25). ACM. (2025). https://doi.org/10.1145/3706468.3706546

work page doi:10.1145/3706468.3706546 2025

[30] [30]

Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics using a systematic review. British Journal of Educational Technology 54 (2023). https://doi.org/10.1111/bjet.13388

work page doi:10.1111/bjet.13388 2023

[31] [31]

Qinyi Liu, Mohammad Khalil, Jelena Jovanovic, and Ronas Shakya. 2024. Scaling while privacy preserving: A comprehensive synthetic tabular data generation and evaluation in learning analytics. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 620–631. https://doi.org/10.1145/3636555.3636921

work page doi:10.1145/3636555.3636921 2024

[32] [32]

Ruibo Liu, Jerry Wei, Fangyu Liu, Google Deepmind, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew Dai. 2024. Best Practices and Lessons Learned on Synthetic Data. Arxiv.org (2024). https://arxiv.org/pdf/2404.07503

work page arXiv 2024

[33] [33]

Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and van. 2023. GOGGLE: Generative Modelling for Tabular Data by Learning Relational Structure. https://openreview.net/forum?id=fPVRcJqspu

work page 2023

[34] [34]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE Laurens van der Maaten. Journal of Machine Learning Research 9 (2008), 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf Creating Artificial Students that Never Existed: Using LLMs and CTGAN 19 Manuscript submitted to ACM

work page 2008

[35] [35]

Melle Mendikowski and Mattis Hartwig. 2022. Creating customers that never existed: Synthesis of e-commerce data using CTGAN. In 18th International Conference on Machine Learning and Data Mining (MLDM-22). New York, US: IBAI Publishing. 91–105

work page 2022

[36] [36]

Menéndez, J.A

M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo. 1997. The Jensen-Shannon divergence. Journal of the Franklin Institute 334 (1997), 307–318. https://doi.org/10.1016/s0016-0032(96)00063-4

work page doi:10.1016/s0016-0032(96)00063-4 1997

[37] [37]

Marko Miletic and Murat Sariyar. 2024. Assessing the Potentials of LLMs and GANs as StateoftheArt Tabular Synthetic Data Generation Methods. Privacy in Statistical Databases (2024), 374–389. https://doi.org/10.3390/app14145975

work page doi:10.3390/app14145975 2024

[38] [38]

Luis Moles, Alain Andres, Goretti Echegaray, and Fernando Boto. 2024. Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics 12 (2024). https://doi.org/10.3390/math12121898

work page doi:10.3390/math12121898 2024

[39] [39]

Abdallah Moubayed, MohammadNoor Injadat, Abdallah Shami, Ali Bou Nassif, and Hanan Lutfiyya. 2020. Student Performance and Engagement Prediction in eLearning datasets. IEEE dataport (2020). https://dx.doi.org/10.21227/4xkr-0f88

work page doi:10.21227/4xkr-0f88 2020

[40] [40]

Panaretos and Yoav Zemel

Victor M. Panaretos and Yoav Zemel. 2019. Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and Its Application 6 (2019), 405–431. https://doi.org/10.1146/annurev-statistics-030718-104938

work page doi:10.1146/annurev-statistics-030718-104938 2019

[41] [41]

Stanislav Pozdniakov, Jonathan Brazil, Solmaz Abdi, Aneesha Bakharia, Shazia Sadiq, Dragan Gašević, Paul Denny, and Hassan Khosravi. 2024. Large language models meet user interfaces: The case of provisioning feedback. Computers and Education: Artificial Intelligence 7 (2024), 100289. https://doi.org/10.1016/j.caeai.2024.100289

work page doi:10.1016/j.caeai.2024.100289 2024

[42] [42]

Paul Prinsloo, Mohammad Khalil, and Sharon Slade. 2023. Learning analytics as data ecology: a tentative proposal. Journal of Computing in Higher Education 36 (2023). https://doi.org/10.1007/s12528-023-09355-4

work page doi:10.1007/s12528-023-09355-4 2023

[43] [43]

Paul Prinsloo, Sharon Slade, and Mohammad Khalil. 2019. Student data privacy in MOOCs: A sentiment analysis. Distance Education 40, 3 (2019), 395–413

work page 2019

[44] [44]

Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv:2301.07573 [cs] (2023). https://arxiv.org/abs/2301.07573

work page arXiv 2023

[45] [45]

Zhaozhi Qian, Rob Davis, and Mihaela Van Der Schaar. 2024. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. https://proceedings.neurips.cc/paper_files/paper/2023/file/09723c9f291f6056fd1885081859c186-Paper-Datasets_and_Benchmarks.pdf

work page 2024

[46] [46]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://huggingface.co/openai-community/gpt2

work page 2019

[47] [47]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108

work page internal anchor Pith review Pith/arXiv arXiv 2020

[48] [48]

Neil Selwyn. 2020. Re-imagining ‘Learning Analytics’ . . . a case for starting again? The Internet and Higher Education 46 (2020), 100745. https: //doi.org/10.1016/j.iheduc.2020.100745

work page doi:10.1016/j.iheduc.2020.100745 2020

[49] [49]

Wannapon Suraworachet, Jennifer Seon, and Mutlu Cukurova. 2024. Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 473–485. https://doi.org/10.1145/3636555.3636905

work page doi:10.1145/3636555.3636905 2024

[50] [50]

Dimitrios Tzimas and Stavros Demetriadis. 2021. Ethical issues in learning analytics: a review of the field. Educational Technology Research and Development 69 (2021). https://doi.org/10.1007/s11423-021-09977-4

work page doi:10.1007/s11423-021-09977-4 2021

[51] [51]

Kurt VanLehn, Stellan Ohlsson, and Rod Nason. 1994. Applications of simulated students: An exploration. Journal of artificial intelligence in education 5 (1994), 135–135

work page 1994

[52] [52]

Deborah West, Ann Luzeckyj, Bill Searle, Danny Toohey, Jessica Vanderlelie, and Kevin R Bell. 2020. Perspectives from the stakeholder: Students’ views regarding learning analytics and data collection. Australasian Journal of Educational Technology 36, 6 (2020), 72–88. https://doi.org/10.14742/ajet.5957

work page doi:10.14742/ajet.5957 2020

[53] [53]

Lei Xu, Maria Skoularidou, Alfredo Cuesta -Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. Neural Information Processing Systems 32 (2019). https://papers.nips.cc/paper_files/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html

work page 2019

[54] [54]

Shengzhe Xu, Virginia Tech, Cho-Ting Lee, Mandar Sharma, Raquib Yousuf, Nikhil Muralidhar, and Naren Ramakrishnan. 2024. Are LLMs Naturally Good at Synthetic Tabular Data Generation? ArXiv.org (2024). https://arxiv.org/pdf/2406.14541

work page arXiv 2024

[55] [55]

Lixiang Yan, Linxuan Zhao, Dragan Gasevic, and Roberto Martinez-Maldonado. 2022. Scalability, sustainability, and ethicality of multimodal learning analytics. LAK22: 12th International Learning Analytics and Knowledge Conference (2022), 13–23. https://doi.org/10.1145/3506860.3506862

work page doi:10.1145/3506860.3506862 2022

[56] [56]

Chen Zhan, Oscar Blessed Deho, Xuwei Zhang, Srecko Joksimovic, and Maarten de Laat. 2023. Synthetic data generator for student data serving learning analytics: A comparative study. Learning Letters 1 (2023), 5. https://doi.org/10.59453/KHZW9006

work page doi:10.59453/khzw9006 2023

[57] [57]

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. ArXiv.org (2019). https://arxiv.org/abs/1911.00536

work page arXiv 2019

[58] [58]

Zilong Zhao, Aditya Kunar, Robert Birke, Lydia Chen, and Hiek Van der Scheer. 2021. CTAB-GAN: Effective Table Data Synthesizing. Proceedings of Machine Learning Research 157 (2021). https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf

work page 2021