Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation
Pith reviewed 2026-05-23 06:01 UTC · model grok-4.3
The pith
CTGAN and LLMs generate synthetic student data that resembles real data for learning analytics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CTGAN and the three LLMs produce synthetic student datasets whose statistical properties and predictive utility closely match those of real student data, allowing the synthetic records to serve learning analytics models while avoiding direct use of protected information.
What carries the argument
Generative models (CTGAN for tabular synthesis and LLMs for feature generation) trained to replicate the joint distribution of real student records.
If this is right
- Synthetic records can stand in for real student data when training or evaluating learning analytics models.
- CTGAN and the tested LLMs can be ranked against one another by the same utility metrics.
- Learning analytics work gains an additional data-generation route that sidesteps data-protection barriers.
- Methodological experiments in the field can draw on larger or more varied synthetic collections than real data alone would allow.
Where Pith is reading between the lines
- The same generators could be applied to other tabular domains that face comparable privacy limits.
- Synthetic data sets produced this way could serve as controlled test beds for checking whether analytics models remain stable under distribution shifts.
- Integration of these generators into existing analytics pipelines would let researchers measure end-to-end effects on model outputs.
Load-bearing premise
Utility metrics on synthetic data serve as reliable stand-ins for performance when the same data is used inside actual learning analytics applications.
What would settle it
Train the same learning analytics model once on real student data and once on the synthetic data, then observe whether accuracy or other task metrics drop sharply on an independent real test set.
Figures
read the original abstract
In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores the application of CTGAN and LLMs (GPT-2, DistilGPT-2, DialoGPT) for generating synthetic tabular student data to address privacy concerns in learning analytics. It claims that a comprehensive set of utility evaluation metrics demonstrates the strong potential of these methods to produce high-quality synthetic datasets resembling real student data, which can then be used to train learning analytics models.
Significance. If the empirical results are substantiated with concrete metrics and validated against downstream learning analytics tasks, this could offer valuable methodological contributions to the field by expanding the toolbox for synthetic data generation in education. The paper does not indicate provision of code or data for reproducibility.
major comments (2)
- [Abstract] Abstract: The assertion that 'our results demonstrate the strong potential' based on 'utility evaluation metrics' is not accompanied by any specific numerical values, chosen metrics, baseline comparisons, or statistical significance tests, undermining the ability to evaluate the central claim.
- [Evaluation section] Evaluation: The paper applies a set of utility metrics (statistical and predictive) but provides no evidence or discussion that these metrics correlate with or proxy actual performance on downstream learning analytics tasks (e.g., dropout prediction or grade forecasting) when models are trained on synthetic data and evaluated on real held-out data.
minor comments (1)
- The abstract could benefit from briefly naming the specific utility metrics applied to improve clarity on the evaluation approach.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'our results demonstrate the strong potential' based on 'utility evaluation metrics' is not accompanied by any specific numerical values, chosen metrics, baseline comparisons, or statistical significance tests, undermining the ability to evaluate the central claim.
Authors: We agree that the abstract would be strengthened by including concrete numerical results. The evaluation section contains the full set of metrics (including statistical similarity scores such as correlation and distribution distances, predictive utility accuracies, and model comparisons), but these were not summarized numerically in the abstract. In the revised manuscript we will update the abstract to report key quantitative findings, chosen metrics, and baseline comparisons from the results. revision: yes
-
Referee: [Evaluation section] Evaluation: The paper applies a set of utility metrics (statistical and predictive) but provides no evidence or discussion that these metrics correlate with or proxy actual performance on downstream learning analytics tasks (e.g., dropout prediction or grade forecasting) when models are trained on synthetic data and evaluated on real held-out data.
Authors: Our predictive utility evaluation already trains models on synthetic data and evaluates them on real held-out data, which directly measures performance on downstream-style tasks. However, we did not explicitly discuss the correlation of these metrics with specific learning analytics applications such as dropout prediction or grade forecasting, nor did we include those exact tasks. We will add a dedicated discussion paragraph in the evaluation section explaining how the chosen predictive utility metrics serve as proxies and will consider adding one illustrative downstream task experiment if space allows. revision: partial
Circularity Check
Empirical generator comparison with no derivation chain
full rationale
The paper is a standard empirical comparison of CTGAN and three LLMs for generating synthetic student tabular data, followed by application of statistical and predictive utility metrics. No equations, fitted parameters, or derivation steps are present that reduce any reported result to its inputs by construction. The work contains no self-citations used as load-bearing uniqueness theorems, no ansatzes smuggled via citation, and no renaming of known results as new derivations. The evaluation metrics are applied after generation and do not create a self-definitional loop. This is self-contained empirical research with no circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generative models trained on real student records can produce synthetic records whose statistical properties support downstream predictive models
Reference graph
Works this paper leans on
-
[1]
Mahed Abroshan, Andrew Elliott, and Mahdi Khalili. 2024. Imposing fairness constraints in synthetic data generation. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics 238 (2024), 2269–2277. https://proceedings.mlr.press/v238/abroshan24a.html
work page 2024
-
[2]
Ahmed Alaa, Boris Van Breugel, Evgeny Saveliev, and Mihaela Van Der Schaar. 2022. How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022 (2022). https://proceedings.mlr.press/v162/alaa22a/alaa22a.pdf
work page 2022
-
[3]
Chris Alexiuk, Shashank Verma, and Vivienne Zhang. 2024. Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4 340B. https://developer.nvidia.com/blog/leverage-our-latest-open-models-for-synthetic-data-generation-with-nvidia-nemotron-4-340b/
work page 2024
-
[4]
Francis Anscombe. 1973. Graphs in Statistical Analysis. The American Statistician 27 (1973), 17–21. https://doi.org/10.2307/2682899
-
[5]
Alan M. Berg, Stefan T. Mol, Gábor Kismihók, and Niall Sclater. 2016. The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics 3 (2016), 107–128. https://doi.org/10.18608/jla.2016.31.7
-
[6]
Anabel Bethencourt-Aguilar, Dagoberto Castellanos-Nieves, Juan José Sosa-Alonso, and Manuel Area-Moreira. 2023. Use of Generative Adversarial Networks (GANs) in Educational Technology Research. Journal of New Approaches in Educational Research 12 (2023), 153–153. https://doi.org/10. 7821/naer.2023.1.1231
work page 2023
-
[7]
Karan Bhanot, Miao Qi, John S Erickson, Isabelle Guyon, and Kristin P Bennett. 2021. The Problem of Fairness in Synthetic Healthcare Data. Entropy 23 (2021). https://doi.org/10.3390/e23091165
- [8]
- [9]
-
[10]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv abs/2303.12712 (2023). https://api.semanticscholar.org/Corpu...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Paulo Cortez. 2014. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/320/student+performance
work page 2014
-
[12]
Tukur Dahiru. 2008. P - value, a true test of statistical significance? A cautionary note. Annals of Ibadan postgraduate medicine 6 (2008), 21–6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111019/
work page 2008
-
[13]
Shane Dawson, Srecko Joksimovic, Oleksandra Poquet, and George Siemens. 2019. Increasing the impact of learning analytics. LAK19: Proceedings of the 9th International Conference on Learning Analytics & Knowledge (2019), 446–455. https://doi.org/10.1145/3303772.3303784
-
[14]
Erica Espinosa and Alvaro Figueira. 2023. On the Quality of Synthetic Generated Tabular Data. Mathematics 11 (2023). https://doi.org/10.3390/ math11153278
work page 2023
-
[15]
Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding dropouts in MOOCs. AAAI’19/IAAI’19/EAAI’19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (2019). https://do...
-
[16]
Rebecca Ferguson and Doug Clow. 2017. Where is the evidence? a call to action for learning analytics. LAK ’17: Proceedings of the Seventh International Learning Analytics & Knowledge Conference (2017), 56–65. https://doi.org/10.1145/3027385.3027396
-
[17]
Alvaro Figueira and Bruno Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10 (2022), 2733. https://doi.org/10.3390/math10152733
-
[18]
Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learning Analytics. IEEE Access 10 (2022), 26230–26241. https://doi.org/10.1109/access.2022.3156073
-
[19]
Mikel Hernadez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62 (2023), e19–e38. https://doi.org/10.1055/s-0042- 1760247
-
[20]
Geoffrey Hinton and Sam Roweis. 2002. Stochastic Neighbor Embedding. https://cs.nyu.edu/~roweis/papers/sne_final.pdf
work page 2002
-
[21]
Markus Hittmeir, Andreas Ekelhart, and Rudolf Mayer. 2019. On the utility of synthetic data: An empirical evaluation on machine learning tasks. ARES ’19: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019). https://doi.org/10.1145/3339252.3339281
-
[22]
Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic dataset generation for fairer unfairness research. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 200–209. https://doi.org/10.1145/3636555.3636868
-
[23]
James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. 2022. Synthetic Data – what, why and how? The Royal Society (2022). https://doi.org/10.48550/arxiv.2205.03257
-
[24]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. , 8110 –8119 pages. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_ StyleGAN_CVPR_2020_paper.html
work page 2020
-
[25]
Mohammad Khalil. 2018. Learning Analytics in Massive Open Online Courses. ArXiv.org (2018). https://doi.org/10.48550/arXiv.1802.09344
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.09344 2018
-
[26]
Harsh Kumar, Ilya Musabirov, Joseph Jay Williams, and Michael Liut. 2023. QuickTA: Exploring the Design Space of Using Large Language Models to Provide Support to Students. Learning Analytics and Knowledge Conference (LAK’23) (2023). https://tspace.library.utoronto.ca/bitstream/1807/ 127196/1/2023_Kumar_QuickTA_exploring_design_space.pdf
work page 2023
-
[27]
Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open University Learning Analytics dataset. Scientific Data 4 (2017), 170171. https: //doi.org/10.1038/sdata.2017.171
-
[28]
Jiayin Lin, Geng Sun, Jun Shen, Tingru Cui, Ping Yu, Dongming Xu, Li Li, and Ghassan Beydoun. 2019. Towards the readiness of learning analytics data for micro learning. Services Computing – SCC 2019: 16th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings (2019), 66–76....
-
[29]
Qinyi Liu, Oscar Deho, Farhad Vadiee, Mohammad Khalil, Srecko Joksimovic, and George Siemens. 2025. Can Synthetic Data Be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK’25). ACM. (2025). https://doi.org/10.1145/3706468.3706546
-
[30]
Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics using a systematic review. British Journal of Educational Technology 54 (2023). https://doi.org/10.1111/bjet.13388
-
[31]
Qinyi Liu, Mohammad Khalil, Jelena Jovanovic, and Ronas Shakya. 2024. Scaling while privacy preserving: A comprehensive synthetic tabular data generation and evaluation in learning analytics. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 620–631. https://doi.org/10.1145/3636555.3636921
- [32]
-
[33]
Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and van. 2023. GOGGLE: Generative Modelling for Tabular Data by Learning Relational Structure. https://openreview.net/forum?id=fPVRcJqspu
work page 2023
-
[34]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE Laurens van der Maaten. Journal of Machine Learning Research 9 (2008), 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf Creating Artificial Students that Never Existed: Using LLMs and CTGAN 19 Manuscript submitted to ACM
work page 2008
-
[35]
Melle Mendikowski and Mattis Hartwig. 2022. Creating customers that never existed: Synthesis of e-commerce data using CTGAN. In 18th International Conference on Machine Learning and Data Mining (MLDM-22). New York, US: IBAI Publishing. 91–105
work page 2022
-
[36]
M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo. 1997. The Jensen-Shannon divergence. Journal of the Franklin Institute 334 (1997), 307–318. https://doi.org/10.1016/s0016-0032(96)00063-4
-
[37]
Marko Miletic and Murat Sariyar. 2024. Assessing the Potentials of LLMs and GANs as StateoftheArt Tabular Synthetic Data Generation Methods. Privacy in Statistical Databases (2024), 374–389. https://doi.org/10.3390/app14145975
-
[38]
Luis Moles, Alain Andres, Goretti Echegaray, and Fernando Boto. 2024. Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics 12 (2024). https://doi.org/10.3390/math12121898
-
[39]
Abdallah Moubayed, MohammadNoor Injadat, Abdallah Shami, Ali Bou Nassif, and Hanan Lutfiyya. 2020. Student Performance and Engagement Prediction in eLearning datasets. IEEE dataport (2020). https://dx.doi.org/10.21227/4xkr-0f88
-
[40]
Victor M. Panaretos and Yoav Zemel. 2019. Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and Its Application 6 (2019), 405–431. https://doi.org/10.1146/annurev-statistics-030718-104938
-
[41]
Stanislav Pozdniakov, Jonathan Brazil, Solmaz Abdi, Aneesha Bakharia, Shazia Sadiq, Dragan Gašević, Paul Denny, and Hassan Khosravi. 2024. Large language models meet user interfaces: The case of provisioning feedback. Computers and Education: Artificial Intelligence 7 (2024), 100289. https://doi.org/10.1016/j.caeai.2024.100289
-
[42]
Paul Prinsloo, Mohammad Khalil, and Sharon Slade. 2023. Learning analytics as data ecology: a tentative proposal. Journal of Computing in Higher Education 36 (2023). https://doi.org/10.1007/s12528-023-09355-4
-
[43]
Paul Prinsloo, Sharon Slade, and Mohammad Khalil. 2019. Student data privacy in MOOCs: A sentiment analysis. Distance Education 40, 3 (2019), 395–413
work page 2019
- [44]
-
[45]
Zhaozhi Qian, Rob Davis, and Mihaela Van Der Schaar. 2024. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. https://proceedings.neurips.cc/paper_files/paper/2023/file/09723c9f291f6056fd1885081859c186-Paper-Datasets_and_Benchmarks.pdf
work page 2024
-
[46]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://huggingface.co/openai-community/gpt2
work page 2019
-
[47]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[48]
Neil Selwyn. 2020. Re-imagining ‘Learning Analytics’ . . . a case for starting again? The Internet and Higher Education 46 (2020), 100745. https: //doi.org/10.1016/j.iheduc.2020.100745
-
[49]
Wannapon Suraworachet, Jennifer Seon, and Mutlu Cukurova. 2024. Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 473–485. https://doi.org/10.1145/3636555.3636905
-
[50]
Dimitrios Tzimas and Stavros Demetriadis. 2021. Ethical issues in learning analytics: a review of the field. Educational Technology Research and Development 69 (2021). https://doi.org/10.1007/s11423-021-09977-4
-
[51]
Kurt VanLehn, Stellan Ohlsson, and Rod Nason. 1994. Applications of simulated students: An exploration. Journal of artificial intelligence in education 5 (1994), 135–135
work page 1994
-
[52]
Deborah West, Ann Luzeckyj, Bill Searle, Danny Toohey, Jessica Vanderlelie, and Kevin R Bell. 2020. Perspectives from the stakeholder: Students’ views regarding learning analytics and data collection. Australasian Journal of Educational Technology 36, 6 (2020), 72–88. https://doi.org/10.14742/ajet.5957
-
[53]
Lei Xu, Maria Skoularidou, Alfredo Cuesta -Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. Neural Information Processing Systems 32 (2019). https://papers.nips.cc/paper_files/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html
work page 2019
- [54]
-
[55]
Lixiang Yan, Linxuan Zhao, Dragan Gasevic, and Roberto Martinez-Maldonado. 2022. Scalability, sustainability, and ethicality of multimodal learning analytics. LAK22: 12th International Learning Analytics and Knowledge Conference (2022), 13–23. https://doi.org/10.1145/3506860.3506862
-
[56]
Chen Zhan, Oscar Blessed Deho, Xuwei Zhang, Srecko Joksimovic, and Maarten de Laat. 2023. Synthetic data generator for student data serving learning analytics: A comparative study. Learning Letters 1 (2023), 5. https://doi.org/10.59453/KHZW9006
- [57]
-
[58]
Zilong Zhao, Aditya Kunar, Robert Birke, Lydia Chen, and Hiek Van der Scheer. 2021. CTAB-GAN: Effective Table Data Synthesizing. Proceedings of Machine Learning Research 157 (2021). https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.