pith. sign in

arxiv: 2501.01793 · v1 · pith:HCSTDOTLnew · submitted 2025-01-03 · 💻 cs.LG · cs.AI

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

Pith reviewed 2026-05-23 06:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords synthetic dataCTGANlarge language modelslearning analyticsstudent dataGANprivacy
0
0 comments X

The pith

CTGAN and LLMs generate synthetic student data that resembles real data for learning analytics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether CTGAN and large language models can create synthetic tabular records about students. Privacy rules restrict access to real student data, so the authors treat synthetic generation as a direct substitute. They train CTGAN along with GPT2, DistilGPT2, and DialoGPT on student datasets, then measure how closely the outputs match the originals in statistical distributions and in how well downstream models perform on them. The results indicate the generated records preserve enough structure to support learning analytics tasks.

Core claim

CTGAN and the three LLMs produce synthetic student datasets whose statistical properties and predictive utility closely match those of real student data, allowing the synthetic records to serve learning analytics models while avoiding direct use of protected information.

What carries the argument

Generative models (CTGAN for tabular synthesis and LLMs for feature generation) trained to replicate the joint distribution of real student records.

If this is right

  • Synthetic records can stand in for real student data when training or evaluating learning analytics models.
  • CTGAN and the tested LLMs can be ranked against one another by the same utility metrics.
  • Learning analytics work gains an additional data-generation route that sidesteps data-protection barriers.
  • Methodological experiments in the field can draw on larger or more varied synthetic collections than real data alone would allow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generators could be applied to other tabular domains that face comparable privacy limits.
  • Synthetic data sets produced this way could serve as controlled test beds for checking whether analytics models remain stable under distribution shifts.
  • Integration of these generators into existing analytics pipelines would let researchers measure end-to-end effects on model outputs.

Load-bearing premise

Utility metrics on synthetic data serve as reliable stand-ins for performance when the same data is used inside actual learning analytics applications.

What would settle it

Train the same learning analytics model once on real student data and once on the synthetic data, then observe whether accuracy or other task metrics drop sharply on an independent real test set.

Figures

Figures reproduced from arXiv: 2501.01793 by Mohammad Khalil, Qinyi Liu, Ronas Shakya, Sam Urmian.

Figure 2
Figure 2. Figure 2: t-SNE plots showing the similarity between synthetic data generated by each generative model and the real data for Dataset B2. Best viewed in colour. Following the study’s pipeline in [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 6
Figure 6. Figure 6: The mean value for the classification machine learning evaluation metrics of Accuracy (left), AUCROC (middle), and F1-score (right) for all the LA datasets and the generative models. Black bar stands for the standard error of mean. Best viewed in colour. and B2 involve multi-label classification, resulting in lower metric values compared to the other datasets. Overall, the Accuracy differences between the … view at source ↗
read the original abstract

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript explores the application of CTGAN and LLMs (GPT-2, DistilGPT-2, DialoGPT) for generating synthetic tabular student data to address privacy concerns in learning analytics. It claims that a comprehensive set of utility evaluation metrics demonstrates the strong potential of these methods to produce high-quality synthetic datasets resembling real student data, which can then be used to train learning analytics models.

Significance. If the empirical results are substantiated with concrete metrics and validated against downstream learning analytics tasks, this could offer valuable methodological contributions to the field by expanding the toolbox for synthetic data generation in education. The paper does not indicate provision of code or data for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'our results demonstrate the strong potential' based on 'utility evaluation metrics' is not accompanied by any specific numerical values, chosen metrics, baseline comparisons, or statistical significance tests, undermining the ability to evaluate the central claim.
  2. [Evaluation section] Evaluation: The paper applies a set of utility metrics (statistical and predictive) but provides no evidence or discussion that these metrics correlate with or proxy actual performance on downstream learning analytics tasks (e.g., dropout prediction or grade forecasting) when models are trained on synthetic data and evaluated on real held-out data.
minor comments (1)
  1. The abstract could benefit from briefly naming the specific utility metrics applied to improve clarity on the evaluation approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'our results demonstrate the strong potential' based on 'utility evaluation metrics' is not accompanied by any specific numerical values, chosen metrics, baseline comparisons, or statistical significance tests, undermining the ability to evaluate the central claim.

    Authors: We agree that the abstract would be strengthened by including concrete numerical results. The evaluation section contains the full set of metrics (including statistical similarity scores such as correlation and distribution distances, predictive utility accuracies, and model comparisons), but these were not summarized numerically in the abstract. In the revised manuscript we will update the abstract to report key quantitative findings, chosen metrics, and baseline comparisons from the results. revision: yes

  2. Referee: [Evaluation section] Evaluation: The paper applies a set of utility metrics (statistical and predictive) but provides no evidence or discussion that these metrics correlate with or proxy actual performance on downstream learning analytics tasks (e.g., dropout prediction or grade forecasting) when models are trained on synthetic data and evaluated on real held-out data.

    Authors: Our predictive utility evaluation already trains models on synthetic data and evaluates them on real held-out data, which directly measures performance on downstream-style tasks. However, we did not explicitly discuss the correlation of these metrics with specific learning analytics applications such as dropout prediction or grade forecasting, nor did we include those exact tasks. We will add a dedicated discussion paragraph in the evaluation section explaining how the chosen predictive utility metrics serve as proxies and will consider adding one illustrative downstream task experiment if space allows. revision: partial

Circularity Check

0 steps flagged

Empirical generator comparison with no derivation chain

full rationale

The paper is a standard empirical comparison of CTGAN and three LLMs for generating synthetic student tabular data, followed by application of statistical and predictive utility metrics. No equations, fitted parameters, or derivation steps are present that reduce any reported result to its inputs by construction. The work contains no self-citations used as load-bearing uniqueness theorems, no ansatzes smuggled via citation, and no renaming of known results as new derivations. The evaluation metrics are applied after generation and do not create a self-definitional loop. This is self-contained empirical research with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or paper-specific axioms; the work implicitly relies on the standard assumption that GANs and LLMs can approximate the joint distribution of real tabular records.

axioms (1)
  • domain assumption Generative models trained on real student records can produce synthetic records whose statistical properties support downstream predictive models
    Central premise required for the utility claim; stated in the abstract's description of validation.

pith-pipeline@v0.9.0 · 5740 in / 1273 out tokens · 67875 ms · 2026-05-23T06:01:50.150531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

  1. [1]

    Mahed Abroshan, Andrew Elliott, and Mahdi Khalili. 2024. Imposing fairness constraints in synthetic data generation. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics 238 (2024), 2269–2277. https://proceedings.mlr.press/v238/abroshan24a.html

  2. [2]

    Ahmed Alaa, Boris Van Breugel, Evgeny Saveliev, and Mihaela Van Der Schaar. 2022. How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022 (2022). https://proceedings.mlr.press/v162/alaa22a/alaa22a.pdf

  3. [3]

    Chris Alexiuk, Shashank Verma, and Vivienne Zhang. 2024. Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4 340B. https://developer.nvidia.com/blog/leverage-our-latest-open-models-for-synthetic-data-generation-with-nvidia-nemotron-4-340b/

  4. [4]

    Francis Anscombe. 1973. Graphs in Statistical Analysis. The American Statistician 27 (1973), 17–21. https://doi.org/10.2307/2682899

  5. [5]

    Berg, Stefan T

    Alan M. Berg, Stefan T. Mol, Gábor Kismihók, and Niall Sclater. 2016. The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics 3 (2016), 107–128. https://doi.org/10.18608/jla.2016.31.7

  6. [6]

    Anabel Bethencourt-Aguilar, Dagoberto Castellanos-Nieves, Juan José Sosa-Alonso, and Manuel Area-Moreira. 2023. Use of Generative Adversarial Networks (GANs) in Educational Technology Research. Journal of New Approaches in Educational Research 12 (2023), 153–153. https://doi.org/10. 7821/naer.2023.1.1231

  7. [7]

    Karan Bhanot, Miao Qi, John S Erickson, Isabelle Guyon, and Kristin P Bennett. 2021. The Problem of Fairness in Synthetic Healthcare Data. Entropy 23 (2021). https://doi.org/10.3390/e23091165

  8. [8]

    Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. ArXiv.org (2023). https://arxiv.org/pdf/2210.06280

  9. [9]

    Boris van Breugel and Mihaela van der Schaar. 2022. Why Tabular Foundation Models Should Be a Research Priority. Arxiv.org (2022). https: //arxiv.org/html/2405.01147v1#S7 18 Mohammad Khalil, Farhad Vadiee, Ronas Shakya, and Qinyi Liu Manuscript submitted to ACM

  10. [10]

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv abs/2303.12712 (2023). https://api.semanticscholar.org/Corpu...

  11. [11]

    Paulo Cortez. 2014. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/320/student+performance

  12. [12]

    Tukur Dahiru. 2008. P - value, a true test of statistical significance? A cautionary note. Annals of Ibadan postgraduate medicine 6 (2008), 21–6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111019/

  13. [13]

    Shane Dawson, Srecko Joksimovic, Oleksandra Poquet, and George Siemens. 2019. Increasing the impact of learning analytics. LAK19: Proceedings of the 9th International Conference on Learning Analytics & Knowledge (2019), 446–455. https://doi.org/10.1145/3303772.3303784

  14. [14]

    Erica Espinosa and Alvaro Figueira. 2023. On the Quality of Synthetic Generated Tabular Data. Mathematics 11 (2023). https://doi.org/10.3390/ math11153278

  15. [15]

    Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding dropouts in MOOCs. AAAI’19/IAAI’19/EAAI’19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (2019). https://do...

  16. [16]

    Rebecca Ferguson and Doug Clow. 2017. Where is the evidence? a call to action for learning analytics. LAK ’17: Proceedings of the Seventh International Learning Analytics & Knowledge Conference (2017), 56–65. https://doi.org/10.1145/3027385.3027396

  17. [17]

    Alvaro Figueira and Bruno Vaz. 2022. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 10 (2022), 2733. https://doi.org/10.3390/math10152733

  18. [18]

    Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learning Analytics. IEEE Access 10 (2022), 26230–26241. https://doi.org/10.1109/access.2022.3156073

  19. [19]

    Mikel Hernadez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. 2023. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods of Information in Medicine 62 (2023), e19–e38. https://doi.org/10.1055/s-0042- 1760247

  20. [20]

    Geoffrey Hinton and Sam Roweis. 2002. Stochastic Neighbor Embedding. https://cs.nyu.edu/~roweis/papers/sne_final.pdf

  21. [21]

    Markus Hittmeir, Andreas Ekelhart, and Rudolf Mayer. 2019. On the utility of synthetic data: An empirical evaluation on machine learning tasks. ARES ’19: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019). https://doi.org/10.1145/3339252.3339281

  22. [22]

    Lan Jiang, Clara Belitz, and Nigel Bosch. 2024. Synthetic dataset generation for fairer unfairness research. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 200–209. https://doi.org/10.1145/3636555.3636868

  23. [23]

    James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. 2022. Synthetic Data – what, why and how? The Royal Society (2022). https://doi.org/10.48550/arxiv.2205.03257

  24. [24]

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. , 8110 –8119 pages. https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_ StyleGAN_CVPR_2020_paper.html

  25. [25]

    Mohammad Khalil. 2018. Learning Analytics in Massive Open Online Courses. ArXiv.org (2018). https://doi.org/10.48550/arXiv.1802.09344

  26. [26]

    Harsh Kumar, Ilya Musabirov, Joseph Jay Williams, and Michael Liut. 2023. QuickTA: Exploring the Design Space of Using Large Language Models to Provide Support to Students. Learning Analytics and Knowledge Conference (LAK’23) (2023). https://tspace.library.utoronto.ca/bitstream/1807/ 127196/1/2023_Kumar_QuickTA_exploring_design_space.pdf

  27. [27]

    Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. 2017. Open University Learning Analytics dataset. Scientific Data 4 (2017), 170171. https: //doi.org/10.1038/sdata.2017.171

  28. [28]

    Jiayin Lin, Geng Sun, Jun Shen, Tingru Cui, Ping Yu, Dongming Xu, Li Li, and Ghassan Beydoun. 2019. Towards the readiness of learning analytics data for micro learning. Services Computing – SCC 2019: 16th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings (2019), 66–76....

  29. [29]

    Qinyi Liu, Oscar Deho, Farhad Vadiee, Mohammad Khalil, Srecko Joksimovic, and George Siemens. 2025. Can Synthetic Data Be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK’25). ACM. (2025). https://doi.org/10.1145/3706468.3706546

  30. [30]

    Qinyi Liu and Mohammad Khalil. 2023. Understanding privacy and data protection issues in learning analytics using a systematic review. British Journal of Educational Technology 54 (2023). https://doi.org/10.1111/bjet.13388

  31. [31]

    Qinyi Liu, Mohammad Khalil, Jelena Jovanovic, and Ronas Shakya. 2024. Scaling while privacy preserving: A comprehensive synthetic tabular data generation and evaluation in learning analytics. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 620–631. https://doi.org/10.1145/3636555.3636921

  32. [32]

    Ruibo Liu, Jerry Wei, Fangyu Liu, Google Deepmind, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew Dai. 2024. Best Practices and Lessons Learned on Synthetic Data. Arxiv.org (2024). https://arxiv.org/pdf/2404.07503

  33. [33]

    Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and van. 2023. GOGGLE: Generative Modelling for Tabular Data by Learning Relational Structure. https://openreview.net/forum?id=fPVRcJqspu

  34. [34]

    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE Laurens van der Maaten. Journal of Machine Learning Research 9 (2008), 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf Creating Artificial Students that Never Existed: Using LLMs and CTGAN 19 Manuscript submitted to ACM

  35. [35]

    Melle Mendikowski and Mattis Hartwig. 2022. Creating customers that never existed: Synthesis of e-commerce data using CTGAN. In 18th International Conference on Machine Learning and Data Mining (MLDM-22). New York, US: IBAI Publishing. 91–105

  36. [36]

    Menéndez, J.A

    M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo. 1997. The Jensen-Shannon divergence. Journal of the Franklin Institute 334 (1997), 307–318. https://doi.org/10.1016/s0016-0032(96)00063-4

  37. [37]

    Marko Miletic and Murat Sariyar. 2024. Assessing the Potentials of LLMs and GANs as StateoftheArt Tabular Synthetic Data Generation Methods. Privacy in Statistical Databases (2024), 374–389. https://doi.org/10.3390/app14145975

  38. [38]

    Luis Moles, Alain Andres, Goretti Echegaray, and Fernando Boto. 2024. Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics 12 (2024). https://doi.org/10.3390/math12121898

  39. [39]

    Abdallah Moubayed, MohammadNoor Injadat, Abdallah Shami, Ali Bou Nassif, and Hanan Lutfiyya. 2020. Student Performance and Engagement Prediction in eLearning datasets. IEEE dataport (2020). https://dx.doi.org/10.21227/4xkr-0f88

  40. [40]

    Panaretos and Yoav Zemel

    Victor M. Panaretos and Yoav Zemel. 2019. Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and Its Application 6 (2019), 405–431. https://doi.org/10.1146/annurev-statistics-030718-104938

  41. [41]

    Stanislav Pozdniakov, Jonathan Brazil, Solmaz Abdi, Aneesha Bakharia, Shazia Sadiq, Dragan Gašević, Paul Denny, and Hassan Khosravi. 2024. Large language models meet user interfaces: The case of provisioning feedback. Computers and Education: Artificial Intelligence 7 (2024), 100289. https://doi.org/10.1016/j.caeai.2024.100289

  42. [42]

    Paul Prinsloo, Mohammad Khalil, and Sharon Slade. 2023. Learning analytics as data ecology: a tentative proposal. Journal of Computing in Higher Education 36 (2023). https://doi.org/10.1007/s12528-023-09355-4

  43. [43]

    Paul Prinsloo, Sharon Slade, and Mohammad Khalil. 2019. Student data privacy in MOOCs: A sentiment analysis. Distance Education 40, 3 (2019), 395–413

  44. [44]

    Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv:2301.07573 [cs] (2023). https://arxiv.org/abs/2301.07573

  45. [45]

    Zhaozhi Qian, Rob Davis, and Mihaela Van Der Schaar. 2024. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. https://proceedings.neurips.cc/paper_files/paper/2023/file/09723c9f291f6056fd1885081859c186-Paper-Datasets_and_Benchmarks.pdf

  46. [46]

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://huggingface.co/openai-community/gpt2

  47. [47]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108

  48. [48]

    Neil Selwyn. 2020. Re-imagining ‘Learning Analytics’ . . . a case for starting again? The Internet and Higher Education 46 (2020), 100745. https: //doi.org/10.1016/j.iheduc.2020.100745

  49. [49]

    Wannapon Suraworachet, Jennifer Seon, and Mutlu Cukurova. 2024. Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference (2024), 473–485. https://doi.org/10.1145/3636555.3636905

  50. [50]

    Dimitrios Tzimas and Stavros Demetriadis. 2021. Ethical issues in learning analytics: a review of the field. Educational Technology Research and Development 69 (2021). https://doi.org/10.1007/s11423-021-09977-4

  51. [51]

    Kurt VanLehn, Stellan Ohlsson, and Rod Nason. 1994. Applications of simulated students: An exploration. Journal of artificial intelligence in education 5 (1994), 135–135

  52. [52]

    Deborah West, Ann Luzeckyj, Bill Searle, Danny Toohey, Jessica Vanderlelie, and Kevin R Bell. 2020. Perspectives from the stakeholder: Students’ views regarding learning analytics and data collection. Australasian Journal of Educational Technology 36, 6 (2020), 72–88. https://doi.org/10.14742/ajet.5957

  53. [53]

    Lei Xu, Maria Skoularidou, Alfredo Cuesta -Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. Neural Information Processing Systems 32 (2019). https://papers.nips.cc/paper_files/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html

  54. [54]

    Shengzhe Xu, Virginia Tech, Cho-Ting Lee, Mandar Sharma, Raquib Yousuf, Nikhil Muralidhar, and Naren Ramakrishnan. 2024. Are LLMs Naturally Good at Synthetic Tabular Data Generation? ArXiv.org (2024). https://arxiv.org/pdf/2406.14541

  55. [55]

    Lixiang Yan, Linxuan Zhao, Dragan Gasevic, and Roberto Martinez-Maldonado. 2022. Scalability, sustainability, and ethicality of multimodal learning analytics. LAK22: 12th International Learning Analytics and Knowledge Conference (2022), 13–23. https://doi.org/10.1145/3506860.3506862

  56. [56]

    Chen Zhan, Oscar Blessed Deho, Xuwei Zhang, Srecko Joksimovic, and Maarten de Laat. 2023. Synthetic data generator for student data serving learning analytics: A comparative study. Learning Letters 1 (2023), 5. https://doi.org/10.59453/KHZW9006

  57. [57]

    Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. ArXiv.org (2019). https://arxiv.org/abs/1911.00536

  58. [58]

    Zilong Zhao, Aditya Kunar, Robert Birke, Lydia Chen, and Hiek Van der Scheer. 2021. CTAB-GAN: Effective Table Data Synthesizing. Proceedings of Machine Learning Research 157 (2021). https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf