Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya
Pith reviewed 2026-05-10 16:57 UTC · model grok-4.3
The pith
Machine learning models trained on synthetic data can accurately predict which children are at risk of missing vaccines in Kenya while protecting privacy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Classification models trained on MOH 510 vaccination records from Narok County can reliably identify children at risk of missing key vaccines. Training the same models on TabSyn-generated synthetic data produces equivalent predictive results, allowing the use of data for forecasting without exposing individual patient details in a vulnerable population.
What carries the argument
TabSyn tabular diffusion-based synthetic data generation used to train Logistic Regression and XGBoost classifiers for identifying vaccination risk on real and generated records.
If this is right
- Targeted interventions can reach children predicted at highest risk of missing vaccines to raise overall coverage rates.
- Clinics with limited digital systems can still run scalable forecasts of immunization needs using synthetic records.
- Privacy concerns in nomadic and low-resource populations no longer block the use of health data for prediction.
- Resource allocation for vaccine delivery can rely on evidence from models that do not require sharing original patient files.
Where Pith is reading between the lines
- The method could extend to predicting other health service gaps in populations where real data sharing is restricted.
- Mobile tools for community health workers might incorporate these risk scores to prioritize home visits.
- Testing the same workflow on vaccination datasets from different regions would show whether TabSyn performs consistently across schedules and cultures.
Load-bearing premise
The synthetic data accurately reproduces the statistical distributions and relationships in the original vaccination records without introducing biases that would affect predictions for at-risk children.
What would settle it
Train separate models on real records and on TabSyn synthetic records, then test both on a fresh hold-out set of actual vaccination records; a meaningful drop in recall or precision for the synthetic-trained model would disprove the claim of no performance loss.
Figures
read the original abstract
Background: Limited data utilization in low-resource settings poses a barrier to the vaccine delivery ecosystem, undermining efforts to achieve equitable immunization coverage. In nomadic populations, individuals face an increased risk of missing crucial vaccination doses as children. One such population is the Maasai in Narok County, Kenya, where the absence of high-volume, quality data hampers accurate coverage estimates, impedes efficient resource allocation, and weakens the ability to deliver timely interventions. Additionally, data privacy concerns are heightened in groups with limited sensitive data. Objectives: First, we aim to identify children at risk of missing key vaccines across a large population to provide timely, evidence-based interventions that support increased vaccination coverage. Second, we aim to better protect the privacy of sensitive health data in a vulnerable population. Methods: We digitized 8 years of child vaccination records from the MOH 510 registry (n=6,913) and applied machine learning models (Logistic Regression and XGBoost) to identify children at risk. Additionally, we utilize a novel approach to tabular diffusion-based synthetic data generation (TabSyn) to protect patient privacy within the models. Results: Our findings show that classification techniques can reliably and successfully predict children at risk of missing a vaccine, with recall, precision, and F1-scores exceeding 90% for some vaccines modeled. Additionally, training these models with synthetic data rather than real data, thus preserving the privacy of individuals within the original dataset, does not lead to a loss in predictive performance. Conclusion: These results support the use of synthetic data implementation in health informatics strategies for clinics with limited digital infrastructure, enabling privacy-preserving, scalable forecasting for childhood immunization coverage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the application of machine learning classifiers (Logistic Regression and XGBoost) to predict children at risk of missing vaccinations using a dataset of 6,913 records from the MOH 510 registry in Narok, Kenya. It further explores the use of TabSyn for generating synthetic tabular data to preserve privacy and shows that models trained on this synthetic data achieve comparable performance to those trained on real data, with some models reporting recall, precision, and F1-scores above 90%.
Significance. If validated, these results would demonstrate the feasibility of using synthetic data for privacy-preserving predictive modeling in public health, particularly in low-resource and nomadic populations where data sensitivity is high. This could support better resource allocation for vaccination programs. The approach addresses both predictive accuracy and ethical data use, which is valuable for health informatics in similar contexts. However, the absence of key methodological details currently prevents a full evaluation of the claims' robustness.
major comments (3)
- [Methods] Methods: The description of the experimental setup lacks details on data partitioning (e.g., train/validation/test splits), cross-validation procedures, and hyperparameter optimization for the Logistic Regression and XGBoost models. These are essential to evaluate whether the reported performance metrics (recall, precision, F1 >90%) reflect true predictive capability or potential overfitting, especially with a dataset of n=6,913 and likely class imbalance.
- [Results] Results: There are no reported checks on the fidelity of the TabSyn-generated synthetic data, such as comparisons of statistical distributions, class-conditional metrics, or preservation of correlations for the at-risk (minority) class. Given the potential for distortion in rare-event tails with tabular diffusion models, this omission undermines the claim that synthetic data training does not lead to loss in predictive performance.
- [Results] Results: No statistical significance testing or confidence intervals are provided for the performance metrics or the comparison between real and synthetic training setups. This is particularly important to substantiate the equivalence claim.
minor comments (2)
- [Methods] Methods: Provide more details on how the target labels for 'at risk' were defined from the vaccination records.
- [Abstract] Abstract: The abstract mentions 'some vaccines modeled' but does not specify which ones achieved the high scores; this should be clarified for context.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency and robustness of our work. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods: The description of the experimental setup lacks details on data partitioning (e.g., train/validation/test splits), cross-validation procedures, and hyperparameter optimization for the Logistic Regression and XGBoost models. These are essential to evaluate whether the reported performance metrics (recall, precision, F1 >90%) reflect true predictive capability or potential overfitting, especially with a dataset of n=6,913 and likely class imbalance.
Authors: We agree that these methodological details are essential for reproducibility and to allow proper assessment of overfitting risks given the dataset size and potential class imbalance. In the revised manuscript, we will expand the Methods section to explicitly describe: a stratified 70/15/15 train/validation/test split to preserve class distributions; 5-fold stratified cross-validation for model evaluation and tuning; and the hyperparameter optimization procedure (grid search over regularization parameters for Logistic Regression and learning rate, max depth, and estimators for XGBoost, with early stopping). We will also report the selected hyperparameters and any regularization techniques applied. revision: yes
-
Referee: [Results] Results: There are no reported checks on the fidelity of the TabSyn-generated synthetic data, such as comparisons of statistical distributions, class-conditional metrics, or preservation of correlations for the at-risk (minority) class. Given the potential for distortion in rare-event tails with tabular diffusion models, this omission undermines the claim that synthetic data training does not lead to a loss in predictive performance.
Authors: We acknowledge that fidelity validation would provide stronger support for the equivalence claim, particularly for the minority class. While our evaluation centered on downstream predictive performance, we will add a dedicated subsection in Results presenting fidelity checks, including marginal distribution comparisons (means, variances, histograms), correlation matrix preservation, and class-conditional statistics for the at-risk group. We will also note limitations of tabular diffusion models regarding rare-event tails and their potential impact on generalizability. revision: yes
-
Referee: [Results] Results: No statistical significance testing or confidence intervals are provided for the performance metrics or the comparison between real and synthetic training setups. This is particularly important to substantiate the equivalence claim.
Authors: We agree that statistical tests and confidence intervals are needed to rigorously support the claim of no performance loss. In the revision, we will report 95% bootstrap confidence intervals for all metrics (precision, recall, F1) based on multiple runs. We will also include statistical comparisons (e.g., paired t-tests on cross-validation fold metrics or McNemar's test) between real and synthetic models, with p-values, to assess whether observed differences are significant. Updated tables and text will present these results. revision: yes
Circularity Check
No significant circularity; empirical ML application with measured outcomes
full rationale
The paper applies standard classifiers (Logistic Regression, XGBoost) to real and TabSyn-generated vaccination records, then reports recall/precision/F1 on test splits. No derivations, uniqueness theorems, or self-referential equations exist; performance figures are direct empirical measurements rather than quantities forced by construction from fitted inputs or prior self-citations. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Manyanga D, Byabamazima C, Masvikeni B, Ochieng M, and Wanyoike S. A decade of progress and challenges in government support for routine immunization in east and southern africa (2015-2024).Pan Afr Med J, 51(22), 2025
work page 2015
-
[2]
Unicef annual report 2024: staying and delivering for children
UNICEF. Unicef annual report 2024: staying and delivering for children. Technical report, United Nations Children’s Fund, 2025. Accessed October 13, 2025
work page 2024
-
[3]
Using public data to predict demand for mobile health clinics
Chen H, Ghosh S, Fan G, Behari N, Biswas A, Williams M, Oriol NE, and Tambe M. Using public data to predict demand for mobile health clinics. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12461–12467, 2022
work page 2022
-
[4]
Sequential vaccine allocation with delayed feedback
Xiao Y , Ou HC, Chen H, Nguyen VT, and Tran-Thanh L. Sequential vaccine allocation with delayed feedback. InProceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI 2022), pages 5199–5205, 2022
work page 2022
-
[5]
Immunization agenda 2030: a global strategy to leave no one behind
World Health Organization. Immunization agenda 2030: a global strategy to leave no one behind. Technical report, World Health Organization, 2020. Accessed March 9, 2026
work page 2030
-
[6]
Gavi 6.0: the alliance’s strategy 2026-2030
Gavi, The Vaccine Alliance. Gavi 6.0: the alliance’s strategy 2026-2030. Technical report, Gavi, 2024. Accessed March 9, 2026
work page 2026
-
[7]
The global action plan for healthy lives and well-being for all
World Health Organization. The global action plan for healthy lives and well-being for all. Technical report, World Health Organization, 2019. Accessed March 9, 2026
work page 2019
-
[8]
Haeuser E, Byrne S, Nguyen J, Raggi C, McLaughlin SA, Bisignano C, Harris AA, Smith AE, Lindstedt PA, and Smith G. Global, regional, and national trends in routine childhood vaccination coverage from 1980 to 2023 with forecasts to 2030: a systematic analysis for the global burden of disease study 2023.Lancet, 2025
work page 1980
-
[9]
Active screening for recurrent diseases: a reinforcement learning approach
Ou HC, Chen H, Jabbari S, and Tambe M. Active screening for recurrent diseases: a reinforcement learning approach. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 992–1000, 2021
work page 2021
-
[10]
Orwa B. Management of medical records for better healthcare service delivery: a case study of narok county referral hospital, kenya.Hum Resour Leadersh J, 7(1), 2022
work page 2022
-
[11]
Masinde J, Mugambi F, and Muthee DW. Big data and personal information privacy in developing countries: insights from kenya.Front Big Data, 8:1532362, 2025
work page 2025
-
[12]
Auto-Encoding Variational Bayes
Kingma DP and Welling M. Auto-encoding variational bayes.arXiv, 2022. arXiv:1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Generative adversarial nets.Adv Neural Inf Process Syst, 27, 2014
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, and Ozair S. Generative adversarial nets.Adv Neural Inf Process Syst, 27, 2014
work page 2014
-
[14]
Denoising diffusion probabilistic models.Adv Neural Inf Process Syst, 33:6840–6851, 2020
Ho J, Jain A, and Abbeel P. Denoising diffusion probabilistic models.Adv Neural Inf Process Syst, 33:6840–6851, 2020
work page 2020
-
[15]
Population aware diffusion for time series generation
Li Y , Meng H, Bi Z, Urnes IT, and Chen H. Population aware diffusion for time series generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18520–18529, 2025
work page 2025
-
[16]
Mixed-type tabular data synthesis with score-based diffusion in latent space
Zhang H, Zhang J, Shen Z, Srinivasan B, Qin X, Faloutsos C, Rangwala H, and Karypis G. Mixed-type tabular data synthesis with score-based diffusion in latent space. InProceedings of the International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[17]
Forslund M, Mathieson K, Djibo Y , Mbindyo C, Lugangira N, and Balasubrama- niam P. Strengthening the evidence base on the use of digital health technologies to accelerate progress towards universal health coverage.Oxford Open Digit Health, 2:oqae033, 2024
work page 2024
-
[18]
Micronutrient defi- ciency prediction via publicly available satellite data
Bondi E, Chen H, Golden CD, Behari N, and Tambe M. Micronutrient defi- ciency prediction via publicly available satellite data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12454–12460, 2022
work page 2022
-
[19]
Bondi-Kelly E, Chen H, Golden CD, Behari N, and Tambe M. Predicting mi- cronutrient deficiency with publicly available satellite data.AI Mag, 44(1):30–40, 2023
work page 2023
-
[20]
Gawande MS, Zade N, Kumar P, Gundewar S, Weerarathna IN, and Verma P. The role of artificial intelligence in pandemic responses: from epidemiological modeling to vaccine development.Mol Biomed, 6(1):1, 2025
work page 2025
-
[21]
Predictive modeling of vaccination uptake in us counties: a machine learning-based approach
Cheong Q, Au-Yeung M, Quon S, Concepcion K, and Kong JD. Predictive modeling of vaccination uptake in us counties: a machine learning-based approach. J Med Internet Res, 23(11):e33231, 2021
work page 2021
-
[22]
Dodoo CC, Hanson-Yamoah E, Adedia D, Erzuah I, Yamoah P, Brobbey F, Cob- bold C, and Mensah J. Using machine learning algorithms to predict covid-19 vaccine uptake: a year after the introduction of covid-19 vaccines in ghana.Vaccine X, 18:100466, 2024
work page 2024
-
[23]
Hasan MK, Jawad MT, Dutta A, Awal MA, Islam MA, Masud M, and Al-Amri JF. Associating measles vaccine uptake classification and its underlying factors using an ensemble of machine learning models.IEEE Access, 9:119613–119628, 2021
work page 2021
-
[24]
Mbunge E. Leveraging ensemble machine learning approaches to predict measles vaccination status among children under five: insights from the 2019 zimbabwe mics. InComput Sci On-line Conf, pages 310–324, 2025
work page 2019
-
[25]
Demsash AW, Chereka AA, Walle AD, Kassie SY , Bekele F, and Bekana T. Ma- chine learning algorithms’ application to predict childhood vaccination among children aged 12-23 months in ethiopia: evidence from the 2016 ethiopian demo- graphic and health survey dataset.PLoS One, 18(10):e0288867, 2023
work page 2016
-
[26]
Kalegele K and Lubua EW. Determinants of childhood vaccination uptake: a machine learning approach using a decision tree classifier.J Inform, 5(1), 2025
work page 2025
-
[27]
López DM, Rico-Olarte C, Blobel B, and Hullin C. Challenges and solutions for transforming health ecosystems in low- and middle-income countries through artificial intelligence.Front Med, 9:958097, 2022
work page 2022
-
[28]
Faketables: using gans to generate functional dependency preserving tables with bounded real data
Chen H, Jajodia S, Liu J, Park N, Sokolov V , and Subrahmanian VS. Faketables: using gans to generate functional dependency preserving tables with bounded real data. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2074–2080, 2019
work page 2074
-
[29]
Medgan: medical image translation using gans.Comput Med Imaging Graph, 79:101684, 2020
Armanious K, Jiang C, Fischer M, Küstner T, Hepp T, Nikolaou K, Gatidis S, and Yang B. Medgan: medical image translation using gans.Comput Med Imaging Graph, 79:101684, 2020. Green Lab 2020/2021, September–October, 2020, Amsterdam, The Netherlands Jimmy Bach, Y ang Li, M.S., Y aqi Liu, B.S., John Sankok, Rose Kimani, Carrie B. Dolan, PhD, Julius N. Odhiam...
work page 2020
-
[30]
Baowaly MK, Lin CC, Liu CL, and Chen KT. Synthesizing electronic health records using improved generative adversarial networks.J Am Med Inform Assoc, 26(3):228–241, 2019
work page 2019
-
[31]
Eva: generating longitudinal electronic health records using conditional variational autoencoders
Biswal S, Ghosh S, Duke J, Malin B, Stewart W, Xiao C, and Sun J. Eva: generating longitudinal electronic health records using conditional variational autoencoders. InProceedings of Machine Learning for Healthcare Conference, pages 260–282, 2021
work page 2021
-
[32]
Score- based generative modeling through stochastic differential equations
Song Y , Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, and Poole B. Score- based generative modeling through stochastic differential equations. InProceed- ings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[33]
High-resolution image synthesis with latent diffusion models
Rombach R, Blattmann A, Lorenz D, Esser P, and Ommer B. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684– 10695, 2022
work page 2022
-
[34]
Diffusion-ts: interpretable diffusion for general time series generation
Yuan X and Qiao Y . Diffusion-ts: interpretable diffusion for general time series generation. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[35]
Tabddpm: modelling tabular data with diffusion models
Kotelnikov A, Baranchuk D, Rubachev I, and Babenko A. Tabddpm: modelling tabular data with diffusion models. InProc Int Conf Mach Learn, pages 17564– 17579, 2023
work page 2023
-
[36]
Scoehr: generating synthetic electronic health records using continuous-time diffusion models
Naseer AA, Walker B, Landon C, Ambrosy A, Fudim M, Wysham N, Toro B, Swaminathan S, and Lyons T. Scoehr: generating synthetic electronic health records using continuous-time diffusion models. InProceedings of Machine Learning for Healthcare Conference, 2023
work page 2023
-
[37]
Xgboost: a scalable tree boosting system.Cornell University, 2016
Chen T. Xgboost: a scalable tree boosting system.Cornell University, 2016
work page 2016
-
[38]
Phillips DE, Dieleman JL, Lim SS, and Shearer J. Determinants of effective vaccine coverage in low and middle-income countries: a systematic review and interpretive synthesis.BMC Health Serv Res, 17(1):681, 2017
work page 2017
-
[39]
Scikit-learn: machine learning in python.J Mach Learn Res, 12:2825–2830, 2011
Pedregosa F, Varoquaux G, Gramfort A, Michel V , Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, and Dubourg V . Scikit-learn: machine learning in python.J Mach Learn Res, 12:2825–2830, 2011. Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya Green Lab 2020/2021, September–October, 2020, Amsterdam, Th...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.