pith. sign in

arxiv: 2604.08902 · v1 · submitted 2026-04-10 · 💻 cs.LG

Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya

Pith reviewed 2026-05-10 16:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords synthetic datamachine learningvaccination predictionchildhood immunizationprivacy preservationKenyarisk classificationTabSyn
0
0 comments X

The pith

Machine learning models trained on synthetic data can accurately predict which children are at risk of missing vaccines in Kenya while protecting privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that logistic regression and XGBoost models applied to eight years of digitized child vaccination records can flag those likely to miss key doses with recall, precision, and F1 scores above 90 percent for several vaccines. It further demonstrates that replacing the original records with synthetic versions generated by TabSyn maintains this performance level exactly. This combination directly tackles the shortage of usable data in nomadic communities and the heightened privacy needs around sensitive health information. The approach supports targeted interventions and better resource planning for immunization programs.

Core claim

Classification models trained on MOH 510 vaccination records from Narok County can reliably identify children at risk of missing key vaccines. Training the same models on TabSyn-generated synthetic data produces equivalent predictive results, allowing the use of data for forecasting without exposing individual patient details in a vulnerable population.

What carries the argument

TabSyn tabular diffusion-based synthetic data generation used to train Logistic Regression and XGBoost classifiers for identifying vaccination risk on real and generated records.

If this is right

  • Targeted interventions can reach children predicted at highest risk of missing vaccines to raise overall coverage rates.
  • Clinics with limited digital systems can still run scalable forecasts of immunization needs using synthetic records.
  • Privacy concerns in nomadic and low-resource populations no longer block the use of health data for prediction.
  • Resource allocation for vaccine delivery can rely on evidence from models that do not require sharing original patient files.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to predicting other health service gaps in populations where real data sharing is restricted.
  • Mobile tools for community health workers might incorporate these risk scores to prioritize home visits.
  • Testing the same workflow on vaccination datasets from different regions would show whether TabSyn performs consistently across schedules and cultures.

Load-bearing premise

The synthetic data accurately reproduces the statistical distributions and relationships in the original vaccination records without introducing biases that would affect predictions for at-risk children.

What would settle it

Train separate models on real records and on TabSyn synthetic records, then test both on a fresh hold-out set of actual vaccination records; a meaningful drop in recall or precision for the synthetic-trained model would disprove the claim of no performance loss.

Figures

Figures reproduced from arXiv: 2604.08902 by Carrie B. Dolan, Haipeng Chen, Jimmy Bach, John Sankok, Julius N. Odhiambo, Rose Kimani, Yang Li, Yaqi Liu.

Figure 1
Figure 1. Figure 1: Data Preprocessing Steps were removed, reducing the sample size to 7,517 patients. After this, we removed the multicollinear latitude and longitude predictors (as these are perfectly correlated with the village predictor). 2. Numeric Data. Numeric features in our dataset included the child’s age and the first visit day for vaccination. Each of these variables was subsetted to avoid unreliable observations.… view at source ↗
Figure 2
Figure 2. Figure 2: Number of Individuals Within Each Clinic Registry [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distances Traveled by Individuals to the Nearest [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature Importance Analysis: DPT3 Real Data [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature Importance Analysis: DPT3 Synthetic Data [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Background: Limited data utilization in low-resource settings poses a barrier to the vaccine delivery ecosystem, undermining efforts to achieve equitable immunization coverage. In nomadic populations, individuals face an increased risk of missing crucial vaccination doses as children. One such population is the Maasai in Narok County, Kenya, where the absence of high-volume, quality data hampers accurate coverage estimates, impedes efficient resource allocation, and weakens the ability to deliver timely interventions. Additionally, data privacy concerns are heightened in groups with limited sensitive data. Objectives: First, we aim to identify children at risk of missing key vaccines across a large population to provide timely, evidence-based interventions that support increased vaccination coverage. Second, we aim to better protect the privacy of sensitive health data in a vulnerable population. Methods: We digitized 8 years of child vaccination records from the MOH 510 registry (n=6,913) and applied machine learning models (Logistic Regression and XGBoost) to identify children at risk. Additionally, we utilize a novel approach to tabular diffusion-based synthetic data generation (TabSyn) to protect patient privacy within the models. Results: Our findings show that classification techniques can reliably and successfully predict children at risk of missing a vaccine, with recall, precision, and F1-scores exceeding 90% for some vaccines modeled. Additionally, training these models with synthetic data rather than real data, thus preserving the privacy of individuals within the original dataset, does not lead to a loss in predictive performance. Conclusion: These results support the use of synthetic data implementation in health informatics strategies for clinics with limited digital infrastructure, enabling privacy-preserving, scalable forecasting for childhood immunization coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript describes the application of machine learning classifiers (Logistic Regression and XGBoost) to predict children at risk of missing vaccinations using a dataset of 6,913 records from the MOH 510 registry in Narok, Kenya. It further explores the use of TabSyn for generating synthetic tabular data to preserve privacy and shows that models trained on this synthetic data achieve comparable performance to those trained on real data, with some models reporting recall, precision, and F1-scores above 90%.

Significance. If validated, these results would demonstrate the feasibility of using synthetic data for privacy-preserving predictive modeling in public health, particularly in low-resource and nomadic populations where data sensitivity is high. This could support better resource allocation for vaccination programs. The approach addresses both predictive accuracy and ethical data use, which is valuable for health informatics in similar contexts. However, the absence of key methodological details currently prevents a full evaluation of the claims' robustness.

major comments (3)
  1. [Methods] Methods: The description of the experimental setup lacks details on data partitioning (e.g., train/validation/test splits), cross-validation procedures, and hyperparameter optimization for the Logistic Regression and XGBoost models. These are essential to evaluate whether the reported performance metrics (recall, precision, F1 >90%) reflect true predictive capability or potential overfitting, especially with a dataset of n=6,913 and likely class imbalance.
  2. [Results] Results: There are no reported checks on the fidelity of the TabSyn-generated synthetic data, such as comparisons of statistical distributions, class-conditional metrics, or preservation of correlations for the at-risk (minority) class. Given the potential for distortion in rare-event tails with tabular diffusion models, this omission undermines the claim that synthetic data training does not lead to loss in predictive performance.
  3. [Results] Results: No statistical significance testing or confidence intervals are provided for the performance metrics or the comparison between real and synthetic training setups. This is particularly important to substantiate the equivalence claim.
minor comments (2)
  1. [Methods] Methods: Provide more details on how the target labels for 'at risk' were defined from the vaccination records.
  2. [Abstract] Abstract: The abstract mentions 'some vaccines modeled' but does not specify which ones achieved the high scores; this should be clarified for context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency and robustness of our work. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods: The description of the experimental setup lacks details on data partitioning (e.g., train/validation/test splits), cross-validation procedures, and hyperparameter optimization for the Logistic Regression and XGBoost models. These are essential to evaluate whether the reported performance metrics (recall, precision, F1 >90%) reflect true predictive capability or potential overfitting, especially with a dataset of n=6,913 and likely class imbalance.

    Authors: We agree that these methodological details are essential for reproducibility and to allow proper assessment of overfitting risks given the dataset size and potential class imbalance. In the revised manuscript, we will expand the Methods section to explicitly describe: a stratified 70/15/15 train/validation/test split to preserve class distributions; 5-fold stratified cross-validation for model evaluation and tuning; and the hyperparameter optimization procedure (grid search over regularization parameters for Logistic Regression and learning rate, max depth, and estimators for XGBoost, with early stopping). We will also report the selected hyperparameters and any regularization techniques applied. revision: yes

  2. Referee: [Results] Results: There are no reported checks on the fidelity of the TabSyn-generated synthetic data, such as comparisons of statistical distributions, class-conditional metrics, or preservation of correlations for the at-risk (minority) class. Given the potential for distortion in rare-event tails with tabular diffusion models, this omission undermines the claim that synthetic data training does not lead to a loss in predictive performance.

    Authors: We acknowledge that fidelity validation would provide stronger support for the equivalence claim, particularly for the minority class. While our evaluation centered on downstream predictive performance, we will add a dedicated subsection in Results presenting fidelity checks, including marginal distribution comparisons (means, variances, histograms), correlation matrix preservation, and class-conditional statistics for the at-risk group. We will also note limitations of tabular diffusion models regarding rare-event tails and their potential impact on generalizability. revision: yes

  3. Referee: [Results] Results: No statistical significance testing or confidence intervals are provided for the performance metrics or the comparison between real and synthetic training setups. This is particularly important to substantiate the equivalence claim.

    Authors: We agree that statistical tests and confidence intervals are needed to rigorously support the claim of no performance loss. In the revision, we will report 95% bootstrap confidence intervals for all metrics (precision, recall, F1) based on multiple runs. We will also include statistical comparisons (e.g., paired t-tests on cross-validation fold metrics or McNemar's test) between real and synthetic models, with p-values, to assess whether observed differences are significant. Updated tables and text will present these results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML application with measured outcomes

full rationale

The paper applies standard classifiers (Logistic Regression, XGBoost) to real and TabSyn-generated vaccination records, then reports recall/precision/F1 on test splits. No derivations, uniqueness theorems, or self-referential equations exist; performance figures are direct empirical measurements rather than quantities forced by construction from fitted inputs or prior self-citations. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that standard supervised classification and diffusion-based synthetic data generation can be applied directly to tabular health records; no additional free parameters, axioms, or invented entities are introduced beyond those implicit in the chosen ML algorithms and TabSyn method.

pith-pipeline@v0.9.0 · 5625 in / 1189 out tokens · 64968 ms · 2026-05-10T16:57:20.885540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    A decade of progress and challenges in government support for routine immunization in east and southern africa (2015-2024).Pan Afr Med J, 51(22), 2025

    Manyanga D, Byabamazima C, Masvikeni B, Ochieng M, and Wanyoike S. A decade of progress and challenges in government support for routine immunization in east and southern africa (2015-2024).Pan Afr Med J, 51(22), 2025

  2. [2]

    Unicef annual report 2024: staying and delivering for children

    UNICEF. Unicef annual report 2024: staying and delivering for children. Technical report, United Nations Children’s Fund, 2025. Accessed October 13, 2025

  3. [3]

    Using public data to predict demand for mobile health clinics

    Chen H, Ghosh S, Fan G, Behari N, Biswas A, Williams M, Oriol NE, and Tambe M. Using public data to predict demand for mobile health clinics. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12461–12467, 2022

  4. [4]

    Sequential vaccine allocation with delayed feedback

    Xiao Y , Ou HC, Chen H, Nguyen VT, and Tran-Thanh L. Sequential vaccine allocation with delayed feedback. InProceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI 2022), pages 5199–5205, 2022

  5. [5]

    Immunization agenda 2030: a global strategy to leave no one behind

    World Health Organization. Immunization agenda 2030: a global strategy to leave no one behind. Technical report, World Health Organization, 2020. Accessed March 9, 2026

  6. [6]

    Gavi 6.0: the alliance’s strategy 2026-2030

    Gavi, The Vaccine Alliance. Gavi 6.0: the alliance’s strategy 2026-2030. Technical report, Gavi, 2024. Accessed March 9, 2026

  7. [7]

    The global action plan for healthy lives and well-being for all

    World Health Organization. The global action plan for healthy lives and well-being for all. Technical report, World Health Organization, 2019. Accessed March 9, 2026

  8. [8]

    Haeuser E, Byrne S, Nguyen J, Raggi C, McLaughlin SA, Bisignano C, Harris AA, Smith AE, Lindstedt PA, and Smith G. Global, regional, and national trends in routine childhood vaccination coverage from 1980 to 2023 with forecasts to 2030: a systematic analysis for the global burden of disease study 2023.Lancet, 2025

  9. [9]

    Active screening for recurrent diseases: a reinforcement learning approach

    Ou HC, Chen H, Jabbari S, and Tambe M. Active screening for recurrent diseases: a reinforcement learning approach. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 992–1000, 2021

  10. [10]

    Management of medical records for better healthcare service delivery: a case study of narok county referral hospital, kenya.Hum Resour Leadersh J, 7(1), 2022

    Orwa B. Management of medical records for better healthcare service delivery: a case study of narok county referral hospital, kenya.Hum Resour Leadersh J, 7(1), 2022

  11. [11]

    Big data and personal information privacy in developing countries: insights from kenya.Front Big Data, 8:1532362, 2025

    Masinde J, Mugambi F, and Muthee DW. Big data and personal information privacy in developing countries: insights from kenya.Front Big Data, 8:1532362, 2025

  12. [12]

    Auto-Encoding Variational Bayes

    Kingma DP and Welling M. Auto-encoding variational bayes.arXiv, 2022. arXiv:1312.6114

  13. [13]

    Generative adversarial nets.Adv Neural Inf Process Syst, 27, 2014

    Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, and Ozair S. Generative adversarial nets.Adv Neural Inf Process Syst, 27, 2014

  14. [14]

    Denoising diffusion probabilistic models.Adv Neural Inf Process Syst, 33:6840–6851, 2020

    Ho J, Jain A, and Abbeel P. Denoising diffusion probabilistic models.Adv Neural Inf Process Syst, 33:6840–6851, 2020

  15. [15]

    Population aware diffusion for time series generation

    Li Y , Meng H, Bi Z, Urnes IT, and Chen H. Population aware diffusion for time series generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18520–18529, 2025

  16. [16]

    Mixed-type tabular data synthesis with score-based diffusion in latent space

    Zhang H, Zhang J, Shen Z, Srinivasan B, Qin X, Faloutsos C, Rangwala H, and Karypis G. Mixed-type tabular data synthesis with score-based diffusion in latent space. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  17. [17]

    Strengthening the evidence base on the use of digital health technologies to accelerate progress towards universal health coverage.Oxford Open Digit Health, 2:oqae033, 2024

    Forslund M, Mathieson K, Djibo Y , Mbindyo C, Lugangira N, and Balasubrama- niam P. Strengthening the evidence base on the use of digital health technologies to accelerate progress towards universal health coverage.Oxford Open Digit Health, 2:oqae033, 2024

  18. [18]

    Micronutrient defi- ciency prediction via publicly available satellite data

    Bondi E, Chen H, Golden CD, Behari N, and Tambe M. Micronutrient defi- ciency prediction via publicly available satellite data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12454–12460, 2022

  19. [19]

    Predicting mi- cronutrient deficiency with publicly available satellite data.AI Mag, 44(1):30–40, 2023

    Bondi-Kelly E, Chen H, Golden CD, Behari N, and Tambe M. Predicting mi- cronutrient deficiency with publicly available satellite data.AI Mag, 44(1):30–40, 2023

  20. [20]

    The role of artificial intelligence in pandemic responses: from epidemiological modeling to vaccine development.Mol Biomed, 6(1):1, 2025

    Gawande MS, Zade N, Kumar P, Gundewar S, Weerarathna IN, and Verma P. The role of artificial intelligence in pandemic responses: from epidemiological modeling to vaccine development.Mol Biomed, 6(1):1, 2025

  21. [21]

    Predictive modeling of vaccination uptake in us counties: a machine learning-based approach

    Cheong Q, Au-Yeung M, Quon S, Concepcion K, and Kong JD. Predictive modeling of vaccination uptake in us counties: a machine learning-based approach. J Med Internet Res, 23(11):e33231, 2021

  22. [22]

    Using machine learning algorithms to predict covid-19 vaccine uptake: a year after the introduction of covid-19 vaccines in ghana.Vaccine X, 18:100466, 2024

    Dodoo CC, Hanson-Yamoah E, Adedia D, Erzuah I, Yamoah P, Brobbey F, Cob- bold C, and Mensah J. Using machine learning algorithms to predict covid-19 vaccine uptake: a year after the introduction of covid-19 vaccines in ghana.Vaccine X, 18:100466, 2024

  23. [23]

    Associating measles vaccine uptake classification and its underlying factors using an ensemble of machine learning models.IEEE Access, 9:119613–119628, 2021

    Hasan MK, Jawad MT, Dutta A, Awal MA, Islam MA, Masud M, and Al-Amri JF. Associating measles vaccine uptake classification and its underlying factors using an ensemble of machine learning models.IEEE Access, 9:119613–119628, 2021

  24. [24]

    Leveraging ensemble machine learning approaches to predict measles vaccination status among children under five: insights from the 2019 zimbabwe mics

    Mbunge E. Leveraging ensemble machine learning approaches to predict measles vaccination status among children under five: insights from the 2019 zimbabwe mics. InComput Sci On-line Conf, pages 310–324, 2025

  25. [25]

    Demsash AW, Chereka AA, Walle AD, Kassie SY , Bekele F, and Bekana T. Ma- chine learning algorithms’ application to predict childhood vaccination among children aged 12-23 months in ethiopia: evidence from the 2016 ethiopian demo- graphic and health survey dataset.PLoS One, 18(10):e0288867, 2023

  26. [26]

    Determinants of childhood vaccination uptake: a machine learning approach using a decision tree classifier.J Inform, 5(1), 2025

    Kalegele K and Lubua EW. Determinants of childhood vaccination uptake: a machine learning approach using a decision tree classifier.J Inform, 5(1), 2025

  27. [27]

    Challenges and solutions for transforming health ecosystems in low- and middle-income countries through artificial intelligence.Front Med, 9:958097, 2022

    López DM, Rico-Olarte C, Blobel B, and Hullin C. Challenges and solutions for transforming health ecosystems in low- and middle-income countries through artificial intelligence.Front Med, 9:958097, 2022

  28. [28]

    Faketables: using gans to generate functional dependency preserving tables with bounded real data

    Chen H, Jajodia S, Liu J, Park N, Sokolov V , and Subrahmanian VS. Faketables: using gans to generate functional dependency preserving tables with bounded real data. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2074–2080, 2019

  29. [29]

    Medgan: medical image translation using gans.Comput Med Imaging Graph, 79:101684, 2020

    Armanious K, Jiang C, Fischer M, Küstner T, Hepp T, Nikolaou K, Gatidis S, and Yang B. Medgan: medical image translation using gans.Comput Med Imaging Graph, 79:101684, 2020. Green Lab 2020/2021, September–October, 2020, Amsterdam, The Netherlands Jimmy Bach, Y ang Li, M.S., Y aqi Liu, B.S., John Sankok, Rose Kimani, Carrie B. Dolan, PhD, Julius N. Odhiam...

  30. [30]

    Synthesizing electronic health records using improved generative adversarial networks.J Am Med Inform Assoc, 26(3):228–241, 2019

    Baowaly MK, Lin CC, Liu CL, and Chen KT. Synthesizing electronic health records using improved generative adversarial networks.J Am Med Inform Assoc, 26(3):228–241, 2019

  31. [31]

    Eva: generating longitudinal electronic health records using conditional variational autoencoders

    Biswal S, Ghosh S, Duke J, Malin B, Stewart W, Xiao C, and Sun J. Eva: generating longitudinal electronic health records using conditional variational autoencoders. InProceedings of Machine Learning for Healthcare Conference, pages 260–282, 2021

  32. [32]

    Score- based generative modeling through stochastic differential equations

    Song Y , Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, and Poole B. Score- based generative modeling through stochastic differential equations. InProceed- ings of the International Conference on Learning Representations (ICLR), 2021

  33. [33]

    High-resolution image synthesis with latent diffusion models

    Rombach R, Blattmann A, Lorenz D, Esser P, and Ommer B. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684– 10695, 2022

  34. [34]

    Diffusion-ts: interpretable diffusion for general time series generation

    Yuan X and Qiao Y . Diffusion-ts: interpretable diffusion for general time series generation. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024

  35. [35]

    Tabddpm: modelling tabular data with diffusion models

    Kotelnikov A, Baranchuk D, Rubachev I, and Babenko A. Tabddpm: modelling tabular data with diffusion models. InProc Int Conf Mach Learn, pages 17564– 17579, 2023

  36. [36]

    Scoehr: generating synthetic electronic health records using continuous-time diffusion models

    Naseer AA, Walker B, Landon C, Ambrosy A, Fudim M, Wysham N, Toro B, Swaminathan S, and Lyons T. Scoehr: generating synthetic electronic health records using continuous-time diffusion models. InProceedings of Machine Learning for Healthcare Conference, 2023

  37. [37]

    Xgboost: a scalable tree boosting system.Cornell University, 2016

    Chen T. Xgboost: a scalable tree boosting system.Cornell University, 2016

  38. [38]

    Determinants of effective vaccine coverage in low and middle-income countries: a systematic review and interpretive synthesis.BMC Health Serv Res, 17(1):681, 2017

    Phillips DE, Dieleman JL, Lim SS, and Shearer J. Determinants of effective vaccine coverage in low and middle-income countries: a systematic review and interpretive synthesis.BMC Health Serv Res, 17(1):681, 2017

  39. [39]

    Scikit-learn: machine learning in python.J Mach Learn Res, 12:2825–2830, 2011

    Pedregosa F, Varoquaux G, Gramfort A, Michel V , Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, and Dubourg V . Scikit-learn: machine learning in python.J Mach Learn Res, 12:2825–2830, 2011. Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya Green Lab 2020/2021, September–October, 2020, Amsterdam, Th...