pith. sign in

arxiv: 2508.09160 · v2 · submitted 2025-08-05 · 💻 cs.LG · cs.DB· q-bio.QM

Presenting DiaData for Research on Type 1 Diabetes

Pith reviewed 2026-05-18 23:54 UTC · model grok-4.3

classification 💻 cs.LG cs.DBq-bio.QM
keywords type 1 diabetescontinuous glucose monitoringhypoglycemiadataset integrationmachine learningheart ratedata quality
0
0 comments X

The pith

Integrating 15 type 1 diabetes datasets creates a unified collection of 2510 subjects and 149 million glucose readings taken every five minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper combines fifteen existing type 1 diabetes datasets into one large resource covering 2510 subjects. Glucose values appear at five-minute intervals for a total of 149 million measurements, with four percent falling in the hypoglycemic range below 70 mg/dL. Two extracted sub-databases add demographic details across the full set and heart rate recordings for a subset, showing roughly equal numbers of male and female subjects across different ages. The authors also examine data quality and report a relationship between heart rate changes and hypoglycemia events that appear 15 to 55 minutes later. This work directly tackles the scarcity of large, accessible datasets that has limited machine learning efforts to forecast and prevent dangerous low blood sugar episodes.

Core claim

By merging data from fifteen separate sources the authors produced DiaData, a single database holding glucose measurements from 2510 type 1 diabetes subjects recorded every five minutes and totaling 149 million values, of which four percent lie in the hypoglycemic range. Sub-database I supplies demographic information and Sub-database II supplies heart rate data, both preserving balanced sex and age distributions. The integration process further reveals that missing values and class imbalance remain major data-quality obstacles, while correlation analysis identifies heart-rate patterns that precede hypoglycemia by 15 to 55 minutes.

What carries the argument

The DiaData integrated database formed by merging fifteen source datasets of continuous glucose monitoring readings at five-minute resolution.

If this is right

  • Machine learning models for glucose forecasting and hypoglycemia alarms can now train on a much larger and demographically balanced collection than was previously available.
  • Any analysis using the dataset must account for the documented problems of missing values and class imbalance.
  • The 15-to-55-minute heart-rate correlation window supplies a concrete physiological signal that can be incorporated into early-warning algorithms.
  • Studies focused on age or sex differences in type 1 diabetes can draw directly on the balanced demographic sub-database.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The size and structure of the combined set could support longitudinal studies that track how individual glucose patterns evolve over months or years.
  • Extending the same integration approach to additional wearable signals such as activity or sleep data would likely strengthen prediction models further.
  • Because only four percent of readings are hypoglycemic, model training will probably require explicit techniques to avoid bias toward normal-range values.

Load-bearing premise

The fifteen source datasets share compatible measurement units, recording frequencies, and patient characteristics so that merging them does not create major inconsistencies or biases.

What would settle it

Re-running the integration on the original fifteen datasets and obtaining subject counts or total measurement numbers that differ substantially from 2510 subjects and 149 million readings would show that the sources cannot be combined without significant distortion.

Figures

Figures reproduced from arXiv: 2508.09160 by Beyza Cinar, Maria Maleshkova.

Figure 1
Figure 1. Figure 1: Database Schema the corresponding dataframes. For the T1DGranada dataset, the age was computed from the reported birthyear of the patient and the timestamps. The SHD dataset mentioned that patients at least aged 60 were chosen for the study. Thus, the age was stored as a string of ”60-100”, assuming 100 to be the maximum age. All extracted values were aligned to the timestamps and patient IDs. For all data… view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of Glucose Levels in the Main Dataset [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Missing Value Count of the Main Dataset C. Demographics Turning now to the analysis of sub-dataset I, sex is equally distributed, with females comprising 52% (66 more females out of 2096 subjects) and males 48% (Figure 4a). Likewise, the total number of measurements is similarly balanced, with females contributing 54% and males 46% (Figure 4b). Females have up to 9.9 million more data points in total, whic… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of Demography [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Boxplot Representation of Glucose Levels Across [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of Heart Rate Values heart rate data was estimated with the Spearman’s Rank correlation. The overall population correlation is 0.073, while the mean of all individual correlations per subject is only 0.038. The correlation of hypoglycemic points, assessed with different correlation methods, can be seen in Table III. Class 0 stands for the hypoglycemic event defined as ≤ 70 mg/dL, class 1 for 5-10 … view at source ↗
Figure 7
Figure 7. Figure 7: Correlation Between Hypoglycemic Glucose Values [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Type 1 diabetes (T1D) is an autoimmune disorder that leads to the destruction of insulin-producing cells, resulting in insulin deficiency, as to why the affected individuals depend on external insulin injections. However, insulin can decrease blood glucose levels and can cause hypoglycemia. Hypoglycemia is a severe event of low blood glucose levels ($\le$70 mg/dL) with dangerous side effects of dizziness, coma, or death. Data analysis can significantly enhance diabetes care by identifying personal patterns and trends leading to adverse events. Especially, machine learning (ML) models can predict glucose levels and provide early alarms. However, diabetes and hypoglycemia research is limited by the unavailability of large datasets. Thus, this work systematically integrates 15 datasets to provide a large database of 2510 subjects with glucose measurements recorded every 5 minutes. In total, 149 million measurements are included, of which 4% represent values in the hypoglycemic range. Moreover, two sub-databases are extracted. Sub-database I includes demographics, and sub-database II includes heart rate data. The integrated dataset provides an equal distribution of sex and different age levels. As a further contribution, data quality is assessed, revealing that data imbalance and missing values present a significant challenge. Moreover, a correlation study on glucose levels and heart rate data is conducted, showing a relation between 15 and 55 minutes before hypoglycemia.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents DiaData, a large integrated database for Type 1 Diabetes research created by combining 15 existing datasets. It includes data from 2510 subjects with a total of 149 million glucose measurements recorded every 5 minutes, of which 4% are in the hypoglycemic range (≤70 mg/dL). Two sub-databases are derived: one with demographic information and another with heart rate data. The dataset shows balanced sex distribution and varied age levels. Additionally, the work assesses data quality, noting challenges with imbalance and missing values, and conducts a correlation study indicating a relationship between heart rate data from 15 to 55 minutes prior to hypoglycemic events.

Significance. If the dataset integration is performed rigorously and the resource is made available with proper documentation, this could significantly benefit the machine learning community working on glucose prediction and hypoglycemia detection by providing a large-scale, multi-source T1D dataset. The correlation findings between heart rate and glucose levels may inform feature engineering for predictive models, though their generalizability and statistical robustness require further validation.

major comments (3)
  1. [Data Integration and Harmonization] The methods for harmonizing the 15 source datasets to a consistent 5-minute sampling frequency and mg/dL units are insufficiently detailed. Specific steps for handling varying native sampling rates (e.g., 5, 10, or 15 minutes), any interpolation or downsampling procedures, and validation of the resulting 149 million measurements and 4% hypoglycemic prevalence are missing, which is critical to support the central claims of a bias-free integrated resource.
  2. [Demographic Analysis] The assertion of an equal distribution of sex and different age levels across the 2510 subjects lacks supporting quantitative evidence, such as breakdowns by category or statistical measures of balance, making it difficult to assess the representativeness of the integrated dataset.
  3. [Correlation Study] The correlation analysis between glucose levels and heart rate data showing a relation 15-55 minutes before hypoglycemia is presented without details on the statistical methods used (e.g., correlation coefficients, p-values, or controls for multiple comparisons), limiting the ability to evaluate the strength and reliability of this finding.
minor comments (2)
  1. [Abstract] The abstract mentions 'data quality is assessed' but does not specify the metrics or findings quantitatively, which could be clarified for better reader understanding.
  2. [Introduction] Consider adding references to established clinical definitions of hypoglycemia and standard CGM sampling practices to strengthen the background.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on DiaData. We have addressed each major comment below with clarifications and commitments to strengthen the manuscript. Revisions will focus on adding methodological transparency without altering the core contributions of the integrated dataset.

read point-by-point responses
  1. Referee: [Data Integration and Harmonization] The methods for harmonizing the 15 source datasets to a consistent 5-minute sampling frequency and mg/dL units are insufficiently detailed. Specific steps for handling varying native sampling rates (e.g., 5, 10, or 15 minutes), any interpolation or downsampling procedures, and validation of the resulting 149 million measurements and 4% hypoglycemic prevalence are missing, which is critical to support the central claims of a bias-free integrated resource.

    Authors: We agree that greater detail on harmonization is warranted to support reproducibility. In the revised manuscript, we will expand the Methods section to explicitly describe the standardization process, including how native sampling rates (5, 10, or 15 minutes) were handled via downsampling or linear interpolation where appropriate, unit conversions to mg/dL, and the validation steps used to confirm the aggregate totals of 149 million measurements and the 4% hypoglycemic prevalence across the integrated resource. revision: yes

  2. Referee: [Demographic Analysis] The assertion of an equal distribution of sex and different age levels across the 2510 subjects lacks supporting quantitative evidence, such as breakdowns by category or statistical measures of balance, making it difficult to assess the representativeness of the integrated dataset.

    Authors: We acknowledge that the current statement would benefit from quantitative support. The revised version will include explicit demographic breakdowns (e.g., counts and percentages for sex categories, age ranges or means with standard deviations) and any statistical assessments of balance or representativeness across the 2510 subjects from the 15 source datasets. revision: yes

  3. Referee: [Correlation Study] The correlation analysis between glucose levels and heart rate data showing a relation 15-55 minutes before hypoglycemia is presented without details on the statistical methods used (e.g., correlation coefficients, p-values, or controls for multiple comparisons), limiting the ability to evaluate the strength and reliability of this finding.

    Authors: We recognize the importance of reporting statistical details for the correlation findings. In the revision, we will specify the methods employed, including the use of lagged Pearson correlations over the 15-55 minute windows prior to hypoglycemic events, the resulting coefficients, p-values, and any corrections applied for multiple comparisons to allow proper evaluation of the observed relationships. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely descriptive data integration with no derivations or predictions

full rationale

The paper presents an integrated T1D CGM dataset by combining 15 source collections, reporting aggregate counts (2510 subjects, 149 million measurements, 4% hypoglycemic) and extracting sub-databases for demographics and heart rate. It includes a data quality assessment and a basic correlation analysis between glucose and heart rate. No equations, models, predictions, or first-principles derivations are present. The central claims are empirical summaries of the harmonized resource rather than results derived from fitted parameters or self-referential steps. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way. The work is self-contained as a data resource contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on successful merging of heterogeneous datasets under the assumption of compatibility; no free parameters, new entities, or ad-hoc inventions are introduced beyond standard data integration practices.

axioms (1)
  • domain assumption The 15 source datasets can be harmonized for integration without introducing major biases in glucose measurements or patient demographics.
    This premise is required to support the reported total subject count, measurement volume, hypoglycemic percentage, and balanced sex/age distribution.

pith-pipeline@v0.9.0 · 5780 in / 1457 out tokens · 71309 ms · 2026-05-18T23:54:44.424777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Effect of continuous glucose monitoring on glycemic control in adolescents and young adults with type 1 diabetes: A randomized clinical trial,

    L. M. Laffel et al., “Effect of continuous glucose monitoring on glycemic control in adolescents and young adults with type 1 diabetes: A randomized clinical trial,”JAMA, vol. 323, no. 23, p. 2388, Jun. 2020,issn: 0098-7484.doi:10.1001/jama.2020. 6940

  2. [2]

    Brussels, Belgium: International Diabetes Federation, 2025,isbn: 978-2-930229-96-6

    International Diabetes Federation,IDF Diabetes Atlas, 11th ed. Brussels, Belgium: International Diabetes Federation, 2025,isbn: 978-2-930229-96-6

  3. [3]

    Therapeutic modelling of type 1 diabetes,

    N. Nilam, S. M., and P. N., “Therapeutic modelling of type 1 diabetes,” inType 1 Diabetes - Com- plications, Pathogenesis, and Alternative Treatments. InTech, Nov. 2011,isbn: 9789533077567.doi:10 . 5772/21919

  4. [4]

    Sergazinov, E

    R. Sergazinov, E. Chun, V . Rogovchenko, N. Fernan- des, N. Kasman, and I. Gaynanova,GlucoBench: Cu- rated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks, Version Number: 1, 2024.doi:10.48550/ARXIV.2410.05780

  5. [5]

    Advanced diabetes management using ar- tificial intelligence and continuous glucose monitoring sensors,

    M. Vettoretti, G. Cappon, A. Facchinetti, and G. Sparacino, “Advanced diabetes management using ar- tificial intelligence and continuous glucose monitoring sensors,”Sensors, vol. 20, no. 14, p. 3870, Jul. 2020, issn: 1424-8220.doi:10.3390/s20143870

  6. [6]

    REPLACE-BG: A Randomized Trial Comparing Continuous Glucose Monitoring With and Without Routine Blood Glucose Monitoring in Adults With Well-Controlled Type 1 Diabetes,

    G. Aleppo et al., “REPLACE-BG: A Randomized Trial Comparing Continuous Glucose Monitoring With and Without Routine Blood Glucose Monitoring in Adults With Well-Controlled Type 1 Diabetes,” en, Diabetes Care, vol. 40, no. 4, pp. 538–545, Apr. 2017, issn: 0149-5992, 1935-5548.doi:10 . 2337 / dc16 - 2482

  7. [7]

    Data-based algorithms and models using diabetics real data for blood glucose and hypogly- caemia prediction – a systematic literature review,

    V . Felizardo, N. M. Garcia, N. Pombo, and I. Megdiche, “Data-based algorithms and models using diabetics real data for blood glucose and hypogly- caemia prediction – a systematic literature review,”Ar- tificial Intelligence in Medicine, vol. 118, p. 102 120, Aug. 2021,issn: 0933-3657.doi:10 . 1016 / j . artmed.2021.102120

  8. [8]

    The open D1NAMO dataset: A multi-modal dataset for research on non-invasive type 1 diabetes management,

    F. Dubosson, J.-E. Ranvier, S. Bromuri, J.-P. Cal- bimonte, J. Ruiz, and M. Schumacher, “The open D1NAMO dataset: A multi-modal dataset for research on non-invasive type 1 diabetes management,” en,In- formatics in Medicine Unlocked, vol. 13, pp. 92–100, 2018,issn: 23529148.doi:10.1016/j.imu.2018. 09.003

  9. [9]

    Chinese diabetes datasets for data- driven machine learning,

    Q. Zhao et al., “Chinese diabetes datasets for data- driven machine learning,” en,Scientific Data, vol. 10, no. 1, p. 35, Jan. 2023,issn: 2052-4463.doi:10 . 1038/s41597-023-01940-7

  10. [10]

    HUPA-UCM diabetes dataset,

    J. I. Hidalgo, J. Alvarado, M. Botella, A. Aramendi, J. M. Velasco, and O. Garnica, “HUPA-UCM diabetes dataset,” en,Data in Brief, vol. 55, p. 110 559, Aug. 2024,issn: 23523409.doi:10.1016/j.dib.2024. 110559

  11. [11]

    Cinar, J

    B. Cinar, J. D. Onwuchekwa, and M. Maleshkova, Deep learning-based hypoglycemia classification across multiple prediction horizons, 2025.doi:10 . 48550/ARXIV.2504.00009

  12. [12]

    Transfer learning in hypoglycemia clas- sification,

    B. Cinar, F. Grensing, L. van den Boom, and M. Maleshkova, “Transfer learning in hypoglycemia clas- sification,” en, inLecture Notes in Computer Sci- ence, ser. Lecture notes in computer science, Cham: Springer Nature Switzerland, 2024, pp. 98–109.doi: https://doi.org/10.1007/978-3-031-67278- 1_8

  13. [13]

    Big data integra- tion,

    X. L. Dong and D. Srivastava, “Big data integra- tion,” in2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD: IEEE, Apr. 2013.doi:10.1109/ICDE.2013.6544914

  14. [14]

    Mouse Genome Database: From sequence to phenotypes and disease models,

    J. T. Eppig et al., “Mouse Genome Database: From sequence to phenotypes and disease models,” en,gen- esis, vol. 53, no. 8, pp. 458–473, Aug. 2015,issn: 1526-954X, 1526-968X.doi:10.1002/dvg.22874

  15. [15]

    Recommendations for the creation of benchmark datasets for reproducible artificial in- telligence in radiology,

    N. Sourlos et al., “Recommendations for the creation of benchmark datasets for reproducible artificial in- telligence in radiology,” en,Insights into Imaging, vol. 15, no. 1, p. 248, Oct. 2024,issn: 1869-4101. doi:10.1186/s13244-024-01833-2

  16. [16]

    Data-driven curation pro- cess for describing the blood glucose management in the intensive care unit,

    A. Robles Ar ´evalo et al., “Data-driven curation pro- cess for describing the blood glucose management in the intensive care unit,” en,Scientific Data, vol. 8, no. 1, p. 80, Mar. 2021,issn: 2052-4463.doi:10 . 1038/s41597-021-00864-4

  17. [17]

    T1diabetesgranada: A lon- gitudinal multi-modal dataset of type 1 diabetes mel- litus,

    C. Rodriguez-Leon et al., “T1diabetesgranada: A lon- gitudinal multi-modal dataset of type 1 diabetes mel- litus,”Scientific Data, vol. 10, no. 1, Dec. 2023,issn: 2052-4463.doi:10.1038/s41597-023-02737-4

  18. [18]

    Diatrend: A dataset from advanced diabetes technol- ogy to enable development of novel analytic solu- tions,

    T. Prioleau, A. Bartolome, R. Comi, and C. Stanger, “Diatrend: A dataset from advanced diabetes technol- ogy to enable development of novel analytic solu- tions,”Scientific Data, vol. 10, no. 1, Aug. 2023,issn: 2052-4463.doi:10.1038/s41597-023-02469-5

  19. [19]

    Jaeb Center for Health Research,Diabetes datasets - public data archive,https://public.jaeb.org/ datasets/diabetes, Accessed: 2025-04-23, 2025

  20. [20]

    Alvarado,Hupa-ucm diabetes dataset, 2024.doi: 10.17632/3HBCSCWZ44.1

    J. Alvarado,Hupa-ucm diabetes dataset, 2024.doi: 10.17632/3HBCSCWZ44.1

  21. [21]

    Zhu,Diabetes datasets-shanghait1dm and shang- hait2dm, 2022.doi:10

    J. Zhu,Diabetes datasets-shanghait1dm and shang- hait2dm, 2022.doi:10 . 6084 / M9 . FIGSHARE . 20444397.V3

  22. [22]

    Dubosson, Jean-Eudes Ranvier, S

    F. Dubosson, Jean-Eudes Ranvier, S. Bromuri, J.-P. Calbimonte, J. Ruiz, and M. Schumacher,The open d1namo dataset: A multi-modal dataset for research on non-invasive type 1 diabetes management, en, 2018.doi:10.5281/ZENODO.5651217

  23. [23]

    ICT Innovaties Zorg,Dataset - diabetes adolescents time series with heart rate, Accessed: 2025-04-23

  24. [24]

    J. F. Gait ´an Guerrero, J. L. L ´opez Ruiz, C. Mart ´ınez Cruz, and M. Espinilla Est ´evez,T1gduja: Glucose dataset of a patient with type 1 diabetes mellitus, en, 2024.doi:10.5281/ZENODO.11284018

  25. [25]

    Trial of hybrid closed-loop control in young children with type 1 diabetes,

    R. P. Wadwa et al., “Trial of hybrid closed-loop control in young children with type 1 diabetes,”New England Journal of Medicine, vol. 388, no. 11, pp. 991– 1001, Mar. 2023,issn: 1533-4406.doi:10 . 1056 / nejmoa2210834

  26. [26]

    Quality-of-life measures in children and adults with type 1 diabetes,

    “Quality-of-life measures in children and adults with type 1 diabetes,”Diabetes Care, vol. 33, no. 10, pp. 2175–2177, Aug. 2010,issn: 1935-5548.doi:10. 2337/dc10-0331

  27. [27]

    Time spent outside of target glucose range for young children with type 1 dia- betes: A continuous glucose monitor study,

    L. A. DiMeglio et al., “Time spent outside of target glucose range for young children with type 1 dia- betes: A continuous glucose monitor study,”Diabetic Medicine, vol. 37, no. 8, pp. 1308–1315, Mar. 2020, issn: 1464-5491.doi:10.1111/dme.14276

  28. [28]

    Hypoglycemia and glycemic control in older adults with type 1 diabetes: Baseline results from the wisdm study,

    A. L. Carlson et al., “Hypoglycemia and glycemic control in older adults with type 1 diabetes: Baseline results from the wisdm study,”Journal of Diabetes Science and Technology, vol. 15, no. 3, pp. 582– 592, Dec. 2019,issn: 1932-2968.doi:10 . 1177 / 1932296819894974

  29. [29]

    Risk factors associated with severe hypoglycemia in older adults with type 1 di- abetes,

    R. S. Weinstock et al., “Risk factors associated with severe hypoglycemia in older adults with type 1 di- abetes,”Diabetes Care, vol. 39, no. 4, pp. 603–610, Dec. 2015,issn: 1935-5548.doi:10 . 2337 / dc15 - 1426

  30. [30]

    Fill- ing missing values on wearable-sensory time series data,

    S. Lin, X. Wu, G. Martinez, and N. V . Chawla, “Fill- ing missing values on wearable-sensory time series data,” inProceedings of the 2020 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, Jan. 2020, pp. 46–54,isbn: 9781611976236.doi:10.1137/1.9781611976236.6

  31. [31]

    Diadata: An integrated large dataset for type 1 diabetes and hypoglycemia research,

    B. Cinar and M. Maleshkova, “Diadata: An integrated large dataset for type 1 diabetes and hypoglycemia research,” en, 2025.doi:10.24405/20048