Presenting DiaData for Research on Type 1 Diabetes
Pith reviewed 2026-05-18 23:54 UTC · model grok-4.3
The pith
Integrating 15 type 1 diabetes datasets creates a unified collection of 2510 subjects and 149 million glucose readings taken every five minutes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By merging data from fifteen separate sources the authors produced DiaData, a single database holding glucose measurements from 2510 type 1 diabetes subjects recorded every five minutes and totaling 149 million values, of which four percent lie in the hypoglycemic range. Sub-database I supplies demographic information and Sub-database II supplies heart rate data, both preserving balanced sex and age distributions. The integration process further reveals that missing values and class imbalance remain major data-quality obstacles, while correlation analysis identifies heart-rate patterns that precede hypoglycemia by 15 to 55 minutes.
What carries the argument
The DiaData integrated database formed by merging fifteen source datasets of continuous glucose monitoring readings at five-minute resolution.
If this is right
- Machine learning models for glucose forecasting and hypoglycemia alarms can now train on a much larger and demographically balanced collection than was previously available.
- Any analysis using the dataset must account for the documented problems of missing values and class imbalance.
- The 15-to-55-minute heart-rate correlation window supplies a concrete physiological signal that can be incorporated into early-warning algorithms.
- Studies focused on age or sex differences in type 1 diabetes can draw directly on the balanced demographic sub-database.
Where Pith is reading between the lines
- The size and structure of the combined set could support longitudinal studies that track how individual glucose patterns evolve over months or years.
- Extending the same integration approach to additional wearable signals such as activity or sleep data would likely strengthen prediction models further.
- Because only four percent of readings are hypoglycemic, model training will probably require explicit techniques to avoid bias toward normal-range values.
Load-bearing premise
The fifteen source datasets share compatible measurement units, recording frequencies, and patient characteristics so that merging them does not create major inconsistencies or biases.
What would settle it
Re-running the integration on the original fifteen datasets and obtaining subject counts or total measurement numbers that differ substantially from 2510 subjects and 149 million readings would show that the sources cannot be combined without significant distortion.
Figures
read the original abstract
Type 1 diabetes (T1D) is an autoimmune disorder that leads to the destruction of insulin-producing cells, resulting in insulin deficiency, as to why the affected individuals depend on external insulin injections. However, insulin can decrease blood glucose levels and can cause hypoglycemia. Hypoglycemia is a severe event of low blood glucose levels ($\le$70 mg/dL) with dangerous side effects of dizziness, coma, or death. Data analysis can significantly enhance diabetes care by identifying personal patterns and trends leading to adverse events. Especially, machine learning (ML) models can predict glucose levels and provide early alarms. However, diabetes and hypoglycemia research is limited by the unavailability of large datasets. Thus, this work systematically integrates 15 datasets to provide a large database of 2510 subjects with glucose measurements recorded every 5 minutes. In total, 149 million measurements are included, of which 4% represent values in the hypoglycemic range. Moreover, two sub-databases are extracted. Sub-database I includes demographics, and sub-database II includes heart rate data. The integrated dataset provides an equal distribution of sex and different age levels. As a further contribution, data quality is assessed, revealing that data imbalance and missing values present a significant challenge. Moreover, a correlation study on glucose levels and heart rate data is conducted, showing a relation between 15 and 55 minutes before hypoglycemia.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DiaData, a large integrated database for Type 1 Diabetes research created by combining 15 existing datasets. It includes data from 2510 subjects with a total of 149 million glucose measurements recorded every 5 minutes, of which 4% are in the hypoglycemic range (≤70 mg/dL). Two sub-databases are derived: one with demographic information and another with heart rate data. The dataset shows balanced sex distribution and varied age levels. Additionally, the work assesses data quality, noting challenges with imbalance and missing values, and conducts a correlation study indicating a relationship between heart rate data from 15 to 55 minutes prior to hypoglycemic events.
Significance. If the dataset integration is performed rigorously and the resource is made available with proper documentation, this could significantly benefit the machine learning community working on glucose prediction and hypoglycemia detection by providing a large-scale, multi-source T1D dataset. The correlation findings between heart rate and glucose levels may inform feature engineering for predictive models, though their generalizability and statistical robustness require further validation.
major comments (3)
- [Data Integration and Harmonization] The methods for harmonizing the 15 source datasets to a consistent 5-minute sampling frequency and mg/dL units are insufficiently detailed. Specific steps for handling varying native sampling rates (e.g., 5, 10, or 15 minutes), any interpolation or downsampling procedures, and validation of the resulting 149 million measurements and 4% hypoglycemic prevalence are missing, which is critical to support the central claims of a bias-free integrated resource.
- [Demographic Analysis] The assertion of an equal distribution of sex and different age levels across the 2510 subjects lacks supporting quantitative evidence, such as breakdowns by category or statistical measures of balance, making it difficult to assess the representativeness of the integrated dataset.
- [Correlation Study] The correlation analysis between glucose levels and heart rate data showing a relation 15-55 minutes before hypoglycemia is presented without details on the statistical methods used (e.g., correlation coefficients, p-values, or controls for multiple comparisons), limiting the ability to evaluate the strength and reliability of this finding.
minor comments (2)
- [Abstract] The abstract mentions 'data quality is assessed' but does not specify the metrics or findings quantitatively, which could be clarified for better reader understanding.
- [Introduction] Consider adding references to established clinical definitions of hypoglycemia and standard CGM sampling practices to strengthen the background.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript on DiaData. We have addressed each major comment below with clarifications and commitments to strengthen the manuscript. Revisions will focus on adding methodological transparency without altering the core contributions of the integrated dataset.
read point-by-point responses
-
Referee: [Data Integration and Harmonization] The methods for harmonizing the 15 source datasets to a consistent 5-minute sampling frequency and mg/dL units are insufficiently detailed. Specific steps for handling varying native sampling rates (e.g., 5, 10, or 15 minutes), any interpolation or downsampling procedures, and validation of the resulting 149 million measurements and 4% hypoglycemic prevalence are missing, which is critical to support the central claims of a bias-free integrated resource.
Authors: We agree that greater detail on harmonization is warranted to support reproducibility. In the revised manuscript, we will expand the Methods section to explicitly describe the standardization process, including how native sampling rates (5, 10, or 15 minutes) were handled via downsampling or linear interpolation where appropriate, unit conversions to mg/dL, and the validation steps used to confirm the aggregate totals of 149 million measurements and the 4% hypoglycemic prevalence across the integrated resource. revision: yes
-
Referee: [Demographic Analysis] The assertion of an equal distribution of sex and different age levels across the 2510 subjects lacks supporting quantitative evidence, such as breakdowns by category or statistical measures of balance, making it difficult to assess the representativeness of the integrated dataset.
Authors: We acknowledge that the current statement would benefit from quantitative support. The revised version will include explicit demographic breakdowns (e.g., counts and percentages for sex categories, age ranges or means with standard deviations) and any statistical assessments of balance or representativeness across the 2510 subjects from the 15 source datasets. revision: yes
-
Referee: [Correlation Study] The correlation analysis between glucose levels and heart rate data showing a relation 15-55 minutes before hypoglycemia is presented without details on the statistical methods used (e.g., correlation coefficients, p-values, or controls for multiple comparisons), limiting the ability to evaluate the strength and reliability of this finding.
Authors: We recognize the importance of reporting statistical details for the correlation findings. In the revision, we will specify the methods employed, including the use of lagged Pearson correlations over the 15-55 minute windows prior to hypoglycemic events, the resulting coefficients, p-values, and any corrections applied for multiple comparisons to allow proper evaluation of the observed relationships. revision: yes
Circularity Check
No significant circularity: purely descriptive data integration with no derivations or predictions
full rationale
The paper presents an integrated T1D CGM dataset by combining 15 source collections, reporting aggregate counts (2510 subjects, 149 million measurements, 4% hypoglycemic) and extracting sub-databases for demographics and heart rate. It includes a data quality assessment and a basic correlation analysis between glucose and heart rate. No equations, models, predictions, or first-principles derivations are present. The central claims are empirical summaries of the harmonized resource rather than results derived from fitted parameters or self-referential steps. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way. The work is self-contained as a data resource contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 15 source datasets can be harmonized for integration without introducing major biases in glucose measurements or patient demographics.
Reference graph
Works this paper leans on
-
[1]
L. M. Laffel et al., “Effect of continuous glucose monitoring on glycemic control in adolescents and young adults with type 1 diabetes: A randomized clinical trial,”JAMA, vol. 323, no. 23, p. 2388, Jun. 2020,issn: 0098-7484.doi:10.1001/jama.2020. 6940
-
[2]
Brussels, Belgium: International Diabetes Federation, 2025,isbn: 978-2-930229-96-6
International Diabetes Federation,IDF Diabetes Atlas, 11th ed. Brussels, Belgium: International Diabetes Federation, 2025,isbn: 978-2-930229-96-6
work page 2025
-
[3]
Therapeutic modelling of type 1 diabetes,
N. Nilam, S. M., and P. N., “Therapeutic modelling of type 1 diabetes,” inType 1 Diabetes - Com- plications, Pathogenesis, and Alternative Treatments. InTech, Nov. 2011,isbn: 9789533077567.doi:10 . 5772/21919
work page 2011
-
[4]
R. Sergazinov, E. Chun, V . Rogovchenko, N. Fernan- des, N. Kasman, and I. Gaynanova,GlucoBench: Cu- rated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks, Version Number: 1, 2024.doi:10.48550/ARXIV.2410.05780
-
[5]
M. Vettoretti, G. Cappon, A. Facchinetti, and G. Sparacino, “Advanced diabetes management using ar- tificial intelligence and continuous glucose monitoring sensors,”Sensors, vol. 20, no. 14, p. 3870, Jul. 2020, issn: 1424-8220.doi:10.3390/s20143870
-
[6]
G. Aleppo et al., “REPLACE-BG: A Randomized Trial Comparing Continuous Glucose Monitoring With and Without Routine Blood Glucose Monitoring in Adults With Well-Controlled Type 1 Diabetes,” en, Diabetes Care, vol. 40, no. 4, pp. 538–545, Apr. 2017, issn: 0149-5992, 1935-5548.doi:10 . 2337 / dc16 - 2482
work page 2017
-
[7]
V . Felizardo, N. M. Garcia, N. Pombo, and I. Megdiche, “Data-based algorithms and models using diabetics real data for blood glucose and hypogly- caemia prediction – a systematic literature review,”Ar- tificial Intelligence in Medicine, vol. 118, p. 102 120, Aug. 2021,issn: 0933-3657.doi:10 . 1016 / j . artmed.2021.102120
-
[8]
F. Dubosson, J.-E. Ranvier, S. Bromuri, J.-P. Cal- bimonte, J. Ruiz, and M. Schumacher, “The open D1NAMO dataset: A multi-modal dataset for research on non-invasive type 1 diabetes management,” en,In- formatics in Medicine Unlocked, vol. 13, pp. 92–100, 2018,issn: 23529148.doi:10.1016/j.imu.2018. 09.003
-
[9]
Chinese diabetes datasets for data- driven machine learning,
Q. Zhao et al., “Chinese diabetes datasets for data- driven machine learning,” en,Scientific Data, vol. 10, no. 1, p. 35, Jan. 2023,issn: 2052-4463.doi:10 . 1038/s41597-023-01940-7
work page 2023
-
[10]
J. I. Hidalgo, J. Alvarado, M. Botella, A. Aramendi, J. M. Velasco, and O. Garnica, “HUPA-UCM diabetes dataset,” en,Data in Brief, vol. 55, p. 110 559, Aug. 2024,issn: 23523409.doi:10.1016/j.dib.2024. 110559
- [11]
-
[12]
Transfer learning in hypoglycemia clas- sification,
B. Cinar, F. Grensing, L. van den Boom, and M. Maleshkova, “Transfer learning in hypoglycemia clas- sification,” en, inLecture Notes in Computer Sci- ence, ser. Lecture notes in computer science, Cham: Springer Nature Switzerland, 2024, pp. 98–109.doi: https://doi.org/10.1007/978-3-031-67278- 1_8
-
[13]
X. L. Dong and D. Srivastava, “Big data integra- tion,” in2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD: IEEE, Apr. 2013.doi:10.1109/ICDE.2013.6544914
-
[14]
Mouse Genome Database: From sequence to phenotypes and disease models,
J. T. Eppig et al., “Mouse Genome Database: From sequence to phenotypes and disease models,” en,gen- esis, vol. 53, no. 8, pp. 458–473, Aug. 2015,issn: 1526-954X, 1526-968X.doi:10.1002/dvg.22874
-
[15]
N. Sourlos et al., “Recommendations for the creation of benchmark datasets for reproducible artificial in- telligence in radiology,” en,Insights into Imaging, vol. 15, no. 1, p. 248, Oct. 2024,issn: 1869-4101. doi:10.1186/s13244-024-01833-2
-
[16]
A. Robles Ar ´evalo et al., “Data-driven curation pro- cess for describing the blood glucose management in the intensive care unit,” en,Scientific Data, vol. 8, no. 1, p. 80, Mar. 2021,issn: 2052-4463.doi:10 . 1038/s41597-021-00864-4
work page 2021
-
[17]
T1diabetesgranada: A lon- gitudinal multi-modal dataset of type 1 diabetes mel- litus,
C. Rodriguez-Leon et al., “T1diabetesgranada: A lon- gitudinal multi-modal dataset of type 1 diabetes mel- litus,”Scientific Data, vol. 10, no. 1, Dec. 2023,issn: 2052-4463.doi:10.1038/s41597-023-02737-4
-
[18]
T. Prioleau, A. Bartolome, R. Comi, and C. Stanger, “Diatrend: A dataset from advanced diabetes technol- ogy to enable development of novel analytic solu- tions,”Scientific Data, vol. 10, no. 1, Aug. 2023,issn: 2052-4463.doi:10.1038/s41597-023-02469-5
-
[19]
Jaeb Center for Health Research,Diabetes datasets - public data archive,https://public.jaeb.org/ datasets/diabetes, Accessed: 2025-04-23, 2025
work page 2025
-
[20]
Alvarado,Hupa-ucm diabetes dataset, 2024.doi: 10.17632/3HBCSCWZ44.1
J. Alvarado,Hupa-ucm diabetes dataset, 2024.doi: 10.17632/3HBCSCWZ44.1
-
[21]
Zhu,Diabetes datasets-shanghait1dm and shang- hait2dm, 2022.doi:10
J. Zhu,Diabetes datasets-shanghait1dm and shang- hait2dm, 2022.doi:10 . 6084 / M9 . FIGSHARE . 20444397.V3
work page 2022
-
[22]
Dubosson, Jean-Eudes Ranvier, S
F. Dubosson, Jean-Eudes Ranvier, S. Bromuri, J.-P. Calbimonte, J. Ruiz, and M. Schumacher,The open d1namo dataset: A multi-modal dataset for research on non-invasive type 1 diabetes management, en, 2018.doi:10.5281/ZENODO.5651217
-
[23]
ICT Innovaties Zorg,Dataset - diabetes adolescents time series with heart rate, Accessed: 2025-04-23
work page 2025
-
[24]
J. F. Gait ´an Guerrero, J. L. L ´opez Ruiz, C. Mart ´ınez Cruz, and M. Espinilla Est ´evez,T1gduja: Glucose dataset of a patient with type 1 diabetes mellitus, en, 2024.doi:10.5281/ZENODO.11284018
-
[25]
Trial of hybrid closed-loop control in young children with type 1 diabetes,
R. P. Wadwa et al., “Trial of hybrid closed-loop control in young children with type 1 diabetes,”New England Journal of Medicine, vol. 388, no. 11, pp. 991– 1001, Mar. 2023,issn: 1533-4406.doi:10 . 1056 / nejmoa2210834
work page 2023
-
[26]
Quality-of-life measures in children and adults with type 1 diabetes,
“Quality-of-life measures in children and adults with type 1 diabetes,”Diabetes Care, vol. 33, no. 10, pp. 2175–2177, Aug. 2010,issn: 1935-5548.doi:10. 2337/dc10-0331
work page 2010
-
[27]
L. A. DiMeglio et al., “Time spent outside of target glucose range for young children with type 1 dia- betes: A continuous glucose monitor study,”Diabetic Medicine, vol. 37, no. 8, pp. 1308–1315, Mar. 2020, issn: 1464-5491.doi:10.1111/dme.14276
-
[28]
A. L. Carlson et al., “Hypoglycemia and glycemic control in older adults with type 1 diabetes: Baseline results from the wisdm study,”Journal of Diabetes Science and Technology, vol. 15, no. 3, pp. 582– 592, Dec. 2019,issn: 1932-2968.doi:10 . 1177 / 1932296819894974
work page 2019
-
[29]
Risk factors associated with severe hypoglycemia in older adults with type 1 di- abetes,
R. S. Weinstock et al., “Risk factors associated with severe hypoglycemia in older adults with type 1 di- abetes,”Diabetes Care, vol. 39, no. 4, pp. 603–610, Dec. 2015,issn: 1935-5548.doi:10 . 2337 / dc15 - 1426
work page 2015
-
[30]
Fill- ing missing values on wearable-sensory time series data,
S. Lin, X. Wu, G. Martinez, and N. V . Chawla, “Fill- ing missing values on wearable-sensory time series data,” inProceedings of the 2020 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, Jan. 2020, pp. 46–54,isbn: 9781611976236.doi:10.1137/1.9781611976236.6
-
[31]
Diadata: An integrated large dataset for type 1 diabetes and hypoglycemia research,
B. Cinar and M. Maleshkova, “Diadata: An integrated large dataset for type 1 diabetes and hypoglycemia research,” en, 2025.doi:10.24405/20048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.