MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

Caleb Mayer; David Klonoff; Elizabeth Healey; Michael Snyder; Miriam Wolff; Nathaniel Jeffries; Sam Royston; Tao Wang

arxiv: 2606.18640 · v2 · pith:3MLMTKLWnew · submitted 2026-06-17 · 💻 cs.LG · q-bio.QM

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

Nathaniel Jeffries , Miriam Wolff , Sam Royston , Elizabeth Healey , Caleb Mayer , David Klonoff , Michael Snyder , Tao Wang This is my paper

Pith reviewed 2026-06-26 21:52 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords glucose forecastingtype 1 diabetesmultimodal databenchmarkinsulin dosingcarbohydrate intaketime series models

0 comments

The pith

The benefit of adding data modalities to glucose forecasting depends on model complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MetaboNet-Bench, an open-source framework for evaluating glucose forecasting models that combine continuous glucose monitor readings with insulin dosing and carbohydrate intake data in type 1 diabetes. It applies the benchmark to several published models plus a custom multimodal time-series model to test performance differences. A reader would care because the work directly addresses the lack of standardized evaluation that currently blocks fair comparisons and slows progress toward better glycemic control tools.

Core claim

MetaboNet-Bench supplies an extensible evaluation framework for multimodal glucose forecasting. When applied to existing models and a custom architecture, it shows that gains from including insulin and carbohydrate signals are conditioned on the complexity of the underlying model, while the expanded set of clinical metrics surfaces concrete gaps that future algorithms must address.

What carries the argument

MetaboNet-Bench, an extensible open-source evaluation framework that standardizes testing of glucose forecasting models on glucose, insulin, and carbohydrate inputs.

If this is right

Algorithms can now be compared on identical multimodal datasets rather than single-modality subsets.
Developers can isolate whether their architecture gains from insulin or carbohydrate signals.
Expanded clinical metrics make it easier to name specific forecasting weaknesses for targeted fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider adoption of the benchmark could shift research focus from single-modality CGM models to integrated closed-loop designs.
Testing the framework on larger, more diverse patient cohorts might expose whether modality benefits vary by individual physiology.

Load-bearing premise

The handful of recently published models and one custom model chosen for testing are representative of the wider space of glucose forecasting algorithms.

What would settle it

Re-running the benchmark on a fresh collection of models and obtaining performance gains from extra modalities that show no dependence on model complexity would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.18640 by Caleb Mayer, David Klonoff, Elizabeth Healey, Michael Snyder, Miriam Wolff, Nathaniel Jeffries, Sam Royston, Tao Wang.

**Figure 1.** Figure 1: illustrates the MetaboNet-Bench workflow. Data are retrieved and preprocessed by filtering features, imputing zeros for missing insulin and carbohydrate values, and removing outliers. The data are then segmented using a sliding window before model inference. Finally, models are evaluated on the glucose forecasting task using quantitative metrics and visualizations of clinical accuracy. 3.1 Datasets This st… view at source ↗

**Figure 2.** Figure 2: (Left) Linear extrapolation DTS error grid at 30-minute PH. (Middle) GluForecast DTS [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of samples across glycemic regions (left) and corresponding RMSE results [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of performance of the ablated models in the presence of common blood glucose [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Relation between parameter count and RMSE improvement due to introduction of Insulin and Carbohydrate data (∆ in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: RMSE for Novel Patients, vs Known Patients with some data in the training set. The RMSE [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the signed error for each model across prediction horizons, from shorter [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Heatmap illustrating, for each model, the relationship between prediction horizon, reference [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of model RMSE when there are no recent meals present in the data (left), [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of model RMSE during hyperglycemic samples in the true values of the [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: RMSE results across subpopulations aggregated over all prediction horizons, illustrating [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: RMSE results across subpopulations aggregated over all prediction horizons, illustrating [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized that the lack of standardized model performance evaluation benchmarks makes fair comparison difficult and hinders further innovation, and thus benchmark standardization is in urgent need. Furthermore, many published glucose forecasting algorithms are limited to CGM data alone, ignoring other multimodal signals such as insulin dosing and carbohydrate intake. Here, we introduce MetaboNet-Bench, a benchmark for multimodal glucose forecasting for patients with type 1 diabetes that provides an extensible open-source evaluation framework for comparison of glucose forecasting algorithms that leverage glucose, insulin, and carbohydrate data. We then demonstrate its utility by benchmarking several recently published glucose forecasting models and a custom multimodal time-series model, representing different model architectures. The results show that the benefit of adding data modalities is conditioned on the complexity of the model and that incorporating more clinical metrics helps identify meaningful gaps to fill for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaboNet-Bench supplies a needed open framework for comparing multimodal glucose models, but the claim that modality gains depend on complexity rests on uncontrolled model comparisons.

read the letter

MetaboNet-Bench is an open-source evaluation setup for glucose forecasting that accepts CGM plus insulin and carbohydrate inputs. The authors run it on several recently published models plus one custom multimodal time-series model and report that extra modalities help more with complex models while also surfacing research gaps.

The release itself is the clearest contribution. The field has complained about non-comparable results for years, and an extensible public benchmark lowers the barrier for new work. That part is straightforward and addresses a documented pain point.

The central observation about modality benefits being conditioned on complexity is harder to accept at face value. The paper simply takes existing published models that already differ in architecture, training procedure, and parameter count, then looks for patterns. Nothing in the abstract shows they held architecture family fixed and varied only depth or width, so the interaction could easily trace to other factors. The stress-test note is on target here.

No numbers, dataset sizes, error bars, or statistical details appear in the abstract, which makes it impossible to judge how large or consistent the reported effects are. If the full paper supplies those and the code is released as promised, the work becomes more usable.

This is for researchers in diabetes technology who need a shared testbed. A reader who wants to try new multimodal inputs or compare against a standard set will find it practical. The paper shows clear thinking about the standardization problem even if the complexity analysis is not tightly controlled.

I would send it to peer review. A benchmark paper does not need to be flawless in every analysis to be worth referee time, provided the framework is reproducible and the experiments are reported with enough detail to let others build on it.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MetaboNet-Bench, an extensible open-source benchmark framework for multimodal glucose forecasting in type 1 diabetes that incorporates CGM glucose, insulin dosing, and carbohydrate intake data. It evaluates several recently published glucose forecasting models plus one custom multimodal time-series model representing different architectures, and reports that the benefit of additional modalities is conditioned on model complexity while multimodal clinical metrics help surface research gaps.

Significance. A standardized, reproducible benchmark for multimodal T1D forecasting would address a recognized gap in the literature and could facilitate fairer model comparisons. The open-source framework and emphasis on clinical metrics are constructive contributions; if the empirical claims are supported by properly controlled experiments, the work could usefully guide future multimodal modeling efforts.

major comments (2)

[Abstract] Abstract: the claim that 'the benefit of adding data modalities is conditioned on the complexity of the model' is not supported by a controlled isolation of complexity. The evaluation compares a small set of published models plus one custom model; these differ simultaneously in architecture family, training procedure, and other factors, so any observed interaction cannot be attributed specifically to complexity rather than confounders.
[Abstract] Abstract (and presumed Results section): no quantitative performance numbers, error bars, dataset sizes, train/test splits, or statistical tests are referenced, preventing assessment of whether the modality-benefit pattern is robust or merely descriptive.

minor comments (1)

[Abstract] The abstract would be strengthened by a brief statement of the primary evaluation metric(s) and the number of subjects or time-series length used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims and the presentation of results. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'the benefit of adding data modalities is conditioned on the complexity of the model' is not supported by a controlled isolation of complexity. The evaluation compares a small set of published models plus one custom model; these differ simultaneously in architecture family, training procedure, and other factors, so any observed interaction cannot be attributed specifically to complexity rather than confounders.

Authors: We agree that the evaluated models vary across multiple dimensions (architecture family, training procedures, hyperparameters, etc.) and that our results do not isolate model complexity through controlled ablation or matched experiments. The reported pattern is therefore an empirical observation across the selected models rather than a causal attribution to complexity alone. We will revise the abstract to state that the benefit of additional modalities 'appears to depend on model complexity in the evaluated models' and will add a limitations paragraph in the discussion section explicitly noting the presence of confounding factors and the observational nature of the finding. revision: yes
Referee: [Abstract] Abstract (and presumed Results section): no quantitative performance numbers, error bars, dataset sizes, train/test splits, or statistical tests are referenced, preventing assessment of whether the modality-benefit pattern is robust or merely descriptive.

Authors: The abstract is intentionally concise and summarizes the high-level contribution and key observation. The full manuscript contains the requested quantitative details (performance metrics with error bars, dataset sizes, train/test splits, and statistical comparisons) in the Results and Experimental Setup sections. To improve accessibility, we will add one or two representative quantitative highlights (e.g., key RMSE or clinical metric values) to the abstract while remaining within length limits, and we will ensure the Results section explicitly cross-references all evaluation details. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or self-referential structure

full rationale

The paper introduces MetaboNet-Bench as an evaluation framework and reports empirical results from benchmarking published models plus one custom model on glucose, insulin, and carbohydrate data. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central observation that modality benefit is conditioned on model complexity is presented as a direct outcome of the benchmark experiments rather than reducing to any input by construction. This is a standard empirical comparison study whose claims rest on external data evaluations, not internal redefinitions or citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is a benchmarking framework rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5731 in / 1073 out tokens · 19903 ms · 2026-06-26T21:52:39.255670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Cappon, G., Prendin, F., Facchinetti, A., Sparacino, G., and Del Favero, S. Individualized models for glucose prediction in type 1 diabetes: Comparing black-box approaches to a physiological white-box one.IEEE Transactions on Biomedical Engineering, 70(11):3105–3115, Nov 2023a. doi: 10.1109/TBME.2023.3276193. Cappon, G., Vettoretti, M., Sparacino, G., Del...

work page doi:10.1109/tbme.2023.3276193 2023
[2]

Deep Multi-Output Forecasting: Learning to Accurately Predict Blood Glucose Trajectories

doi: 10.48550/arXiv.1806.05357. URL https://arxiv.org/abs/1806.05357. Gao, S., Hartvigsen, T., Koker, T., Queen, O., Tsiligkaridis, T., and Zitnik, M. UniTS: A unified multi- task time series model. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Systems 37, volum...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.05357
[3]

eCollection 2024 Aug

doi: 10.1016/j.dib.2024.110559. eCollection 2024 Aug. Jaloli, M. and Cescon, M. Long-term prediction of blood glucose levels in type 1 diabetes using a CNN-LSTM-based deep neural network.J. Diabetes Sci. Technol., 17(6):1590–1601, November

work page doi:10.1016/j.dib.2024.110559 2024
[4]

URL https://www.nature.com/articles/ s41597-023-02469-5

doi: 10.1038/s41597-023-02469-5. URL https://www.nature.com/articles/ s41597-023-02469-5. Prioleau, T., Lu, B., and Cui, Y . Glucose-ml: A collection of longitudinal diabetes datasets for development of robust ai solutions

work page doi:10.1038/s41597-023-02469-5
[5]

URL https: //arxiv.org/abs/2507.14077

doi: 10.48550/arXiv.2507.14077. URL https: //arxiv.org/abs/2507.14077. Replica Health. Metabonet data dictionary. https://metabo-net.org/data-dictionary,

work page doi:10.48550/arxiv.2507.14077
[6]

Glu- cobench: Curated list of continuous glucose monitoring datasets with prediction benchmarks

Sergazinov, R., Chun, E., Rogovchenko, V ., Fernandes, N., Kasman, N., and Gaynanova, I. Glu- cobench: Curated list of continuous glucose monitoring datasets with prediction benchmarks. arXiv preprint arXiv:2410.05780,

arXiv
[7]

URLhttps://doi.org/10.21105/joss.06904

doi: 10.21105/joss.06904. URLhttps://doi.org/10.21105/joss.06904. Wolff, M. K., Royston, S., Fougner, A. L., Schaathun, H. G., Steinert, M., and V olden, R. A perspective on harmonizing diabetes management datasets.Data Brief, 59(111399):111399, April 2025a. Wolff, M. K., Schaathun, H. G., Gros, S., V olden, R., Steinert, M., and Fougner, A. L. Blood gluc...

work page doi:10.21105/joss.06904 2025
[8]

doi: 10.48550/arXiv.2601. 11505. URLhttps://arxiv.org/abs/2601.11505. 12 Xie, J. and Wang, Q. Benchmarking machine learning algorithms on blood glucose prediction for type i diabetes in comparison with classical time -series models.IEEE Transactions on Biomedical Engineering, 67(11):3101–3124,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601
[9]

URL https://pubmed.ncbi.nlm.nih.gov/32091990/

doi: 10.1109/TBME.2020.2975959. URL https://pubmed.ncbi.nlm.nih.gov/32091990/. Yang, T., Wu, R., Tao, R., Wen, S., Ma, N., Zhao, Y ., Yu, X., and Li, H. Multi-scale long short- term memory network with multi-lag structure for blood glucose prediction. InProceedings of the 5th International Workshop on Knowledge Discovery in Healthcare Data (KDH@ECAI 2020)...

work page doi:10.1109/tbme.2020.2975959 2020
[10]

Zisser, H., Renard, E., Kovatchev, B., Cobelli, C., Avogaro, A., Nimri, R., Magni, L., Buckingham, B

doi: 10.1038/s41597-023-01940-7. Zisser, H., Renard, E., Kovatchev, B., Cobelli, C., Avogaro, A., Nimri, R., Magni, L., Buckingham, B. A., Chase, H. P., Doyle, 3rd, F. J., Lum, J., Calhoun, P., Kollman, C., Dassau, E., Farret, A., Place, J., Breton, M., Anderson, S. M., Dalla Man, C., Del Favero, S., Bruttomesso, D., Filippi, A., Scotton, R., Phillip, M.,...

work page doi:10.1038/s41597-023-01940-7
[11]

# Not MDI

13 A Appendix - Datasets Table 3: Overview of all datasets included in the public release of the MetaboNet dataset (Wolff et al., 2026), showing the number of subjects per dataset. The column “# Not MDI” reports the number of subjects using continuous insulin pump therapy, i.e., not treated with Multiple Daily Injections (MDI). The MetaboNet consolidated ...

2026
[12]

If any of these carbs are nonzero we say the sample ispostprandial

Postprandial:For a given sample, we look at the current reported carbohydrates as well as the reported carbs every 5 minutes up until 30 minutes prior to the sample. If any of these carbs are nonzero we say the sample ispostprandial. Correction Bolus:If a sample has a CGM value > 250 mg/dL and an insulin value of > 2IU for any of the intervals until 30 mi...

2026

[1] [1]

Cappon, G., Prendin, F., Facchinetti, A., Sparacino, G., and Del Favero, S. Individualized models for glucose prediction in type 1 diabetes: Comparing black-box approaches to a physiological white-box one.IEEE Transactions on Biomedical Engineering, 70(11):3105–3115, Nov 2023a. doi: 10.1109/TBME.2023.3276193. Cappon, G., Vettoretti, M., Sparacino, G., Del...

work page doi:10.1109/tbme.2023.3276193 2023

[2] [2]

Deep Multi-Output Forecasting: Learning to Accurately Predict Blood Glucose Trajectories

doi: 10.48550/arXiv.1806.05357. URL https://arxiv.org/abs/1806.05357. Gao, S., Hartvigsen, T., Koker, T., Queen, O., Tsiligkaridis, T., and Zitnik, M. UniTS: A unified multi- task time series model. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Systems 37, volum...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.05357

[3] [3]

eCollection 2024 Aug

doi: 10.1016/j.dib.2024.110559. eCollection 2024 Aug. Jaloli, M. and Cescon, M. Long-term prediction of blood glucose levels in type 1 diabetes using a CNN-LSTM-based deep neural network.J. Diabetes Sci. Technol., 17(6):1590–1601, November

work page doi:10.1016/j.dib.2024.110559 2024

[4] [4]

URL https://www.nature.com/articles/ s41597-023-02469-5

doi: 10.1038/s41597-023-02469-5. URL https://www.nature.com/articles/ s41597-023-02469-5. Prioleau, T., Lu, B., and Cui, Y . Glucose-ml: A collection of longitudinal diabetes datasets for development of robust ai solutions

work page doi:10.1038/s41597-023-02469-5

[5] [5]

URL https: //arxiv.org/abs/2507.14077

doi: 10.48550/arXiv.2507.14077. URL https: //arxiv.org/abs/2507.14077. Replica Health. Metabonet data dictionary. https://metabo-net.org/data-dictionary,

work page doi:10.48550/arxiv.2507.14077

[6] [6]

Glu- cobench: Curated list of continuous glucose monitoring datasets with prediction benchmarks

Sergazinov, R., Chun, E., Rogovchenko, V ., Fernandes, N., Kasman, N., and Gaynanova, I. Glu- cobench: Curated list of continuous glucose monitoring datasets with prediction benchmarks. arXiv preprint arXiv:2410.05780,

arXiv

[7] [7]

URLhttps://doi.org/10.21105/joss.06904

doi: 10.21105/joss.06904. URLhttps://doi.org/10.21105/joss.06904. Wolff, M. K., Royston, S., Fougner, A. L., Schaathun, H. G., Steinert, M., and V olden, R. A perspective on harmonizing diabetes management datasets.Data Brief, 59(111399):111399, April 2025a. Wolff, M. K., Schaathun, H. G., Gros, S., V olden, R., Steinert, M., and Fougner, A. L. Blood gluc...

work page doi:10.21105/joss.06904 2025

[8] [8]

doi: 10.48550/arXiv.2601. 11505. URLhttps://arxiv.org/abs/2601.11505. 12 Xie, J. and Wang, Q. Benchmarking machine learning algorithms on blood glucose prediction for type i diabetes in comparison with classical time -series models.IEEE Transactions on Biomedical Engineering, 67(11):3101–3124,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601

[9] [9]

URL https://pubmed.ncbi.nlm.nih.gov/32091990/

doi: 10.1109/TBME.2020.2975959. URL https://pubmed.ncbi.nlm.nih.gov/32091990/. Yang, T., Wu, R., Tao, R., Wen, S., Ma, N., Zhao, Y ., Yu, X., and Li, H. Multi-scale long short- term memory network with multi-lag structure for blood glucose prediction. InProceedings of the 5th International Workshop on Knowledge Discovery in Healthcare Data (KDH@ECAI 2020)...

work page doi:10.1109/tbme.2020.2975959 2020

[10] [10]

Zisser, H., Renard, E., Kovatchev, B., Cobelli, C., Avogaro, A., Nimri, R., Magni, L., Buckingham, B

doi: 10.1038/s41597-023-01940-7. Zisser, H., Renard, E., Kovatchev, B., Cobelli, C., Avogaro, A., Nimri, R., Magni, L., Buckingham, B. A., Chase, H. P., Doyle, 3rd, F. J., Lum, J., Calhoun, P., Kollman, C., Dassau, E., Farret, A., Place, J., Breton, M., Anderson, S. M., Dalla Man, C., Del Favero, S., Bruttomesso, D., Filippi, A., Scotton, R., Phillip, M.,...

work page doi:10.1038/s41597-023-01940-7

[11] [11]

# Not MDI

13 A Appendix - Datasets Table 3: Overview of all datasets included in the public release of the MetaboNet dataset (Wolff et al., 2026), showing the number of subjects per dataset. The column “# Not MDI” reports the number of subjects using continuous insulin pump therapy, i.e., not treated with Multiple Daily Injections (MDI). The MetaboNet consolidated ...

2026

[12] [12]

If any of these carbs are nonzero we say the sample ispostprandial

Postprandial:For a given sample, we look at the current reported carbohydrates as well as the reported carbs every 5 minutes up until 30 minutes prior to the sample. If any of these carbs are nonzero we say the sample ispostprandial. Correction Bolus:If a sample has a CGM value > 250 mg/dL and an insulin value of > 2IU for any of the intervals until 30 mi...

2026