arxiv: 2604.17133 · v1 · submitted 2026-04-18 · 💻 cs.AI · cs.CR

Recognition: unknown

If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data

Yanjun Cui , Ali Emami , Temiloluwa Prioleau , Nikhil Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords continuous glucose monitoringprivacy-preserving AIquestion answeringlarge language modelsdiabetes self-managementlocal computationhealth data agents

0 comments

The pith

Privacy-preserving agent uses LLMs only to select local functions for answering questions on personal glucose data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CGM-Agent, a system designed to let people ask free-form questions about their continuous glucose monitor readings without sending any health data to external servers. An LLM acts solely as a planner that chooses from a fixed set of analytical functions based on the query text, after which the chosen function runs entirely on the user's device. This setup targets privacy worries in diabetes care while still supporting inquisitive queries that current static CGM apps cannot handle. A new benchmark of over four thousand questions, mixing templates and real user examples, shows leading models reaching 94 percent accuracy on synthetic cases and 88 percent on ambiguous real ones, with most mistakes traced to unclear intent or timing rather than calculation errors.

Core claim

CGM-Agent is a privacy-preserving framework for question answering over personal glucose data in which the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, the authors built a benchmark of 4,180 questions from parameterized templates and real queries with ground truth from deterministic programs, finding that top models reach 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries, while lightweight models remain competitive.

What carries the argument

The CGM-Agent architecture in which the LLM selects from a library of analytical functions using only the query text, after which the selected function executes locally on the device to produce the answer.

If this is right

People with diabetes can pose natural-language questions about trends, patterns, and events in their CGM data and receive accurate answers without uploading records.
Lightweight models achieve competitive results inside the agent design, opening the door to on-device deployment without cloud costs.
Most errors arise from ambiguity in user intent or time references rather than from failures in function execution or computation.
Releasing the code and the 4,180-question benchmark enables others to build and test similar local agents for other personal sensor streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same function-selection pattern could apply to other continuous personal sensors such as heart-rate or activity monitors while keeping raw readings local.
Adding mechanisms to clarify ambiguous time periods or intent before function selection might raise accuracy on messy real-world queries.
Keeping all data on-device could simplify compliance with health-data regulations that restrict cloud transmission of identifiable records.
Over time the function library could be expanded or made user-extensible without retraining the underlying LLM.

Load-bearing premise

A fixed library of analytical functions is enough to cover every user question about glucose data, and the LLM can pick the right function from the text of the query alone without ever seeing the actual data values.

What would settle it

A collection of real user queries that cannot be answered by any function in the current library or that cause the LLM to pick the wrong function, producing answers that differ from ground-truth results computed directly from the glucose readings.

Figures

Figures reproduced from arXiv: 2604.17133 by Ali Emami, Nikhil Singh, Temiloluwa Prioleau, Yanjun Cui.

**Figure 1.** Figure 1: Overview of CGM-Agent. The pipeline consists of three layers: (1) the Input Processor resolves ambiguous queries, (2) the Analytical Agent generates tool calls executed in a local sandbox, and (3) the Response Generator synthesizes a natural language answer. Raw CGM data (shown in the bottom sandbox) never crosses the privacy boundary; only function calls and aggregated metrics are exchanged with the LLM. … view at source ↗

**Figure 2.** Figure 2: Average Daily Glucose Profile. Generated by the plot_daily_trends function, this visualization aggregates 7 days of data to display the mean glucose trajectory (solid green line) and glucose standard deviation (shaded area, ±1 SD) relative to standard clinical target boundaries (70–180 mg/dL). B Appendix: Control flow for two representative query types [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Control flow for two representative query types. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CGM-Agent keeps glucose data local by having the LLM pick functions only, with a benchmark that tests selection accuracy well but leaves coverage of real queries untested.

read the letter

The main point here is a system where the LLM acts only as a selector for a fixed set of local analytical functions that run on the user's CGM data, so nothing sensitive ever leaves the device. They built a 4,180-question benchmark by mixing parameterized templates tied to those functions with real user queries, then ran six LLMs and reported 94% value accuracy on the synthetic set and 88% on the ambiguous real ones. Errors mostly trace to intent or temporal issues rather than computation problems, and smaller models stay competitive enough for possible on-device use. They also release the code and benchmark, which is straightforwardly helpful.

Referee Report

1 major / 2 minor

Summary. The manuscript presents CGM-Agent, a privacy-preserving framework for question answering over continuous glucose monitor (CGM) data. In this system, an LLM functions solely as a reasoning engine to select from a predefined library of analytical functions, with all data processing and computation occurring locally on the user's device to preserve privacy. The authors construct a benchmark consisting of 4,180 questions by combining parameterized templates with real user queries, deriving ground truth via deterministic program execution. Evaluation of six leading LLMs shows top-performing models achieving 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries, with errors primarily attributed to intent and temporal ambiguity rather than computational issues. Lightweight models are shown to perform competitively, and the code and benchmark are released publicly.

Significance. If the central results hold, this work demonstrates a viable approach for enabling interactive, free-form querying of sensitive personal health data using LLMs without compromising privacy, which has significant implications for diabetes self-management and the broader development of trustworthy health agents. The public release of the code and benchmark is a notable strength that supports reproducibility and future research. However, the significance is tempered by the need to confirm that the fixed analytical function library adequately covers the range of real-world user queries, as the current evaluation focuses on in-scope performance.

major comments (1)

[Abstract (benchmark construction)] As described in the abstract, the benchmark of 4,180 questions is constructed by combining parameterized question templates with real user queries. This construction ties queries to the fixed library of analytical functions, so the reported 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries measure LLM function-selection skill for in-scope queries. No coverage statistic, out-of-scope rejection rate, or analysis of the fraction of genuine patient questions that fall outside the library is provided. This is load-bearing for the central claim that the system supports free-form inquiries about CGM data.

minor comments (2)

[Abstract] The abstract provides no details on prompt engineering for function selection, the size and coverage of the analytical function library, or the protocol for handling ambiguous queries; adding these in the methods or evaluation section would improve verifiability.
A diagram of the CGM-Agent architecture would clarify the separation between LLM reasoning and local computation for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the importance of clarifying the scope of our benchmark relative to free-form queries. We address the major comment below and will incorporate revisions to strengthen the presentation of the work's limitations and capabilities.

read point-by-point responses

Referee: As described in the abstract, the benchmark of 4,180 questions is constructed by combining parameterized question templates with real user queries. This construction ties queries to the fixed library of analytical functions, so the reported 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries measure LLM function-selection skill for in-scope queries. No coverage statistic, out-of-scope rejection rate, or analysis of the fraction of genuine patient questions that fall outside the library is provided. This is load-bearing for the central claim that the system supports free-form inquiries about CGM data.

Authors: We agree that the benchmark is deliberately constructed around queries for which ground truth can be obtained via deterministic execution of the analytical function library; this ensures objective, reproducible evaluation of the LLM's reasoning and selection performance. The 4,180 questions therefore reflect in-scope cases by design. We acknowledge that an explicit coverage analysis would better contextualize the central claim of supporting free-form inquiries. In the revised manuscript we will add a dedicated subsection describing the real-user query collection process, the fraction of collected queries that could be mapped to the library, and the system's intended behavior (e.g., explicit rejection or clarification request) for out-of-scope inputs. This addition will make the evaluation boundaries transparent without altering the reported accuracy figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical system evaluation

full rationale

The paper presents the design and empirical evaluation of CGM-Agent, a privacy-preserving agent that uses LLMs solely as function selectors for local analytical computations over CGM data. The benchmark of 4,180 questions is constructed from parameterized templates (tied to the fixed function library) plus real user queries, with ground truth obtained via deterministic program execution. Reported accuracies (94% synthetic, 88% real-world) measure LLM selection performance on this benchmark. No mathematical derivation chain, predictions, fitted parameters, self-definitional constructs, or load-bearing self-citations exist. The evaluation is self-contained and externally verifiable via the released code and benchmark; any limitations on query coverage fall under generalizability rather than circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about LLM reasoning and function coverage rather than any fitted parameters or new invented entities.

axioms (2)

domain assumption LLMs can select the correct analytical functions from natural-language queries without access to the underlying glucose data.
This is the core premise that allows the LLM to act only as a reasoning engine while keeping data local.
domain assumption A fixed set of analytical functions is sufficient to answer typical user questions about continuous glucose data.
Required for the local-computation approach to cover the query distribution in the benchmark.

pith-pipeline@v0.9.0 · 5525 in / 1342 out tokens · 45278 ms · 2026-05-10T06:19:28.592773+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references

[1]

Average blood glucose, Standard Deviation, Glycemic Variability (CV), estimated A1c, estimated Glucose Management Indicator (eGMI)
[2]

Min/Max blood glucose, Hypoglycemia and hyperglycemia events
[3]

food/exercise/sleep

Glucose excursions and glucose trends Answerability Logic:1.Direct Data (YES): Questions about past glucose data. 2.Behavioral (Indirect YES):Questions about "food/exercise/sleep" ARE answerable IF convertible to glucose trends during a specific time. 3.Medical/External (NO):General medical knowledge, future predictions, or questions strictly requiring in...
[4]

Analyze glucose excursions for {dates_str}. Find significant rapid changes and details on timing, magnitude, and speed

Absolute difference. 3. Which group is higher.” •Event Analysis:“Analyze glucose excursions for {dates_str}. Find significant rapid changes and details on timing, magnitude, and speed.” •Visualization:“Plot my typical daily CGM blood glucose trends for {dates_str}. Output the mean values used to generate the plot.” Input Fields: •user_question: The origin...

2025
[5]

(T2D, 15-min sampling)