Recognition: unknown
If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data
Pith reviewed 2026-05-10 06:19 UTC · model grok-4.3
The pith
Privacy-preserving agent uses LLMs only to select local functions for answering questions on personal glucose data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CGM-Agent is a privacy-preserving framework for question answering over personal glucose data in which the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, the authors built a benchmark of 4,180 questions from parameterized templates and real queries with ground truth from deterministic programs, finding that top models reach 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries, while lightweight models remain competitive.
What carries the argument
The CGM-Agent architecture in which the LLM selects from a library of analytical functions using only the query text, after which the selected function executes locally on the device to produce the answer.
If this is right
- People with diabetes can pose natural-language questions about trends, patterns, and events in their CGM data and receive accurate answers without uploading records.
- Lightweight models achieve competitive results inside the agent design, opening the door to on-device deployment without cloud costs.
- Most errors arise from ambiguity in user intent or time references rather than from failures in function execution or computation.
- Releasing the code and the 4,180-question benchmark enables others to build and test similar local agents for other personal sensor streams.
Where Pith is reading between the lines
- The same function-selection pattern could apply to other continuous personal sensors such as heart-rate or activity monitors while keeping raw readings local.
- Adding mechanisms to clarify ambiguous time periods or intent before function selection might raise accuracy on messy real-world queries.
- Keeping all data on-device could simplify compliance with health-data regulations that restrict cloud transmission of identifiable records.
- Over time the function library could be expanded or made user-extensible without retraining the underlying LLM.
Load-bearing premise
A fixed library of analytical functions is enough to cover every user question about glucose data, and the LLM can pick the right function from the text of the query alone without ever seeing the actual data values.
What would settle it
A collection of real user queries that cannot be answered by any function in the current library or that cause the LLM to pick the wrong function, producing answers that differ from ground-truth results computed directly from the glucose readings.
Figures
read the original abstract
Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CGM-Agent, a privacy-preserving framework for question answering over continuous glucose monitor (CGM) data. In this system, an LLM functions solely as a reasoning engine to select from a predefined library of analytical functions, with all data processing and computation occurring locally on the user's device to preserve privacy. The authors construct a benchmark consisting of 4,180 questions by combining parameterized templates with real user queries, deriving ground truth via deterministic program execution. Evaluation of six leading LLMs shows top-performing models achieving 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries, with errors primarily attributed to intent and temporal ambiguity rather than computational issues. Lightweight models are shown to perform competitively, and the code and benchmark are released publicly.
Significance. If the central results hold, this work demonstrates a viable approach for enabling interactive, free-form querying of sensitive personal health data using LLMs without compromising privacy, which has significant implications for diabetes self-management and the broader development of trustworthy health agents. The public release of the code and benchmark is a notable strength that supports reproducibility and future research. However, the significance is tempered by the need to confirm that the fixed analytical function library adequately covers the range of real-world user queries, as the current evaluation focuses on in-scope performance.
major comments (1)
- [Abstract (benchmark construction)] As described in the abstract, the benchmark of 4,180 questions is constructed by combining parameterized question templates with real user queries. This construction ties queries to the fixed library of analytical functions, so the reported 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries measure LLM function-selection skill for in-scope queries. No coverage statistic, out-of-scope rejection rate, or analysis of the fraction of genuine patient questions that fall outside the library is provided. This is load-bearing for the central claim that the system supports free-form inquiries about CGM data.
minor comments (2)
- [Abstract] The abstract provides no details on prompt engineering for function selection, the size and coverage of the analytical function library, or the protocol for handling ambiguous queries; adding these in the methods or evaluation section would improve verifiability.
- A diagram of the CGM-Agent architecture would clarify the separation between LLM reasoning and local computation for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting the importance of clarifying the scope of our benchmark relative to free-form queries. We address the major comment below and will incorporate revisions to strengthen the presentation of the work's limitations and capabilities.
read point-by-point responses
-
Referee: As described in the abstract, the benchmark of 4,180 questions is constructed by combining parameterized question templates with real user queries. This construction ties queries to the fixed library of analytical functions, so the reported 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries measure LLM function-selection skill for in-scope queries. No coverage statistic, out-of-scope rejection rate, or analysis of the fraction of genuine patient questions that fall outside the library is provided. This is load-bearing for the central claim that the system supports free-form inquiries about CGM data.
Authors: We agree that the benchmark is deliberately constructed around queries for which ground truth can be obtained via deterministic execution of the analytical function library; this ensures objective, reproducible evaluation of the LLM's reasoning and selection performance. The 4,180 questions therefore reflect in-scope cases by design. We acknowledge that an explicit coverage analysis would better contextualize the central claim of supporting free-form inquiries. In the revised manuscript we will add a dedicated subsection describing the real-user query collection process, the fraction of collected queries that could be mapped to the library, and the system's intended behavior (e.g., explicit rejection or clarification request) for out-of-scope inputs. This addition will make the evaluation boundaries transparent without altering the reported accuracy figures. revision: yes
Circularity Check
No significant circularity in empirical system evaluation
full rationale
The paper presents the design and empirical evaluation of CGM-Agent, a privacy-preserving agent that uses LLMs solely as function selectors for local analytical computations over CGM data. The benchmark of 4,180 questions is constructed from parameterized templates (tied to the fixed function library) plus real user queries, with ground truth obtained via deterministic program execution. Reported accuracies (94% synthetic, 88% real-world) measure LLM selection performance on this benchmark. No mathematical derivation chain, predictions, fitted parameters, self-definitional constructs, or load-bearing self-citations exist. The evaluation is self-contained and externally verifiable via the released code and benchmark; any limitations on query coverage fall under generalizability rather than circularity by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can select the correct analytical functions from natural-language queries without access to the underlying glucose data.
- domain assumption A fixed set of analytical functions is sufficient to answer typical user questions about continuous glucose data.
Reference graph
Works this paper leans on
-
[1]
Average blood glucose, Standard Deviation, Glycemic Variability (CV), estimated A1c, estimated Glucose Management Indicator (eGMI)
-
[2]
Min/Max blood glucose, Hypoglycemia and hyperglycemia events
-
[3]
food/exercise/sleep
Glucose excursions and glucose trends Answerability Logic:1.Direct Data (YES): Questions about past glucose data. 2.Behavioral (Indirect YES):Questions about "food/exercise/sleep" ARE answerable IF convertible to glucose trends during a specific time. 3.Medical/External (NO):General medical knowledge, future predictions, or questions strictly requiring in...
-
[4]
Analyze glucose excursions for {dates_str}. Find significant rapid changes and details on timing, magnitude, and speed
Absolute difference. 3. Which group is higher.” •Event Analysis:“Analyze glucose excursions for {dates_str}. Find significant rapid changes and details on timing, magnitude, and speed.” •Visualization:“Plot my typical daily CGM blood glucose trends for {dates_str}. Output the mean values used to generate the plot.” Input Fields: •user_question: The origin...
2025
-
[5]
(T2D, 15-min sampling)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.