Aim High, Stay Private: Differentially Private Synthetic Data Enables Public Release of Behavioral Health Information with High Utility
Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3
The pith
Synthetic data generated with differential privacy at epsilon=5 retains adequate predictive utility for a real behavioral health study while reducing re-identification risks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors generate differentially private synthetic data for the LEMURS behavioral health dataset using the Adaptive Iterative Mechanism and demonstrate that datasets produced at epsilon=5 preserve adequate predictive utility for downstream tasks while significantly mitigating privacy risks, as measured by a utility framework informed by real uses of the original records.
What carries the argument
The Adaptive Iterative Mechanism (AIM), which builds synthetic data by iteratively refining noisy statistics to meet a chosen differential privacy budget across many attributes and records.
If this is right
- Public release of the synthetic LEMURS data becomes feasible at epsilon=5 without exposing participants to standard re-identification attacks.
- Researchers can run predictive models on the released data and obtain results close to those from the protected original records.
- The same generation and evaluation steps can be repeated on other multi-attribute health datasets to decide acceptable epsilon values.
- Data stewards gain a reproducible method to document privacy-utility trade-offs before sharing behavioral health information.
Where Pith is reading between the lines
- The approach could be applied to other wearable-health collections to test whether epsilon=5 remains sufficient when the number of attributes or participants changes.
- If institutions adopt this workflow, review boards might require explicit epsilon reporting for any public behavioral data release.
- Extending the utility tests to tasks outside the original framework, such as longitudinal trend analysis, would clarify the limits of the current findings.
Load-bearing premise
The chosen utility evaluation framework, built from existing uses of the LEMURS dataset, accurately reflects the downstream tasks that future users of the released data will perform.
What would settle it
A new prediction task, such as forecasting a specific mental-health outcome not included in the paper's evaluation, that shows substantially lower accuracy on the epsilon=5 synthetic data than on the original data would falsify the utility claim.
Figures
read the original abstract
Sharing health and behavioral data raises significant privacy concerns, as conventional de-identification methods are susceptible to privacy attacks. Differential Privacy (DP) provides formal guarantees against re-identification risks, but practical implementation necessitates balancing privacy protection and the utility of data. We demonstrate the use of DP to protect individuals in a real behavioral health study, while making the data publicly available and retaining high utility for downstream users of the data. We use the Adaptive Iterative Mechanism (AIM) to generate DP synthetic data for Phase 1 of the Lived Experiences Measured Using Rings Study (LEMURS). The LEMURS dataset comprises physiological measurements from wearable devices (Oura rings) and self-reported survey data from first-year college students. We evaluate the synthetic datasets across a range of privacy budgets, epsilon = 1 to 100, focusing on the trade-off between privacy and utility. We evaluate the utility of the synthetic data using a framework informed by actual uses of the LEMURS dataset. Our evaluation identifies the trade-off between privacy and utility across synthetic datasets generated with different privacy budgets. We find that synthetic data sets with epsilon = 5 preserve adequate predictive utility while significantly mitigating privacy risks. Our methodology establishes a reproducible framework for evaluating the practical impacts of epsilon on generating private synthetic datasets with numerous attributes and records, contributing to informed decision-making in data sharing practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript applies the Adaptive Iterative Mechanism (AIM) to generate differentially private synthetic data from the LEMURS behavioral health dataset (Oura ring physiological measurements and self-reported surveys from first-year college students). It evaluates synthetic datasets across epsilon values from 1 to 100 and concludes that epsilon=5 preserves adequate predictive utility for downstream tasks while substantially reducing privacy risks, proposing a reproducible evaluation framework grounded in actual uses of the LEMURS data.
Significance. If the utility results hold under broader validation, the work provides a practical demonstration of releasing sensitive multi-attribute behavioral health data publicly under formal differential privacy guarantees. It supplies concrete guidance on privacy-utility trade-offs for high-dimensional datasets and a template for epsilon selection that could support data-sharing practices in health research.
major comments (2)
- [Utility evaluation framework] The central claim that epsilon=5 preserves adequate predictive utility rests on an evaluation framework informed by actual uses of the LEMURS dataset. However, the manuscript does not demonstrate that the chosen predictive tasks and metrics capture critical statistical properties for behavioral-health research, such as joint distributions, temporal correlations between wearables and surveys, or heterogeneity across student subgroups. If these properties degrade faster under AIM at epsilon=5, the reported trade-off does not generalize to the full range of downstream analyses.
- [Methods and experimental setup] The abstract reports an epsilon sweep and utility evaluation, yet the methods description lacks error bars, baseline comparisons (e.g., non-private synthetic data or alternative DP mechanisms), and statistical tests for the utility results. Without these, it is not possible to confirm that the epsilon=5 conclusion is robust rather than influenced by post-hoc choices or task selection.
minor comments (1)
- [Abstract] The abstract could more explicitly state the specific predictive metrics and thresholds used to define 'adequate' utility and 'significant' privacy mitigation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the evaluation framework and experimental rigor. We address each major comment below and indicate where revisions will be made to the next version of the manuscript.
read point-by-point responses
-
Referee: [Utility evaluation framework] The central claim that epsilon=5 preserves adequate predictive utility rests on an evaluation framework informed by actual uses of the LEMURS dataset. However, the manuscript does not demonstrate that the chosen predictive tasks and metrics capture critical statistical properties for behavioral-health research, such as joint distributions, temporal correlations between wearables and surveys, or heterogeneity across student subgroups. If these properties degrade faster under AIM at epsilon=5, the reported trade-off does not generalize to the full range of downstream analyses.
Authors: We appreciate the referee's emphasis on broader statistical fidelity. Our predictive tasks were selected to reflect documented downstream uses of the LEMURS data in prior behavioral-health analyses. We agree that explicit checks on joint distributions, cross-modal correlations, and subgroup heterogeneity would better support generalizability claims. In the revised manuscript we will add marginal and pairwise correlation fidelity metrics between wearable and survey attributes, plus a basic subgroup stability check (e.g., by gender and academic major where sample sizes permit). We note that the LEMURS Phase 1 data are primarily cross-sectional summaries rather than fine-grained longitudinal traces, which limits the depth of temporal correlation analysis we can perform without additional data processing; however, we will report the available pairwise temporal alignments where they exist. revision: partial
-
Referee: [Methods and experimental setup] The abstract reports an epsilon sweep and utility evaluation, yet the methods description lacks error bars, baseline comparisons (e.g., non-private synthetic data or alternative DP mechanisms), and statistical tests for the utility results. Without these, it is not possible to confirm that the epsilon=5 conclusion is robust rather than influenced by post-hoc choices or task selection.
Authors: We agree that these elements are needed for robustness. The revised manuscript will include (i) error bars computed over multiple independent runs of AIM for each epsilon value, (ii) a non-private synthetic baseline generated with the same AIM procedure but epsilon set to infinity, and (iii) paired statistical tests (e.g., Wilcoxon signed-rank) comparing utility metrics across epsilon values. We will also briefly discuss why alternative mechanisms such as PATE or DP-GAN were not included, given the mixed tabular structure of the LEMURS dataset and the computational constraints of the study. revision: yes
Circularity Check
No circularity: empirical application of external DP mechanism with held-out utility evaluation
full rationale
The paper applies the established Adaptive Iterative Mechanism (AIM) for differential privacy to generate synthetic data from the LEMURS dataset and measures utility empirically across epsilon values on predictive tasks drawn from actual dataset uses. Privacy guarantees derive from the standard DP definition rather than any internal construction, and no prediction or result reduces to a fitted parameter or self-citation by definition. The central claims rest on measured trade-offs rather than self-referential steps, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The Adaptive Iterative Mechanism satisfies epsilon-differential privacy for the chosen privacy budget.
- domain assumption The utility framework based on actual LEMURS uses is representative of future analysts' needs.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use the Adaptive Iterative Mechanism (AIM) to generate DP synthetic data... evaluate utility using regression models, Spearman correlation, UMAP, and L1/L2 marginal errors across epsilon = 1 to 100.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
synthetic data sets with epsilon = 5 preserve adequate predictive utility
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Price, J. E. Hidalgo, Y . M. Bird, L. S. Bloomfield, C. Buck, J. Cerutti, P. S. Dodds, M. I. Fudolig, R. Gehman, M. Hickok et al., “A large clinical trial to improve well-being during the transition to college using wearables: The lived experiences measured using rings study,” Contemporary clinical trials , vol. 133, p. 107338, 2023
work page 2023
-
[2]
L. Sweeney, “AboutMyInfo.org,” 2024, accessed: 2024-08-26. [Online]. Available: https://aboutmyinfo.org/
work page 2024
-
[3]
Broken promises of privacy: Responding to the surprising failure of anonymization,
P. Ohm, “Broken promises of privacy: Responding to the surprising failure of anonymization,” UCLA l. Rev., vol. 57, p. 1701, 2009
work page 2009
-
[4]
U.S. Department of Health and Human Services, “Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule,” 2024, accessed: 2024-08-26. [Online]. Available: https://www.hhs.gov/hipaa/for- professionals/special-topics/de-identification/index.html
work page 2024
-
[5]
Calibrating noise to sensitivity in private data analysis,
C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3. Springer, 2006, pp. 265–284
work page 2006
-
[6]
The algorithmic foundations of differential privacy,
C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,”Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014
work page 2014
-
[7]
Aim: An adaptive and iterative mechanism for differentially private synthetic data,
R. McKenna, B. Mullins, D. Sheldon, and G. Miklau, “Aim: An adaptive and iterative mechanism for differentially private synthetic data,” arXiv preprint arXiv:2201.12677 , 2022
-
[8]
The application of differential privacy to health data,
F. K. Dankar and K. El Emam, “The application of differential privacy to health data,” in Proceedings of the 2012 Joint EDBT/ICDT Workshops, 2012, pp. 158–166
work page 2012
-
[9]
The promise of differential privacy: a tutorial on al- gorithmic techniques,
C. Dwork, “The promise of differential privacy: a tutorial on al- gorithmic techniques,” in 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, D (Oct. 2011) . Citeseer, 2021, pp. 1–2
work page 2011
-
[10]
Differential privacy for clinical trial data: Preliminary evaluations,
D. Vu and A. Slavkovic, “Differential privacy for clinical trial data: Preliminary evaluations,” in 2009 IEEE International Conference on Data Mining Workshops. IEEE, 2009, pp. 138–143
work page 2009
-
[11]
Functional Mechanism: Regression Analysis under Differential Privacy
J. Zhang, Z. Zhang, X. Xiao, Y . Yang, and M. Winslett, “Functional mechanism: Regression analysis under differential privacy,” arXiv preprint arXiv:1208.0219, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[12]
Deep learning with differential privacy,
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , 2016, pp. 308–318
work page 2016
-
[13]
Collective sleep and activity patterns of college students from wearable devices,
M. I. Fudolig, L. S. Bloomfield, M. Price, Y . M. Bird, J. E. Hidalgo, J. Llorin, J. Lovato, E. W. McGinnis, R. S. McGinnis, T. Ricketts et al., “Collective sleep and activity patterns of college students from wearable devices,” arXiv preprint arXiv:2412.17969 , 2024
-
[14]
A. E. Mason, F. M. Hecht, S. K. Davis, J. L. Natale, W. Hartogensis, N. Damaso, K. T. Claypool, S. Dilchert, S. Dasgupta, S. Purawat et al. , “Detection of COVID-19 using multimodal data from a wearable device: results from the first TemPredict Study,” Scientific reports, vol. 12, no. 1, p. 3463, 2022
work page 2022
-
[15]
S. K. Shiba, C. A. Temple, J. Krasnoff, S. Dilchert, B. L. Smarr, J. Robishaw, and A. E. Mason, “Assessing adherence to multi-modal Oura ring wearables from COVID-19 detection among healthcare workers,” Cureus, vol. 15, no. 9, 2023. 13
work page 2023
-
[16]
W. contributors, “Netflix Prize,” 2024, accessed: 2024-08-26. [Online]. Available: https://en.wikipedia.org/wiki/Netflix%5FPrize
work page 2024
-
[17]
Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset,
A. Tockar, “Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset,” 2014, accessed: 2024-08-26. [Online]. Available: https://agkn.wordpress.com/2014/09/15/riding- with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/
work page 2014
-
[18]
The modernization of statistical disclosure limitation at the US Census Bureau,
A. N. Dajani, A. D. Lauger, P. E. Singer, D. Kifer, J. P. Re- iter, A. Machanavajjhala, S. L. Garfinkel, S. A. Dahl, M. Graham, V . Karwaet al., “The modernization of statistical disclosure limitation at the US Census Bureau,” in September 2017 meeting of the Census Scientific Advisory Committee , 2017
work page 2017
-
[19]
Epis- temic parity: Reproducibility as an evaluation metric for differential privacy,
L. Rosenblatt, B. Herman, A. Holovenko, W. Lee, J. Loftus, E. McK- innie, T. Rumezhak, A. Stadnik, B. Howe, and J. Stoyanovich, “Epis- temic parity: Reproducibility as an evaluation metric for differential privacy,”ACM SIGMOD Record , vol. 53, no. 1, pp. 65–74, 2024
work page 2024
-
[20]
Benchmarking differentially private synthetic data generation algo- rithms,
Y . Tao, R. McKenna, M. Hay, A. Machanavajjhala, and G. Miklau, “Benchmarking differentially private synthetic data generation algo- rithms,” arXiv preprint arXiv:2112.09238 , 2021
-
[21]
Winning the nist contest: A scalable and general approach to differentially private synthetic data,
R. McKenna, G. Miklau, and D. Sheldon, “Winning the nist contest: A scalable and general approach to differentially private synthetic data,” arXiv preprint arXiv:2108.04978 , 2021
-
[22]
Data synthesis via differentially private markov random fields,
K. Cai, X. Lei, J. Wei, and X. Xiao, “Data synthesis via differentially private markov random fields,”Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 2190–2202, 2021
work page 2021
-
[23]
Differentially private synthetic data: Applied evaluations and enhancements,
L. Rosenblatt, X. Liu, S. Pouyanfar, E. de Leon, A. Desai, and J. Allen, “Differentially private synthetic data: Applied evaluations and enhancements,” arXiv preprint arXiv:2011.05537 , 2020
-
[24]
Privbayes: Private data release via bayesian networks,
J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao, “Privbayes: Private data release via bayesian networks,” ACM Trans- actions on Database Systems (TODS) , vol. 42, no. 4, pp. 1–41, 2017
work page 2017
-
[25]
Iterative methods for private synthetic data: Unifying framework and new methods,
T. Liu, G. Vietri, and S. Z. Wu, “Iterative methods for private synthetic data: Unifying framework and new methods,” Advances in Neural Information Processing Systems , vol. 34, pp. 690–702, 2021
work page 2021
-
[26]
HDMM: Optimizing error of high-dimensional statistical queries under differ- ential privacy,
R. McKenna, G. Miklau, M. Hay, and A. Machanavajjhala, “HDMM: Optimizing error of high-dimensional statistical queries under differ- ential privacy,” arXiv preprint arXiv:2106.12118 , 2021
-
[27]
Privacy Col- laborative Research Cycle – Archive,
National Institute of Standards and Technology, “Privacy Col- laborative Research Cycle – Archive,” Available online, 2024, https://pages.nist.gov/privacy collaborative research cycle /pages/archive.html, Accessed: 2025-04-28
work page 2024
-
[28]
I. T. Jolliffe, Principal component analysis for special types of data . Springer, 2002
work page 2002
-
[29]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold ap- proximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
LEMURS: Lived Ex- periences Measured Using Rings Study,
V . C. S. Center, “LEMURS: Lived Ex- periences Measured Using Rings Study,” https://vermontcomplexsystems.org/research/projects/lemurs/, 2024, accessed: 2025-04-28
work page 2024
-
[31]
Predicting stress in first-year college students using sleep data from wearable devices,
L. S. Bloomfield, M. I. Fudolig, J. Kim, J. Llorin, J. L. Lovato, E. W. McGinnis, R. S. McGinnis, M. Price, T. H. Ricketts, P. S. Doddset al., “Predicting stress in first-year college students using sleep data from wearable devices,” PLOS Digital Health , vol. 3, no. 4, p. e0000473, 2024
work page 2024
-
[32]
L. Bloomfield, M. I. Fudolig, P. S. Dodds, J. Kim, J. Llorin, J. L. Lovato, E. McGinnis, R. S. McGinnis, M. Price, T. Ricketts et al. , “Events and behaviors associated with symptoms of generalized anxiety disorder in first-year college students,” 2023
work page 2023
-
[33]
M. I. Fudolig, L. S. Bloomfield, M. Price, Y . M. Bird, J. E. Hidalgo, J. N. Kim, J. Llorin, J. Lovato, E. W. McGinnis, R. S. McGinnis et al., “The Two Fundamental Shapes of Sleep Heart Rate Dynamics and Their Connection to Mental Health in College Students,” Digital Biomarkers, vol. 8, no. 1, pp. 120–131, 2024
work page 2024
-
[34]
Hypothesis testing interpretations and renyi differential privacy,
B. Balle, G. Barthe, M. Gaboardi, J. Hsu, and T. Sato, “Hypothesis testing interpretations and renyi differential privacy,” in International Conference on Artificial Intelligence and Statistics . PMLR, 2020, pp. 2496–2506
work page 2020
-
[35]
Guidelines for evaluating differential privacy guarantees,
J. P. Near, D. Darais, N. Lefkovitz, G. Howarth et al. , “Guidelines for evaluating differential privacy guarantees,” National Institute of Standards and Technology, Tech. Rep, pp. 800–226, 2023
work page 2023
-
[36]
Differential privacy: A primer for a non-technical audience,
A. Wood, M. Altman, A. Bembenek, M. Bun, M. Gaboardi, J. Honaker, K. Nissim, D. R. O’Brien, T. Steinke, and S. Vadhan, “Differential privacy: A primer for a non-technical audience,” Vand. J. Ent. & Tech. L. , vol. 21, p. 209, 2018
work page 2018
-
[37]
TAPAS: a toolbox for adversarial privacy auditing of synthetic data,
F. Houssiau, J. Jordon, S. N. Cohen, O. Daniel, A. Elliott, J. Geddes, C. Mole, C. Rangel-Smith, and L. Szpruch, “TAPAS: a toolbox for adversarial privacy auditing of synthetic data,” arXiv preprint arXiv:2211.06550, 2022
-
[38]
Developing a hierarchical model for unraveling conspiracy theories,
M. Ghasemizade and J. Onaolapo, “Developing a hierarchical model for unraveling conspiracy theories,” EPJ Data Science, vol. 13, no. 1, p. 31, 2024
work page 2024
-
[39]
Record Linkage Doc- umentation,
Record Linkage Development Team, “Record Linkage Doc- umentation,” 2025, accessed: 2025-01-27. [Online]. Available: https://recordlinkage.readthedocs.io/en/latest/
work page 2025
-
[40]
Privacy in pharmacogenetics: An {End-to-End} case study of personalized warfarin dosing,
M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Risten- part, “Privacy in pharmacogenetics: An {End-to-End} case study of personalized warfarin dosing,” in 23rd USENIX security symposium (USENIX Security 14) , 2014, pp. 17–32. 14
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.