pith. sign in

arxiv: 1907.00782 · v1 · pith:RCTWPYF7new · submitted 2019-06-28 · 💻 cs.CR · cs.CY· cs.DB· cs.LG

Collecting and Analyzing Multidimensional Data with Local Differential Privacy

Pith reviewed 2026-05-25 13:49 UTC · model grok-4.3

classification 💻 cs.CR cs.CYcs.DBcs.LG
keywords local differential privacymultidimensional datanoise variancestochastic gradient descentdata perturbationnumeric attributescategorical attributesprivacy mechanisms
0
0 comments X

The pith

New local differential privacy mechanisms collect multidimensional numeric and categorical data with lower worst-case noise variance than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that even basic tasks like mean estimation under local differential privacy lack sufficient solutions for multidimensional data. It introduces perturbation mechanisms for numeric attributes that match or exceed existing accuracy measured by worst-case noise variance, then generalizes them to mixed numeric and categorical attributes where the new methods consistently improve on that metric. The same mechanisms support construction of an LDP-compliant stochastic gradient descent procedure. Real-data experiments confirm the accuracy gains over earlier approaches.

Core claim

The authors propose novel LDP mechanisms for collecting a numeric attribute whose accuracy is at least no worse (and usually better) than existing solutions in terms of worst-case noise variance. They extend these mechanisms to multidimensional data that can contain both numeric and categorical attributes, where the mechanisms always outperform existing solutions regarding worst-case noise variance. As a case study, they apply the solutions to build an LDP-compliant stochastic gradient descent algorithm.

What carries the argument

Perturbation mechanisms that minimize worst-case noise variance while satisfying local differential privacy for numeric attributes and their extension to mixed numeric-categorical multidimensional data.

If this is right

  • Mean estimation and similar statistics over single or multiple attributes become more accurate under LDP.
  • LDP-compliant stochastic gradient descent can train models with less noise than previous LDP baselines.
  • Systems collecting mixed-type user data can preserve more utility while meeting local privacy guarantees.
  • Basic data-collection primitives improve, which benefits any downstream LDP analysis that relies on them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-reduction approach might apply to other statistics or learning objectives if the perturbation design generalizes beyond means and gradients.
  • If worst-case variance aligns with observed utility in deployed systems, the methods could raise data quality in existing LDP deployments such as browser telemetry.
  • Evaluating performance when attributes are correlated could expose further gains or edge cases not addressed by the independent-attribute analysis.

Load-bearing premise

Worst-case noise variance fully captures practical utility, and the perturbation schemes meet the LDP definition for any input distribution without assumptions on attribute ranges or correlations.

What would settle it

A calculation or experiment on any multidimensional dataset with numeric and categorical attributes that shows an existing LDP mechanism achieving strictly lower worst-case noise variance than the proposed mechanisms.

Figures

Figures reproduced from arXiv: 1907.00782 by Ge Yu, Hyejin Shin, Junbum Shin, Jun Zhao, Ning Wang, Siu Cheung Hui, Xiaokui Xiao, Yin Yang.

Figure 1
Figure 1. Figure 1: Different mechanisms’ worst-case noise variances for [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The noisy output t ∗ i ’s probability density function pdf(t ∗ i ) in the Piecewise Mechanism. class of applications.) In contrast, Duchi et al.’s solution incurs a noise variance that increases with the decrease of |ti |, as shown in Equation 4. Now consider the estimator 1 n Pn i t ∗ i used by the aggregator to infer the mean value of all ti . The variance of this estimator is 1/n of the average variance… view at source ↗
Figure 3
Figure 3. Figure 3: The worst-case variance of PM (resp. HM) as a fraction of the worst-case variance of Duchi et al.’s solution, for various [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Result accuracy for mean estimation (on numeric attributes) and frequency estimation (on categorical attributes). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Result accuracy on synthetic datasets with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Result accuracy vs. privacy budget on different data [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Result accuracy vs. number of users. 1e-006 1e-005 0.0001 0.001 5 10 15 19 MSE dimensionality Laplace SCDF Duchi PM HM 1e-006 1e-005 0.0001 0.001 5 10 15 19 MSE dimensionality OUE Proposed solution (a) Numeric (b) Categorical [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Result accuracy vs. dimensionality. Meanwhile, on numeric attributes, the proposed solutions PM and HM outperform Duchi et al.’s solution in all settings. This is because (i) although all three methods are asymptoti￾cally optimal, Duchi et al.’s solution incurs a larger constant than the proposed algorithms and (ii) Duchi et al.’s solution cannot handle categorical attributes, and, thus, needs to be combin… view at source ↗
Figure 9
Figure 9. Figure 9: Logistic Regression. Laplace Duchi et al. PM HM Non-private 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.5 1 2 4 misclassification rate privacy budget ε 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.5 1 2 4 misclassification rate privacy budget ε (a) BR (b) MX [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Support Vector Machines (SVM). no clear trend for the performance gap between PM/HM and Duchi et al.’s solution [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: demonstrates the mean squared error (MSE) of the linear regression model generated by each method with varying . We omit the MSE results for the Laplace mecha￾nism, since they are far higher than the other three methods. The proposed solutions PM and HM once again consistently outperform Duchi et al.’s solution. Overall, our experimental results demonstrate the effectiveness of PM and HM for empir￾ical r… view at source ↗
read the original abstract

Local differential privacy (LDP) is a recently proposed privacy standard for collecting and analyzing data, which has been used, e.g., in the Chrome browser, iOS and macOS. In LDP, each user perturbs her information locally, and only sends the randomized version to an aggregator who performs analyses, which protects both the users and the aggregator against private information leaks. Although LDP has attracted much research attention in recent years, the majority of existing work focuses on applying LDP to complex data and/or analysis tasks. In this paper, we point out that the fundamental problem of collecting multidimensional data under LDP has not been addressed sufficiently, and there remains much room for improvement even for basic tasks such as computing the mean value over a single numeric attribute under LDP. Motivated by this, we first propose novel LDP mechanisms for collecting a numeric attribute, whose accuracy is at least no worse (and usually better) than existing solutions in terms of worst-case noise variance. Then, we extend these mechanisms to multidimensional data that can contain both numeric and categorical attributes, where our mechanisms always outperform existing solutions regarding worst-case noise variance. As a case study, we apply our solutions to build an LDP-compliant stochastic gradient descent algorithm (SGD), which powers many important machine learning tasks. Experiments using real datasets confirm the effectiveness of our methods, and their advantages over existing solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes novel LDP mechanisms for numeric attributes that achieve worst-case noise variance at least as good as (and usually better than) prior solutions such as the Laplace mechanism. It extends these to multidimensional mixed numeric/categorical data, claiming strict outperformance on the same metric, and applies the mechanisms to construct an LDP-compliant SGD algorithm. Experiments on real datasets are reported to confirm the advantages.

Significance. If the variance bounds and LDP proofs hold, the work improves practical utility for LDP data collection tasks used in production systems (Chrome, iOS) and for downstream ML. Credit is due for focusing on an explicit, comparable metric (worst-case noise variance) and for including both mechanism constructions and an SGD case study with real-data experiments.

major comments (2)
  1. [§3.2] §3.2, variance analysis: the claim that the new numeric mechanism is 'at least no worse' than baselines requires an explicit inequality proof comparing the derived worst-case variance (e.g., Eq. (X)) against Duchi et al. and Laplace for all ε; without it the central improvement claim rests on the abstract statement alone.
  2. [§4.3] §4.3, multidimensional extension: the 'always outperform' statement for mixed attributes (Table 2) assumes attribute independence in the variance calculation; if the paper's LDP proof or variance bound does not address possible correlations, the outperformance may not hold for arbitrary input distributions.
minor comments (2)
  1. [Abstract] The abstract states accuracy claims without referencing the specific variance equations or theorems; add forward references to §3 and §4.
  2. [§3] Notation for the perturbation functions (numeric vs. categorical) would benefit from a summary table of input ranges and output domains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for minor revision. We address the major comments below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2, variance analysis: the claim that the new numeric mechanism is 'at least no worse' than baselines requires an explicit inequality proof comparing the derived worst-case variance (e.g., Eq. (X)) against Duchi et al. and Laplace for all ε; without it the central improvement claim rests on the abstract statement alone.

    Authors: We concur that an explicit proof is required to rigorously support the claim. The revised manuscript will include a dedicated lemma that proves the worst-case variance of our proposed numeric mechanism is at most that of the Laplace mechanism and Duchi et al.'s approach for every ε > 0. This follows directly from algebraic comparison of the respective variance formulas derived in §3.2. revision: yes

  2. Referee: [§4.3] §4.3, multidimensional extension: the 'always outperform' statement for mixed attributes (Table 2) assumes attribute independence in the variance calculation; if the paper's LDP proof or variance bound does not address possible correlations, the outperformance may not hold for arbitrary input distributions.

    Authors: The mechanisms for numeric and categorical attributes are applied independently to each attribute, and the worst-case noise variance is the sum of the per-attribute worst-case variances. This bound holds irrespective of any correlations among attributes, as the worst-case is taken over all possible input values for each attribute separately. The LDP property is satisfied locally per user without assuming independence. We will add a clarifying remark in §4.3 to emphasize that the outperformance claim is with respect to this worst-case metric and does not depend on data correlations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces novel LDP mechanisms for numeric attributes and their multidimensional extensions, with claims of improved worst-case noise variance over prior solutions. The abstract presents these as explicit new constructions whose accuracy is analyzed and compared directly to existing work, without any self-definitional loops, fitted inputs relabeled as predictions, or load-bearing self-citations that collapse the central results to the inputs by construction. The derivation chain for the mechanisms, extensions, and SGD case study remains independent of the target claims, consistent with a self-contained presentation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work appears to rest on the standard definition of local differential privacy and the choice of worst-case noise variance as the utility metric.

pith-pipeline@v0.9.0 · 5804 in / 1028 out tokens · 21945 ms · 2026-05-25T13:49:56.282209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    https://www.ipums.org

    Integrated public use microdata series, 2017. https://www.ipums.org

  2. [2]

    https://sites.google.com/site/icde2019tr/tr

    Collecting and analyzing multidimensional data with local differential privacy (technical report), 2018. https://sites.google.com/site/icde2019tr/tr

  3. [3]

    Avent, A

    B. Avent, A. Korolova, D. Zeber, T. Hovden, and B. Livshits. BLENDER: Enabling local search with a hybrid differential privacy model. In USENIX Security Symposium , pages 747–764, 2017

  4. [4]

    Bassily, K

    R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy hitters. In NIPS, pages 2285–2293, 2017

  5. [5]

    Bassily and A

    R. Bassily and A. Smith. Local, private, efficient protocols for succinct histograms. In ACM STOC, pages 127–135, 2015

  6. [6]

    Bittau, U

    A. Bittau, U. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M. Rudominer, U. Kode, J. Tinnes, and B. Seefeld. PROCHLO: Strong privacy for analytics in the crowd. In ACM SOSP, pages 441– 459, 2017

  7. [7]

    M. Bun, J. Nelson, and U. Stemmer. Heavy hitters and the structure of local privacy. In ACM PODS, pages 435–447, 2018

  8. [8]

    R. Chen, H. Li, A. K. Qin, S. P. Kasiviswanathan, and H. Jin. Private spatial data aggregation in the local setting. In IEEE ICDE, pages 289– 300, 2016

  9. [9]

    Cormode, S

    G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang. Privacy at scale: Local differential privacy in practice. In ACM SIGMOD Tutorials, 2018

  10. [10]

    Cormode, T

    G. Cormode, T. Kulkarni, and D. Srivastava. Marginal release under local differential privacy. In ACM SIGMOD, 2018

  11. [11]

    Cortes and V

    C. Cortes and V . Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995

  12. [12]

    B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In NIPS, pages 3574–3583, 2017

  13. [13]

    J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In IEEE FOCS, pages 429–438, 2013

  14. [14]

    J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018

  15. [15]

    J. C. Duchi, M. J. Wainwright, and M. I. Jordan. Local privacy and minimax bounds: Sharp rates for probability estimation. In NIPS, pages 1529–1537, 2013

  16. [16]

    Dwork, F

    C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006

  17. [17]

    Dwork and A

    C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9(3- 4):211–407, 2014

  18. [18]

    Erlingsson, V

    ´U. Erlingsson, V . Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In ACM CCS, pages 1054–1067, 2014

  19. [19]

    Fanti, V

    G. Fanti, V . Pihur, and ´U. Erlingsson. Building a RAPPOR with the unknown: Privacy-preserving learning of associations and data dictionaries. PoPETs, 2016(3):41–61, 2016

  20. [20]

    Local Private Hypothesis Testing: Chi-Square Tests

    M. Gaboardi and R. Rogers. Local private hypothesis testing: Chi-square tests. arXiv preprint arXiv:1709.07155 , 2017

  21. [21]

    Q. Geng, P. Kairouz, S. Oh, and P. Viswanath. The staircase mechanism in differential privacy. J. Sel. Topics Signal Processing, 9(7):1176–1184, 2015

  22. [22]

    J. Hamm, A. C. Champion, G. Chen, M. Belkin, and D. Xuan. Crowd- ML: A privacy-preserving learning framework for a crowd of smart devices. In IEEE ICDCS, pages 11–20, 2015

  23. [23]

    Kairouz, K

    P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estima- tion under local privacy. In ICML, 2016

  24. [24]

    Kairouz, S

    P. Kairouz, S. Oh, and P. Viswanath. Extremal mechanisms for local differential privacy. In NIPS, pages 2879–2887, 2014

  25. [25]

    S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In IEEE FOCS, pages 531–540, 2008

  26. [26]

    J. W. Kim, D.-H. Kim, and B. Jang. Application of local differential privacy to collection of indoor positioning data. IEEE Access, 6:4276– 4286, 2018

  27. [27]

    Krishnan, J

    S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. PrivateClean: Data cleaning and differential privacy. In ACM SIGMOD, pages 937–951, 2016

  28. [28]

    C. Liu, S. Chakraborty, and P. Mittal. DEEProtect: Enabling inference- based access control on mobile sensing applications. arXiv preprint arXiv:1702.06159, 2017

  29. [29]

    McSherry and K

    F. McSherry and K. Talwar. Mechanism design via differential privacy. In IEEE FOCS, pages 94–103, 2007

  30. [30]

    Murakami, H

    T. Murakami, H. Hino, and J. Sakuma. Toward distribution estima- tion under local differential privacy with small samples. PoPETs, 2018(3):84–104, 2018

  31. [31]

    Nissim and U

    K. Nissim and U. Stemmer. Clustering algorithms for the centralized and local models. In Algorithmic Learning Theory (ALT) , pages 619–653, Apr 2018

  32. [32]

    Pastore and M

    A. Pastore and M. Gastpar. Locally differentially-private distribution estimation. In IEEE International Symposium on Information Theory (ISIT), pages 2694–2698, 2016

  33. [33]

    Z. Qin, Y . Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy hitter estimation over set-valued data with local differential privacy. In ACM CCS, pages 192–203, 2016

  34. [34]

    X. Ren, C. M. Yu, W. Yu, S. Yang, X. Yang, J. A. McCann, and P. S. Yu. LoPub: High-dimensional crowdsourced data publication with local differential privacy. IEEE Transactions on Information Forensics and Security, 13(9):2151–2166, Sept 2018

  35. [35]

    Soria-Comas and J

    J. Soria-Comas and J. Domingo-Ferrer. Optimal data-independent noise for differential privacy. Inf. Sci., 250:200–214, 2013

  36. [36]

    J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang. Privacy loss in Apple’s implementation of differential privacy on macOS 10.12. arXiv preprint arXiv:1709.02753, 2017

  37. [37]

    N. Wang, X. Xiao, Y . Yang, T. D. Hoang, H. Shin, J. Shin, and G. Yu. PrivTrie: Effective frequent term discovery under local differential privacy. In IEEE ICDE, 2018

  38. [38]

    T. Wang, J. Blocki, N. Li, and S. Jha. Locally differentially private protocols for frequency estimation. In USENIX Security, 2017

  39. [39]

    T. Wang, N. Li, and S. Jha. Locally differentially private heavy hitter identification. arXiv preprint arXiv:1708.06674 , 2017

  40. [40]

    T. Wang, N. Li, and S. Jha. Locally differentially private frequent itemset mining. In IEEE Symposium on Security and Privacy (S&P) , 2018

  41. [41]

    S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association , 60(309):63–69, 1965

  42. [42]

    Ye and A

    M. Ye and A. Barg. Optimal schemes for discrete distribution estimation under local differential privacy. In IEEE International Symposium on Information Theory (ISIT) , pages 759–763, 2017. 12