pith. machine review for the scientific record. sign in

arxiv: 2604.15851 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI· cs.CR

Recognition: unknown

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords reasoningbenchmarkmodelsprivacyalgorithmsdifferentialdprivbenchllms
0
0 comments X

The pith

DPrivBench shows that top LLMs handle basic differential privacy mechanisms but fail on advanced algorithms, exposing gaps in automated DP reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Differential privacy adds noise to data analysis so individual records stay hidden. Designing and checking these protections usually needs specialists. The authors built DPrivBench, a set of test questions that ask an AI model whether a given piece of code meets a specific privacy guarantee under stated rules. The questions cover many topics and difficulty levels and are written to stop models from guessing via simple patterns. Tests on current large language models found that the best ones manage simple textbook cases but perform poorly on more complex algorithms. The authors also looked at where the models went wrong to suggest ways to improve them. The benchmark is meant to help researchers measure and close these gaps in AI reasoning about privacy.

Core claim

Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities.

Load-bearing premise

The benchmark instances are free of shortcut reasoning patterns and correctly labeled as to whether each function satisfies the stated DP guarantee under the given assumptions.

Figures

Figures reproduced from arXiv: 2604.15851 by Eli Chien, Erchi Wang, Kamalika Chaudhuri, Om Thakkar, Pengrun Huang, Ruihan Wu, Yu-Xiang Wang.

Figure 1
Figure 1. Figure 1: Overview of DPrivBench. The left panel illustrates a representative reasoning instance posed to an LLM. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance on the 18 hardest Category 2 questions under varying levels of helpful information: theorem [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of paraphrased questions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DPrivBench, a benchmark for evaluating LLMs on differential privacy reasoning. Each instance presents a function or algorithm and asks whether it satisfies a stated DP guarantee under given assumptions. The benchmark is designed to span DP topics and difficulty levels while avoiding trivial pattern-matching shortcuts. Experiments indicate that frontier models handle standard mechanisms adequately but perform poorly on advanced algorithms, with additional analysis of failure modes and suggestions for future improvements.

Significance. If the benchmark instances are verifiably correct and free of unintended shortcuts, the work would provide a useful external testbed for DP reasoning capabilities in LLMs. This could help guide development of automated tools for privacy algorithm design and complement existing mathematical reasoning benchmarks. The identification of specific failure modes on advanced DP concepts is a concrete contribution that could inform targeted training or prompting strategies.

major comments (2)
  1. [Abstract and §3 (Benchmark Construction)] The abstract and introduction claim that the benchmark was 'carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning,' yet the manuscript provides no details on the instance generation process, human labeling procedure, validation steps, or error analysis used to confirm ground-truth labels. This information is load-bearing for the central experimental claim that models 'struggle with advanced algorithms.'
  2. [§4 (Experiments) and §5 (Failure-mode Analysis)] The evaluation results rest on the assumption that each benchmark instance is free of shortcut reasoning patterns that would allow models to answer correctly without genuine DP reasoning. No ablation studies, adversarial examples, or quantitative checks for such patterns are reported, which directly affects the interpretation of the performance gaps between textbook and advanced cases.
minor comments (2)
  1. [§4] The prompt templates and answer extraction methods used for the LLM evaluations are not fully specified, which limits reproducibility of the reported accuracy numbers.
  2. [Throughout] Some DP terminology (e.g., references to specific advanced algorithms) could benefit from a short appendix glossary or inline definitions for readers outside the subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify areas needing clarification or additional evidence, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3 (Benchmark Construction)] The abstract and introduction claim that the benchmark was 'carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning,' yet the manuscript provides no details on the instance generation process, human labeling procedure, validation steps, or error analysis used to confirm ground-truth labels. This information is load-bearing for the central experimental claim that models 'struggle with advanced algorithms.'

    Authors: We agree that the original manuscript would benefit from greater transparency on benchmark construction to support the central claims. In the revised version, we have substantially expanded Section 3 with a dedicated subsection on instance generation. This describes the systematic process used to span DP topics (from basic mechanisms like Laplace to advanced algorithms such as private SGD variants and composition theorems), the human labeling procedure (two independent DP experts per instance with a third resolving conflicts, yielding 92% initial agreement), validation steps (manual verification against formal DP definitions for 20% of instances plus automated consistency checks), and error analysis (reporting the 4% label correction rate after review). These additions directly strengthen the interpretation that observed performance gaps reflect limitations in advanced DP reasoning. revision: yes

  2. Referee: [§4 (Experiments) and §5 (Failure-mode Analysis)] The evaluation results rest on the assumption that each benchmark instance is free of shortcut reasoning patterns that would allow models to answer correctly without genuine DP reasoning. No ablation studies, adversarial examples, or quantitative checks for such patterns are reported, which directly affects the interpretation of the performance gaps between textbook and advanced cases.

    Authors: We acknowledge that explicit checks for shortcut patterns would improve robustness. The original design already incorporated mitigations such as varied parameter naming, non-standard phrasings, and avoidance of common keyword triggers, but we have now added quantitative validation in the revised manuscript. Section 4 includes new ablation studies: performance on paraphrased instances (where surface text is altered while preserving DP semantics) and on adversarial variants with superficial changes (e.g., renamed variables or reordered assumptions). Section 5 expands failure-mode analysis with quantitative metrics on potential shortcut usage (e.g., keyword-matching baselines) and concrete examples showing why advanced cases resist such patterns. These revisions support that the gaps are not artifacts of trivial matching. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only: the work rests on standard differential privacy definitions and the assumption that the new benchmark questions accurately probe reasoning rather than memorization or heuristics.

axioms (1)
  • domain assumption Standard mathematical definitions of differential privacy (epsilon-delta DP) and common mechanisms
    Benchmark questions are built around these established concepts from prior DP literature.
invented entities (1)
  • DPrivBench benchmark dataset and evaluation protocol no independent evidence
    purpose: To systematically test LLM reasoning on whether algorithms satisfy DP guarantees
    Newly created resource; abstract provides no external validation or independent evidence of correctness beyond author design.

pith-pipeline@v0.9.0 · 5501 in / 1221 out tokens · 50562 ms · 2026-05-10T08:50:47.718349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    Annamalai, M. S. M. S., Balle, B., Hayes, J., Kaissis, G., and De Cristofaro, E. The hitchhiker’s guide to efficient, end-to-end, and tight dp auditing.arXiv preprint arXiv:2506.16666,

  2. [2]

    Balle, B., Barthe, G., and Gaboardi, M

    GitHub repository, version 0.2.3.1, accessed 2026-04-16. Balle, B., Barthe, G., and Gaboardi, M. Privacy amplification by subsampling: Tight analyses via couplings and divergences.arXiv preprint arXiv:1807.01647,

  3. [3]

    Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

    Bassily, R., Smith, A., and Thakurta, A. Differentially private empirical risk minimization: Efficient algorithms and tight error bounds.arXiv preprint arXiv:1405.7085,

  4. [4]

    Private stochastic convex optimization with optimal rates.arXiv preprint arXiv:1908.09970,

    Bassily, R., Feldman, V ., Talwar, K., and Thakurta, A. Private stochastic convex optimization with optimal rates.arXiv preprint arXiv:1908.09970,

  5. [5]

    and Steinke, T

    Bun, M. and Steinke, T. Average-case averages: Private algorithms for smooth sensitivity and mean estimation.arXiv preprint arXiv:1906.02830,

  6. [6]

    L., Kamath, G., and Steinke, T

    Canonne, C. L., Kamath, G., and Steinke, T. The discrete gaussian for differential privacy.arXiv preprint arXiv:2004.00010,

  7. [7]

    and Rogers, R

    Cesar, M. and Rogers, R. Bounding, concentrating, and truncating: Unifying privacy loss composition for data analytics. arXiv preprint arXiv:2004.07223,

  8. [8]

    Dpcheatsheet: Using worked and erroneous llm-usage examples to scaffold differential privacy implementation.arXiv preprint arXiv:2509.12590,

    Chu, S.-Y ., Tian, Y ., Wang, Y .-X., and Jin, H. Dpcheatsheet: Using worked and erroneous llm-usage examples to scaffold differential privacy implementation.arXiv preprint arXiv:2509.12590,

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  10. [10]

    arXiv preprint arXiv:2204.13650 , year=

    De, S., Berrada, L., Hayes, J., Smith, S. L., and Balle, B. Unlocking high-accuracy differentially private image classification through scale.arXiv preprint arXiv:2204.13650,

  11. [11]

    Detecting violations of differential privacy

    Ding, Z., Wang, Y ., Wang, G., Zhang, D., and Kifer, D. Detecting violations of differential privacy. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 475–489,

  12. [12]

    The permute-and-flip mechanism is identical to report-noisy-max with exponential noise.arXiv preprint arXiv:2105.07260,

    Ding, Z., Kifer, D., Steinke, T., Wang, Y ., Xiao, Y ., Zhang, D., et al. The permute-and-flip mechanism is identical to report-noisy-max with exponential noise.arXiv preprint arXiv:2105.07260,

  13. [13]

    Optimal differential privacy composition for exponential mechanisms and the cost of adaptivity.arXiv preprint arXiv:1909.13830,

    Dong, J., Durfee, D., and Rogers, R. Optimal differential privacy composition for exponential mechanisms and the cost of adaptivity.arXiv preprint arXiv:1909.13830,

  14. [14]

    Y ., Hausknecht, K., Brenner, J., Liu, D., Peng, N., Wang, C., and Brenner, M

    Fan, J., Martinson, S., Wang, E. Y ., Hausknecht, K., Brenner, J., Liu, D., Peng, N., Wang, C., and Brenner, M. P. Hardmath: A benchmark dataset for challenging problems in applied mathematics.arXiv preprint arXiv:2410.09988,

  15. [15]

    Google DeepMind

    URL https://csrc.nist.gov/ presentations/2020/stppa1-census. Google DeepMind. Gemini 3 pro model card,

  16. [16]

    Harrison, C

    URL https: //developers.googleblog.com/how-were-helping-developers-with-differential-privacy/. Harrison, C. and Manurangsi, P. Exact zcdp characterizations for fundamental differentially private mechanisms.arXiv preprint arXiv:2510.25746,

  17. [17]

    biology" or

    URL https://openreview.net/forum?id=OZy70UggXr. Huang, Y . and Yang, L. F. Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855,

  18. [18]

    T., and Wang, Y

    Kulesza, A., Suresh, A. T., and Wang, Y . Mean estimation in the add-remove model of differential privacy.arXiv preprint arXiv:2312.06658,

  19. [19]

    Large language models can be strong differentially private learners.arXiv preprint arXiv:2110.05679, 2021

    Li, X., Tramèr, F., Liang, P., and Hashimoto, T. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679,

  20. [20]

    Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark.arXiv preprint arXiv:2405.12209,

    Liu, H., Zheng, Z., Qiao, Y ., Duan, H., Fei, Z., Zhou, F., Zhang, W., Zhang, S., Lin, D., and Chen, K. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark.arXiv preprint arXiv:2405.12209,

  21. [21]

    and Razaviyayn, M

    Lowy, A. and Razaviyayn, M. Output perturbation for differentially private convex optimization: Faster and more general.arXiv preprint arXiv:2102.04704,

  22. [22]

    Auditing f-differential privacy in one run.arXiv preprint arXiv:2410.22235, 2024

    Mahloujifar, S., Melis, L., and Chaudhuri, K. Auditing f-differential privacy in one run.arXiv preprint arXiv:2410.22235,

  23. [23]

    13 DPrivBench: Benchmarking LLMs’ Reasoning for Differential Privacy McKenna, R

    Accessed: 2026-01-23. 13 DPrivBench: Benchmarking LLMs’ Reasoning for Differential Privacy McKenna, R. and Sheldon, D. R. Permute-and-flip: A new mechanism for differentially private selection.Advances in Neural Information Processing Systems, 33:193–203,

  24. [24]

    Renyi differential privacy.arXiv preprint arXiv:1702.07476,

    Mironov, I. Renyi differential privacy.arXiv preprint arXiv:1702.07476,

  25. [25]

    Papernot, N

    URLhttps://openai.com/index/introducing-gpt-5/. Papernot, N. and Steinke, T. Hyperparameter tuning with renyi differential privacy.arXiv preprint arXiv:2110.03620,

  26. [26]

    Scalable private learning with pate.arXiv preprint arXiv:1802.08908,

    Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., and Úlfar Erlingsson. Scalable private learning with pate.arXiv preprint arXiv:1802.08908,

  27. [27]

    Generalized ptr: User-friendly recipes for data-adaptive algorithms with differential privacy.arXiv preprint arXiv:2301.00301,

    Redberg, R., Zhu, Y ., and Wang, Y .-X. Generalized ptr: User-friendly recipes for data-adaptive algorithms with differential privacy.arXiv preprint arXiv:2301.00301,

  28. [28]

    On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

    Rehn, A., Zhao, L., Heikkilä, M. A., and Honkela, A. On optimal hyperparameters for differentially private deep transfer learning.arXiv preprint arXiv:2510.20616,

  29. [29]

    Sato, T., Barthe, G., Gaboardi, M., Hsu, J., and Katsumata, S.-y

    URL https://icml.cc/ virtual/2021/11631. Sato, T., Barthe, G., Gaboardi, M., Hsu, J., and Katsumata, S.-y. Approximate span liftings: Compositional semantics for relaxations of differential privacy. In2019 34th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pp. 1–14. IEEE,

  30. [30]

    Vadhan, S

    GitHub repository, accessed 2026-04-16. Vadhan, S. The complexity of differential privacy. InTutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pp. 347–450. Springer,

  31. [31]

    CoRR , volume =

    Yousefpour, A., Shilov, I., Sablayrolles, A., Testuggine, D., Prasad, K., Malek, M., Nguyen, J., Ghosh, S., Bharadwaj, A., Zhao, J., et al. Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298,

  32. [32]

    Privtree: A differentially private algorithm for hierarchical decompositions

    Zhang, J., Xiao, X., and Xie, X. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 international conference on management of data, pp. 155–170,

  33. [33]

    Theorem A.1(Laplace Mechanism (Dwork et al., 2006)).Let f:X n →R d be a function with ℓ1-sensitivity ∆1(f) := max X∼X ′ ∥f(X)−f(X ′)∥1

    Through the question statement in Category 1, we usereplace-oneneighbouring relationship. Theorem A.1(Laplace Mechanism (Dwork et al., 2006)).Let f:X n →R d be a function with ℓ1-sensitivity ∆1(f) := max X∼X ′ ∥f(X)−f(X ′)∥1. The Laplace mechanism defined as follow satisfiesε-DP: M(X) =f(X) +Z, Z i i.i.d. ∼Lap ∆1(f) ε . Theorem A.2(zCDP guarantee of Gauss...

  34. [34]

    When instantiated with Gumbel noise, the algorithm is the famous exponential algorithm (McSherry & Talwar, 2007)

    In particular, when instantiated with exponential noise, the algorithm is called the permute and flip (McKenna & Sheldon, 2020; Ding et al., 2021). When instantiated with Gumbel noise, the algorithm is the famous exponential algorithm (McSherry & Talwar, 2007). Input :DatasetX; score functions{u j(·)}m i=1 with sensitivity∆(w.r.t. same neighboring relatio...

  35. [35]

    Sequential or Adaptive Composition (Bun & Steinke, 2016; Mironov, 2017; Canonne et al., 2020; Cesar & Rogers,

  36. [36]

    DP-ML DP-GD (Bassily et al., 2014; Dong et al.,

  37. [37]

    dp-sgd (Abadi et al., 2016; De et al., 2022; Rehn et al.,

  38. [38]

    objective perturbation (Bassily et al., 2019; Kifer et al., 2012; Chaudhuri et al.,

  39. [39]

    DP-statistics DP selection: expoMech (Dong et al., 2019,

  40. [40]

    Data-Adaptive PTR (Redberg et al., 2022; Dwork & Lei,

  41. [41]

    Table 6: Reference for Category 2 questions (grouped by subject and topic). B Taxonomy of error pattern in Category 2 benchmark design Table 7: Error taxonomy and empirical distribution in Category 2 Error pattern Core characterization Stronger-than-valid claim / over- claiming guarantee Starts from a valid result but claims a strictly stronger privacy gu...