arxiv: 2604.15851 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI· cs.CR

Recognition: unknown

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

Erchi Wang , Pengrun Huang , Eli Chien , Om Thakkar , Kamalika Chaudhuri , Yu-Xiang Wang , Ruihan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords reasoningbenchmarkmodelsprivacyalgorithmsdifferentialdprivbenchllms

0 comments

The pith

DPrivBench shows that top LLMs handle basic differential privacy mechanisms but fail on advanced algorithms, exposing gaps in automated DP reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Differential privacy adds noise to data analysis so individual records stay hidden. Designing and checking these protections usually needs specialists. The authors built DPrivBench, a set of test questions that ask an AI model whether a given piece of code meets a specific privacy guarantee under stated rules. The questions cover many topics and difficulty levels and are written to stop models from guessing via simple patterns. Tests on current large language models found that the best ones manage simple textbook cases but perform poorly on more complex algorithms. The authors also looked at where the models went wrong to suggest ways to improve them. The benchmark is meant to help researchers measure and close these gaps in AI reasoning about privacy.

Core claim

Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities.

Load-bearing premise

The benchmark instances are free of shortcut reasoning patterns and correctly labeled as to whether each function satisfies the stated DP guarantee under the given assumptions.

Figures

Figures reproduced from arXiv: 2604.15851 by Eli Chien, Erchi Wang, Kamalika Chaudhuri, Om Thakkar, Pengrun Huang, Ruihan Wu, Yu-Xiang Wang.

**Figure 1.** Figure 1: Overview of DPrivBench. The left panel illustrates a representative reasoning instance posed to an LLM. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Performance on the 18 hardest Category 2 questions under varying levels of helpful information: theorem [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of paraphrased questions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

read the original abstract

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPrivBench introduces a useful new test set for LLM reasoning on differential privacy but leaves key construction details thin.

read the letter

The colleague should know that this paper creates DPrivBench, a set of questions asking whether given functions meet stated differential privacy guarantees, and reports that top models manage basic mechanisms but all of them fail on advanced algorithms. That gap is the main takeaway worth noting for anyone thinking about LLMs as privacy assistants. The benchmark itself is the actual new piece. It tries to span many DP topics and difficulty levels while blocking trivial pattern matching, which is a reasonable design goal and not just another math word problem set. The experiments give a clear picture of where current models sit, and the authors flag directions for improvement, which is practical. The paper does well at framing the problem for non-expert use of DP and at positioning the benchmark as complementary to existing math-reasoning tests. The soft spots are in the benchmark construction. The abstract claims careful coverage and anti-shortcut measures, but without explicit checks on labeling accuracy, human validation of the instances, or a breakdown of why models fail, it is hard to know how much the reported struggles reflect real reasoning limits versus possible label noise or unintended shortcuts. If the full paper has solid error analysis and reproducible instance generation, that concern shrinks; otherwise the central claim rests on less evidence than it should. No obvious math or citation problems appear. This is for researchers working on privacy-preserving ML, LLM evaluation in technical domains, or tools that lower the barrier to DP. A reader who needs a starting point for testing automated DP reasoning will get concrete data and failure examples to build on. It deserves a serious referee because a benchmark in this niche can move the field forward even if the current version needs tightening on validation. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces DPrivBench, a benchmark for evaluating LLMs on differential privacy reasoning. Each instance presents a function or algorithm and asks whether it satisfies a stated DP guarantee under given assumptions. The benchmark is designed to span DP topics and difficulty levels while avoiding trivial pattern-matching shortcuts. Experiments indicate that frontier models handle standard mechanisms adequately but perform poorly on advanced algorithms, with additional analysis of failure modes and suggestions for future improvements.

Significance. If the benchmark instances are verifiably correct and free of unintended shortcuts, the work would provide a useful external testbed for DP reasoning capabilities in LLMs. This could help guide development of automated tools for privacy algorithm design and complement existing mathematical reasoning benchmarks. The identification of specific failure modes on advanced DP concepts is a concrete contribution that could inform targeted training or prompting strategies.

major comments (2)

[Abstract and §3 (Benchmark Construction)] The abstract and introduction claim that the benchmark was 'carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning,' yet the manuscript provides no details on the instance generation process, human labeling procedure, validation steps, or error analysis used to confirm ground-truth labels. This information is load-bearing for the central experimental claim that models 'struggle with advanced algorithms.'
[§4 (Experiments) and §5 (Failure-mode Analysis)] The evaluation results rest on the assumption that each benchmark instance is free of shortcut reasoning patterns that would allow models to answer correctly without genuine DP reasoning. No ablation studies, adversarial examples, or quantitative checks for such patterns are reported, which directly affects the interpretation of the performance gaps between textbook and advanced cases.

minor comments (2)

[§4] The prompt templates and answer extraction methods used for the LLM evaluations are not fully specified, which limits reproducibility of the reported accuracy numbers.
[Throughout] Some DP terminology (e.g., references to specific advanced algorithms) could benefit from a short appendix glossary or inline definitions for readers outside the subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify areas needing clarification or additional evidence, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §3 (Benchmark Construction)] The abstract and introduction claim that the benchmark was 'carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning,' yet the manuscript provides no details on the instance generation process, human labeling procedure, validation steps, or error analysis used to confirm ground-truth labels. This information is load-bearing for the central experimental claim that models 'struggle with advanced algorithms.'

Authors: We agree that the original manuscript would benefit from greater transparency on benchmark construction to support the central claims. In the revised version, we have substantially expanded Section 3 with a dedicated subsection on instance generation. This describes the systematic process used to span DP topics (from basic mechanisms like Laplace to advanced algorithms such as private SGD variants and composition theorems), the human labeling procedure (two independent DP experts per instance with a third resolving conflicts, yielding 92% initial agreement), validation steps (manual verification against formal DP definitions for 20% of instances plus automated consistency checks), and error analysis (reporting the 4% label correction rate after review). These additions directly strengthen the interpretation that observed performance gaps reflect limitations in advanced DP reasoning. revision: yes
Referee: [§4 (Experiments) and §5 (Failure-mode Analysis)] The evaluation results rest on the assumption that each benchmark instance is free of shortcut reasoning patterns that would allow models to answer correctly without genuine DP reasoning. No ablation studies, adversarial examples, or quantitative checks for such patterns are reported, which directly affects the interpretation of the performance gaps between textbook and advanced cases.

Authors: We acknowledge that explicit checks for shortcut patterns would improve robustness. The original design already incorporated mitigations such as varied parameter naming, non-standard phrasings, and avoidance of common keyword triggers, but we have now added quantitative validation in the revised manuscript. Section 4 includes new ablation studies: performance on paraphrased instances (where surface text is altered while preserving DP semantics) and on adversarial variants with superficial changes (e.g., renamed variables or reordered assumptions). Section 5 expands failure-mode analysis with quantitative metrics on potential shortcut usage (e.g., keyword-matching baselines) and concrete examples showing why advanced cases resist such patterns. These revisions support that the gaps are not artifacts of trivial matching. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only: the work rests on standard differential privacy definitions and the assumption that the new benchmark questions accurately probe reasoning rather than memorization or heuristics.

axioms (1)

domain assumption Standard mathematical definitions of differential privacy (epsilon-delta DP) and common mechanisms
Benchmark questions are built around these established concepts from prior DP literature.

invented entities (1)

DPrivBench benchmark dataset and evaluation protocol no independent evidence
purpose: To systematically test LLM reasoning on whether algorithms satisfy DP guarantees
Newly created resource; abstract provides no external validation or independent evidence of correctness beyond author design.

pith-pipeline@v0.9.0 · 5501 in / 1221 out tokens · 50562 ms · 2026-05-10T08:50:47.718349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 26 canonical work pages · 2 internal anchors

[1]

Annamalai, M. S. M. S., Balle, B., Hayes, J., Kaissis, G., and De Cristofaro, E. The hitchhiker’s guide to efficient, end-to-end, and tight dp auditing.arXiv preprint arXiv:2506.16666,

work page arXiv
[2]

Balle, B., Barthe, G., and Gaboardi, M

GitHub repository, version 0.2.3.1, accessed 2026-04-16. Balle, B., Barthe, G., and Gaboardi, M. Privacy amplification by subsampling: Tight analyses via couplings and divergences.arXiv preprint arXiv:1807.01647,

work page arXiv 2026
[3]

Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

Bassily, R., Smith, A., and Thakurta, A. Differentially private empirical risk minimization: Efficient algorithms and tight error bounds.arXiv preprint arXiv:1405.7085,

work page Pith review arXiv
[4]

Private stochastic convex optimization with optimal rates.arXiv preprint arXiv:1908.09970,

Bassily, R., Feldman, V ., Talwar, K., and Thakurta, A. Private stochastic convex optimization with optimal rates.arXiv preprint arXiv:1908.09970,

work page arXiv 1908
[5]

and Steinke, T

Bun, M. and Steinke, T. Average-case averages: Private algorithms for smooth sensitivity and mean estimation.arXiv preprint arXiv:1906.02830,

work page arXiv 1906
[6]

L., Kamath, G., and Steinke, T

Canonne, C. L., Kamath, G., and Steinke, T. The discrete gaussian for differential privacy.arXiv preprint arXiv:2004.00010,

work page arXiv 2004
[7]

and Rogers, R

Cesar, M. and Rogers, R. Bounding, concentrating, and truncating: Unifying privacy loss composition for data analytics. arXiv preprint arXiv:2004.07223,

work page arXiv 2004
[8]

Dpcheatsheet: Using worked and erroneous llm-usage examples to scaffold differential privacy implementation.arXiv preprint arXiv:2509.12590,

Chu, S.-Y ., Tian, Y ., Wang, Y .-X., and Jin, H. Dpcheatsheet: Using worked and erroneous llm-usage examples to scaffold differential privacy implementation.arXiv preprint arXiv:2509.12590,

work page arXiv
[9]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2204.13650 , year=

De, S., Berrada, L., Hayes, J., Smith, S. L., and Balle, B. Unlocking high-accuracy differentially private image classification through scale.arXiv preprint arXiv:2204.13650,

work page arXiv
[11]

Detecting violations of differential privacy

Ding, Z., Wang, Y ., Wang, G., Zhang, D., and Kifer, D. Detecting violations of differential privacy. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 475–489,

2018
[12]

The permute-and-flip mechanism is identical to report-noisy-max with exponential noise.arXiv preprint arXiv:2105.07260,

Ding, Z., Kifer, D., Steinke, T., Wang, Y ., Xiao, Y ., Zhang, D., et al. The permute-and-flip mechanism is identical to report-noisy-max with exponential noise.arXiv preprint arXiv:2105.07260,

work page arXiv
[13]

Optimal differential privacy composition for exponential mechanisms and the cost of adaptivity.arXiv preprint arXiv:1909.13830,

Dong, J., Durfee, D., and Rogers, R. Optimal differential privacy composition for exponential mechanisms and the cost of adaptivity.arXiv preprint arXiv:1909.13830,

work page arXiv 1909
[14]

Y ., Hausknecht, K., Brenner, J., Liu, D., Peng, N., Wang, C., and Brenner, M

Fan, J., Martinson, S., Wang, E. Y ., Hausknecht, K., Brenner, J., Liu, D., Peng, N., Wang, C., and Brenner, M. P. Hardmath: A benchmark dataset for challenging problems in applied mathematics.arXiv preprint arXiv:2410.09988,

work page arXiv
[15]

Google DeepMind

URL https://csrc.nist.gov/ presentations/2020/stppa1-census. Google DeepMind. Gemini 3 pro model card,

2020
[16]

Harrison, C

URL https: //developers.googleblog.com/how-were-helping-developers-with-differential-privacy/. Harrison, C. and Manurangsi, P. Exact zcdp characterizations for fundamental differentially private mechanisms.arXiv preprint arXiv:2510.25746,

work page arXiv
[17]

biology" or

URL https://openreview.net/forum?id=OZy70UggXr. Huang, Y . and Yang, L. F. Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855,

work page arXiv 2025
[18]

T., and Wang, Y

Kulesza, A., Suresh, A. T., and Wang, Y . Mean estimation in the add-remove model of differential privacy.arXiv preprint arXiv:2312.06658,

work page arXiv
[19]

Large language models can be strong differentially private learners.arXiv preprint arXiv:2110.05679, 2021

Li, X., Tramèr, F., Liang, P., and Hashimoto, T. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679,

work page arXiv
[20]

Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark.arXiv preprint arXiv:2405.12209,

Liu, H., Zheng, Z., Qiao, Y ., Duan, H., Fei, Z., Zhou, F., Zhang, W., Zhang, S., Lin, D., and Chen, K. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark.arXiv preprint arXiv:2405.12209,

work page arXiv
[21]

and Razaviyayn, M

Lowy, A. and Razaviyayn, M. Output perturbation for differentially private convex optimization: Faster and more general.arXiv preprint arXiv:2102.04704,

work page arXiv
[22]

Auditing f-differential privacy in one run.arXiv preprint arXiv:2410.22235, 2024

Mahloujifar, S., Melis, L., and Chaudhuri, K. Auditing f-differential privacy in one run.arXiv preprint arXiv:2410.22235,

work page arXiv
[23]

13 DPrivBench: Benchmarking LLMs’ Reasoning for Differential Privacy McKenna, R

Accessed: 2026-01-23. 13 DPrivBench: Benchmarking LLMs’ Reasoning for Differential Privacy McKenna, R. and Sheldon, D. R. Permute-and-flip: A new mechanism for differentially private selection.Advances in Neural Information Processing Systems, 33:193–203,

2026
[24]

Renyi differential privacy.arXiv preprint arXiv:1702.07476,

Mironov, I. Renyi differential privacy.arXiv preprint arXiv:1702.07476,

work page arXiv
[25]

Papernot, N

URLhttps://openai.com/index/introducing-gpt-5/. Papernot, N. and Steinke, T. Hyperparameter tuning with renyi differential privacy.arXiv preprint arXiv:2110.03620,

work page arXiv
[26]

Scalable private learning with pate.arXiv preprint arXiv:1802.08908,

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., and Úlfar Erlingsson. Scalable private learning with pate.arXiv preprint arXiv:1802.08908,

work page arXiv
[27]

Generalized ptr: User-friendly recipes for data-adaptive algorithms with differential privacy.arXiv preprint arXiv:2301.00301,

Redberg, R., Zhu, Y ., and Wang, Y .-X. Generalized ptr: User-friendly recipes for data-adaptive algorithms with differential privacy.arXiv preprint arXiv:2301.00301,

work page arXiv
[28]

On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

Rehn, A., Zhao, L., Heikkilä, M. A., and Honkela, A. On optimal hyperparameters for differentially private deep transfer learning.arXiv preprint arXiv:2510.20616,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Sato, T., Barthe, G., Gaboardi, M., Hsu, J., and Katsumata, S.-y

URL https://icml.cc/ virtual/2021/11631. Sato, T., Barthe, G., Gaboardi, M., Hsu, J., and Katsumata, S.-y. Approximate span liftings: Compositional semantics for relaxations of differential privacy. In2019 34th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pp. 1–14. IEEE,

2021
[30]

Vadhan, S

GitHub repository, accessed 2026-04-16. Vadhan, S. The complexity of differential privacy. InTutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pp. 347–450. Springer,

2026
[31]

CoRR , volume =

Yousefpour, A., Shilov, I., Sablayrolles, A., Testuggine, D., Prasad, K., Malek, M., Nguyen, J., Ghosh, S., Bharadwaj, A., Zhao, J., et al. Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298,

work page arXiv
[32]

Privtree: A differentially private algorithm for hierarchical decompositions

Zhang, J., Xiao, X., and Xie, X. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 international conference on management of data, pp. 155–170,

2016
[33]

Theorem A.1(Laplace Mechanism (Dwork et al., 2006)).Let f:X n →R d be a function with ℓ1-sensitivity ∆1(f) := max X∼X ′ ∥f(X)−f(X ′)∥1

Through the question statement in Category 1, we usereplace-oneneighbouring relationship. Theorem A.1(Laplace Mechanism (Dwork et al., 2006)).Let f:X n →R d be a function with ℓ1-sensitivity ∆1(f) := max X∼X ′ ∥f(X)−f(X ′)∥1. The Laplace mechanism defined as follow satisfiesε-DP: M(X) =f(X) +Z, Z i i.i.d. ∼Lap ∆1(f) ε . Theorem A.2(zCDP guarantee of Gauss...

2006
[34]

When instantiated with Gumbel noise, the algorithm is the famous exponential algorithm (McSherry & Talwar, 2007)

In particular, when instantiated with exponential noise, the algorithm is called the permute and flip (McKenna & Sheldon, 2020; Ding et al., 2021). When instantiated with Gumbel noise, the algorithm is the famous exponential algorithm (McSherry & Talwar, 2007). Input :DatasetX; score functions{u j(·)}m i=1 with sensitivity∆(w.r.t. same neighboring relatio...

2020
[35]

Sequential or Adaptive Composition (Bun & Steinke, 2016; Mironov, 2017; Canonne et al., 2020; Cesar & Rogers,

2016
[36]

DP-ML DP-GD (Bassily et al., 2014; Dong et al.,

2014
[37]

dp-sgd (Abadi et al., 2016; De et al., 2022; Rehn et al.,

2016
[38]

objective perturbation (Bassily et al., 2019; Kifer et al., 2012; Chaudhuri et al.,

2019
[39]

DP-statistics DP selection: expoMech (Dong et al., 2019,

2019
[40]

Data-Adaptive PTR (Redberg et al., 2022; Dwork & Lei,

2022
[41]

Table 6: Reference for Category 2 questions (grouped by subject and topic). B Taxonomy of error pattern in Category 2 benchmark design Table 7: Error taxonomy and empirical distribution in Category 2 Error pattern Core characterization Stronger-than-valid claim / over- claiming guarantee Starts from a valid result but claims a strictly stronger privacy gu...

2025