arxiv: 2602.22699 · v2 · submitted 2026-02-26 · 💻 cs.CR · cs.DB· cs.LG

Recognition: no theorem link

DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule

Tomoya Matsumoto , Shokichi Takakura , Shun Takagi , Satoshi Hasegawa

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:25 UTC · model grok-4.3

classification 💻 cs.CR cs.DBcs.LG

keywords differential privacySQL queriesminimum frequency ruleprivacy-preserving analysisTPC-H benchmarkquery validationexploratory data analysis

0 comments

The pith

DPSQL+ combines user-level differential privacy with a minimum frequency rule in a modular SQL library that supports aggregates, joins, and quadratic statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DPSQL+ as a library that lets analysts run SQL queries on sensitive data while satisfying both differential privacy and the governance requirement that every released group must draw from at least k distinct individuals. It works by statically checking queries against a safe SQL subset, tracking cumulative privacy spend across sessions, and routing results through a portable backend. Experiments on the TPC-H benchmark show that the restricted queries still deliver usable accuracy for basic aggregates, joins, and quadratic statistics, and that the system supports substantially more queries before the global privacy budget is exhausted. If the static restrictions preserve enough utility, organizations could release query results without exposing individuals to membership or attribute inference attacks. The design separates validation, accounting, and execution so that the same privacy logic can sit in front of different database engines.

Core claim

DPSQL+ achieves practical accuracy across a wide range of analytical workloads from basic aggregates to quadratic statistics and join operations and allows substantially more queries under a fixed global privacy budget than prior libraries by enforcing user-level (ε,δ)-DP together with the minimum frequency rule through a Validator that statically restricts queries to a DP-safe SQL subset, an Accountant that tracks cumulative privacy loss, and a Backend that interfaces with various database engines.

What carries the argument

The Validator that statically restricts incoming queries to a DP-safe subset of SQL, paired with the Accountant that maintains a consistent record of total privacy loss across multiple queries.

If this is right

Basic aggregate queries can be answered with calibrated noise while still producing results that analysts can use.
Join operations and quadratic statistics remain feasible inside the privacy and frequency constraints.
A fixed global privacy budget supports more total queries than earlier DP SQL libraries in the same evaluation setting.
The same privacy logic can be applied to different database engines without rewriting the validator or accountant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data platforms that must satisfy both privacy regulations and minimum-frequency governance rules could adopt the library as a drop-in query gateway.
Extending the validator to additional SQL constructs would expand the range of analyses that can be performed without manual workarounds.
The accounting mechanism could be reused in other query languages if a comparable static validator is built for them.
Evaluating the system on production workloads with real schema complexity would reveal whether TPC-H results generalize to typical enterprise data.

Load-bearing premise

Statically restricting queries to a DP-safe SQL subset via the Validator preserves sufficient utility for typical exploratory data analysis workloads without needing post-hoc adjustments.

What would settle it

A controlled test that runs a realistic collection of exploratory SQL queries through the Validator and measures both the fraction of queries rejected and the end-to-end accuracy loss on the queries that pass would directly test whether utility remains adequate.

Figures

Figures reproduced from arXiv: 2602.22699 by Satoshi Hasegawa, Shokichi Takakura, Shun Takagi, Tomoya Matsumoto.

**Figure 1.** Figure 1: The architecture of DPSQL+. across queries. Finally, the Backend interfaces with data engines such as Spark SQL or DuckDB to apply contribution bounding, double thresholding, and noisy aggregation. When a query is submitted, the system guides it through four stages: validation, budget checking, execution, and delivery. This automated pipeline ensures that all privacy rules are strictly followed and the g… view at source ↗

**Figure 2.** Figure 2: Mean Relative Error (%) against the ground truth as a function of the privacy parameter [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Maximum number of queries executable with fixed per [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

SQL is the de facto interface for exploratory data analysis; however, releasing exact query results can expose sensitive information through membership or attribute inference attacks. Differential privacy (DP) provides rigorous privacy guarantees, but in practice, DP alone may not satisfy governance requirements such as the \emph{minimum frequency rule}, which requires each released group (cell) to include contributions from at least $k$ distinct individuals. In this paper, we present \textbf{DPSQL+}, a privacy-preserving SQL library that simultaneously enforces user-level $(\varepsilon,\delta)$-DP and the minimum frequency rule. DPSQL+ adopts a modular architecture consisting of: (i) a \emph{Validator} that statically restricts queries to a DP-safe subset of SQL; (ii) an \emph{Accountant} that consistently tracks cumulative privacy loss across multiple queries; and (iii) a \emph{Backend} that interfaces with various database engines, ensuring portability and extensibility. Experiments on the TPC-H benchmark demonstrate that DPSQL+ achieves practical accuracy across a wide range of analytical workloads -- from basic aggregates to quadratic statistics and join operations -- and allows substantially more queries under a fixed global privacy budget than prior libraries in our evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPSQL+ packages DP and minimum-frequency rules into a modular SQL library that runs more queries under budget on TPC-H, but the validator's static restrictions are the part that needs checking against real EDA workloads.

read the letter

DPSQL+ combines user-level differential privacy with the minimum frequency rule in a single SQL library. The concrete addition is the three-part architecture: a validator that accepts only a safe SQL subset, an accountant that tracks total privacy spend, and a backend that talks to different engines. That packaging is new even if the pieces draw from prior DP-SQL work and the frequency rule itself is not original.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DPSQL+, a modular library for executing SQL queries under user-level (ε,δ)-differential privacy while also enforcing a minimum frequency rule requiring each output group to have contributions from at least k distinct individuals. The system uses a Validator to restrict queries to a DP-safe SQL subset, an Accountant to track privacy loss, and a Backend for database portability. Evaluation on the TPC-H benchmark claims practical accuracy for aggregates, quadratic statistics, and joins, along with allowing more queries under a fixed privacy budget compared to prior work.

Significance. If the results hold, this contribution is significant as it bridges differential privacy with practical governance requirements like the minimum frequency rule in a usable SQL interface. The modular design facilitates extensibility across database engines, and the TPC-H experiments provide evidence of utility for analytical workloads. This could enable broader adoption of privacy-preserving analytics in settings requiring both formal DP guarantees and frequency-based protections.

major comments (2)

[§3.2] §3.2 (Validator): The description of the Validator's static restrictions for enforcing both DP and the minimum frequency rule lacks a formal characterization of the supported query language fragment (e.g., allowed join types, aggregation forms, or group-by clauses). Without this, it is unclear whether typical exploratory queries are supported or require reformulation, which directly impacts the central claim of practical accuracy across wide workloads.
[§5] §5 (Evaluation): The reported gains in query volume under fixed budget and accuracy metrics for quadratic statistics and joins lack exact counts of accepted/rejected queries by the Validator, the precise privacy budget allocation, and details such as error bars or run-to-run variance. This makes it difficult to assess whether the performance claims generalize beyond the specific TPC-H subset tested.

minor comments (2)

[§5] The experimental setup would benefit from an explicit table listing the (ε, δ, k) parameter values used across all TPC-H workloads.
[Figure 4] Figure captions for throughput plots should include the number of runs and any statistical tests performed for the 'substantially more queries' comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive recommendation for minor revision. We address each major comment below and will incorporate the suggested clarifications into the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Validator): The description of the Validator's static restrictions for enforcing both DP and the minimum frequency rule lacks a formal characterization of the supported query language fragment (e.g., allowed join types, aggregation forms, or group-by clauses). Without this, it is unclear whether typical exploratory queries are supported or require reformulation, which directly impacts the central claim of practical accuracy across wide workloads.

Authors: We agree that an explicit formal characterization of the supported query fragment would improve clarity and help readers assess the scope of supported exploratory queries. In the revised version, we will add a dedicated paragraph and table in §3.2 that enumerates the allowed constructs: supported join types (inner joins on foreign-key relationships only), permitted aggregation functions (SUM, COUNT, AVG, and quadratic forms such as variance), group-by requirements, and the precise conditions under which the minimum-frequency rule is enforced by the Validator. This addition will directly address the concern without altering the underlying implementation. revision: yes
Referee: [§5] §5 (Evaluation): The reported gains in query volume under fixed budget and accuracy metrics for quadratic statistics and joins lack exact counts of accepted/rejected queries by the Validator, the precise privacy budget allocation, and details such as error bars or run-to-run variance. This makes it difficult to assess whether the performance claims generalize beyond the specific TPC-H subset tested.

Authors: We acknowledge that the current evaluation section would benefit from greater quantitative transparency. In the revision we will expand §5 with: (i) exact counts of queries accepted versus rejected by the Validator on the TPC-H workload, (ii) the concrete per-query privacy-budget allocation policy (including how the global (ε,δ) budget is partitioned), and (iii) error bars or standard deviations computed over multiple independent runs to quantify run-to-run variance. These additions will strengthen the evidence for the reported gains in query volume and accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in engineering system

full rationale

The paper describes an engineering library (DPSQL+) with modular components (Validator for DP-safe SQL subset, Accountant for privacy budget tracking, Backend for DB interfacing) and supports its claims of practical accuracy on TPC-H workloads via direct empirical evaluation rather than any mathematical derivation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the minimum-frequency rule and DP guarantees are enforced by construction in the architecture but are not presented as derived results that reduce to their own inputs. The contribution is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a systems contribution describing library architecture and benchmark results; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5520 in / 1093 out tokens · 48486 ms · 2026-05-15T19:25:06.775722+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Revealing information while preserving privacy

Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. InProceedings of the Twenty- Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’03, page 202–210, New York, NY , USA, 2003. Association for Computing Machinery

work page 2003
[2]

On the difficulties of disclosure prevention in statistical databases or the case for differential privacy.Journal of Privacy and Confi- dentiality, 2(1), Sep

Cynthia Dwork and Moni Naor. On the difficulties of disclosure prevention in statistical databases or the case for differential privacy.Journal of Privacy and Confi- dentiality, 2(1), Sep. 2010

work page 2010
[3]

The algorithmic foun- dations of differential privacy.Found

Cynthia Dwork and Aaron Roth. The algorithmic foun- dations of differential privacy.Found. Trends Theor. Comput. Sci., 9(3–4):211–407, 2014

work page 2014
[4]

Johnson, Joseph P

Noah M. Johnson, Joseph P. Near, Joseph M. Heller- stein, and Dawn Song. Chorus: a programming frame- work for building scalable differential privacy mecha- nisms. InIEEE European Symposium on Security and Privacy, EuroS&P 2020, Genoa, Italy, September 7-11, 2020, pages 535–551, 2020

work page 2020
[5]

ZetaSQL Differential Privacy extension

Google. ZetaSQL Differential Privacy extension. https://github.com/google/differential-privacy/tree/ main/examples/zetasql, 2023

work page 2023
[6]

SmartNoise SQL

OpenDP Community. SmartNoise SQL. https://docs. smartnoise.org/sql/, 2023

work page 2023
[7]

Jeddak-DPSQL

ByteDance. Jeddak-DPSQL. https://github.com/ bytedance/Jeddak-DPSQL, 2023

work page 2023
[8]

Qrlew: Rewriting sql into dif- ferentially private sql.arXiv, abs/2401.06273, 2024

Nicolas Grislain, Paul Roussel, and Victoria de Sainte Agathe. Qrlew: Rewriting sql into dif- ferentially private sql.arXiv, abs/2401.06273, 2024

work page arXiv 2024
[9]

DOP-SQL: A general-purpose, high-utility, and extensible private sql system.Proc

Jianzhe Yu, Wei Dong, Juanru Fang, Dajun Sun, and Ke Yi. DOP-SQL: A general-purpose, high-utility, and extensible private sql system.Proc. VLDB Endow., 17(12):4385–4388, 2024

work page 2024
[10]

Im- plementing multiple evaluation techniques in statistical disclosure control for tabular data

Amang Sukasih, Donsig Jang, and John Czajka. Im- plementing multiple evaluation techniques in statistical disclosure control for tabular data. InProceedings of the Fourth International Conference on Establishment Surveys (ICES 2012), 2012

work page 2012
[11]

Dajani, and Phyllis Singer

Simson Garfinkel, Barbara Guttman, Joseph Near, Aref N. Dajani, and Phyllis Singer. De-identifying gov- ernment datasets: Techniques and governance.NIST Special Publication (SP) 800-188, National Institute of Standards and Technology, Gaithersburg, MD, 2023

work page 2023
[12]

Optimal variance and covariance estimation under dif- ferential privacy in the add-remove model and beyond

Shokichi Takakura, Seng Liew, and Satoshi Hasegawa. Optimal variance and covariance estimation under dif- ferential privacy in the add-remove model and beyond. arXiv, abs/2509.04919, 2025

work page arXiv 2025
[13]

Concentrated differ- ential privacy: Simplifications, extensions, and lower bounds

Mark Bun and Thomas Steinke. Concentrated differ- ential privacy: Simplifications, extensions, and lower bounds. InTheory of Cryptography, pages 635–658, Berlin, Heidelberg, 2016. Springer Berlin Heidelberg

work page 2016
[14]

Privacy Loss Distri- butions

Google Differential Privacy Team. Privacy Loss Distri- butions. https://github.com/google/differential-privacy/ blob/main/common docs/Privacy Loss Distributions. pdf, 2025

work page 2025
[15]

On significance of the least significant bits for differential privacy

Ilya Mironov. On significance of the least significant bits for differential privacy. InProceedings of the 2012 ACM Conference on Computer and Communications Security, CCS ’12, page 650–661, New York, NY , USA,

work page 2012
[16]

Association for Computing Machinery

work page
[17]

Precision-based attacks and interval refining: how to break, then fix, differential privacy on finite computers.arXiv, abs/2207.13793, 2022

Samuel Haney, Damien Desfontaines, Luke Hartman, Ruchit Shrestha, and Michael Hay. Precision-based attacks and interval refining: how to break, then fix, differential privacy on finite computers.arXiv, abs/2207.13793, 2022

work page arXiv 2022
[18]

Wilson, Celia Yuxin Zhang, William Lam, Damien Desfontaines, Daniel Simmons-Marengo, and Bryant Gipson

Royce J. Wilson, Celia Yuxin Zhang, William Lam, Damien Desfontaines, Daniel Simmons-Marengo, and Bryant Gipson. Differentially Private SQL with Bounded User Contribution.Proceedings on Privacy Enhancing Technologies, 2020(2):230–250, 2020

work page 2020
[19]

Plume: Differ- ential privacy at scale.arXiv, abs/2201.11603, 2022

Kareem Amin, Jennifer Gillenwater, Matthew Joseph, Alex Kulesza, and Sergei Vassilvitskii. Plume: Differ- ential privacy at scale.arXiv, abs/2201.11603, 2022

work page arXiv 2022
[20]

Exact privacy analysis of the gaussian sparse histogram mechanism.Journal of Privacy and Confi- dentiality, 14(1), 2024

Arjun Wilkins, Daniel Kifer, Danfeng Zhang, and Brian Karrer. Exact privacy analysis of the gaussian sparse histogram mechanism.Journal of Privacy and Confi- dentiality, 14(1), 2024

work page 2024
[21]

OpenDP Library

Michael Shoemate, Andrew Vyrros, Chuck McCal- lum, Raman Prasad, Philip Durbin, S ´ılvia Casacu- berta Puig, Ethan Cowan, Vicki Xu, Zachary Ratliff, Nicol´as Berrios, Alex Whitworth, Michael Eliot, Chris- tian Lebeda, Oren Renard, and Claire McKay Bowen. OpenDP Library. https://github.com/opendp/opendp

work page
[22]

AWS Clean Rooms

Amazon Web Services, Inc. AWS Clean Rooms. https: //aws.amazon.com/clean-rooms/. Accessed: 2026-01- 20

work page 2026
[23]

Experiments & analysis of privacy-preserving sql query sanitization systems.arXiv, abs/2510.13528, 2025

Lo¨ıs Ecoffet, Veronika Rehn-Sonigo, Jean-Franc ¸ois Couchot, and Catuscia Palamidessi. Experiments & analysis of privacy-preserving sql query sanitization systems.arXiv, abs/2510.13528, 2025

work page arXiv 2025
[24]

TPC BENCHMARK H Standard Specification, Revision 3.0.1

Transaction Processing Performance Council. TPC BENCHMARK H Standard Specification, Revision 3.0.1. https://www.tpc.org/tpc documents current versions/pdf/tpc-h v3.0.1.pdf, 2022. Accessed: 2026- 02-13

work page 2022
[25]

Toward provably pri- vate analytics and insights into genai use.arXiv, abs/2510.21684, 2025

Albert Cheu, Artem Lagzdin, Brett McLarnon, Daniel Ramage, Katharine Daly, Marco Gruteser, Peter Kairouz, Rakshita Tandon, Stanislav Chiknavaryan, Ti- mon Overveldt, and Zoe Gong. Toward provably pri- vate analytics and insights into genai use.arXiv, abs/2510.21684, 2025. A Details of Evaluation Settings The experiments in Section 6 employ 10 query patter...

work page arXiv 2025