pith. sign in

arxiv: 2605.29093 · v1 · pith:LJFQ3LCRnew · submitted 2026-05-27 · 💻 cs.DB

ScanTwin: Simulating Performance Regressions Without Access to Tenant Data

Pith reviewed 2026-06-29 09:06 UTC · model grok-4.3

classification 💻 cs.DB
keywords differential privacyParquetrow-group pruningperformance regressionscan timingsynthetic datacloud databases
0
0 comments X

The pith

ScanTwin extracts privacy-protected Parquet row-group sketches to reproduce scan pruning and timing without tenant data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScanTwin as a way for cloud platform developers to recreate performance regressions seen on specific tenant datasets. It works by pulling boundary values and compressed sizes from each row group in the Parquet footer, then releasing a version of those values under differential privacy. The goal is to keep the physical layout details that control which row groups get skipped during scans and how long each scan takes. Experiments on TPC-H and SSB show that at infinite privacy the method matches the original exactly on pruning, while at ε=5 high-selectivity queries stay within 8.5 percent pruning error and DuckDB scan times track closely.

Core claim

ScanTwin extracts a per-row-group sketch containing boundary values and compressed sizes from the Parquet footer and releases it under ε-differential privacy using boundary parameterization. This sketch is sufficient to drive row-group pruning decisions and scan timing behavior in an engine such as DuckDB, producing 0 percent pruning error and less than 1 percent byte error at ε=∞, and below 8.5 percent pruning error for queries with selectivity above 30 percent at ε=5 on both TPC-H and SSB.

What carries the argument

Per-row-group sketches of boundary values and compressed sizes, released under ε-differential privacy via boundary parameterization, that preserve the layout properties used for pruning and timing.

If this is right

  • Developers can reproduce tenant-specific scan regressions locally without any access to the tenant data.
  • High-selectivity queries retain usable pruning accuracy under moderate privacy budgets on standard benchmarks.
  • DuckDB scan timing on the released sketches closely follows the timing observed on the original files.
  • The approach works for both TPC-H and SSB at the reported row counts and privacy levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sketch format could be applied to other columnar formats that store row-group metadata.
  • Varying the privacy parameter ε across a range would let teams choose the accuracy-privacy tradeoff for different debugging tasks.
  • If the sketches also captured column statistics beyond boundaries, pruning simulation for more complex predicates might improve.

Load-bearing premise

The noisy boundary values and sizes from the Parquet footer still produce the same pruning decisions and scan costs as the original data.

What would settle it

Run the same queries on DuckDB using both the original Parquet files and the ScanTwin sketches at ε=5; if per-query scan times diverge by more than a few percent even when pruning error stays low, the claim fails.

Figures

Figures reproduced from arXiv: 2605.29093 by Donghyun Sohn, Jennie Rogers.

Figure 1
Figure 1. Figure 1: illustrates this layout. Only the filter column’s per-RG metadata is extracted; other columns are ignored at sketch time. The sketch S is composed of the row count 𝑁, the number of RGs 𝐾, and the column count. Since Parquet stores statistics in the footer, we extract the sketch without reading data pages. Sorting by the filter column is standard practice (e.g., Snowflake clustering keys [1], Databricks Z-o… view at source ↗
Figure 2
Figure 2. Figure 2: MAPE-RG vs. 𝜀 broken down by selectivity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MAPE-RG vs. declared 𝑚 at 𝜀=5. 4.4 Engine Validation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

In cloud data platforms, developers often encounter performance regressions that occur in specific tenant datasets. However, due to confidentiality constraints, they cannot access the original data, which makes it difficult to reproduce these regressions locally. Current methods for synthetic data usually focus on statistical properties, such as matching data distributions or improving query accuracy. However, they overlook the physical properties that control how the engine behaves during scans, including row-group pruning. We propose ScanTwin, a lightweight framework that extracts a per-row-group sketch from the Parquet footer, including boundary values and compressed sizes, and releases them under $\varepsilon$-differential privacy using a boundary parameterization. On TPC-H and SSB (6M rows), ScanTwin achieves 0% pruning error and less than 1% byte error at $\varepsilon{=}\infty$. Under $\varepsilon{=}5$, high-selectivity queries ($>$30%) incur below 8.5% pruning error on both datasets, and per-query scan timing on DuckDB closely tracks the original.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ScanTwin, a framework that extracts per-row-group sketches (boundary values and compressed sizes) from Parquet footers and releases them under ε-differential privacy using a boundary parameterization. The goal is to simulate scan performance regressions on tenant data without direct access. On TPC-H and SSB (6M rows), it reports 0% pruning error and <1% byte error at ε=∞; under ε=5, high-selectivity queries (>30%) incur <8.5% pruning error on both datasets, with per-query DuckDB scan timings closely tracking the original.

Significance. If the timing fidelity claim holds, the work addresses a practical need in multi-tenant cloud databases by enabling local reproduction of regressions from privacy-preserving metadata alone. The evaluation on independent public benchmarks (TPC-H, SSB) is a positive for reproducibility.

major comments (3)
  1. [Abstract/Evaluation] Abstract and Evaluation: The central claim that DuckDB scan timings closely track the original under ε=5 rests on pruning decisions, yet only aggregate pruning error (<8.5% for >30% selectivity queries) is reported; no per-query set-overlap, exact-match rate for scanned row-groups, or byte-error at ε=5 is provided. These metrics are required to confirm that I/O volumes and thus timings are preserved, as independent Laplace noise on boundaries can flip individual pruning decisions near predicates even when aggregate error remains small.
  2. [Evaluation] Evaluation methodology: The reported error percentages lack error bars, full details on how the >30% selectivity threshold was chosen, or sensitivity analysis for boundary noise application; without these, the concrete performance claims cannot be assessed for robustness.
  3. [Boundary parameterization] Boundary parameterization section: The assumption that min/max plus compressed sizes released under ε-DP suffice to preserve physical pruning and scan-cost properties needs direct validation (e.g., via per-row-group decision fidelity), as the paper's aggregate metric does not rule out timing divergence from flipped pruning decisions.
minor comments (1)
  1. Clarify the exact sensitivity used for Laplace noise on boundary values and how compressed sizes are handled under DP.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and agree to incorporate additional metrics and analyses to strengthen the evaluation.

read point-by-point responses
  1. Referee: [Abstract/Evaluation] Abstract and Evaluation: The central claim that DuckDB scan timings closely track the original under ε=5 rests on pruning decisions, yet only aggregate pruning error (<8.5% for >30% selectivity queries) is reported; no per-query set-overlap, exact-match rate for scanned row-groups, or byte-error at ε=5 is provided. These metrics are required to confirm that I/O volumes and thus timings are preserved, as independent Laplace noise on boundaries can flip individual pruning decisions near predicates even when aggregate error remains small.

    Authors: We agree that per-query metrics would strengthen the evidence. While the reported per-query DuckDB timings already indicate that I/O volumes are preserved in practice, we will add set-overlap, exact-match rates for scanned row-groups, and byte-error at ε=5 to the revised evaluation section. revision: yes

  2. Referee: [Evaluation] Evaluation methodology: The reported error percentages lack error bars, full details on how the >30% selectivity threshold was chosen, or sensitivity analysis for boundary noise application; without these, the concrete performance claims cannot be assessed for robustness.

    Authors: We will revise the evaluation to include error bars on the reported percentages, explain the rationale for the >30% selectivity threshold (focusing on queries where row-group pruning has the largest impact on scan cost), and add a sensitivity analysis for boundary noise. revision: yes

  3. Referee: [Boundary parameterization] Boundary parameterization section: The assumption that min/max plus compressed sizes released under ε-DP suffice to preserve physical pruning and scan-cost properties needs direct validation (e.g., via per-row-group decision fidelity), as the paper's aggregate metric does not rule out timing divergence from flipped pruning decisions.

    Authors: The close tracking of per-query DuckDB timings provides empirical support that scan-cost properties are preserved overall. Nevertheless, we will add explicit per-row-group decision fidelity metrics in the revision to directly validate the boundary parameterization as requested. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on independent empirical evaluation

full rationale

The paper introduces ScanTwin as a DP release mechanism for Parquet row-group sketches and evaluates it directly on external public benchmarks (TPC-H, SSB). Reported metrics (pruning error, byte error, DuckDB scan timing) are measured outcomes on those datasets rather than quantities derived from the method's own parameters or equations. No self-definitional relations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the central results do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that footer metadata alone captures pruning-relevant physical properties and that the chosen parameterization plus DP noise does not destroy that utility; no new entities are postulated.

free parameters (1)
  • epsilon
    Privacy budget selected for experiments; controls the noise level and resulting error rates reported.
axioms (1)
  • domain assumption Parquet footer boundary values and compressed sizes determine row-group pruning decisions during scans
    Invoked in the extraction and simulation steps described in the abstract.

pith-pipeline@v0.9.1-grok · 5701 in / 1184 out tokens · 27822 ms · 2026-06-29T09:06:23.278844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Clustering Keys & Clustered Tables

    2024. Clustering Keys & Clustered Tables. https://docs.snowflake.com/en/user- guide/tables-clustering-keys

  2. [2]

    File Formats — DuckDB Documentation

    2024. File Formats — DuckDB Documentation. https://duckdb.org/docs/stable/ guides/performance/file_formats

  3. [3]

    When to partition tables on Databricks

    2024. When to partition tables on Databricks. https://docs.databricks.com/aws/ en/tables/partitions

  4. [4]

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrat- ing Noise to Sensitivity in Private Data Analysis. InTheory of Cryptography

  5. [5]

    Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differen- tial Privacy.Foundations and Trends in Theoretical Computer Science9, 3–4 (2014), 211–407

  6. [6]

    Yunqing Ge, Jianbin Qin, Shuyuan Zheng, Yongrui Zhong, Bo Tang, Yu-Xuan Qiu, Rui Mao, Ye Yuan, Makoto Onizuka, and Chuan Xiao. 2024. Privacy-Enhanced 4 ScanTwin: Simulating Performance Regressions Without Access to Tenant Data SeQureDB ’26, May 31-June 05, 2026, Bengaluru, India Database Synthesis for Benchmark Publishing.Proceedings of the VLDB Endow- me...

  7. [7]

    Naoise Holohan, Spiros Antonatos, Stefano Braghin, and Pól Mac Aonghusa

  8. [8]

    doi:10.29012/jpc.715

    The Bounded Laplace Mechanism in Differential Privacy.Journal of Privacy and Confidentiality10, 1 (2019). doi:10.29012/jpc.715

  9. [9]

    Jinho Jung, Hong Hu, Joy Arulraj, Taesoo Kim, and Woonhak Kang. 2020. APOLLO: Automatic detection and diagnosis of performance regressions in database systems.Proceedings of the VLDB Endowment13, 1 (2020), 57–70

  10. [10]

    Ryan McKenna, Brendan Sheldon, and Gerome Miklau. 2022. AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. InProceedings of the 39th International Conference on Machine Learning (ICML)

  11. [11]

    2009.Star Schema Benchmark

    Pat O’Neil, Betty O’Neil, and Xuedong Chen. 2009.Star Schema Benchmark. Technical Report Revision 3. University of Massachusetts at Boston

  12. [12]

    Transaction Processing Performance Council. 2023. TPC-H benchmark specifica- tion.Published at http://www.tpc.org(2023)

  13. [13]

    Wentao Wu, Anshuman Dutt, Gaoxiang Xu, Vivek Narasayya, and Surajit Chaud- huri. 2026. Understanding and Detecting Query Performance Regression in Practical Index Tuning.Proceedings of the ACM on Management of Data (SIG- MOD)3, 6 (2026)

  14. [14]

    Liyang Xie, Kexin Lin, Shu Wang, Fei Wang, and Jiayu Zhou. 2018. Differentially Private Generative Adversarial Network.arXiv preprint arXiv:1802.06739(2018)

  15. [15]

    Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Ge Yu, and Marianne Winslett

  16. [16]

    doi:10.1007/s00778-013-0309-y

    Differentially Private Histogram Publication.The VLDB Journal22, 6 (2013), 797–822. doi:10.1007/s00778-013-0309-y

  17. [17]

    Procopiuc, Divesh Srivastava, and Xiaokui Xiao

    Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems42, 4 (2017). 5