Robust Spectral Watermark for Synthetic Tabular Data
Pith reviewed 2026-05-17 04:51 UTC · model grok-4.3
The pith
A frequency-domain method watermarks synthetic tabular data for reliable origin detection after common modifications.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform, and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. A novel rank-based pseudorandom bit generation method enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against post-processing and adaptive attacks, while preserving high data fidelity and fully supporting mixed-type features.
What carries the argument
Adjustment of imaginary parts in selected DFT entries after Yeo-Johnson normalization and standardization, paired with rank-based pseudorandom bit generation for embedding and row-wise detection.
If this is right
- The watermark remains detectable after typical post-processing operations that preserve overall statistics.
- Detection works row by row without needing to store any extra information.
- Both discrete and continuous columns keep their original modeling value after watermarking.
- The scheme runs efficiently as a post-edit step on already-generated tables.
Where Pith is reading between the lines
- The same frequency-domain idea could be tested on other generated data formats if suitable normalizations are identified.
- Embedding during the generation step rather than afterward might reduce any residual utility cost.
- Over time the method could support public registries that verify data provenance for regulated synthetic datasets.
Load-bearing premise
That small changes to the imaginary parts of the Fourier transform will leave the overall statistical properties and downstream usefulness of the synthetic data essentially unchanged.
What would settle it
A post-processing step such as mild rounding, scaling, or noise addition that erases the ability to recover the watermark bits while the data still matches the original distribution on standard utility metrics.
Figures
read the original abstract
The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on the inverse process of large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to common post-processing attacks. To address these limitations, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for synthetic tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against post-processing and adaptive attacks, while preserving high data fidelity and fully supporting mixed-type features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TAB-DRW, a post-editing spectral watermarking scheme for synthetic tabular data. It applies Yeo-Johnson normalization and standardization to heterogeneous features, computes the DFT, and perturbs the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits generated by a novel rank-based method that supports row-wise retrieval without storage. The central empirical claim is that the resulting watermarked tables exhibit strong detectability and robustness to post-processing and adaptive attacks on five benchmark datasets while preserving high data fidelity and supporting mixed discrete-continuous features.
Significance. If the experimental results hold with adequate quantitative support, the work would offer a computationally lightweight alternative to diffusion-model-based watermarking, addressing key limitations in cost, mixed-type support, and storage overhead. The rank-based keying mechanism and frequency-domain embedding could enable practical provenance tracking for synthetic data in regulated domains such as healthcare and finance.
major comments (2)
- [§4] §4 (Experiments) and abstract: The claim of 'strong detectability and robustness' is presented without any reported quantitative metrics (e.g., detection accuracy, AUC, bit-error rate), error bars, or statistical tests across the five datasets and listed attacks. This absence prevents verification of the central empirical claim and must be addressed with full tables of results.
- [§3.1–3.2] §3.1–3.2 (Method): The adaptive selection criterion for DFT entries and the precise effect of imaginary-part perturbation on the inverse transform are not fully specified. Because utility preservation is load-bearing for the overall contribution, the manuscript should include an explicit derivation or pseudocode showing that the modification preserves the real-valued output and does not systematically alter marginal distributions.
minor comments (2)
- [Figures in §4] Figure captions and axis labels in the experimental section should explicitly state the attack parameters (e.g., noise level, row deletion fraction) used in each robustness test.
- [§2] The related-work section should include a direct comparison table against at least one recent tabular watermarking baseline on the same five datasets to contextualize the reported fidelity-robustness trade-off.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and abstract: The claim of 'strong detectability and robustness' is presented without any reported quantitative metrics (e.g., detection accuracy, AUC, bit-error rate), error bars, or statistical tests across the five datasets and listed attacks. This absence prevents verification of the central empirical claim and must be addressed with full tables of results.
Authors: We agree that explicit quantitative metrics are required to support the central claims. In the revised manuscript we will add comprehensive tables in Section 4 (and update the abstract) that report detection accuracy, AUC, and bit-error rate for each dataset and attack, together with error bars from repeated trials and results of statistical significance tests. revision: yes
-
Referee: [§3.1–3.2] §3.1–3.2 (Method): The adaptive selection criterion for DFT entries and the precise effect of imaginary-part perturbation on the inverse transform are not fully specified. Because utility preservation is load-bearing for the overall contribution, the manuscript should include an explicit derivation or pseudocode showing that the modification preserves the real-valued output and does not systematically alter marginal distributions.
Authors: We thank the referee for noting the need for greater precision. The adaptive selection chooses high-magnitude DFT coefficients to limit utility impact. Perturbations are applied to imaginary parts while conjugate symmetry is maintained on the Hermitian pair, guaranteeing a real-valued inverse DFT. The rank-based perturbations are zero-mean and bounded, so expected marginal distributions remain unchanged. The revision will add a short derivation and pseudocode in Section 3 to make these properties explicit. revision: yes
Circularity Check
No significant circularity; empirical method with independent experimental validation
full rationale
The paper describes a practical post-editing watermarking pipeline (Yeo-Johnson normalization, standardization, DFT, imaginary-part perturbation keyed by rank-based pseudorandom bits) whose detectability and robustness claims rest on experiments across five benchmark datasets rather than any closed-form derivation or self-referential prediction. No equations reduce a claimed result to a fitted parameter or prior self-citation by construction; the central contribution is an engineering scheme whose performance is measured externally via fidelity metrics and attack survival rates. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Yeo-Johnson transformation followed by standardization produces suitable input for DFT on mixed discrete-continuous tabular features
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TAB-DRW embeds watermark signals in the frequency domain: ... applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Fundamental Trade-Offs in Multi-Bit Watermarking of Stochastic Processes
Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of station...
Reference graph
Works this paper leans on
-
[1]
Watermarking of large language models
Scott Aaronson. Watermarking of large language models. https://simons.berkeley.edu/talks/ scott-aaronson-ut-austin-openai-2023-08-17, 2023
work page 2023
-
[2]
Generating synthetic data in finance: Opportunities, challenges and pitfalls
Samuel A Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E Tillman, Prashant Reddy, and Manuela Veloso. Generating synthetic data in finance: Opportunities, challenges and pitfalls. InProceedings of the First ACM International Conference on AI in Finance, pages 1–8, 2020
work page 2020
-
[3]
arXiv preprint arXiv:2401.02524 (2024)
André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster. Comprehensive exploration of synthetic data generation: A survey.arXiv preprint arXiv:2401.02524, 2024
-
[4]
Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20
-
[5]
R. Bock. MAGIC Gamma Telescope. UCI Machine Learning Repository, 2004. DOI: https://doi.org/10.24432/C52C8B
-
[6]
Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey.IEEE transactions on neural networks and learning systems, 2022
work page 2022
-
[7]
XGBoost: A scalable tree boosting system
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 785–794, 2016
work page 2016
-
[8]
Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Hengrui Zhang, Zhongfen Deng, and Philip S Yu. MUSE: Model-Agnostic Tabular Watermarking via Multi-Sample Selection.arXiv preprint arXiv:2505.24267, 2025
-
[9]
TabMT: Generating tabular data with masked transformers
Manbir Gulati and Paul Roysdon. TabMT: Generating tabular data with masked transformers. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[10]
InProceedings of the 33 International Joint Conference on Artificial Intelligence, pages 8038–8047
Xu Guo and Yiqiang Chen. Generative AI for synthetic data generation: Methods, challenges and the future. arXiv preprint arXiv:2403.04190, 2024
-
[11]
Watermarking generative tabular data
Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, and Guang Cheng. Watermarking generative tabular data. arXiv preprint arXiv:2405.14018, 2024
-
[12]
Baizhou Huang and Xiaojun Wan. WaterPool: A language model watermark mitigating trade-offs among imperceptibility, efficacy and robustness. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4156–4182. Association for Compu...
work page 2025
-
[13]
Stasy: Score-based tabular data synthesis
Jayoung Kim, Chaejeong Lee, and Noseong Park. Stasy: Score-based tabular data synthesis. InInternational Conference on Learning Representations, 2023
work page 2023
-
[14]
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations, 2014
work page 2014
-
[15]
A watermark for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023
work page 2023
-
[16]
Multiclass classification of dry beans using computer vision and machine learning techniques.Comput
Murat Koklu and Ilker Ali Özkan. Multiclass classification of dry beans using computer vision and machine learning techniques.Comput. Electron. Agric., 174:105507, 2020
work page 2020
-
[17]
TabDDPM: Modelling tabular data with diffusion models
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. TabDDPM: Modelling tabular data with diffusion models. InInternational Conference on Machine Learning, pages 17564–17579. PMLR, 2023
work page 2023
-
[18]
Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models.Transactions on Machine Learning Research, 2024
work page 2024
-
[19]
Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules.The Annals of Statistics, 53(1):322–351, 2025
work page 2025
-
[20]
Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best practices and lessons learned on synthetic data. InFirst Conference on Language Modeling, 2024
work page 2024
-
[21]
GOGGLE: Generative modelling for tabular data by learning relational structure
Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and Mihaela van der Schaar. GOGGLE: Generative modelling for tabular data by learning relational structure. InInternational Conference on Learning Representations, 2023
work page 2023
-
[22]
PTW: Pivotal tuning watermarking for pre-trained image generators
Nils Lukas and Florian Kerschbaum. PTW: Pivotal tuning watermarking for pre-trained image generators. In 32nd USENIX Security Symposium (USENIX Security 23), pages 2241–2258, 2023. 13
work page 2023
-
[23]
Alan V Oppenheim.Discrete-time signal processing. Pearson Education India, 1999
work page 1999
-
[24]
Digital watermarking: Algorithms and applications.IEEE signal processing Magazine, 18(4):33–46, 2001
Christine I Podilchuk and Edward J Delp. Digital watermarking: Algorithms and applications.IEEE signal processing Magazine, 18(4):33–46, 2001
work page 2001
-
[25]
Synthetic data for privacy-preserving clinical risk prediction.Scientific Reports, 14(1):25676, 2024
Zhaozhi Qian, Thomas Callender, Bogdan Cebere, Sam M Janes, Neal Navani, and Mihaela van der Schaar. Synthetic data for privacy-preserving clinical risk prediction.Scientific Reports, 14(1):25676, 2024
work page 2024
-
[26]
C. Sakar and Yomi Kastro. Online Shoppers Purchasing Intention Dataset. UCI Machine Learning Repository,
-
[27]
DOI: https://doi.org/10.24432/C5F88Q
-
[28]
scipy.stats.yeojohnson — scipy documentation
SciPy Developers. scipy.stats.yeojohnson — scipy documentation. https://docs.scipy.org/doc/scipy/ reference/generated/scipy.stats.yeojohnson.html, 2025
work page 2025
-
[29]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021
work page 2021
-
[30]
Cambridge university press, 2018
Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018
work page 2018
-
[31]
On provable copyright protection for generative models
Nikhil Vyas, Sham M Kakade, and Boaz Barak. On provable copyright protection for generative models. In International Conference on Machine Learning, pages 35277–35299. PMLR, 2023
work page 2023
-
[32]
Alex X. Wang and Binh P. Nguyen. TTV AE: Transformer-based generative modeling for tabular data generation. Artificial Intelligence, 340:104292, 2025
work page 2025
-
[33]
Tree-rings watermarks: Invisible fingerprints for diffusion images
Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. Tree-rings watermarks: Invisible fingerprints for diffusion images. InAdvances in Neural Information Processing Systems, pages 58047–58063, 2023
work page 2023
-
[34]
Optimizing watermarks for large language models
Bram Wouters. Optimizing watermarks for large language models. InInternational Conference on Machine Learning, pages 53251–53269. PMLR, 2024
work page 2024
-
[35]
A resilient and accessible distribution-preserving watermark for large language models
Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. A resilient and accessible distribution-preserving watermark for large language models. InInternational Conference on Machine Learning. PMLR, 2024
work page 2024
-
[36]
Yangxinyu Xie, Xiang Li, Tanwi Mallick, Weijie Su, and Ruixun Zhang. Debiasing watermarks for large language models via maximal coupling.Journal of the American Statistical Association, (just-accepted):1–21, 2025
work page 2025
-
[37]
Modeling tabular data using conditional GAN
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[38]
Gaussian shading: Provable performance-lossless image watermarking for diffusion models
Zijin Yang, Kai Zeng, Kejiang Chen, Han Fang, Weiming Zhang, and Nenghai Yu. Gaussian shading: Provable performance-lossless image watermarking for diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12162–12171, 2024
work page 2024
-
[39]
Default of Credit Card Clients
I-Cheng Yeh. Default of Credit Card Clients. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C55S3H
-
[40]
A new family of power transformations to improve normality or symmetry
In-Kwon Yeo and Richard A Johnson. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4):954–959, 2000
work page 2000
-
[41]
Mixed-type tabular data synthesis with score-based diffusion in latent space
Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Chris tos Faloutsos, Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with score-based diffusion in latent space. InInternational Conference on Learning Representations, 2024
work page 2024
-
[42]
Attack-resilient image watermarking using stable diffusion
Lijun Zhang, Xiao Liu, Antoni Viros Martin, Cindy Xiong Bearfield, Yuriy Brun, and Hui Guan. Attack-resilient image watermarking using stable diffusion. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[43]
GANBLR: A tabular data generation model
Yishuo Zhang, Nayyar A Zaidi, Jiahui Zhou, and Gang Li. GANBLR: A tabular data generation model. In2021 IEEE International Conference on Data Mining, pages 181–190. IEEE, 2021
work page 2021
-
[44]
Provable robust watermarking for AI-generated text
Xuandong Zhao, Prabhanjan Vijendra Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for AI-generated text. InInternational Conference on Learning Representations, 2024
work page 2024
-
[45]
CY AB-GAN: Effective table data synthesizing
Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. CY AB-GAN: Effective table data synthesizing. In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021
work page 2021
-
[46]
TabularMark: Watermarking tabular datasets for machine learning
Yihao Zheng, Haocheng Xia, Junyuan Pang, Jinfei Liu, Kui Ren, Lingyang Chu, Yang Cao, and Li Xiong. TabularMark: Watermarking tabular datasets for machine learning. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3570–3584, 2024
work page 2024
-
[47]
TabWak: A watermark for tabular diffusion models
Chaoyi Zhu, Jiayi Tang, Jeroen M Galjaard, Pin-Yu Chen, Robert Birke, Cornelis Bos, Lydia Y Chen, et al. TabWak: A watermark for tabular diffusion models. InInternational Conference on Learning Representations, 2025. 14 Contents 1 Introduction 1 2 Method 3 2.1 Watermark Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2025
-
[48]
>$!? agesexeduincomedebt agesexeduincomedebt Standardized Data YJTDFT IYJTIDFT & &
The key-dependent variability in the frequency-domain representation does not substantially affect watermark distortion or detectability. In other words, given randomly selected watermark keys, the privacy-enhanced TAB- DRWshould exhibit nearly consistent performance in terms of data fidelity and watermark detectability. From a theoretical perspective, si...
-
[49]
All models are wrong, but some are useful
The approach supports multi-key scenarios, as a watermark embedded with one key cannot be detected using another, thereby effectively avoiding false positives.Furthermore, the collision-free key space must be sufficiently large to support large-scale deployment. The motivation behind this design lies in the sensitivity of the row-wise DFT to column order:...
-
[50]
Pearson correlation coefficients (PCC).Given that each column is centered and standardized, the difference of PCC between each column pair(j, ℓ)is given by ∆rjℓ = 1 N NX i=1 (xi,j∆xi,ℓ +x i,ℓ∆xi,j + ∆xi,j∆xi,ℓ). Plugging∆x i,j =−αβ ⊤ j xi leads to ∆rjℓ =−α( [Σβ ℓ]j + [Σβ j]ℓ) +α 2 β⊤ j Σβℓ, whereΣ= 1 N X⊤Xwithdiag(Σ) =I. 21
-
[51]
Empirical distribution.Consider the coupling matching xi,j to xi,j + ∆xi,j for each i, we bound the transport cost as below: W2 2(ρj, νj)≤ 1 N NX i=1 (∆xi,j)2 =α 2β⊤ j Σβj, which leads to the claimed inequality. E.3 Proof of Theorem 2 Lemma 1(Gaussian tail bound).LetΦ(u)denote the standard normal CDF andQ(u) := 1−Φ(u). For anyu >0, Q(u)≤ 1 2 e−u2/2. Proof...
-
[52]
Φ r 1 + λmin σ2 ! − 1 2 # + σ√σ2 +λ max
under H0, thus has expected value m 2 and variance m 4 . By Central Limit Theorem, the Z-score underH 0 follows Z= PN i=1 Ti − mN 2q mN 4 d → N(0,1)asN→ ∞. Proof of Lemma 3.For each index pair(i, j)of effective entries, define the events Ei,j :={sign(ℑ(y i,j)) = 2ζ i,j −1}, A i,j :={sign(ℑ(y i,j)) = 1}. We will show that the indicator variables{I(E i,j)}i...
work page 1994
-
[53]
Densitymeasures the distributional similarity between synthetic and real data. For each numerical column, we compute the Kolmogorov–Smirnov statistic (KST); for each categorical column, we compute the total variation distance (TVD). The per-column scores are then averaged to yield the overall Density score. Higher values indicate closer alignment of margi...
-
[54]
Correvaluates preservation of inter-column relationships. We compute the Pearson correlation coefficient for every pair of columns and report their mean as the Corr score. Larger values indicate more faithful reproduction of real feature dependencies
-
[55]
C2STquantifies statistical indistinguishability between synthetic and real data. A logistic regression model is trained and evaluated on the training and validation sets, both of which contain a mix of real and synthetic data. We then report the complement of the ROC AUC averaged over all validation splits. Higher values indicate that the model cannot dis...
-
[56]
MLE: assesses downstream machine learning utility on supervised tasks. We train an XGBoost model [ 7] on synthetic data, then evaluate it on the real testing set, using AUC for the classification task and RMSE for the regression task. Higher MLE scores reflect better machine learning utility of the synthetic data. Regarding the metric for watermark detect...
-
[57]
Generate 100 unwatermarked tables with 1K rows (100K rows in total) using TabSyn
-
[58]
Bootstrap sampling rows from 100K rows to construct 100K synthetic tables for watermark detection
-
[59]
Set the 100-th order-statisticZ (100) as the threshold. Then we have FH0(Z(100))∼Beta(100,99901) . By Clopper-Pearson interval, the estimation procedure above is sufficient to calibrate the critical value q0.001, since the realized FPR has a 95% confidence interval of roughly 0.001±2×10 −4. Table 9 presents the empirical mean and standard deviation of the...
-
[60]
G(aussian)-Noise.adds Gaussian noise with zero mean and a standard deviation equal to 10% of each cell’s value for numerical attributes
-
[61]
C(ategorical)-Noise.perturbs categorical entries by randomly replacing 10% of cells with values sampled from other rows in the same column
-
[62]
Specifically, we conduct the process below for each columnj∈ {1,
A(daptive)-Noise.adds Gaussian noise with zero mean and 0.1 standard deviation to standardized attributes. Specifically, we conduct the process below for each columnj∈ {1, . . . , p}: zij = xij −µ j σj , z ′ ij =z ij +ϵ· N(0,1), x ′ ij =z ′ ij ·σ j +µ j, where ϵ= 0.1 is the attack strength, and µj and σj are the empirical mean and standard deviation of co...
-
[63]
Quantization.discretizes numerical columns using quantile transformation with the 10 quantile bins and maps those discrete quantile levels back to the original data domain with the inverse transform
-
[64]
Resample.redistributes samples to achieve equal representation across target classes by super-sampling underrepre- sented classes and sub-sampling overrepresented ones. 10.Shuffle.randomly permutes all rows of the table. 30 Table 10: Data fidelity and watermark detectability evaluated on tables generated by original TabSyn implementation. No watermarking ...
-
[65]
G(aussian)-Noise.adds Gaussian noise with zero mean and a standard deviation equal to 20% of each cell’s value for numerical attributes
-
[66]
C(ategorical)-Noise.perturbs categorical entries by randomly replacing 20% of cells with values sampled from other rows in the same column. 6.A(daptive)-Noise.adds Gaussian noise with zero mean and 0.2 standard deviation to standardized attributes
-
[67]
Quantization.discretizes numerical columns using quantile transformation with the 10 quantile bins and maps those discrete quantile levels back to the original data domain with the inverse transform. 36 Table 21: Data fidelity and watermark detectability of privacy-enhanced TAB-DRWunder varying watermark keys. All experiments use (γ, δ) = (0.5,0.5) . Fide...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.