TAB-DRW embeds detectable watermarks in the frequency domain of normalized synthetic tabular data via DFT and rank-based pseudorandom bits, achieving robustness to attacks while preserving fidelity and supporting mixed data types.
arXiv preprint arXiv:2401.02524 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Adversaries can degrade synthetic data quality via small manipulations such as label flipping or feature-importance interventions, substantially harming downstream model performance and increasing statistical divergence from real data.
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
PuckTrick library adds controlled imperfections to synthetic data and shows that models trained on the resulting contaminated data outperform those trained on clean synthetic data in financial dataset experiments.
citing papers explorer
-
Robust Spectral Watermark for Synthetic Tabular Data
TAB-DRW embeds detectable watermarks in the frequency domain of normalized synthetic tabular data via DFT and rank-based pseudorandom bits, achieving robustness to attacks while preserving fidelity and supporting mixed data types.
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
-
Quality Degradation Attack in Synthetic Data
Adversaries can degrade synthetic data quality via small manipulations such as label flipping or feature-importance interventions, substantially harming downstream model performance and increasing statistical divergence from real data.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
PuckTrick: A Library for Making Synthetic Data More Realistic
PuckTrick library adds controlled imperfections to synthetic data and shows that models trained on the resulting contaminated data outperform those trained on clean synthetic data in financial dataset experiments.