pith. sign in

arxiv: 1906.08149 · v1 · pith:65DNVIG3new · submitted 2019-06-19 · 💻 cs.DB · cs.CR

Efficient privacy preservation of big data for accurate data mining

Pith reviewed 2026-05-25 19:50 UTC · model grok-4.3

classification 💻 cs.DB cs.CR
keywords privacy preservationbig datadata miningperturbation algorithmgeometric transformationsdata classificationnonreversiblescalability
0
0 comments X

The pith

PABIDOT uses optimal geometric transformations to perturb big data while preserving classification accuracy and privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PABIDOT as a nonreversible perturbation algorithm for privacy preservation of big data. It targets limitations in existing methods that struggle with efficiency, scalability, data utility, or privacy strength. The approach relies on optimal geometric transformations to create the perturbations. Experiments using nine datasets and five classification algorithms show PABIDOT outperforms two related algorithms in execution speed, scalability, attack resistance, and accuracy. This would matter if it allows organizations to perform data mining on large sensitive collections without major privacy or performance costs.

Core claim

PABIDOT is an efficient and scalable nonreversible perturbation algorithm for privacy preservation of big data via optimal geometric transformations. When tested with nine datasets and five classification algorithms, it excels in execution speed, scalability, attack resistance and accuracy in large-scale privacy-preserving data classification when compared with two other related privacy-preserving algorithms.

What carries the argument

PABIDOT, a perturbation algorithm that applies optimal geometric transformations to achieve non-reversibility while supporting downstream classification.

If this is right

  • Privacy-preserving classification on big data can scale without major losses in speed or accuracy.
  • Nonreversible perturbation can provide stronger attack resistance than prior geometric methods while keeping utility high.
  • The same transformation approach works across multiple classification algorithms without per-algorithm redesign.
  • Execution time for privacy steps becomes short enough for routine use on large datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric approach might extend to regression or clustering tasks on sensitive data with similar utility retention.
  • Widespread use could reduce reliance on heavier anonymization techniques that distort data more severely.
  • Testing on streaming or real-time big data sources would check whether the speed gains hold under continuous processing.

Load-bearing premise

The chosen geometric transformations can simultaneously prevent reversal to recover original data and retain enough statistical structure for high classification accuracy.

What would settle it

A replication experiment in which the perturbed data can be reversed to recover original sensitive values or in which classification accuracy falls below the two compared algorithms on the same nine datasets.

Figures

Figures reproduced from arXiv: 1906.08149 by D. Liu, I. Khalil, M.A.P. Chamikara, P. Bertok, S. Camtepe.

Figure 1
Figure 1. Figure 1: Basic flow and the architecture of PABIDOT. In this setting, the data owner is considered to be the trusted [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of Randomized Expansion. The red arrows of the right-hand side show a positive shift where a calibrated positive random value is added to the positive value to increase the positiveness of the original value. The left-hand side which is represented by the blue arrows show a negative shift where a calibrated negative random value is added to the negative value to increase the negativeness of the orig… view at source ↗
Figure 3
Figure 3. Figure 3: Time consumption of PABIDOT. PABIDOT shows linear time complexity for the number of instances, and [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Time Consumption of PABIDOT before and after the efficiency optimization. Both PABIDOT and [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Time consumption comparison of the three methods. Due to the extremely low time consumption of PABIDOT, [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The process used to generate the classification models trained by the perturbed data. This figure represents [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Box plots for the datasets listed in Table 5. The boxplots in the figure show how each perturbation algorithm [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: φ vs. θ. Figure 8a shows variation of the local minimum privacy guarantee (φi) curves for each attribute of the WCDS dataset. The φi values are utilized to generate the global minimum privacy guarantee (φ) curve as shown in Figure 8b; PABIDOT considers the global maximum of φ to select the best perturbation parameters. For the WCDS dataset the best perturbation parameters are θoptimal = 35 and Rifoptimal =… view at source ↗
Figure 9
Figure 9. Figure 9: minimum std(D-Dr ) and average std(D-Dr ) of the reconstructed datasets produced by Naive Snooping, ICA and known I/O. The red vertical lines show the instance of optimal perturbation parameter selection of PABIDOT. The red lines nearly indicate the point at which the corresponding perturbed dataset provides the highest privacy guarantee. This provides empirical evidence on PABIDOT providing the optimal pr… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of σ on min(std(D − Dr )) and classification accuracy. When the σ of the randomized expansion is increased, the minimum std(D − Dp) increases as shown in Figure 10a. However, the classification accuracy shows only a minimal decrease against increasing σ. This confirms PABIDOT’s capability of maintaining utility at a constant level while providing increased resistance to increasing randomized expans… view at source ↗
Figure 1
Figure 1. Figure 1: We assume that only the perturbed data is released and the original data is not accessible [PITH_FULL_IMAGE:figures/full_fig_p037_1.png] view at source ↗
read the original abstract

Computing technologies pervade physical spaces and human lives, and produce a vast amount of data that is available for analysis. However, there is a growing concern that potentially sensitive data may become public if the collected data are not appropriately sanitized before being released for investigation. Although there are more than a few privacy-preserving methods available, they are not efficient, scalable or have problems with data utility, and/or privacy. This paper addresses these issues by proposing an efficient and scalable nonreversible perturbation algorithm, PABIDOT, for privacy preservation of big data via optimal geometric transformations. PABIDOT was tested for efficiency, scalability, resistance, and accuracy using nine datasets and five classification algorithms. Experiments show that PABIDOT excels in execution speed, scalability, attack resistance and accuracy in large-scale privacy-preserving data classification when compared with two other, related privacy-preserving algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes PABIDOT, a non-reversible perturbation algorithm for privacy preservation of big data that relies on optimal geometric transformations. It evaluates the algorithm on nine datasets with five classification algorithms, reporting superior execution speed, scalability, attack resistance, and classification accuracy relative to two existing privacy-preserving methods.

Significance. If the empirical claims hold, the work provides a practical, scalable technique for privacy-preserving classification on large datasets that improves upon prior methods in both efficiency and the utility-privacy balance. The breadth of evaluation across multiple datasets and classifiers supplies concrete evidence that could inform deployment decisions in data-mining applications.

minor comments (2)
  1. The abstract asserts positive experimental outcomes without supplying algorithm equations, attack-model definitions, or statistical tests; the full manuscript should make these elements explicit in the method and evaluation sections to allow independent verification of the superiority claims.
  2. The description of the geometric transformations should include a clear statement of the attack model and a formal argument (or empirical test) establishing non-invertibility, as this property is load-bearing for the privacy guarantee.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation of minor revision. The referee's description accurately captures the PABIDOT proposal, its evaluation across nine datasets and five classifiers, and the reported advantages in speed, scalability, attack resistance, and accuracy.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes the PABIDOT algorithm based on geometric transformations and reports empirical results on nine datasets with five classifiers, comparing speed, scalability, resistance, and accuracy to two baselines. No equations, derivations, or load-bearing steps are present in the provided text that reduce any claimed prediction, uniqueness, or result to a fitted parameter, self-citation chain, or definitional tautology. The evaluation is self-contained against external benchmarks and does not invoke prior author work as a substitute for independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5687 in / 1013 out tokens · 27099 ms · 2026-05-25T19:50:10.736580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    Aggarwal, C. C. (2015). Privacy-preserving data mining. In Data Mining (pp. 663–693). Springer. doi:https://doi.org/10.1007/978-3-319-14142-8

  2. [2]

    C., & Yu, P

    Aggarwal, C. C., & Yu, P. S. (2004). A condensation approach to privacy preserving data mining. In EDBT (pp. 183–199). Springer volume 4. doi: https://doi.org/10.1007/ 978-3-540-24741-8_12

  3. [3]

    Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In ACM Sigmod Record (pp. 439–450). ACM volume 29. doi: https://doi.org/10.1145/335191.335438

  4. [4]

    Aldeen, Y. A. A. S., Salleh, M., & Razzaque, M. A. (2015). A comprehensive review on privacy pre- serving data mining. SpringerPlus, 4, 694. doi:https://doi.org/10.1186/s40064-015-1481-x

  5. [5]

    A., Hoehle, H., Goodarzi, S., & Venkatesh, V

    Aloysius, J. A., Hoehle, H., Goodarzi, S., & Venkatesh, V. (2018). Big data initiatives in retail environments: Linking service process perceptions to shopping outcomes. Annals of operations research, 270, 25–51. doi: https://doi.org/10.1007/s10479-016-2276-3

  6. [6]

    Bettini, C., & Riboni, D. (2015). Privacy protection in pervasive systems: State of the art and technical challenges. Pervasive and Mobile Computing , 17, 159–174. doi: https://doi.org/10. 1016/j.pmcj.2014.09.010

  7. [7]

    Buccafurri, F., Lax, G., Nicolazzo, S., & Nocera, A. (2016). A threat to friendship privacy in facebook. In International Conference on Availability, Reliability, and Security (pp. 96–105). Springer. doi: https://doi.org/10.1007/978-3-319-45507-5_7

  8. [8]

    Capraro, V., & Perc, M. (2018). Grand challenges in social physics: In pursuit of moral behavior. Frontiers in Physics , 6, 107. doi: https://doi.org/10.3389/fphy.2018.00107

  9. [9]

    Chamikara, M. A. P., Bertok, P., Liu, D., Camtepe, S., & Khalil, I. (2018). Efficient data perturbation for privacy preserving and accurate data stream mining. Pervasive and Mobile Computing, 48, 1–19. doi: https://doi.org/10.1016/j.pmcj.2018.05.003. 42

  10. [10]

    Chen, K., & Liu, L. (2005). A random rotation perturbation approach to privacy preserving data classification. The Ohio Center of Excellence in Knowledge-Enabled Computing , . URL: https://corescholar.libraries.wright.edu/knoesis/916/

  11. [11]

    Chen, K., & Liu, L. (2011). Geometric data perturbation for privacy preserving outsourced data mining. Knowledge and Information Systems , 29, 657–695. doi:https://doi.org/10.1007/ s10115-010-0362-4

  12. [12]

    Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newsletter , 4, 28–34. doi: https: //doi.org/10.1145/772862.772867

  13. [13]

    Cuzzocrea, A. (2015). Privacy-preserving big data management: The case of olap. Big Data: Algorithms, Analytics, and Applications , (pp. 301–326;). URL: https://books.google.com.au/ books?isbn=1482240564

  14. [14]

    Dwork, C., Roth, A. et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends R⃝ in Theoretical Computer Science , 9, 211–407. doi: http://dx.doi.org/10.1561/ 0400000042

  15. [15]

    Erlingsson, ´U., Pihur, V., & Korolova, A. (2014). Rappor: Randomized aggregatable privacy- preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security (pp. 1054–1067). ACM. doi: https://doi.org/10.1145/2660267. 2660348

  16. [16]

    Gai, K., Qiu, M., Zhao, H., & Xiong, J. (2016). Privacy-aware adaptive data encryption strategy of big data in cloud computing. In Cyber Security and Cloud Computing (CSCloud), 2016 IEEE 3rd International Conference on (pp. 273–278). IEEE. doi: http://doi.ieeecomputersociety. org/10.1109/CSCloud.2016.52

  17. [17]

    G¨ avert, H., Hurri, J., S¨ arel¨ a, J., & Hyv¨ arinen, A. (2005). The fastica package for matlab.Lab Com- put Inf Sci Helsinki Univ. Technol , . URL: https://research.ics.aalto.fi/ica/fastica/

  18. [18]

    Hasan, A., Jiang, Q., Luo, J., Li, C., & Chen, L. (2016). An effective value swapping method for privacy preserving data publishing. Security and Communication Networks , 9, 3219–3228. doi:https://doi.org/10.1002/sec.1527. 43

  19. [19]

    Helbing, D., Brockmann, D., Chadefaux, T., Donnay, K., Blanke, U., Woolley-Meza, O., Mous- said, M., Johansson, A., Krause, J., Schutte, S. et al. (2015). Saving human lives: What complex- ity science and information systems can contribute. Journal of statistical physics , 158, 735–781. doi:https://doi.org/10.1007/s10955-014-1024-9

  20. [20]

    Howell, D. C. (2016). Fundamental statistics for the behavioral sciences. Cengage Learning. URL: https://books.google.com.au/books?isbn=1305652975

  21. [21]

    Jalili, M., & Perc, M. (2017). Information cascades in complex networks. Journal of Complex Networks, 5, 665–693. doi: https://doi.org/10.1093/comnet/cnx019

  22. [22]

    Jones, H. (2012). Computer Graphics through Key Mathematics . Springer London : Imprint: Springer. URL: https://books.google.com.au/books?id=f7gPBwAAQBAJ

  23. [23]

    O., & Swamy, M

    Kabir, W., Ahmad, M. O., & Swamy, M. (2015). A novel normalization technique for multimodal biometric systems. In Circuits and Systems (MWSCAS), 2015 IEEE 58th International Midwest Symposium on (pp. 1–4). IEEE. doi: https://doi.org/10.1109/MWSCAS.2015.7282214

  24. [24]

    Kairouz, P., Oh, S., & Viswanath, P. (2014). Extremal mechanisms for local differential privacy. In Advances in neural information processing systems (pp. 2879–2887). URL: http://papers. nips.cc/paper/5392-extremal-mechanisms-for-local-differential-privacy

  25. [25]

    Kerschbaum, F., & H¨ arterich, M. (2017). Searchable encryption to reduce encryption degradation in adjustably encrypted databases. In IFIP Annual Conference on Data and Applications Security and Privacy (pp. 325–336). Springer. doi: https://doi.org/10.1007/978-3-319-61176-1_18

  26. [26]

    Kieseberg, P., & Weippl, E. (2018). Security challenges in cyber-physical production systems. In International Conference on Software Quality (pp. 3–16). Springer. doi: https://doi.org/10. 1007/978-3-319-71440-0_1

  27. [27]

    Li, P., Li, J., Huang, Z., Gao, C.-Z., Chen, W.-B., & Chen, K. (2017). Privacy-preserving outsourced classification in cloud computing. Cluster Computing , (pp. 1–10.). doi: https://doi. org/10.1007/s10586-017-0849-9

  28. [28]

    Liu, K., Kargupta, H., & Ryan, J. (2006). Random projection-based multiplicative data pertur- bation for privacy preserving distributed data mining. IEEE Transactions on knowledge and Data Engineering, 18, 92–106. doi: https://doi.org/10.1109/TKDE.2006.14. 44

  29. [29]

    M., & Sundarsekar, R

    Manogaran, G., Thota, C., Lopez, D., Vijayakumar, V., Abbas, K. M., & Sundarsekar, R. (2017). Big data knowledge system in healthcare. In Internet of things and big data technolo- gies for next generation healthcare (pp. 133–157). Springer. doi: https://doi.org/10.1007/ 978-3-319-49736-5_7

  30. [30]

    Maruskin, J. (2012). Essential Linear Algebra . Solar Crest Publishing, LLC. URL: https: //books.google.com.au/books?id=aOF3-hx3u1kC

  31. [31]

    Muralidhar, K., Parsa, R., & Sarathy, R. (1999). A general additive data perturbation method for database security.management science, 45, 1399–1415. doi:https://doi.org/10.1287/mnsc. 45.10.1399

  32. [32]

    Nell, W., & Shure, L. (2011). Memory profiling. URL: https://patents.google.com/patent/ US7908591B1/en uS Patent 7,908,591

  33. [33]

    D., Okkalioglu, M., Koc, M., & Polat, H

    Okkalioglu, B. D., Okkalioglu, M., Koc, M., & Polat, H. (2015). A survey: deriving private information from perturbed data. Artificial Intelligence Review , 44, 547–569. doi: https://doi. org/10.1007/s10462-015-9439-5

  34. [34]

    Paeth, A. W. (2014). Graphics Gems V (Macintosh Version) . Academic Press. URL: https: //books.google.com.au/books?isbn=1483296695

  35. [35]

    Park, K.-j., & Ryou, H.-b. (2003). Anomaly detection scheme using data mining in mobile environment. Computational Science and Its Applications ICCSA , (pp. 978–978.). doi: https: //doi.org/10.1007/3-540-44843-8_3

  36. [36]

    Qin, Z., Yang, Y., Yu, T., Khalil, I., Xiao, X., & Ren, K. (2016). Heavy hitter estimation over set- valued data with local differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 192–203). ACM. doi: https://doi.org/10. 1145/2976749.2978409

  37. [37]

    Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 3–18). IEEE. doi: https://doi.org/10.1109/SP.2017.41

  38. [38]

    Soria-Comas, J., & Domingo-Ferrer, J. (2016). Big data privacy: challenges to privacy prin- ciples and models. Data Science and Engineering , 1, 21–28. doi: https://doi.org/10.1007/ s41019-015-0001-x . 45

  39. [39]

    Steel, E., & Fowler, G. (2010). Facebook in privacy breach. The Wall Street Journal , 18. URL: https://www.wsj.com/articles/SB10001424052702304772804575558484075236968

  40. [40]

    Tang, J., Korolova, A., Bai, X., Wang, X., & Wang, X. (2017). Privacy loss in apple’s im- plementation of differential privacy on macos 10.12. arXiv preprint arXiv:1709.02753 , . URL: https://arxiv.org/abs/1709.02753

  41. [41]

    Torra, V. (2017). Data Privacy: Foundations, New Developments and the Big Data Challenge . Springer. doi: https://doi.org/10.1007/978-3-319-57358-8

  42. [42]

    Torra, V. (2017). Fuzzy microaggregation for the transparency principle. Journal of Applied Logic, 23, 70–80. doi: https://doi.org/10.1016/j.jal.2016.11.007

  43. [43]

    Vatsalan, D., Sehili, Z., Christen, P., & Rahm, E. (2017). Privacy-preserving record linkage for big data: Current approaches and research challenges. In Handbook of Big Data Technologies (pp. 851–895). Springer. doi: https://doi.org/10.1007/978-3-319-49340-4_25

  44. [44]

    Wei, Z., Wu, Y., Yang, Y., Yan, Z., Pei, Q., Xie, Y., & Weng, J. (2018). Autoprivacy: automatic privacy protection and tagging suggestion for mobile social photo. Computers & Security , . doi:https://doi.org/10.1016/j.cose.2017.12.002

  45. [45]

    Wen, Y., Liu, J., Dou, W., Xu, X., Cao, B., & Chen, J. (2018). Scheduling workflows with privacy protection constraints for big data applications on cloud. Future Generation Computer Systems , . doi:https://doi.org/10.1016/j.future.2018.03.028

  46. [46]

    L., & Rosen, P

    Wilson, R. L., & Rosen, P. A. (2008). Protecting data through’perturbation’techniques: The impact on knowledge discovery in databases. In Information Security and Ethics: Concepts, Methodologies, Tools, and Applications (pp. 1550–1561). IGI Global. doi: https://doi.org/10. 4018/978-1-59904-937-3

  47. [47]

    H., Frank, E., Hall, M

    Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques . Morgan Kaufmann. URL: https://books.google.com.au/books?isbn= 0128043571

  48. [48]

    C.-W., Fu, A

    Wong, R. C.-W., Fu, A. W.-C., Wang, K., & Pei, J. (2007). Minimality attack in privacy preserving data publishing. In Proceedings of the 33rd international conference on Very large data bases (pp. 543–554). VLDB Endowment. URL: https://dl.acm.org/citation.cfm?id=1325914. 46

  49. [49]

    Xu, L., Jiang, C., Chen, Y., Ren, Y., & Liu, K. R. (2015). Privacy or utility in data collection? a contract theoretic approach. IEEE Journal of Selected Topics in Signal Processing , 9, 1256–1269. doi:https://doi.org/10.1109/JSTSP.2015.2425798

  50. [50]

    Zhou, J., Cao, Z., Dong, X., & Lin, X. (2015). Ppdm: A privacy-preserving protocol for cloud- assisted e-healthcare systems. IEEE Journal of Selected Topics in Signal Processing, 9, 1332–1344. doi:https://doi.org/10.1109/JSTSP.2015.2427113. 47