pith. sign in

arxiv: 2403.02780 · v9 · submitted 2024-03-05 · 💻 cs.LG · math.OC

Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

Pith reviewed 2026-05-24 03:34 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords data collaborationorthogonal Procrustesbasis alignmentprivacy-preserving learningmulti-party machine learninglinear projectionsorthonormal baseschange of basis
0
0 comments X

The pith

Enforcing orthonormal bases turns data collaboration alignment into a closed-form Orthogonal Procrustes solution that makes performance invariant to the target basis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to align linear projections from multiple private datasets without sharing the secret bases by requiring both secret and target bases to be orthonormal. This change converts the alignment step into the Orthogonal Procrustes problem, which has an exact solution, and produces change-of-basis matrices that achieve orthogonal concordance. Under this concordance every party's representation matches the others up to one shared orthogonal transformation, so the accuracy of any downstream model becomes independent of which target basis was selected. The method keeps the original one-round communication pattern and privacy guarantees while cutting alignment cost from quadratic to linear in the relevant dimensions.

Core claim

By selecting orthonormal secret and target bases, the resulting change-of-basis matrices achieve orthogonal concordance: all parties' representations are aligned up to a shared orthogonal transform. This renders downstream performance invariant to the target basis. Alignment reduces to the Orthogonal Procrustes problem and admits a closed-form solution that lowers complexity from O(min{a(cl)^2,a^2cl}) to O(acl^2).

What carries the argument

Orthonormal Data Collaboration (ODC) that forces orthonormal secret and target bases so that alignment becomes the Orthogonal Procrustes problem and yields orthogonal concordance.

If this is right

  • Alignment cost drops from quadratic to linear in the product of party count, common dimension, and local dimension.
  • Empirical wall-clock speedups reach 100 times on standard benchmarks while accuracy stays equal or improves.
  • One-round communication and the original privacy assumptions of data collaboration remain intact.
  • Downstream model performance no longer depends on the particular choice of target basis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The invariance property could let practitioners pick the numerically most stable orthonormal target basis without accuracy trade-offs.
  • The same orthonormal reduction might apply to other multi-party linear-projection schemes that currently solve alignment iteratively.
  • Because the method is a drop-in replacement, existing data-collaboration codebases can adopt it with minimal refactoring.

Load-bearing premise

Forcing orthonormality on the bases still spans the common subspace and leaves the original linear-projection semantics, information content, and privacy properties unchanged.

What would settle it

An experiment in which the same downstream model is trained on ODC-aligned data using two different orthonormal target bases and accuracy differs by more than numerical precision, or a dataset where the closed-form Procrustes solution fails to produce exact orthogonal concordance.

Figures

Figures reproduced from arXiv: 2403.02780 by Akiko Yoshise, Keiyu Nosaka, Yamato Suetake, Yuichi Takano.

Figure 1
Figure 1. Figure 1: Conceptual illustration of the Orthonormal Data Collaboration (ODC) framework. Each participating user independently projects their [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual privacy verification using CelebA [24]. Original images (panel (a)) compared to images after orthonormal projections (panel (b)) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Absolute communication volume versus quantization bit-width [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heatmaps of the threshold number of FL rounds [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity curves for the break-even FL rounds [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Wall-clock time with varying parameters ( [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of extremely heterogeneous splitting applied to TDC datasets. Binary-labeled data are partitioned across four users, each of [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual privacy analysis using CelebA. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
read the original abstract

Data Collaboration (DC) enables multiple parties to jointly train a model by sharing only linear projections of their private datasets. The core challenge in DC is to align the bases of these projections without revealing each party's secret basis. While existing theory suggests that any target basis spanning the common subspace should suffice, in practice, the choice of basis can substantially affect both accuracy and numerical stability. We introduce Orthonormal Data Collaboration (ODC), which enforces orthonormal secret and target bases, thereby reducing alignment to the classical Orthogonal Procrustes problem, which admits a closed-form solution. We prove that the resulting change-of-basis matrices achieve orthogonal concordance, aligning all parties' representations up to a shared orthogonal transform and rendering downstream performance invariant to the target basis. Computationally, ODC reduces the alignment complexity from O(min{a(cl)^2,a^2cl}) to O(acl^2), and empirical evaluations show up to 100 times speedups with equal or better accuracy across benchmarks. ODC preserves DC's one-round communication pattern and privacy assumptions, providing a simple and efficient drop-in improvement to existing DC pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces Orthonormal Data Collaboration (ODC) as an enhancement to standard Data Collaboration (DC). In DC, parties share only linear projections of private data and must align bases without revealing secret bases. ODC enforces orthonormal secret and target bases, reducing the alignment step to the classical Orthogonal Procrustes problem (closed-form SVD solution). The central claim is a proof that the resulting change-of-basis matrices achieve orthogonal concordance: all parties' representations become aligned up to a shared orthogonal transform, rendering downstream performance invariant to target-basis choice. The method preserves the original one-round communication pattern and privacy model, reduces alignment complexity from O(min{a(cl)^2,a^2 cl}) to O(acl^2), and reports empirical speedups up to 100x with equal or better accuracy on benchmarks.

Significance. If the proof of orthogonal concordance holds, the result supplies a theoretically grounded, drop-in improvement that directly resolves the practical sensitivity of DC to basis selection while adding no communication or privacy overhead. The reduction to a standard, parameter-free problem (Orthogonal Procrustes) and the explicit complexity improvement are clear strengths; the empirical speedups and accuracy parity across benchmarks further support utility. The work credits the classical Procrustes literature and maintains the one-round privacy assumptions of prior DC papers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept. The referee's summary accurately reflects the contributions of Orthonormal Data Collaboration (ODC), including the reduction of alignment to the Orthogonal Procrustes problem, the orthogonal concordance property, complexity reduction, and preservation of the original privacy model.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper proves that orthonormal secret and target bases reduce alignment to the classical Orthogonal Procrustes problem (an external, standard result in linear algebra) and yield change-of-basis matrices achieving orthogonal concordance. No equations or claims in the provided material reduce the result by construction to a fitted parameter, self-citation chain, or renamed input. The appeal to Procrustes is not load-bearing self-citation, and the one-round privacy model plus spanning-property preservation are stated without internal reduction to the target claim. The derivation is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, domain axioms, or invented entities are identifiable. The method relies on standard linear-algebra facts about orthonormal bases and the Orthogonal Procrustes problem.

pith-pipeline@v0.9.0 · 5732 in / 1125 out tokens · 24845 ms · 2026-05-24T03:34:40.837353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Rosati, P

    P. Rosati, P. Deeney, M. Cummins, L. van der Werff, T. Lynn, Social media and stock price reaction to data breach announcements: Evidence from us listed companies, Research in International Business and Finance 47 (2019) 458–469

  2. [2]

    McMahan, E

    B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in: Artificial Intelligence and Statistics, PMLR, 2017, pp. 1273–1282

  3. [3]

    Dwork, Differential privacy: A survey of results, in: International Conference on Theory and Applications of Models of Computation, Springer, 2008, pp

    C. Dwork, Differential privacy: A survey of results, in: International Conference on Theory and Applications of Models of Computation, Springer, 2008, pp. 1–19

  4. [4]

    K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, H. V . Poor, Federated learning with differential privacy: Algorithms and performance analysis, IEEE transactions on information forensics and security 15 (2020) 3454–3469

  5. [5]

    R. Xu, N. Baracaldo, J. Joshi, Privacy-preserving machine learning: Methods, challenges and directions, arXiv preprint arXiv:2108.04417 (2021)

  6. [6]

    Imakura, T

    A. Imakura, T. Sakurai, Data collaboration analysis framework using centralization of individual intermediate representations for distributed data sets, ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil Engineering 6 (2) (2020) 04020018

  7. [7]

    Imakura, X

    A. Imakura, X. Ye, T. Sakurai, Collaborative data analysis: Non-model sharing-type machine learning for dis- tributed data, in: Knowledge Management and Acquisition for Intelligent Systems: 17th Pacific Rim Knowledge Acquisition Workshop, PKAW 2020, Yokohama, Japan, January 7–8, 2021, Proceedings 17, Springer, 2021, pp. 14–29

  8. [8]

    Imakura, A

    A. Imakura, A. Bogdanova, T. Yamazoe, K. Omote, T. Sakurai, Accuracy and privacy evaluations of collabora- tive data analysis, Proceedings of the AAAI Conference on Artificial Intelligence (2021)

  9. [9]

    Imakura, T

    A. Imakura, T. Sakurai, Y . Okada, T. Fujii, T. Sakamoto, H. Abe, Non-readily identifiable data collaboration analysis for multiple datasets including personal information, Information Fusion 98 (2023) 101826

  10. [10]

    Yamashiro, K

    H. Yamashiro, K. Omote, A. Imakura, T. Sakurai, Toward the application of differential privacy to data collabo- ration, IEEE Access PP (2024) 1–1.doi:10.1109/ACCESS.2024.3396146. 41

  11. [11]

    Imakura, T

    A. Imakura, T. Sakurai, Feddcl: a federated data collaboration learning as a hybrid-type privacy-preserving framework based on federated learning and data collaboration, arXiv preprint arXiv:2409.18356 (2024)

  12. [12]

    Kawakami, Y

    Y . Kawakami, Y . Takano, A. Imakura, New solutions based on the generalized eigenvalue problem for the data collaboration analysis, arXiv preprint arXiv:2404.14164 (2024)

  13. [13]

    Nosaka, A

    K. Nosaka, A. Yoshise, Creating collaborative data representations using matrix manifold optimal computation and automated hyperparameter tuning, in: 2023 IEEE 3rd International Conference on Electronic Communica- tions, Internet of Things and Big Data (ICEIB), IEEE, 2023, pp. 180–185

  14. [14]

    P. H. Schönemann, A generalized solution of the orthogonal procrustes problem, Psychometrika 31 (1) (1966) 1–10

  15. [15]

    Penrose, A generalized inverse for matrices, Proceedings of the Cambridge Philosophical Society 51 (1955) 406–413

    R. Penrose, A generalized inverse for matrices, Proceedings of the Cambridge Philosophical Society 51 (1955) 406–413

  16. [16]

    Mizoguchi, A

    A. Mizoguchi, A. Imakura, T. Sakurai, Application of data collaboration analysis to distributed data with mis- aligned features, Informatics in Medicine Unlocked 32 (2022) 101013

  17. [17]

    Mizoguchi, A

    A. Mizoguchi, A. Bogdanova, A. Imakura, T. Sakurai, Data collaboration analysis applied to compound datasets and the introduction of projection data to non-iid settings (2023)

  18. [18]

    Nakayama, Y

    T. Nakayama, Y . Kawamata, A. Toyoda, A. Imakura, R. Kagawa, M. Sanuki, R. Tsunoda, K. Yamagata, T. Saku- rai, Y . Okada, Data collaboration for causal inference from limited medical testing and medication data (2025). arXiv:2501.06511. URLhttps://arxiv.org/abs/2501.06511

  19. [19]

    Kawamata, R

    Y . Kawamata, R. Motai, Y . Okada, A. Imakura, T. Sakurai, Collaborative causal inference on distributed data, Expert Systems with Applications 244 (2024) 123024.doi:https://doi.org/10.1016/j.eswa.2023. 123024. URLhttps://www.sciencedirect.com/science/article/pii/S0957417423035261

  20. [20]

    Bogdanova, A

    A. Bogdanova, A. Imakura, T. Sakurai, Dc-shap method for consistent explainability in privacy-preserving dis- tributed machine learning, Human-Centric Intelligent Systems 3 (3) (2023) 197–210

  21. [21]

    Imakura, R

    A. Imakura, R. Tsunoda, R. Kagawa, K. Yamagata, T. Sakurai, Dc-cox: Data collaboration cox proportional hazards model for privacy-preserving survival analysis on multiple parties, Journal of Biomedical Informatics 137 (2023) 104264

  22. [22]

    Imakura, H

    A. Imakura, H. Inaba, Y . Okada, T. Sakurai, Interpretable collaborative data analysis on distributed data, Expert Systems with Applications 177 (2021) 114891.doi:https://doi.org/10.1016/j.eswa.2021.114891. URLhttps://www.sciencedirect.com/science/article/pii/S0957417421003328

  23. [23]

    Yanagi, S

    T. Yanagi, S. Ikeda, N. Sukegawa, Y . Takano, Privacy-preserving recommender system using the data collabora- tion analysis for distributed datasets, arXiv preprint arXiv:2406.01603 (2024)

  24. [24]

    Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015

  25. [25]

    Nguyen, D

    H. Nguyen, D. Zhuang, P.-Y . Wu, M. Chang, Autogan-based dimension reduction for privacy preservation, Neurocomputing 384 (2020) 94–103

  26. [26]

    J. V . Haxby, J. S. Guntupalli, A. C. Connolly, Y . O. Halchenko, B. R. Conroy, M. I. Gobbini, M. Hanke, P. J. Ramadge, A common, high-dimensional model of the representational space in human ventral temporal cortex, Neuron 72 (2) (2011) 404–416.doi:10.1016/j.neuron.2011.08.026

  27. [27]

    Lorbert, P

    A. Lorbert, P. J. Ramadge, Kernel hyperalignment, in: Advances in Neural Information Processing Systems 25, 2012, pp. 1799–1807. 42

  28. [28]

    Ling, Near-optimal bounds for generalized orthogonal procrustes problem via generalized power method, Applied and Computational Harmonic Analysis 66 (2023) 62–100

    S. Ling, Near-optimal bounds for generalized orthogonal procrustes problem via generalized power method, Applied and Computational Harmonic Analysis 66 (2023) 62–100

  29. [29]

    F. Nie, L. Tian, X. Li, Multiview clustering via adaptively weighted procrustes, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 2022–2030. doi:10.1145/3219819.3220049

  30. [30]

    X. Dong, D. Wu, F. Nie, R. Wang, X. Li, Multi-view clustering with adaptive procrustes on grassmann manifold, Information Sciences 609 (2022) 855–875.doi:10.1016/j.ins.2022.07.089

  31. [31]

    C. Wang, S. Mahadevan, Manifold alignment using procrustes analysis, in: Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 1120–1127

  32. [32]

    Grave, A

    E. Grave, A. Joulin, Q. Berthet, Unsupervised alignment of embeddings with wasserstein procrustes, in: Pro- ceedings of the 22nd International Conference on Artificial Intelligence and Statistics, V ol. 89 of Proceedings of Machine Learning Research, 2019, pp. 1880–1890

  33. [33]

    X. Peng, G. Chen, C. Lin, M. Stevenson, Highly efficient knowledge graph embedding learning with orthogonal procrustes analysis, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2364–2375

  34. [34]

    Iakymchuk, D

    R. Iakymchuk, D. Defour, C. Collange, S. Graillat, Reproducible and accurate matrix multiplication, in: Inter- national Symposium on Scientific Computing, Computer Arithmetic, and Validated Numerics, Springer, 2015, pp. 126–137

  35. [35]

    Martinsson, G

    P.-G. Martinsson, G. Quintana OrtI, N. Heavner, R. Van De Geijn, Householder qr factorization with random- ization for column pivoting (hqrrp), SIAM Journal on Scientific Computing 39 (2) (2017) C96–C115

  36. [36]

    L. Wang, G. Libert, P. Manneback, Kalman filter algorithm based on singular value decomposition, in: [1992] Proceedings of the 31st IEEE Conference on Decision and Control, IEEE, 1992, pp. 1224–1229

  37. [37]

    Mahfoudhi, A fast triangular matrix inversion, in: Proceedings of the World Congress on Engineering, V ol

    R. Mahfoudhi, A fast triangular matrix inversion, in: Proceedings of the World Congress on Engineering, V ol. 1, 2012

  38. [38]

    K. Chen, L. Liu, Geometric data perturbation for privacy preserving outsourced data mining, Knowledge and information systems 29 (3) (2011) 657–695

  39. [39]

    Huang, T

    K. Huang, T. Fu, W. Gao, Y . Zhao, Y . Roohani, J. Leskovec, C. W. Coley, C. Xiao, J. Sun, M. Zitnik, Thera- peutics data commons: Machine learning datasets and tasks for drug discovery and development, arXiv preprint arXiv:2102.09548 (2021)

  40. [40]

    Becker, R

    B. Becker, R. Kohavi, Adult, UCI Machine Learning Repository, DOI: https://doi.org/10.24432/C5XW20 (1996)

  41. [41]

    T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, O. Badawi, The eicu collaborative research database, a freely available multi-center database for critical care research, Scientific data 5 (1) (2018) 1–13

  42. [42]

    Balle, Y .-X

    B. Balle, Y .-X. Wang, Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising, in: International Conference on Machine Learning, PMLR, 2018, pp. 394–403

  43. [43]

    C. Xu, F. Cheng, L. Chen, Z. Du, W. Li, G. Liu, P. W. Lee, Y . Tang, In silico prediction of chemical ames mutagenicity, Journal of chemical information and modeling 52 (11) (2012) 2840–2847

  44. [44]

    A. Mayr, G. Klambauer, T. Unterthiner, S. Hochreiter, Deeptox: toxicity prediction using deep learning, Fron- tiers in Environmental Science 3 (2016) 80

  45. [45]

    Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, V . Pande, Moleculenet: a benchmark for molecular machine learning, Chemical science 9 (2) (2018) 513–530. 43

  46. [46]

    Veith, N

    H. Veith, N. Southall, R. Huang, T. James, D. Fayne, N. Artemenko, M. Shen, J. Inglese, C. P. Austin, D. G. Lloyd, et al., Comprehensive characterization of cytochrome p450 isozyme selectivity across chemical libraries, Nature biotechnology 27 (11) (2009) 1050–1055

  47. [47]

    Schroff, D

    F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823. doi:10.1109/CVPR.2015.7298682

  48. [48]

    Szegedy, S

    C. Szegedy, S. Ioffe, V . Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 2017, pp. 4278–4284

  49. [49]

    Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman, Vggface2: A dataset for recognising faces across pose and age, in: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2018, pp. 67–74.doi:10.1109/FG.2018.00020

  50. [50]

    Deng, The mnist database of handwritten digit images for machine learning research [best of the web], IEEE signal processing magazine 29 (6) (2012) 141–142

    L. Deng, The mnist database of handwritten digit images for machine learning research [best of the web], IEEE signal processing magazine 29 (6) (2012) 141–142

  51. [51]

    H. Xiao, K. Rasul, R. V ollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algo- rithms, arXiv preprint arXiv:1708.07747 (2017)

  52. [52]

    Y . Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, S. H. Bryant, Pubchem: a public information system for analyzing bioactivities of small molecules, Nucleic acids research 37 (suppl_2) (2009) W623–W633. 44