Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen
Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3
The pith
Parallel synchronous and asynchronous methods scale generalized canonical polyadic tensor decomposition to large datasets while maintaining accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that the proposed synchronous hybrid MPI+Kokkos and asynchronous distributed approaches for GCP tensor decomposition achieve even better scalability to large data sets while maintaining accuracy, as studied on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.
What carries the argument
The hybrid synchronous parallelism combining Kokkos for shared-memory random sampling and stochastic optimization with MPI for distribution, plus an asynchronous distributed scheme modeled on federated learning techniques.
If this is right
- Enables decomposition of larger count or binary datasets using flexible loss functions without excessive compute time.
- Supports portable execution across CPU and GPU hardware through the Kokkos layer.
- Delivers linear or near-linear speedups on distributed systems for varying sparsity patterns.
- Maintains decomposition accuracy across multiple loss functions on both synthetic and real data.
Where Pith is reading between the lines
- The asynchronous approach could extend to privacy-sensitive settings where raw data cannot be centralized.
- Similar parallelism patterns might apply to other tensor models such as Tucker decomposition for comparable gains.
- Testing on streaming or dynamically growing datasets would reveal whether the methods remain stable over time.
- Integration into existing data pipelines could reduce the barrier to using GCP on industry-scale problems.
Load-bearing premise
That the randomization, stochastic optimization, and parallelization steps can be combined without introducing convergence problems or accuracy loss relative to the serial GCP baseline.
What would settle it
Running the parallel GCP implementations and the serial baseline on the same large sparse real-world dataset and checking whether the final loss values and reconstruction errors remain comparable while wall-clock time decreases with added processors.
Figures
read the original abstract
The Canonical Polyadic (CP) tensor decomposition is a well-known method for interpretable analysis of high-dimensional data. Recently, the Generalized CP method (GCP) was introduced by Hong and Kolda to allow for flexible choice of the loss function in the optimization problem defining the CP model, enabling more interpretable decompositions of strongly non-Gaussian data such as count or binary data. Furthermore, Kolda and Hong introduced a version of GCP that leverages randomization and stochastic optimization to address scalability to large, sparse data sets. In this work, we take these ideas a step further and consider synchronous and asynchronous algorithms for parallel GCP tensor decomposition through the GenTen software package, exploiting both shared and distributed memory parallelism. We build on shared memory parallel CP decomposition algorithms utilizing Kokkos for portability across CPU and GPU architectures to support the random sampling and stochastic optimization methods required by GCP. We then couple this approach to the well-known medium-grained distributed memory parallelism scheme developed for traditional CP decompositions through MPI, providing a synchronous, hybrid MPI+Kokkos, parallel GCP decomposition capability. Finally, we propose an asynchronous distributed parallelism approach building on related techniques for federated learning to achieve even better scalability to large data sets. We study the effectiveness of the proposed synchronous and asynchronous approaches vis-a-vis computational cost and accuracy on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends prior Generalized CP (GCP) tensor decomposition work by Kolda and Hong, adding synchronous hybrid MPI+Kokkos and asynchronous distributed parallelism schemes within the GenTen package. The central claim is that these parallel approaches achieve improved scalability on large sparse tensors while preserving accuracy relative to the serial randomized/stochastic GCP baseline, demonstrated via experiments on synthetic and real-world datasets of varying size, dimension, and sparsity using multiple loss functions.
Significance. A sound demonstration of portable, scalable GCP implementations would strengthen practical tools for interpretable analysis of non-Gaussian high-dimensional data. The hybrid shared/distributed memory design and federated-learning-inspired asynchrony are technically interesting extensions; however, the absence of quantitative results, error bars, or convergence analysis in the provided description leaves the accuracy-preservation claim unverified at present.
major comments (2)
- [asynchronous distributed parallelism approach] The asynchronous distributed scheme (described after the synchronous MPI+Kokkos approach) applies local stochastic gradients computed on stale factor copies. For loss functions whose gradients are nonlinear in the factors (Poisson, Bernoulli, etc.), this introduces potential bias or variance inflation not present in the synchronous or serial case. No fixed-point analysis, convergence-rate bound, or empirical safeguard (e.g., bounded staleness, compensation terms) is supplied to show that the fixed point or accuracy remains comparable to the serial GCP baseline.
- [evaluation on synthetic and real-world datasets] The abstract and evaluation plan state that accuracy is maintained, yet no quantitative metrics, tables, or figures reporting fit quality, reconstruction error, or iteration counts versus the serial baseline appear in the manuscript description. Without these data, the claim that the parallel schemes “maintain accuracy” cannot be assessed.
minor comments (1)
- [algorithm description] Notation for the stochastic sampling and local gradient computation should be aligned explicitly with the earlier GCP formulation (Hong & Kolda) to make the extension transparent.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating revisions where appropriate. Our responses focus on substance and aim to clarify or strengthen the presentation of the parallel GCP approaches.
read point-by-point responses
-
Referee: [asynchronous distributed parallelism approach] The asynchronous distributed scheme (described after the synchronous MPI+Kokkos approach) applies local stochastic gradients computed on stale factor copies. For loss functions whose gradients are nonlinear in the factors (Poisson, Bernoulli, etc.), this introduces potential bias or variance inflation not present in the synchronous or serial case. No fixed-point analysis, convergence-rate bound, or empirical safeguard (e.g., bounded staleness, compensation terms) is supplied to show that the fixed point or accuracy remains comparable to the serial GCP baseline.
Authors: We acknowledge the referee's concern regarding potential bias in asynchronous updates for nonlinear loss functions. The approach is motivated by federated learning methods that tolerate staleness in stochastic settings. In the revised manuscript, we have added a discussion of the bounded-staleness mechanism implemented in GenTen and included additional empirical results (convergence plots and accuracy metrics) comparing the asynchronous scheme to the synchronous and serial baselines across the tested loss functions. A full fixed-point or convergence-rate analysis lies beyond the scope of this applied paper but is noted as future work. revision: partial
-
Referee: [evaluation on synthetic and real-world datasets] The abstract and evaluation plan state that accuracy is maintained, yet no quantitative metrics, tables, or figures reporting fit quality, reconstruction error, or iteration counts versus the serial baseline appear in the manuscript description. Without these data, the claim that the parallel schemes “maintain accuracy” cannot be assessed.
Authors: We have revised the manuscript to prominently feature the existing quantitative results. New tables and figures now explicitly report fit quality, reconstruction error, and iteration counts for the synchronous and asynchronous schemes versus the serial randomized GCP baseline on all synthetic and real-world datasets. Multiple independent runs with error bars are included to support the accuracy-preservation claim and address variability. revision: yes
- A rigorous fixed-point analysis or convergence-rate bound for the asynchronous scheme with nonlinear loss functions.
Circularity Check
No circularity: new parallel schemes are independent algorithmic extensions evaluated empirically
full rationale
The paper cites the GCP formulation and randomized stochastic optimization from prior work by Hong and Kolda as the foundation, then describes new synchronous (hybrid MPI+Kokkos) and asynchronous (federated-learning-style) parallel implementations as additions. These are presented through algorithmic descriptions and empirical studies on synthetic and real datasets with varying loss functions, without any claimed first-principles derivation, fitted-parameter predictions, or self-referential definitions that reduce the new results to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the central claims rest on direct implementation and benchmarking rather than re-deriving or renaming prior quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Randomization and stochastic optimization preserve useful properties of GCP for non-Gaussian data as shown in prior work.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel federated learning approach for GCP that incorporates asynchronous parallelism... GCP-FedAdam is an extension of Reddi et al.’s FedOpt to GCP.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Generalized Canonical Polyadic (GCP) low-rank tensor decomposition... min F(X,M) = sum f(xi,mi)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
B. M. Adams, W. J. Bohnhoff, K. R. Dalbey, M. S. Ebeida, J. P. Eddy, M. S. Eldred, R. W. Hooper, P. D. Hough, K. T. Hu, J. D. Jakeman, M. Khalil, K. A. Maupin, J. A. Mon- schke, E. M. Ridgway, A. . Rushdi, D. T. Seidl, J. A. Stephens, and J. G. Winokur , Dakota, A multilevel parallel object-oriented framework for design optimization, parameter esti- matio...
- [2]
-
[3]
J. D. Carroll and J.-J. Chang , Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , Psychometrika, 35 (1970), pp. 283–319
work page 1970
-
[4]
E. C. Chi and T. G. Kolda , On Tensors, Sparsity, and Nonnegative Factorizations , SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 1272–1299
work page 2012
-
[5]
J. H. Choi and S. Vishwanathan, DFacTo: Distributed factorization of tensors , in Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 1296–1304
work page 2014
- [6]
-
[7]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng , Large scale distributed deep networks , in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds., vol. 25, Curran Associates, Inc., 2012
work page 2012
-
[8]
D. M. Dunlavy, N. T. Johnson, and others , Pyttb: Python Tensor Toolbox, v1.8.2 , Jan. 2025
work page 2025
-
[9]
H. Fanaee-T and J. Gama , Tensor-based anomaly detection: An interdisciplinary survey , Knowledge- Based Systems, 98 (2016), pp. 130–147, https://doi.org/10.1016/j.knosys.2016.01.027
- [10]
-
[11]
R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis , UCLA Working Papers in Phonetics, 16 (1970), pp. 1–84
work page 1970
-
[12]
A. E. Helal, J. Laukemann, F. Checconi, J. J. Tithi, T. Ranadive, F. Petrini, and J. Choi , ALTO: Adaptive linearized storage of sparse tensors , in Proceedings of the 35th ACM International Conference on Supercomputing, Ics ’21, Virtual Event, USA and New York, NY, USA, 2021, Associ- ation for Computing Machinery, pp. 404–416, https://doi.org/10.1145/344...
-
[13]
J. Helton and F. Davis , Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems, Reliability Engineering & System Safety, 81 (2003), pp. 23–69, https://doi.org/10. 1016/S0951-8320(03)00058-9
work page 2003
-
[14]
D. Hong, T. G. Kolda, and J. A. Duersch , Generalized Canonical Polyadic Tensor Decomposition , SIAM Review, 62 (2020), pp. 133–163
work page 2020
-
[15]
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
M. Janzamin, H. Sedghi, and A. Anandkumar , Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods , June 2015, https://arxiv.org/abs/1506.08473v3
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
O. Kaya and B. Uc ¸ar , Scalable sparse tensor decompositions in distributed memory systems , in Proceedings of the International Conference for High Performance Computing, Networking, Stor- age and Analysis, SC’15, Austin, Texas and New York, NY, USA, 2015, ACM, pp. 77:1–77:11, https://doi.org/10.1145/2807591.2807624
-
[17]
O. Kaya and B. Uc ¸ar, Parallel CANDECOMP/PARAFAC decomposition of sparse tensors using di- mension trees, SIAM Journal On Scientific Computing, 40 (2018), pp. C99–C130
work page 2018
-
[18]
D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds., 2015
work page 2015
-
[19]
T. G. Kolda and B. W. Bader , Tensor Decompositions and Applications , SIAM Review, 51 (2009), pp. 455–500
work page 2009
-
[20]
T. G. Kolda and D. Hong, Stochastic Gradients for Large-Scale Tensor Decomposition, SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 1066–1095
work page 2020
-
[21]
C. Lewis and E. Phipps, Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition, in 2021 IEEE High Performance Extreme Computing Conference (HPEC), PARALLELISM APPROACHES FOR GCP WITH GENTEN 27 Waltham, MA, USA, Sept. 2021, IEEE, pp. 1–5, https://doi.org/10.1109/HPEC49654.2021.9622844
-
[22]
J. Li, J. Sun, and R. Vuduc , HiCOO: Hierarchical storage of sparse tensors , in ACM/IEEE Inter- national Conference for High-Performance Computing, Networking, Storage, and Analysis (SC18), 2018
work page 2018
-
[23]
B. Liu, C. Wen, A. D. Sarwate, and M. Mehri Dehnavi, A unified optimization approach for sparse tensor operations on gpus . ArXiv e-prints, 2017
work page 2017
-
[24]
M. D. McKay, R. J. Beckman, and W. J. Conover , A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code , Technometrics, 21 (1979), pp. 239–245, https://doi.org/10.2307/1268522, https://arxiv.org/abs/1268522
-
[25]
Z. Miao, J. Li, J. C. Calhoun, and R. Ge , BALA-CPD: BALanced and asynchronous distributed tensor decomposition, in 2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, pp. 440–450, https://doi.org/10.1109/CLUSTER51413.2022.00054
-
[26]
J. M. Myers and D. M. Dunlavy , Tensor decompositions for count data that leverage stochastic and deterministic optimization , Optimization Methods and Software, 40 (2025), pp. 352–387, https: //doi.org/10.1080/10556788.2024.2401981, https://arxiv.org/abs/https://doi.org/10.1080/10556788. 2024.2401981
-
[27]
A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov , Tensorizing neural networks , CoRR, abs/1509.06569 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
E. T. Phipps, N. T. Johnson, and T. G. Kolda , Streaming generalized canonical polyadic tensor decompositions, in Proceedings of the Platform for Advanced Scientific Computing Conference, Pasc ’23, New York, NY, USA, 2023, Association for Computing Machinery, https://doi.org/10.1145/ 3592979.3593405
-
[29]
E. T. Phipps and T. G. Kolda , Software for sparse tensor decomposition on emerging computing architectures, SIAM Journal on Scientific Computing, 41 (2019), pp. C269–C290, https://doi.org/10. 1137/18M1210691
work page 2019
-
[30]
W. Pu, S. Ibrahim, X. Fu, and M. Hong , Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses, IEEE Transactions on Signal Processing, 70 (2022), pp. 1803–1818, https: //doi.org/10.1109/TSP.2022.3163896
-
[31]
Adaptive Federated Optimization
S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, Adaptive Federated Optimization, Sept. 2021, https://arxiv.org/abs/2003.00295
work page internal anchor Pith review arXiv 2021
- [32]
-
[33]
S. Smith and G. Karypis , A medium-grained algorithm for sparse tensor factorization , in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 902–911, https: //doi.org/10.1109/IPDPS.2016.113
-
[34]
S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, SPLATT: Efficient and parallel sparse tensor-matrix multiplication, in IPDPS 2015: IEEE International Parallel and Distributed Processing Symposium, 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp. 61–70, https://doi.org/10.1109/ipdps.2015.27
-
[35]
S. U. Stich, Local SGD converges fast and communicates little , 2019, https://arxiv.org/abs/1805.09767
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[36]
C. R. Trott, D. Lebrun-Grandi ´e, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gayatri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff, A. Pow- ell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke , Kokkos 3: Programming model extensions for the exascale era , IEEE Transactions on Parallel and Dis...
-
[37]
M. Vandecappelle, N. Vervliet, and L. D. Lathauwer, Inexact generalized gauss–newton for scaling the canonical polyadic decomposition with non-least-squares cost functions , IEEE Journal of Selected Topics in Signal Processing, 15 (2021), pp. 491–505, https://doi.org/10.1109/JSTSP.2020.3045911
-
[38]
Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun , Rubik: Knowledge guided tensor factorization and completion for health data analytics , in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Kdd ’15, Sydney, NSW, Australia and New York, NY, USA, 2015, ACM, pp. 1265–1274, h...
-
[39]
S. Zhang, A. Choromanska, and Y. LeCun, Deep learning with elastic averaging SGD, in Proceedings 28 J. M. MYERS AND E. T. PHIPPS of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, 2015, MIT Press, pp. 685–693
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.