Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen

Eric T. Phipps; Jeremy M. Myers

arxiv: 2605.20353 · v1 · pith:4XRBKSRAnew · submitted 2026-05-19 · 🧮 math.NA · cs.NA

Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen

Jeremy M. Myers , Eric T. Phipps This is my paper

Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3

classification 🧮 math.NA cs.NA

keywords generalized canonical polyadic decompositiontensor decompositionparallel algorithmsstochastic optimizationdistributed memoryshared memory parallelismasynchronous methodsKokkos

0 comments

The pith

Parallel synchronous and asynchronous methods scale generalized canonical polyadic tensor decomposition to large datasets while maintaining accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops synchronous and asynchronous parallel algorithms for Generalized Canonical Polyadic (GCP) tensor decomposition to handle large, sparse, high-dimensional data more efficiently than prior serial approaches. It builds on randomization and stochastic optimization by adding shared-memory parallelism via Kokkos for CPU and GPU portability and distributed-memory parallelism via MPI in a hybrid setup. An asynchronous variant draws on federated learning ideas to push scalability further. A sympathetic reader would care because these changes aim to enable interpretable analysis of non-Gaussian data such as counts or binary values in massive real-world datasets without major losses in speed or quality.

Core claim

The authors claim that the proposed synchronous hybrid MPI+Kokkos and asynchronous distributed approaches for GCP tensor decomposition achieve even better scalability to large data sets while maintaining accuracy, as studied on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.

What carries the argument

The hybrid synchronous parallelism combining Kokkos for shared-memory random sampling and stochastic optimization with MPI for distribution, plus an asynchronous distributed scheme modeled on federated learning techniques.

If this is right

Enables decomposition of larger count or binary datasets using flexible loss functions without excessive compute time.
Supports portable execution across CPU and GPU hardware through the Kokkos layer.
Delivers linear or near-linear speedups on distributed systems for varying sparsity patterns.
Maintains decomposition accuracy across multiple loss functions on both synthetic and real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The asynchronous approach could extend to privacy-sensitive settings where raw data cannot be centralized.
Similar parallelism patterns might apply to other tensor models such as Tucker decomposition for comparable gains.
Testing on streaming or dynamically growing datasets would reveal whether the methods remain stable over time.
Integration into existing data pipelines could reduce the barrier to using GCP on industry-scale problems.

Load-bearing premise

That the randomization, stochastic optimization, and parallelization steps can be combined without introducing convergence problems or accuracy loss relative to the serial GCP baseline.

What would settle it

Running the parallel GCP implementations and the serial baseline on the same large sparse real-world dataset and checking whether the final loss values and reconstruction errors remain comparable while wall-clock time decreases with added processors.

Figures

Figures reproduced from arXiv: 2605.20353 by Eric T. Phipps, Jeremy M. Myers.

**Figure 2.** Figure 2: Comparison of fused and non-fused semi-stratified sampling/MTTKRP kernels on [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of distributed GCP approaches for over a range of total number of [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Breaking down GCP runtime for the all-reduce and two-sided communication ap [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Strong scaling of two GCP-Adam epochs across a range of compute nodes. The slope of the best linear fit (in log-space) is shown above each curve. performed a parameter sensitivity analysis on synthetic data utilizing on-node parallelism only to establish baseline behavior. Second, we introduced distributed parallelism and conducted experiments with two large-scale, sparse tensor datasets. Since federated a… view at source ↗

**Figure 6.** Figure 6: Simple rank correlation coefficient convergence of [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Scatter plots varying rate and adam-beta1 for GCP-SGD. The y-axis denotes the relative error from the MLE (top) and total time (bottom); rate is plotted along the x-axis; variations in adam-beta1 are denoted by the hue. up to 21× faster than GCP-SGD. However, parameter tuning seems to have a deleterious effect on accuracy for both methods overall, which is surprising given all that changed was the random i… view at source ↗

**Figure 8.** Figure 8: Scatter plots varying rate and decay for GCP-SGD. The y-axis denotes the relative error from the MLE (top) and total time (bottom); rate is plotted along the x-axis; variations in decay are denoted by the hue. We varied the asynchrony and the number of processors. Each node in the cluster consists of one 44-core IBM Power9 processor and 4 NVIDIA V100 GPUs with 32 GB of RAM each. We compiled GenTen with NVC… view at source ↗

**Figure 9.** Figure 9: Scatter plots varying two parameters for [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Results on the synthetic count dataset comparing [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Traces of Poisson loss values by time in seconds for the arXiv dataset. The hue, [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Results on the amazon-reviews dataset comparing [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

The Canonical Polyadic (CP) tensor decomposition is a well-known method for interpretable analysis of high-dimensional data. Recently, the Generalized CP method (GCP) was introduced by Hong and Kolda to allow for flexible choice of the loss function in the optimization problem defining the CP model, enabling more interpretable decompositions of strongly non-Gaussian data such as count or binary data. Furthermore, Kolda and Hong introduced a version of GCP that leverages randomization and stochastic optimization to address scalability to large, sparse data sets. In this work, we take these ideas a step further and consider synchronous and asynchronous algorithms for parallel GCP tensor decomposition through the GenTen software package, exploiting both shared and distributed memory parallelism. We build on shared memory parallel CP decomposition algorithms utilizing Kokkos for portability across CPU and GPU architectures to support the random sampling and stochastic optimization methods required by GCP. We then couple this approach to the well-known medium-grained distributed memory parallelism scheme developed for traditional CP decompositions through MPI, providing a synchronous, hybrid MPI+Kokkos, parallel GCP decomposition capability. Finally, we propose an asynchronous distributed parallelism approach building on related techniques for federated learning to achieve even better scalability to large data sets. We study the effectiveness of the proposed synchronous and asynchronous approaches vis-a-vis computational cost and accuracy on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a hybrid MPI+Kokkos synchronous path and a federated-learning-style asynchronous scheme to randomized stochastic GCP, but the abstract supplies no numbers or convergence checks to back the accuracy claims.

read the letter

The main point is that the authors extend the randomized GCP framework from Kolda and Hong by layering on both a synchronous hybrid MPI plus Kokkos implementation and an asynchronous distributed variant. The asynchronous piece is the clearer addition, since it borrows from federated learning to reduce synchronization overhead on large sparse tensors. The synchronous path is a more incremental reuse of existing medium-grained CP parallelism, adapted to support the random sampling and stochastic steps that GCP requires. They also ship this in the GenTen package with portability across CPU and GPU targets. That combination of portability and the async idea is the practical contribution worth noting. The paper outlines a reasonable evaluation plan across synthetic and real datasets that vary in size, dimension, sparsity, and loss function, which shows they understand the target use cases in scientific computing. The soft spot is the lack of any reported results or analysis in the abstract on whether the parallel versions actually preserve accuracy. For losses such as Poisson or Bernoulli the gradient is nonlinear in the factors, so asynchronous updates against stale copies can produce biased or higher-variance steps; nothing in the description indicates they have checked convergence rates or fixed-point behavior against the serial baseline. If the full manuscript contains only the algorithmic description without those checks, the central claim about maintained accuracy remains an assumption rather than a demonstrated result. This work is aimed at people already using or extending GCP for non-Gaussian tensor data who need distributed implementations. A reader focused on scalable numerical methods would get implementation details and a starting point for further testing. I would send it to peer review because the parallel extensions address a genuine scaling need and the community can evaluate the async scheme once the experiments are visible.

Referee Report

2 major / 1 minor

Summary. The manuscript extends prior Generalized CP (GCP) tensor decomposition work by Kolda and Hong, adding synchronous hybrid MPI+Kokkos and asynchronous distributed parallelism schemes within the GenTen package. The central claim is that these parallel approaches achieve improved scalability on large sparse tensors while preserving accuracy relative to the serial randomized/stochastic GCP baseline, demonstrated via experiments on synthetic and real-world datasets of varying size, dimension, and sparsity using multiple loss functions.

Significance. A sound demonstration of portable, scalable GCP implementations would strengthen practical tools for interpretable analysis of non-Gaussian high-dimensional data. The hybrid shared/distributed memory design and federated-learning-inspired asynchrony are technically interesting extensions; however, the absence of quantitative results, error bars, or convergence analysis in the provided description leaves the accuracy-preservation claim unverified at present.

major comments (2)

[asynchronous distributed parallelism approach] The asynchronous distributed scheme (described after the synchronous MPI+Kokkos approach) applies local stochastic gradients computed on stale factor copies. For loss functions whose gradients are nonlinear in the factors (Poisson, Bernoulli, etc.), this introduces potential bias or variance inflation not present in the synchronous or serial case. No fixed-point analysis, convergence-rate bound, or empirical safeguard (e.g., bounded staleness, compensation terms) is supplied to show that the fixed point or accuracy remains comparable to the serial GCP baseline.
[evaluation on synthetic and real-world datasets] The abstract and evaluation plan state that accuracy is maintained, yet no quantitative metrics, tables, or figures reporting fit quality, reconstruction error, or iteration counts versus the serial baseline appear in the manuscript description. Without these data, the claim that the parallel schemes “maintain accuracy” cannot be assessed.

minor comments (1)

[algorithm description] Notation for the stochastic sampling and local gradient computation should be aligned explicitly with the earlier GCP formulation (Hong & Kolda) to make the extension transparent.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating revisions where appropriate. Our responses focus on substance and aim to clarify or strengthen the presentation of the parallel GCP approaches.

read point-by-point responses

Referee: [asynchronous distributed parallelism approach] The asynchronous distributed scheme (described after the synchronous MPI+Kokkos approach) applies local stochastic gradients computed on stale factor copies. For loss functions whose gradients are nonlinear in the factors (Poisson, Bernoulli, etc.), this introduces potential bias or variance inflation not present in the synchronous or serial case. No fixed-point analysis, convergence-rate bound, or empirical safeguard (e.g., bounded staleness, compensation terms) is supplied to show that the fixed point or accuracy remains comparable to the serial GCP baseline.

Authors: We acknowledge the referee's concern regarding potential bias in asynchronous updates for nonlinear loss functions. The approach is motivated by federated learning methods that tolerate staleness in stochastic settings. In the revised manuscript, we have added a discussion of the bounded-staleness mechanism implemented in GenTen and included additional empirical results (convergence plots and accuracy metrics) comparing the asynchronous scheme to the synchronous and serial baselines across the tested loss functions. A full fixed-point or convergence-rate analysis lies beyond the scope of this applied paper but is noted as future work. revision: partial
Referee: [evaluation on synthetic and real-world datasets] The abstract and evaluation plan state that accuracy is maintained, yet no quantitative metrics, tables, or figures reporting fit quality, reconstruction error, or iteration counts versus the serial baseline appear in the manuscript description. Without these data, the claim that the parallel schemes “maintain accuracy” cannot be assessed.

Authors: We have revised the manuscript to prominently feature the existing quantitative results. New tables and figures now explicitly report fit quality, reconstruction error, and iteration counts for the synchronous and asynchronous schemes versus the serial randomized GCP baseline on all synthetic and real-world datasets. Multiple independent runs with error bars are included to support the accuracy-preservation claim and address variability. revision: yes

standing simulated objections not resolved

A rigorous fixed-point analysis or convergence-rate bound for the asynchronous scheme with nonlinear loss functions.

Circularity Check

0 steps flagged

No circularity: new parallel schemes are independent algorithmic extensions evaluated empirically

full rationale

The paper cites the GCP formulation and randomized stochastic optimization from prior work by Hong and Kolda as the foundation, then describes new synchronous (hybrid MPI+Kokkos) and asynchronous (federated-learning-style) parallel implementations as additions. These are presented through algorithmic descriptions and empirical studies on synthetic and real datasets with varying loss functions, without any claimed first-principles derivation, fitted-parameter predictions, or self-referential definitions that reduce the new results to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the central claims rest on direct implementation and benchmarking rather than re-deriving or renaming prior quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the correctness of prior randomized GCP methods and standard parallel programming models; no new mathematical axioms or invented entities are introduced.

axioms (1)

domain assumption Randomization and stochastic optimization preserve useful properties of GCP for non-Gaussian data as shown in prior work.
The parallel extensions assume the base GCP method remains valid when distributed.

pith-pipeline@v0.9.0 · 5787 in / 1121 out tokens · 38405 ms · 2026-05-21T07:14:46.408733+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel federated learning approach for GCP that incorporates asynchronous parallelism... GCP-FedAdam is an extension of Reddi et al.’s FedOpt to GCP.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Generalized Canonical Polyadic (GCP) low-rank tensor decomposition... min F(X,M) = sum f(xi,mi)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

[1]

B. M. Adams, W. J. Bohnhoff, K. R. Dalbey, M. S. Ebeida, J. P. Eddy, M. S. Eldred, R. W. Hooper, P. D. Hough, K. T. Hu, J. D. Jakeman, M. Khalil, K. A. Maupin, J. A. Mon- schke, E. M. Ridgway, A. . Rushdi, D. T. Seidl, J. A. Stephens, and J. G. Winokur , Dakota, A multilevel parallel object-oriented framework for design optimization, parameter esti- matio...

work page doi:10.2172/1817318 2020
[2]

Bader, T

Brett W. Bader, T. G. Kolda, et al. , Tensor Toolbox for MATLAB, Version 3.6, Sept. 2023

work page 2023
[3]

Eckart-Young

J. D. Carroll and J.-J. Chang , Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , Psychometrika, 35 (1970), pp. 283–319

work page 1970
[4]

E. C. Chi and T. G. Kolda , On Tensors, Sparsity, and Nonnegative Factorizations , SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 1272–1299

work page 2012
[5]

J. H. Choi and S. Vishwanathan, DFacTo: Distributed factorization of tensors , in Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 1296–1304

work page 2014
[6]

Cohen, O

N. Cohen, O. Sharir, and A. Shashua , On the expressive power of deep learning: A tensor analysis , in 29th Annual Conference on Learning Theory, vol. 49, PMLR, June 2016, pp. 698–728

work page 2016
[7]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng , Large scale distributed deep networks , in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds., vol. 25, Curran Associates, Inc., 2012

work page 2012
[8]

D. M. Dunlavy, N. T. Johnson, and others , Pyttb: Python Tensor Toolbox, v1.8.2 , Jan. 2025

work page 2025
[9]

Fanaee-T and J

H. Fanaee-T and J. Gama , Tensor-based anomaly detection: An interdisciplinary survey , Knowledge- Based Systems, 98 (2016), pp. 130–147, https://doi.org/10.1016/j.knosys.2016.01.027

work page doi:10.1016/j.knosys.2016.01.027 2016
[10]

Gujral, R

E. Gujral, R. Pasricha, and E. E. Papalexakis , SamBaTen: Sampling-based batch incremental tensor decomposition, Sept. 2017, https://arxiv.org/abs/1709.00668v1

work page arXiv 2017
[11]

explanatory

R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis , UCLA Working Papers in Phonetics, 16 (1970), pp. 1–84

work page 1970
[12]

A. E. Helal, J. Laukemann, F. Checconi, J. J. Tithi, T. Ranadive, F. Petrini, and J. Choi , ALTO: Adaptive linearized storage of sparse tensors , in Proceedings of the 35th ACM International Conference on Supercomputing, Ics ’21, Virtual Event, USA and New York, NY, USA, 2021, Associ- ation for Computing Machinery, pp. 404–416, https://doi.org/10.1145/344...

work page doi:10.1145/3447818.3461703 2021
[13]

Helton and F

J. Helton and F. Davis , Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems, Reliability Engineering & System Safety, 81 (2003), pp. 23–69, https://doi.org/10. 1016/S0951-8320(03)00058-9

work page 2003
[14]

D. Hong, T. G. Kolda, and J. A. Duersch , Generalized Canonical Polyadic Tensor Decomposition , SIAM Review, 62 (2020), pp. 133–163

work page 2020
[15]

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

M. Janzamin, H. Sedghi, and A. Anandkumar , Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods , June 2015, https://arxiv.org/abs/1506.08473v3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Kaya and B

O. Kaya and B. Uc ¸ar , Scalable sparse tensor decompositions in distributed memory systems , in Proceedings of the International Conference for High Performance Computing, Networking, Stor- age and Analysis, SC’15, Austin, Texas and New York, NY, USA, 2015, ACM, pp. 77:1–77:11, https://doi.org/10.1145/2807591.2807624

work page doi:10.1145/2807591.2807624 2015
[17]

Kaya and B

O. Kaya and B. Uc ¸ar, Parallel CANDECOMP/PARAFAC decomposition of sparse tensors using di- mension trees, SIAM Journal On Scientific Computing, 40 (2018), pp. C99–C130

work page 2018
[18]

D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds., 2015

work page 2015
[19]

T. G. Kolda and B. W. Bader , Tensor Decompositions and Applications , SIAM Review, 51 (2009), pp. 455–500

work page 2009
[20]

T. G. Kolda and D. Hong, Stochastic Gradients for Large-Scale Tensor Decomposition, SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 1066–1095

work page 2020
[21]

Lewis and E

C. Lewis and E. Phipps, Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition, in 2021 IEEE High Performance Extreme Computing Conference (HPEC), PARALLELISM APPROACHES FOR GCP WITH GENTEN 27 Waltham, MA, USA, Sept. 2021, IEEE, pp. 1–5, https://doi.org/10.1109/HPEC49654.2021.9622844

work page doi:10.1109/hpec49654.2021.9622844 2021
[22]

J. Li, J. Sun, and R. Vuduc , HiCOO: Hierarchical storage of sparse tensors , in ACM/IEEE Inter- national Conference for High-Performance Computing, Networking, Storage, and Analysis (SC18), 2018

work page 2018
[23]

B. Liu, C. Wen, A. D. Sarwate, and M. Mehri Dehnavi, A unified optimization approach for sparse tensor operations on gpus . ArXiv e-prints, 2017

work page 2017
[24]

M. D. McKay, R. J. Beckman, and W. J. Conover , A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code , Technometrics, 21 (1979), pp. 239–245, https://doi.org/10.2307/1268522, https://arxiv.org/abs/1268522

work page doi:10.2307/1268522 1979
[25]

Z. Miao, J. Li, J. C. Calhoun, and R. Ge , BALA-CPD: BALanced and asynchronous distributed tensor decomposition, in 2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, pp. 440–450, https://doi.org/10.1109/CLUSTER51413.2022.00054

work page doi:10.1109/cluster51413.2022.00054 2022
[26]

J. M. Myers and D. M. Dunlavy , Tensor decompositions for count data that leverage stochastic and deterministic optimization , Optimization Methods and Software, 40 (2025), pp. 352–387, https: //doi.org/10.1080/10556788.2024.2401981, https://arxiv.org/abs/https://doi.org/10.1080/10556788. 2024.2401981

work page doi:10.1080/10556788.2024.2401981 2025
[27]

Tensorizing Neural Networks

A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov , Tensorizing neural networks , CoRR, abs/1509.06569 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

E. T. Phipps, N. T. Johnson, and T. G. Kolda , Streaming generalized canonical polyadic tensor decompositions, in Proceedings of the Platform for Advanced Scientific Computing Conference, Pasc ’23, New York, NY, USA, 2023, Association for Computing Machinery, https://doi.org/10.1145/ 3592979.3593405

work page arXiv 2023
[29]

E. T. Phipps and T. G. Kolda , Software for sparse tensor decomposition on emerging computing architectures, SIAM Journal on Scientific Computing, 41 (2019), pp. C269–C290, https://doi.org/10. 1137/18M1210691

work page 2019
[30]

W. Pu, S. Ibrahim, X. Fu, and M. Hong , Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses, IEEE Transactions on Signal Processing, 70 (2022), pp. 1803–1818, https: //doi.org/10.1109/TSP.2022.3163896

work page doi:10.1109/tsp.2022.3163896 2022
[31]

Adaptive Federated Optimization

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, Adaptive Federated Optimization, Sept. 2021, https://arxiv.org/abs/2003.00295

work page internal anchor Pith review arXiv 2021
[32]

Smith, J

S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis, FROSTT: The formidable repository of open sparse tensors and tools , 2017

work page 2017
[33]

Smith and G

S. Smith and G. Karypis , A medium-grained algorithm for sparse tensor factorization , in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 902–911, https: //doi.org/10.1109/IPDPS.2016.113

work page doi:10.1109/ipdps.2016.113 2016
[34]

Smith, N

S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, SPLATT: Efficient and parallel sparse tensor-matrix multiplication, in IPDPS 2015: IEEE International Parallel and Distributed Processing Symposium, 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp. 61–70, https://doi.org/10.1109/ipdps.2015.27

work page doi:10.1109/ipdps.2015.27 2015
[35]

S. U. Stich, Local SGD converges fast and communicates little , 2019, https://arxiv.org/abs/1805.09767

work page internal anchor Pith review Pith/arXiv arXiv 2019
[36]

C. R. Trott, D. Lebrun-Grandi ´e, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gayatri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff, A. Pow- ell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke , Kokkos 3: Programming model extensions for the exascale era , IEEE Transactions on Parallel and Dis...

work page doi:10.1109/tpds.2021.3097283 2022
[37]

Vandecappelle, N

M. Vandecappelle, N. Vervliet, and L. D. Lathauwer, Inexact generalized gauss–newton for scaling the canonical polyadic decomposition with non-least-squares cost functions , IEEE Journal of Selected Topics in Signal Processing, 15 (2021), pp. 491–505, https://doi.org/10.1109/JSTSP.2020.3045911

work page doi:10.1109/jstsp.2020.3045911 2021
[38]

Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun , Rubik: Knowledge guided tensor factorization and completion for health data analytics , in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Kdd ’15, Sydney, NSW, Australia and New York, NY, USA, 2015, ACM, pp. 1265–1274, h...

work page arXiv 2015
[39]

Zhang, A

S. Zhang, A. Choromanska, and Y. LeCun, Deep learning with elastic averaging SGD, in Proceedings 28 J. M. MYERS AND E. T. PHIPPS of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, 2015, MIT Press, pp. 685–693

work page 2015

[1] [1]

B. M. Adams, W. J. Bohnhoff, K. R. Dalbey, M. S. Ebeida, J. P. Eddy, M. S. Eldred, R. W. Hooper, P. D. Hough, K. T. Hu, J. D. Jakeman, M. Khalil, K. A. Maupin, J. A. Mon- schke, E. M. Ridgway, A. . Rushdi, D. T. Seidl, J. A. Stephens, and J. G. Winokur , Dakota, A multilevel parallel object-oriented framework for design optimization, parameter esti- matio...

work page doi:10.2172/1817318 2020

[2] [2]

Bader, T

Brett W. Bader, T. G. Kolda, et al. , Tensor Toolbox for MATLAB, Version 3.6, Sept. 2023

work page 2023

[3] [3]

Eckart-Young

J. D. Carroll and J.-J. Chang , Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , Psychometrika, 35 (1970), pp. 283–319

work page 1970

[4] [4]

E. C. Chi and T. G. Kolda , On Tensors, Sparsity, and Nonnegative Factorizations , SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 1272–1299

work page 2012

[5] [5]

J. H. Choi and S. Vishwanathan, DFacTo: Distributed factorization of tensors , in Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 1296–1304

work page 2014

[6] [6]

Cohen, O

N. Cohen, O. Sharir, and A. Shashua , On the expressive power of deep learning: A tensor analysis , in 29th Annual Conference on Learning Theory, vol. 49, PMLR, June 2016, pp. 698–728

work page 2016

[7] [7]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng , Large scale distributed deep networks , in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds., vol. 25, Curran Associates, Inc., 2012

work page 2012

[8] [8]

D. M. Dunlavy, N. T. Johnson, and others , Pyttb: Python Tensor Toolbox, v1.8.2 , Jan. 2025

work page 2025

[9] [9]

Fanaee-T and J

H. Fanaee-T and J. Gama , Tensor-based anomaly detection: An interdisciplinary survey , Knowledge- Based Systems, 98 (2016), pp. 130–147, https://doi.org/10.1016/j.knosys.2016.01.027

work page doi:10.1016/j.knosys.2016.01.027 2016

[10] [10]

Gujral, R

E. Gujral, R. Pasricha, and E. E. Papalexakis , SamBaTen: Sampling-based batch incremental tensor decomposition, Sept. 2017, https://arxiv.org/abs/1709.00668v1

work page arXiv 2017

[11] [11]

explanatory

R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis , UCLA Working Papers in Phonetics, 16 (1970), pp. 1–84

work page 1970

[12] [12]

A. E. Helal, J. Laukemann, F. Checconi, J. J. Tithi, T. Ranadive, F. Petrini, and J. Choi , ALTO: Adaptive linearized storage of sparse tensors , in Proceedings of the 35th ACM International Conference on Supercomputing, Ics ’21, Virtual Event, USA and New York, NY, USA, 2021, Associ- ation for Computing Machinery, pp. 404–416, https://doi.org/10.1145/344...

work page doi:10.1145/3447818.3461703 2021

[13] [13]

Helton and F

J. Helton and F. Davis , Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems, Reliability Engineering & System Safety, 81 (2003), pp. 23–69, https://doi.org/10. 1016/S0951-8320(03)00058-9

work page 2003

[14] [14]

D. Hong, T. G. Kolda, and J. A. Duersch , Generalized Canonical Polyadic Tensor Decomposition , SIAM Review, 62 (2020), pp. 133–163

work page 2020

[15] [15]

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

M. Janzamin, H. Sedghi, and A. Anandkumar , Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods , June 2015, https://arxiv.org/abs/1506.08473v3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Kaya and B

O. Kaya and B. Uc ¸ar , Scalable sparse tensor decompositions in distributed memory systems , in Proceedings of the International Conference for High Performance Computing, Networking, Stor- age and Analysis, SC’15, Austin, Texas and New York, NY, USA, 2015, ACM, pp. 77:1–77:11, https://doi.org/10.1145/2807591.2807624

work page doi:10.1145/2807591.2807624 2015

[17] [17]

Kaya and B

O. Kaya and B. Uc ¸ar, Parallel CANDECOMP/PARAFAC decomposition of sparse tensors using di- mension trees, SIAM Journal On Scientific Computing, 40 (2018), pp. C99–C130

work page 2018

[18] [18]

D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds., 2015

work page 2015

[19] [19]

T. G. Kolda and B. W. Bader , Tensor Decompositions and Applications , SIAM Review, 51 (2009), pp. 455–500

work page 2009

[20] [20]

T. G. Kolda and D. Hong, Stochastic Gradients for Large-Scale Tensor Decomposition, SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 1066–1095

work page 2020

[21] [21]

Lewis and E

C. Lewis and E. Phipps, Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition, in 2021 IEEE High Performance Extreme Computing Conference (HPEC), PARALLELISM APPROACHES FOR GCP WITH GENTEN 27 Waltham, MA, USA, Sept. 2021, IEEE, pp. 1–5, https://doi.org/10.1109/HPEC49654.2021.9622844

work page doi:10.1109/hpec49654.2021.9622844 2021

[22] [22]

J. Li, J. Sun, and R. Vuduc , HiCOO: Hierarchical storage of sparse tensors , in ACM/IEEE Inter- national Conference for High-Performance Computing, Networking, Storage, and Analysis (SC18), 2018

work page 2018

[23] [23]

B. Liu, C. Wen, A. D. Sarwate, and M. Mehri Dehnavi, A unified optimization approach for sparse tensor operations on gpus . ArXiv e-prints, 2017

work page 2017

[24] [24]

M. D. McKay, R. J. Beckman, and W. J. Conover , A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code , Technometrics, 21 (1979), pp. 239–245, https://doi.org/10.2307/1268522, https://arxiv.org/abs/1268522

work page doi:10.2307/1268522 1979

[25] [25]

Z. Miao, J. Li, J. C. Calhoun, and R. Ge , BALA-CPD: BALanced and asynchronous distributed tensor decomposition, in 2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, pp. 440–450, https://doi.org/10.1109/CLUSTER51413.2022.00054

work page doi:10.1109/cluster51413.2022.00054 2022

[26] [26]

J. M. Myers and D. M. Dunlavy , Tensor decompositions for count data that leverage stochastic and deterministic optimization , Optimization Methods and Software, 40 (2025), pp. 352–387, https: //doi.org/10.1080/10556788.2024.2401981, https://arxiv.org/abs/https://doi.org/10.1080/10556788. 2024.2401981

work page doi:10.1080/10556788.2024.2401981 2025

[27] [27]

Tensorizing Neural Networks

A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov , Tensorizing neural networks , CoRR, abs/1509.06569 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

E. T. Phipps, N. T. Johnson, and T. G. Kolda , Streaming generalized canonical polyadic tensor decompositions, in Proceedings of the Platform for Advanced Scientific Computing Conference, Pasc ’23, New York, NY, USA, 2023, Association for Computing Machinery, https://doi.org/10.1145/ 3592979.3593405

work page arXiv 2023

[29] [29]

E. T. Phipps and T. G. Kolda , Software for sparse tensor decomposition on emerging computing architectures, SIAM Journal on Scientific Computing, 41 (2019), pp. C269–C290, https://doi.org/10. 1137/18M1210691

work page 2019

[30] [30]

W. Pu, S. Ibrahim, X. Fu, and M. Hong , Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses, IEEE Transactions on Signal Processing, 70 (2022), pp. 1803–1818, https: //doi.org/10.1109/TSP.2022.3163896

work page doi:10.1109/tsp.2022.3163896 2022

[31] [31]

Adaptive Federated Optimization

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, Adaptive Federated Optimization, Sept. 2021, https://arxiv.org/abs/2003.00295

work page internal anchor Pith review arXiv 2021

[32] [32]

Smith, J

S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis, FROSTT: The formidable repository of open sparse tensors and tools , 2017

work page 2017

[33] [33]

Smith and G

S. Smith and G. Karypis , A medium-grained algorithm for sparse tensor factorization , in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 902–911, https: //doi.org/10.1109/IPDPS.2016.113

work page doi:10.1109/ipdps.2016.113 2016

[34] [34]

Smith, N

S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, SPLATT: Efficient and parallel sparse tensor-matrix multiplication, in IPDPS 2015: IEEE International Parallel and Distributed Processing Symposium, 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp. 61–70, https://doi.org/10.1109/ipdps.2015.27

work page doi:10.1109/ipdps.2015.27 2015

[35] [35]

S. U. Stich, Local SGD converges fast and communicates little , 2019, https://arxiv.org/abs/1805.09767

work page internal anchor Pith review Pith/arXiv arXiv 2019

[36] [36]

C. R. Trott, D. Lebrun-Grandi ´e, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gayatri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff, A. Pow- ell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke , Kokkos 3: Programming model extensions for the exascale era , IEEE Transactions on Parallel and Dis...

work page doi:10.1109/tpds.2021.3097283 2022

[37] [37]

Vandecappelle, N

M. Vandecappelle, N. Vervliet, and L. D. Lathauwer, Inexact generalized gauss–newton for scaling the canonical polyadic decomposition with non-least-squares cost functions , IEEE Journal of Selected Topics in Signal Processing, 15 (2021), pp. 491–505, https://doi.org/10.1109/JSTSP.2020.3045911

work page doi:10.1109/jstsp.2020.3045911 2021

[38] [38]

Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun , Rubik: Knowledge guided tensor factorization and completion for health data analytics , in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Kdd ’15, Sydney, NSW, Australia and New York, NY, USA, 2015, ACM, pp. 1265–1274, h...

work page arXiv 2015

[39] [39]

Zhang, A

S. Zhang, A. Choromanska, and Y. LeCun, Deep learning with elastic averaging SGD, in Proceedings 28 J. M. MYERS AND E. T. PHIPPS of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, 2015, MIT Press, pp. 685–693

work page 2015