pith. sign in

arxiv: 2605.20353 · v1 · pith:4XRBKSRAnew · submitted 2026-05-19 · 🧮 math.NA · cs.NA

Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen

Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords generalized canonical polyadic decompositiontensor decompositionparallel algorithmsstochastic optimizationdistributed memoryshared memory parallelismasynchronous methodsKokkos
0
0 comments X

The pith

Parallel synchronous and asynchronous methods scale generalized canonical polyadic tensor decomposition to large datasets while maintaining accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops synchronous and asynchronous parallel algorithms for Generalized Canonical Polyadic (GCP) tensor decomposition to handle large, sparse, high-dimensional data more efficiently than prior serial approaches. It builds on randomization and stochastic optimization by adding shared-memory parallelism via Kokkos for CPU and GPU portability and distributed-memory parallelism via MPI in a hybrid setup. An asynchronous variant draws on federated learning ideas to push scalability further. A sympathetic reader would care because these changes aim to enable interpretable analysis of non-Gaussian data such as counts or binary values in massive real-world datasets without major losses in speed or quality.

Core claim

The authors claim that the proposed synchronous hybrid MPI+Kokkos and asynchronous distributed approaches for GCP tensor decomposition achieve even better scalability to large data sets while maintaining accuracy, as studied on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.

What carries the argument

The hybrid synchronous parallelism combining Kokkos for shared-memory random sampling and stochastic optimization with MPI for distribution, plus an asynchronous distributed scheme modeled on federated learning techniques.

If this is right

  • Enables decomposition of larger count or binary datasets using flexible loss functions without excessive compute time.
  • Supports portable execution across CPU and GPU hardware through the Kokkos layer.
  • Delivers linear or near-linear speedups on distributed systems for varying sparsity patterns.
  • Maintains decomposition accuracy across multiple loss functions on both synthetic and real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The asynchronous approach could extend to privacy-sensitive settings where raw data cannot be centralized.
  • Similar parallelism patterns might apply to other tensor models such as Tucker decomposition for comparable gains.
  • Testing on streaming or dynamically growing datasets would reveal whether the methods remain stable over time.
  • Integration into existing data pipelines could reduce the barrier to using GCP on industry-scale problems.

Load-bearing premise

That the randomization, stochastic optimization, and parallelization steps can be combined without introducing convergence problems or accuracy loss relative to the serial GCP baseline.

What would settle it

Running the parallel GCP implementations and the serial baseline on the same large sparse real-world dataset and checking whether the final loss values and reconstruction errors remain comparable while wall-clock time decreases with added processors.

Figures

Figures reproduced from arXiv: 2605.20353 by Eric T. Phipps, Jeremy M. Myers.

Figure 1
Figure 1. Figure 1: Two versions of distributed parallelism in GenTen. The figures illustrate a tensor [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of fused and non-fused semi-stratified sampling/MTTKRP kernels on [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of distributed GCP approaches for over a range of total number of [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Breaking down GCP runtime for the all-reduce and two-sided communication ap [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Strong scaling of two GCP-Adam epochs across a range of compute nodes. The slope of the best linear fit (in log-space) is shown above each curve. performed a parameter sensitivity analysis on synthetic data utilizing on-node parallelism only to establish baseline behavior. Second, we introduced distributed parallelism and conducted experiments with two large-scale, sparse tensor datasets. Since federated a… view at source ↗
Figure 6
Figure 6. Figure 6: Simple rank correlation coefficient convergence of [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scatter plots varying rate and adam-beta1 for GCP-SGD. The y-axis denotes the relative error from the MLE (top) and total time (bottom); rate is plotted along the x-axis; variations in adam-beta1 are denoted by the hue. up to 21× faster than GCP-SGD. However, parameter tuning seems to have a deleterious effect on accuracy for both methods overall, which is surprising given all that changed was the random i… view at source ↗
Figure 8
Figure 8. Figure 8: Scatter plots varying rate and decay for GCP-SGD. The y-axis denotes the relative error from the MLE (top) and total time (bottom); rate is plotted along the x-axis; variations in decay are denoted by the hue. We varied the asynchrony and the number of processors. Each node in the cluster consists of one 44-core IBM Power9 processor and 4 NVIDIA V100 GPUs with 32 GB of RAM each. We compiled GenTen with NVC… view at source ↗
Figure 9
Figure 9. Figure 9: Scatter plots varying two parameters for [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Results on the synthetic count dataset comparing [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Traces of Poisson loss values by time in seconds for the arXiv dataset. The hue, [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Results on the amazon-reviews dataset comparing [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
read the original abstract

The Canonical Polyadic (CP) tensor decomposition is a well-known method for interpretable analysis of high-dimensional data. Recently, the Generalized CP method (GCP) was introduced by Hong and Kolda to allow for flexible choice of the loss function in the optimization problem defining the CP model, enabling more interpretable decompositions of strongly non-Gaussian data such as count or binary data. Furthermore, Kolda and Hong introduced a version of GCP that leverages randomization and stochastic optimization to address scalability to large, sparse data sets. In this work, we take these ideas a step further and consider synchronous and asynchronous algorithms for parallel GCP tensor decomposition through the GenTen software package, exploiting both shared and distributed memory parallelism. We build on shared memory parallel CP decomposition algorithms utilizing Kokkos for portability across CPU and GPU architectures to support the random sampling and stochastic optimization methods required by GCP. We then couple this approach to the well-known medium-grained distributed memory parallelism scheme developed for traditional CP decompositions through MPI, providing a synchronous, hybrid MPI+Kokkos, parallel GCP decomposition capability. Finally, we propose an asynchronous distributed parallelism approach building on related techniques for federated learning to achieve even better scalability to large data sets. We study the effectiveness of the proposed synchronous and asynchronous approaches vis-a-vis computational cost and accuracy on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript extends prior Generalized CP (GCP) tensor decomposition work by Kolda and Hong, adding synchronous hybrid MPI+Kokkos and asynchronous distributed parallelism schemes within the GenTen package. The central claim is that these parallel approaches achieve improved scalability on large sparse tensors while preserving accuracy relative to the serial randomized/stochastic GCP baseline, demonstrated via experiments on synthetic and real-world datasets of varying size, dimension, and sparsity using multiple loss functions.

Significance. A sound demonstration of portable, scalable GCP implementations would strengthen practical tools for interpretable analysis of non-Gaussian high-dimensional data. The hybrid shared/distributed memory design and federated-learning-inspired asynchrony are technically interesting extensions; however, the absence of quantitative results, error bars, or convergence analysis in the provided description leaves the accuracy-preservation claim unverified at present.

major comments (2)
  1. [asynchronous distributed parallelism approach] The asynchronous distributed scheme (described after the synchronous MPI+Kokkos approach) applies local stochastic gradients computed on stale factor copies. For loss functions whose gradients are nonlinear in the factors (Poisson, Bernoulli, etc.), this introduces potential bias or variance inflation not present in the synchronous or serial case. No fixed-point analysis, convergence-rate bound, or empirical safeguard (e.g., bounded staleness, compensation terms) is supplied to show that the fixed point or accuracy remains comparable to the serial GCP baseline.
  2. [evaluation on synthetic and real-world datasets] The abstract and evaluation plan state that accuracy is maintained, yet no quantitative metrics, tables, or figures reporting fit quality, reconstruction error, or iteration counts versus the serial baseline appear in the manuscript description. Without these data, the claim that the parallel schemes “maintain accuracy” cannot be assessed.
minor comments (1)
  1. [algorithm description] Notation for the stochastic sampling and local gradient computation should be aligned explicitly with the earlier GCP formulation (Hong & Kolda) to make the extension transparent.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating revisions where appropriate. Our responses focus on substance and aim to clarify or strengthen the presentation of the parallel GCP approaches.

read point-by-point responses
  1. Referee: [asynchronous distributed parallelism approach] The asynchronous distributed scheme (described after the synchronous MPI+Kokkos approach) applies local stochastic gradients computed on stale factor copies. For loss functions whose gradients are nonlinear in the factors (Poisson, Bernoulli, etc.), this introduces potential bias or variance inflation not present in the synchronous or serial case. No fixed-point analysis, convergence-rate bound, or empirical safeguard (e.g., bounded staleness, compensation terms) is supplied to show that the fixed point or accuracy remains comparable to the serial GCP baseline.

    Authors: We acknowledge the referee's concern regarding potential bias in asynchronous updates for nonlinear loss functions. The approach is motivated by federated learning methods that tolerate staleness in stochastic settings. In the revised manuscript, we have added a discussion of the bounded-staleness mechanism implemented in GenTen and included additional empirical results (convergence plots and accuracy metrics) comparing the asynchronous scheme to the synchronous and serial baselines across the tested loss functions. A full fixed-point or convergence-rate analysis lies beyond the scope of this applied paper but is noted as future work. revision: partial

  2. Referee: [evaluation on synthetic and real-world datasets] The abstract and evaluation plan state that accuracy is maintained, yet no quantitative metrics, tables, or figures reporting fit quality, reconstruction error, or iteration counts versus the serial baseline appear in the manuscript description. Without these data, the claim that the parallel schemes “maintain accuracy” cannot be assessed.

    Authors: We have revised the manuscript to prominently feature the existing quantitative results. New tables and figures now explicitly report fit quality, reconstruction error, and iteration counts for the synchronous and asynchronous schemes versus the serial randomized GCP baseline on all synthetic and real-world datasets. Multiple independent runs with error bars are included to support the accuracy-preservation claim and address variability. revision: yes

standing simulated objections not resolved
  • A rigorous fixed-point analysis or convergence-rate bound for the asynchronous scheme with nonlinear loss functions.

Circularity Check

0 steps flagged

No circularity: new parallel schemes are independent algorithmic extensions evaluated empirically

full rationale

The paper cites the GCP formulation and randomized stochastic optimization from prior work by Hong and Kolda as the foundation, then describes new synchronous (hybrid MPI+Kokkos) and asynchronous (federated-learning-style) parallel implementations as additions. These are presented through algorithmic descriptions and empirical studies on synthetic and real datasets with varying loss functions, without any claimed first-principles derivation, fitted-parameter predictions, or self-referential definitions that reduce the new results to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the central claims rest on direct implementation and benchmarking rather than re-deriving or renaming prior quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the correctness of prior randomized GCP methods and standard parallel programming models; no new mathematical axioms or invented entities are introduced.

axioms (1)
  • domain assumption Randomization and stochastic optimization preserve useful properties of GCP for non-Gaussian data as shown in prior work.
    The parallel extensions assume the base GCP method remains valid when distributed.

pith-pipeline@v0.9.0 · 5787 in / 1121 out tokens · 38405 ms · 2026-05-21T07:14:46.408733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

  1. [1]

    B. M. Adams, W. J. Bohnhoff, K. R. Dalbey, M. S. Ebeida, J. P. Eddy, M. S. Eldred, R. W. Hooper, P. D. Hough, K. T. Hu, J. D. Jakeman, M. Khalil, K. A. Maupin, J. A. Mon- schke, E. M. Ridgway, A. . Rushdi, D. T. Seidl, J. A. Stephens, and J. G. Winokur , Dakota, A multilevel parallel object-oriented framework for design optimization, parameter esti- matio...

  2. [2]

    Bader, T

    Brett W. Bader, T. G. Kolda, et al. , Tensor Toolbox for MATLAB, Version 3.6, Sept. 2023

  3. [3]

    Eckart-Young

    J. D. Carroll and J.-J. Chang , Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , Psychometrika, 35 (1970), pp. 283–319

  4. [4]

    E. C. Chi and T. G. Kolda , On Tensors, Sparsity, and Nonnegative Factorizations , SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 1272–1299

  5. [5]

    J. H. Choi and S. Vishwanathan, DFacTo: Distributed factorization of tensors , in Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 1296–1304

  6. [6]

    Cohen, O

    N. Cohen, O. Sharir, and A. Shashua , On the expressive power of deep learning: A tensor analysis , in 29th Annual Conference on Learning Theory, vol. 49, PMLR, June 2016, pp. 698–728

  7. [7]

    J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng , Large scale distributed deep networks , in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds., vol. 25, Curran Associates, Inc., 2012

  8. [8]

    D. M. Dunlavy, N. T. Johnson, and others , Pyttb: Python Tensor Toolbox, v1.8.2 , Jan. 2025

  9. [9]

    Fanaee-T and J

    H. Fanaee-T and J. Gama , Tensor-based anomaly detection: An interdisciplinary survey , Knowledge- Based Systems, 98 (2016), pp. 130–147, https://doi.org/10.1016/j.knosys.2016.01.027

  10. [10]

    Gujral, R

    E. Gujral, R. Pasricha, and E. E. Papalexakis , SamBaTen: Sampling-based batch incremental tensor decomposition, Sept. 2017, https://arxiv.org/abs/1709.00668v1

  11. [11]

    explanatory

    R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis , UCLA Working Papers in Phonetics, 16 (1970), pp. 1–84

  12. [12]

    A. E. Helal, J. Laukemann, F. Checconi, J. J. Tithi, T. Ranadive, F. Petrini, and J. Choi , ALTO: Adaptive linearized storage of sparse tensors , in Proceedings of the 35th ACM International Conference on Supercomputing, Ics ’21, Virtual Event, USA and New York, NY, USA, 2021, Associ- ation for Computing Machinery, pp. 404–416, https://doi.org/10.1145/344...

  13. [13]

    Helton and F

    J. Helton and F. Davis , Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems, Reliability Engineering & System Safety, 81 (2003), pp. 23–69, https://doi.org/10. 1016/S0951-8320(03)00058-9

  14. [14]

    D. Hong, T. G. Kolda, and J. A. Duersch , Generalized Canonical Polyadic Tensor Decomposition , SIAM Review, 62 (2020), pp. 133–163

  15. [15]

    Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

    M. Janzamin, H. Sedghi, and A. Anandkumar , Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods , June 2015, https://arxiv.org/abs/1506.08473v3

  16. [16]

    Kaya and B

    O. Kaya and B. Uc ¸ar , Scalable sparse tensor decompositions in distributed memory systems , in Proceedings of the International Conference for High Performance Computing, Networking, Stor- age and Analysis, SC’15, Austin, Texas and New York, NY, USA, 2015, ACM, pp. 77:1–77:11, https://doi.org/10.1145/2807591.2807624

  17. [17]

    Kaya and B

    O. Kaya and B. Uc ¸ar, Parallel CANDECOMP/PARAFAC decomposition of sparse tensors using di- mension trees, SIAM Journal On Scientific Computing, 40 (2018), pp. C99–C130

  18. [18]

    D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds., 2015

  19. [19]

    T. G. Kolda and B. W. Bader , Tensor Decompositions and Applications , SIAM Review, 51 (2009), pp. 455–500

  20. [20]

    T. G. Kolda and D. Hong, Stochastic Gradients for Large-Scale Tensor Decomposition, SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 1066–1095

  21. [21]

    Lewis and E

    C. Lewis and E. Phipps, Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition, in 2021 IEEE High Performance Extreme Computing Conference (HPEC), PARALLELISM APPROACHES FOR GCP WITH GENTEN 27 Waltham, MA, USA, Sept. 2021, IEEE, pp. 1–5, https://doi.org/10.1109/HPEC49654.2021.9622844

  22. [22]

    J. Li, J. Sun, and R. Vuduc , HiCOO: Hierarchical storage of sparse tensors , in ACM/IEEE Inter- national Conference for High-Performance Computing, Networking, Storage, and Analysis (SC18), 2018

  23. [23]

    B. Liu, C. Wen, A. D. Sarwate, and M. Mehri Dehnavi, A unified optimization approach for sparse tensor operations on gpus . ArXiv e-prints, 2017

  24. [24]

    M. D. McKay, R. J. Beckman, and W. J. Conover , A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code , Technometrics, 21 (1979), pp. 239–245, https://doi.org/10.2307/1268522, https://arxiv.org/abs/1268522

  25. [25]

    Z. Miao, J. Li, J. C. Calhoun, and R. Ge , BALA-CPD: BALanced and asynchronous distributed tensor decomposition, in 2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, pp. 440–450, https://doi.org/10.1109/CLUSTER51413.2022.00054

  26. [26]

    J. M. Myers and D. M. Dunlavy , Tensor decompositions for count data that leverage stochastic and deterministic optimization , Optimization Methods and Software, 40 (2025), pp. 352–387, https: //doi.org/10.1080/10556788.2024.2401981, https://arxiv.org/abs/https://doi.org/10.1080/10556788. 2024.2401981

  27. [27]

    Tensorizing Neural Networks

    A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov , Tensorizing neural networks , CoRR, abs/1509.06569 (2015)

  28. [28]

    E. T. Phipps, N. T. Johnson, and T. G. Kolda , Streaming generalized canonical polyadic tensor decompositions, in Proceedings of the Platform for Advanced Scientific Computing Conference, Pasc ’23, New York, NY, USA, 2023, Association for Computing Machinery, https://doi.org/10.1145/ 3592979.3593405

  29. [29]

    E. T. Phipps and T. G. Kolda , Software for sparse tensor decomposition on emerging computing architectures, SIAM Journal on Scientific Computing, 41 (2019), pp. C269–C290, https://doi.org/10. 1137/18M1210691

  30. [30]

    W. Pu, S. Ibrahim, X. Fu, and M. Hong , Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses, IEEE Transactions on Signal Processing, 70 (2022), pp. 1803–1818, https: //doi.org/10.1109/TSP.2022.3163896

  31. [31]

    Adaptive Federated Optimization

    S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, Adaptive Federated Optimization, Sept. 2021, https://arxiv.org/abs/2003.00295

  32. [32]

    Smith, J

    S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis, FROSTT: The formidable repository of open sparse tensors and tools , 2017

  33. [33]

    Smith and G

    S. Smith and G. Karypis , A medium-grained algorithm for sparse tensor factorization , in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 902–911, https: //doi.org/10.1109/IPDPS.2016.113

  34. [34]

    Smith, N

    S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, SPLATT: Efficient and parallel sparse tensor-matrix multiplication, in IPDPS 2015: IEEE International Parallel and Distributed Processing Symposium, 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp. 61–70, https://doi.org/10.1109/ipdps.2015.27

  35. [35]

    S. U. Stich, Local SGD converges fast and communicates little , 2019, https://arxiv.org/abs/1805.09767

  36. [36]

    C. R. Trott, D. Lebrun-Grandi ´e, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gayatri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff, A. Pow- ell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke , Kokkos 3: Programming model extensions for the exascale era , IEEE Transactions on Parallel and Dis...

  37. [37]

    Vandecappelle, N

    M. Vandecappelle, N. Vervliet, and L. D. Lathauwer, Inexact generalized gauss–newton for scaling the canonical polyadic decomposition with non-least-squares cost functions , IEEE Journal of Selected Topics in Signal Processing, 15 (2021), pp. 491–505, https://doi.org/10.1109/JSTSP.2020.3045911

  38. [38]

    Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun , Rubik: Knowledge guided tensor factorization and completion for health data analytics , in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Kdd ’15, Sydney, NSW, Australia and New York, NY, USA, 2015, ACM, pp. 1265–1274, h...

  39. [39]

    Zhang, A

    S. Zhang, A. Choromanska, and Y. LeCun, Deep learning with elastic averaging SGD, in Proceedings 28 J. M. MYERS AND E. T. PHIPPS of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, 2015, MIT Press, pp. 685–693