pith. sign in

arxiv: 2605.29728 · v1 · pith:HCQ2GQDRnew · submitted 2026-05-28 · 💻 cs.DC

PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration

Pith reviewed 2026-06-29 05:42 UTC · model grok-4.3

classification 💻 cs.DC
keywords processing-in-memorysparse tensor decompositionMTTKRPCP-ALSheterogeneous computingtensor acceleration
0
0 comments X

The pith

Processing-in-memory hardware accelerates the core sparse tensor operation in decomposition by up to 2.64 times when paired with a CPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that the memory-bound sparse matricized tensor times Khatri-Rao product can be mapped to processing-in-memory hardware to speed up alternating least squares tensor decomposition. The work tests partitioning strategies, number formats, and kernel changes on UPMEM PIM, then adds CPU collaboration to reach the reported gains. A sympathetic reader would care because sparse tensor decompositions appear throughout machine learning yet run slowly on ordinary processors due to memory traffic. If the mapping succeeds, the same hardware could handle larger sparse datasets without proportional increases in runtime or energy.

Core claim

PRISM provides the first processing-in-memory implementation of sparse MTTKRP and shows that it reaches 2.37 times the speed of state-of-the-art CPU code; a heterogeneous CPU-PIM version reaches 2.64 times while consuming a larger fraction of peak hardware performance than either CPU or GPU baselines, although the distributed memory layout on UPMEM can reduce gains on some workloads.

What carries the argument

Partitioning strategies that distribute the sparse MTTKRP computation across UPMEM PIM modules together with number-format and kernel choices that adapt the irregular memory accesses to the hardware.

If this is right

  • Sparse MTTKRP can be performed faster on PIM hardware than on general-purpose CPUs alone.
  • Adding a CPU to the PIM execution yields an extra performance increment beyond PIM-only runs.
  • The fraction of peak performance achieved is higher than for CPU or GPU versions of the same operation.
  • Workload-dependent variation occurs because the distributed memory can hinder performance on certain sparse patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same PIM mapping techniques could be tried on other memory-bound sparse linear-algebra kernels common in machine learning.
  • Future PIM architectures might benefit from hardware features that reduce overhead for irregular sparse accesses.
  • Combining PIM with additional accelerators could create multi-level heterogeneous systems for tensor workloads.

Load-bearing premise

The UPMEM distributed memory system and the chosen partitioning strategies will not introduce slowdowns large enough to erase the measured speedups on the workloads of interest.

What would settle it

Running the same PRISM code on a broader collection of real-world sparse tensors and finding that the distributed-memory overhead drops performance below the CPU baseline on more than a small fraction of cases.

Figures

Figures reproduced from arXiv: 2605.29728 by Aleksandar Ilic, Daniel Pacheco, Leonel Sousa.

Figure 1
Figure 1. Figure 1: Representation of element-wise MTTKRP partial result. Finally, this partial result is added to the output row indexed by the nonzero’s third coordinate, as it is the only coordinate that did not index any factor matrix. The number of columns, both of the factor matrices and output, is the same as the decomposition rank, and each column of the result is associated with a different rank. Since there are no d… view at source ↗
Figure 2
Figure 2. Figure 2: Representation of the UPMEM PIM architecture [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: COO and the proposed representation of a tensor [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Partitioning decider flowchart dimension size and nonzero partitioning. As analyzed before, rank partitioning does not require factor matrix data replica￾tion, so it is the partitioning technique favored in the proposed approach. Regarding the two other dimensions, dimension size partitioning is preferable to nonzero partitioning. However, completely avoiding nonzero partitioning may not be optimal. This h… view at source ↗
Figure 6
Figure 6. Figure 6: Influence of fixed-precision and lock usage on convergence and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Heterogeneous spMTTKRP execution time and workload distribution [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Speedup of PIM-only and heterogeneous approaches over ALTO [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning. One of the most used tensor decomposition algorithms is the Alternating Least Squares Canonical Polyadic Decomposition (CP-ALS), where the most time-consuming operation is the Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). This operation is strongly memory-bound, making it hard to implement efficiently on general-purpose processors. This work proposes PRISM, the first approach to tackle this operation using Processing-In-Memory (PIM) technology. We extensively characterize different partitioning strategies, number formats, and kernel optimizations that efficiently adapt this operation to UPMEM PIM, which is further boosted by heterogeneous collaboration with the CPU. The experimental results show that the proposed PIM-based and heterogeneous approaches achieve up to 2.37x and 2.64x speedup compared to state-of-the-art CPU implementations, respectively. However, the UPMEM distributed memory system can significantly hinder performance on certain workloads. Nonetheless, the efficiency of resource consumption for this approach, measured by peak performance fraction usage, is significantly higher than for both CPU and GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PRISM, the first Processing-In-Memory approach for the sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) in CP-ALS tensor decomposition on UPMEM PIM hardware. It characterizes partitioning strategies, number formats, and kernel optimizations, augmented by heterogeneous CPU collaboration. Experimental results report up to 2.37x (PIM-only) and 2.64x (heterogeneous) speedups over state-of-the-art CPU baselines, with higher peak-performance-fraction efficiency than CPU or GPU, while noting that UPMEM distributed memory can significantly hinder performance on certain workloads.

Significance. If the measured speedups prove representative across workloads and the partitioning strategies generalize, the work would demonstrate a practical path for accelerating memory-bound sparse tensor kernels via PIM, with explicit credit for the extensive characterization of partitioning, formats, and heterogeneous mapping that enables the reported gains.

major comments (2)
  1. [Abstract] Abstract and experimental results: the headline speedups of 2.37x/2.64x are reported without error bars, workload-size distributions, or baseline implementation details; this directly affects whether the central performance claim can be considered robust rather than potentially influenced by favorable sparsity patterns or tensor sizes.
  2. [Abstract] Abstract: the statement that 'the UPMEM distributed memory system can significantly hinder performance on certain workloads' is load-bearing for the generalization of the speedup claims, yet no quantification is provided of the fraction of tested workloads where speedup exceeds 1x versus where hindrance occurs; this leaves the representativeness of the headline numbers unverified.
minor comments (2)
  1. Clarify the exact state-of-the-art CPU baselines used (library versions, optimization flags) to allow direct reproduction of the reported speedups.
  2. Add a table or figure summarizing the fraction of workloads exhibiting speedup versus slowdown to address the generalization concern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the robustness of our performance claims. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the headline speedups of 2.37x/2.64x are reported without error bars, workload-size distributions, or baseline implementation details; this directly affects whether the central performance claim can be considered robust rather than potentially influenced by favorable sparsity patterns or tensor sizes.

    Authors: We agree that the abstract's headline numbers would benefit from additional context on variability and baselines. The full manuscript already includes per-tensor results across a range of sizes and sparsity patterns in Section 5, along with descriptions of the CPU baselines (state-of-the-art sparse MTTKRP implementations from the literature). To address the concern directly, we will add error bars (standard deviation across repeated runs) to the key speedup figures, include a summary table or histogram of workload sizes and sparsity levels, and expand the baseline implementation details in the experimental setup subsection. These changes will be reflected in both the abstract (if space permits) and the main text. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'the UPMEM distributed memory system can significantly hinder performance on certain workloads' is load-bearing for the generalization of the speedup claims, yet no quantification is provided of the fraction of tested workloads where speedup exceeds 1x versus where hindrance occurs; this leaves the representativeness of the headline numbers unverified.

    Authors: The manuscript already notes the hindrance effect and reports that the maximum speedups are 2.37x/2.64x while acknowledging cases below 1x. However, we did not include an explicit count or fraction of workloads in each category. We will add this quantification in the revised version by including a breakdown (e.g., number/percentage of tensors where PIM or heterogeneous execution exceeds 1x speedup versus those hindered), supported by the per-workload data already collected. This will be presented in a new table or as part of the existing results figures to clarify representativeness. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental measurements only

full rationale

The paper is an experimental systems work reporting measured speedups (up to 2.37x/2.64x) on UPMEM PIM hardware for spMTTKRP. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided abstract or described claims. All performance numbers are direct empirical results, not reductions to inputs by construction. The paper itself notes workload-dependent hindrance, so no hidden self-definition or load-bearing self-citation is required for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5749 in / 1048 out tokens · 16652 ms · 2026-06-29T05:42:02.907117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 8 canonical work pages

  1. [1]

    Tensor decompositions for learning latent variable models,

    A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, “Tensor decompositions for learning latent variable models,”J. Mach. Learn. Res., vol. 15, no. 1, p. 2773–2832, Jan. 2014

  2. [2]

    Uppipe: A novel pipeline management on in-memory processors for rna-seq quantification,

    L.-C. Chen, C.-C. Ho, and Y .-H. Chang, “Uppipe: A novel pipeline management on in-memory processors for rna-seq quantification,” in 2023 60th ACM/IEEE Design Automation Conference (DAC), 2023, pp. 1–6

  3. [3]

    Exploiting hierarchical parallelism and reusability in tensor kernel pro- cessing on heterogeneous hpc systems,

    Y . Chen, G. Xiao, M. T. ¨Ozsu, Z. Tang, A. Y . Zomaya, and K. Li, “Exploiting hierarchical parallelism and reusability in tensor kernel pro- cessing on heterogeneous hpc systems,” in2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 2522–2535

  4. [4]

    Dfacto: Distributed factorization of tensors,

    J. H. Choi and S. Vishwanathan, “Dfacto: Distributed factorization of tensors,” inAdvances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds., vol. 27. Curran Associates, Inc., 2014

  5. [5]

    A framework for high-throughput sequence alignment using real processing-in-memory systems,

    S. Diab, A. Nassereldine, M. Alser, J. G ´omez Luna, O. Mutlu, and I. El Hajj, “A framework for high-throughput sequence alignment using real processing-in-memory systems,”Bioinformatics, vol. 39, no. 5, p. btad155, 03 2023. [Online]. Available: https://doi.org/10.1093/bioinformatics/btad155

  6. [6]

    High-throughput pairwise alignment with the wavefront algorithm using processing-in-memory,

    S. Diab, A. Nassereldine, M. Alser, J. G. Luna, O. Mutlu, and I. E. Hajj, “High-throughput pairwise alignment with the wavefront algorithm using processing-in-memory,”arXiv preprint arXiv:2204.02085, 2022

  7. [7]

    GitHub - Dr-Noob/peakperf: Achieve peak performance on x86 CPUs and NVIDIA GPUs,

    Dr-Noob, “GitHub - Dr-Noob/peakperf: Achieve peak performance on x86 CPUs and NVIDIA GPUs,” https://github.com/Dr-Noob/peakperf, 2021, [Accessed 02-02-2025]

  8. [8]

    Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,

    C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,”Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 1, Feb. 2022. [Online]. Available: https://doi.org/10.1145/3508041

  9. [9]

    Benchmarking Memory-centric Computing Systems: Analysis of Real Processing-in-Memory Hardware,

    J. G ´omez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking Memory-centric Computing Systems: Analysis of Real Processing-in-Memory Hardware,” in2021 12th Inter- national Green and Sustainable Computing Conference (IGSC). IEEE, 2021

  10. [10]

    Foundations of the parafac procedure: Models and conditions for an “explanatory

    R. A. Harshmanet al., “Foundations of the parafac procedure: Models and conditions for an “explanatory” multi-modal factor analysis,”UCLA working papers in phonetics, vol. 16, no. 1, p. 84, 1970

  11. [11]

    Alto: adaptive linearized storage of sparse tensors,

    A. E. Helal, J. Laukemann, F. Checconi, J. J. Tithi, T. Ranadive, F. Petrini, and J. Choi, “Alto: adaptive linearized storage of sparse tensors,” inProceedings of the 35th ACM International Conference on Supercomputing, ser. ICS ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 404–416. [Online]. Available: https://doi.org/10.1145/344...

  12. [12]

    Gigatensor: scaling tensor analysis up by 100 times - algorithms and discoveries,

    U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos, “Gigatensor: scaling tensor analysis up by 100 times - algorithms and discoveries,” inProceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012

  13. [13]

    Parallel algorithms for tensor completion in the cp format,

    L. Karlsson, D. Kressner, and A. Uschmajew, “Parallel algorithms for tensor completion in the cp format,”Parallel Computing, vol. 57, pp. 222–234, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819115001210

  14. [14]

    Sparsity- aware tensor decomposition,

    S. E. Kurt, S. Raje, A. Sukumaran-Rajam, and P. Sadayappan, “Sparsity- aware tensor decomposition,” in2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, pp. 952–962

  15. [15]

    Scalable tensor decompositions in high performance computing environments,

    J. Li, “Scalable tensor decompositions in high performance computing environments,” 2018

  16. [16]

    Hicoo: Hierarchical storage of sparse tensors,

    J. Li, J. Sun, and R. Vuduc, “Hicoo: Hierarchical storage of sparse tensors,” inSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 238–252

  17. [17]

    Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,

    C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,”Proc. ACM Manag. Data, vol. 1, no. 2, Jun. 2023. [Online]. Available: https://doi.org/10.1145/3589258

  18. [18]

    A computer oriented geodetic data base and a new technique in file sequencing,

    G. M. Morton, “A computer oriented geodetic data base and a new technique in file sequencing,” 1966

  19. [19]

    Efficient, out-of-memory sparse mttkrp on massively parallel architectures,

    A. Nguyen, A. E. Helal, F. Checconi, J. Laukemann, J. J. Tithi, Y . Soh, T. Ranadive, F. Petrini, and J. W. Choi, “Efficient, out-of-memory sparse mttkrp on massively parallel architectures,” inProceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22. New York, NY , USA: Association for Computing Machinery,

  20. [20]

    Available: https://doi.org/10.1145/3524059.3532363

    [Online]. Available: https://doi.org/10.1145/3524059.3532363

  21. [21]

    How much is an nvidia a100?

    M. Shen, “How much is an nvidia a100?” 2024. [Online]. Available: https://modal.com/blog/nvidia-a100-price-article

  22. [22]

    Smith, J

    S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. (2017) FROSTT: The formidable repository of open sparse tensors and tools. [Online]. Available: http://frostt.io/

  23. [23]

    The true processing in mem- ory accelerator,

    UPMEM, “The true processing in mem- ory accelerator,” 2019. [Online]. Available: https://old.hotchips.org/hc31/HC31 1.4 UPMEM.FabriceDevaux.v2 1.pdf

  24. [24]

    Coding tips and recommended practices,

    ——, “Coding tips and recommended practices,” 2025. [Online]. Available: https://sdk.upmem.com/2025.1.0/fff CodingTips.html

  25. [25]

    Robust iterative fitting of multilinear models,

    S. V orobyov, Y . Rong, N. Sidiropoulos, and A. Gershman, “Robust iterative fitting of multilinear models,”IEEE Transactions on Signal Processing, vol. 53, no. 8, pp. 2678–2689, 2005

  26. [26]

    Dynasor: A dynamic memory layout for accelerating sparse mttkrp for tensor decomposition on multi-core cpu,

    S. Wijeratne, R. Kannan, and V . Prasanna, “Dynasor: A dynamic memory layout for accelerating sparse mttkrp for tensor decomposition on multi-core cpu,” in2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2023, pp. 23–33

  27. [27]

    Sparse mttkrp acceleration for tensor decomposition on gpu,

    ——, “Sparse mttkrp acceleration for tensor decomposition on gpu,” inProceedings of the 21st ACM International Conference on Computing Frontiers, ser. CF ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 88–96. [Online]. Available: https://doi.org/10.1145/3649153.3649187

  28. [28]

    Accelerating sparse mttkrp for tensor decomposition on fpga,

    S. Wijeratne, T.-Y . Wang, R. Kannan, and V . Prasanna, “Accelerating sparse mttkrp for tensor decomposition on fpga,” inProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 259–269. [Online]. Available: https://doi.org/10.1145/3543622.3573179