pith. machine review for the scientific record. sign in

arxiv: 2604.17198 · v2 · submitted 2026-04-19 · 💻 cs.PL

Recognition: unknown

Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:11 UTC · model grok-4.3

classification 💻 cs.PL
keywords sparse tensor algebraload balancingparallel partitioningcompiler optimizationmulti-core CPUGPUsparse data structurestensor operations
0
0 comments X

The pith

A partitioning algorithm provably load-balances computation for any sparse tensor algebra expression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse tensor algebra involves irregular computations that are difficult to parallelize evenly across processors. The paper presents an algorithm that divides the work for any such expression so that each parallel unit gets a balanced share. It achieves this by extending merging algorithms used in simpler cases to handle multiple inputs and complex sparse formats in multiple dimensions. If successful, this allows compilers to automatically produce efficient parallel code for CPUs and GPUs without needing custom strategies for each operation.

Core claim

The paper claims to have developed the first algorithm that partitions sparse tensor algebra expressions in a provably load-balanced manner. By generalizing parallel merging algorithms to arbitrary numbers of operands and multi-dimensional hierarchical sparse structures, the method ensures balanced execution and is implemented in a sparse tensor algebra compiler to generate kernels that run competitively with hand-tuned libraries.

What carries the argument

The partitioning algorithm, which generalizes parallel merging algorithms to any number of operands and multi-dimensional hierarchical sparse data structures to ensure load balance.

If this is right

  • Generated parallel kernels compete with or exceed hand-implemented strategies from vendor libraries.
  • Significantly outperforms general-purpose strategies for sparse tensor expressions lacking specialized algorithms.
  • Supports automatic parallel code generation for arbitrary sparse tensor algebra expressions on multi-core CPUs and GPUs.
  • Provides a unified approach to load balancing across different hardware targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the approach scales, it could simplify development of high-performance sparse computing applications by automating what was previously expert work.
  • Similar generalization techniques might apply to load balancing in other domains with irregular data, such as graph algorithms or scientific simulations.
  • Further performance gains could come from combining this partitioning with hardware-specific optimizations or dynamic scheduling.
  • Testing on larger-scale systems could reveal if the provable balance translates to real-world speedups beyond the reported benchmarks.

Load-bearing premise

That extending parallel merging algorithms to arbitrary operands and hierarchical sparse structures keeps both the load-balance guarantee and the practical performance without adding hidden costs or errors.

What would settle it

A concrete sparse tensor algebra expression where running the generated parallel code results in significant load imbalance across execution units or underperforms compared to a sequential version.

Figures

Figures reproduced from arXiv: 2604.17198 by Alexander J Root, Atharva Chougule, Bobby Yan, Fredrik Kjolstad, Rohan Yadav, Rubens Lacouture.

Figure 1
Figure 1. Figure 1: Sparse vectors 𝑎, 𝑏, and 𝑐, as index/value pairs. parallelization techniques. However, the parallel sorting com￾munity has developed techniques to parallelize a coiterative merge of two vectors in a load-balanced manner [22, 44]. Our approach in Nacho adapts and generalizes this strat￾egy to parallelize the coiterative intersections and unions performed by sparse tensor algebra kernels. Nacho parti￾tions t… view at source ↗
Figure 2
Figure 2. Figure 2: illustrates a partitioning found by using this search for three-way merge. The partition boundaries are found by searching for cutoff points in the coordinate space that enclose a specific number of non-zeros. For example, the first partition is found by searching for cutoff coordinate 𝑥1 such that the range of coordinates from [0, 𝑥1) contains Coordinate space 0 1 2 3 4 5 6 7 8 9 10 11 Sparse vector 𝑎 𝑎0 … view at source ↗
Figure 3
Figure 3. Figure 3: a depicts one such strategy, with a two-dimensional irregular partitioning to load balance the number of non￾zeros processed in each partition. The partitioning problem is further complicated by hi￾erarchical skipping, as observed in the Hadamard product on DCSR matrices in Listing 3. Here, coiteration over the 𝑗 dimension is restricted to rows that are nonempty in both operands. The sparse intersection in… view at source ↗
Figure 4
Figure 4. Figure 4: Iteration sub-spaces enclosed by different cost functions over two 3 × 3 × 2 sparse tensors 𝐴 and 𝐵. Dotted coordinates are 0-valued and are not explicitly stored. C𝑘 (𝑘1 | 𝑖1, 𝑗2) is just the singular coordinate (𝑖1, 𝑗2, 𝑘0), as [𝑘0, 𝑘1) = 𝑘0. We require two properties of cost functions for use in our search algorithm: monotonicity and hierarchical consistency: Monotonicity: Cost functions must be monoton… view at source ↗
Figure 5
Figure 5. Figure 5: Recursive partitioning for Hadamard product. Listing 4. Computing 𝑍𝑖𝑗 = 𝐴𝑖𝑗 · 𝐵𝑖𝑗 on DCSR matrices via multiple load-balanced kernels. 1 // Load balance on just loop 𝑖. 2 while 𝑖 rows (𝐴) ∩ rows (𝐵): 3 𝑇 [𝑖] = C𝑗 (𝑁𝑗 | 𝑖) 4 𝑇 ′ = exclusive_prefix_sum (𝑇 ) 5 let C ′ 𝑖 (𝑥𝑖 ) = 𝑇 ′ [𝑥𝑖 ] 6 // Load balance using C ′ 𝑖 and C𝑗 . 7 for 𝑖 𝑖𝑇 : 8 while 𝑗 cols (𝐴[𝑖]) ∩ cols (𝐵[𝑖] ): 9 𝑍 [𝑖, 𝑗] = 𝐴[𝑖, 𝑗] * 𝐵[𝑖, 𝑗] 3.… view at source ↗
Figure 6
Figure 6. Figure 6: Nacho code generation. Gray boxes denote modi￾fications to prior work in Taco. We intercept the lowering of coiterative loops to generate partition functions and modify the iteration bounds used in compute and assembly stages. in Section 4.2.2. As part of specialization, we perform an optimization to search over sparse tensors and store their partitions by using the position space (iterators into sparse da… view at source ↗
Figure 7
Figure 7. Figure 7: Kernels for single and nested intersections. (b) connects two sets of kernels load balanced by the strategy in (a) with a prefix sum in between. Gray kernels are parallel. produces the kernels illustrated in Figure 7a, but a DCSR element-wise multiplication (nested sparse intersections, e.g., Listing 3), produces two partitioning kernels, one for load balancing each sparse intersection. We illustrate the f… view at source ↗
Figure 8
Figure 8. Figure 8: Speedup over Taco of CSR addition (𝐴 + 𝐵) on synthetic power-law matrices of different exponents of size 100000 × 100000. Lower exponents indicate higher skew.  [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: CSR addition on pairs of SuiteSparse matrices. 102 103 104 105 106 107 Total nnz (nnzA + nnzB) 10−1 100 101 Runtime (ms) Nacho PyTorch (a) COO addition, geo-mean speedup: 4.1×. 102 103 104 105 106 107 Total nnz (nnzA + nnzB) 10−1 100 Runtime (ms) Nacho PyTorch (b) COO Hadamard product, geo-mean speedup: 5.3× [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GPU kernels on pairs of SuiteSparse matrices. We compare Nacho with these vendor libraries for CSR addition in Figures 10a and 10b. Nacho achieves a geomean speed-up of 3.4× over Intel MKL (between 0.23–26×), and a geomean speed-up of 1.8× over cuSPARSE (between 0.13– 54×). Nacho is faster than Intel MKL on 93% of matrices and faster than cuSPARSE on 92% of matrices. The cuSPARSE library does not support … view at source ↗
Figure 12
Figure 12. Figure 12: SpGEMM CSR × CSR, geo-mean speedup: 0.73×. this: Nacho achieves geo-mean speedups of 4.1× and 5.3× for element-wise addition and multiplication, respectively. Lastly, we compare against cuSPARSE on sparse matrix￾matrix multiplication on SuiteSparse matrices in [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: COO matrix added to CSR matrix on SuiteSparse. 102 103 104 105 106 107 Total nnz (nnzA + nnzB) 10−1 101 Runtime (ms) Nacho PyTorch (a) CPU results, geo-mean speedup 6.4×. 102 103 104 105 106 107 Total nnz (nnzA + nnzB) 10−1 100 101 Runtime (ms) Nacho PyTorch (b) GPU results, geo-mean speedup 3.8× [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CSR Hadamard product on SuiteSparse matrices. from asymptotically efficient coiteration-based kernels and by avoiding the unnecessary format conversion. Finally, we evaluate CSR element-wise multiplication in Fig￾ure 14. This operation is not supported by Intel MKL or cuSPARSE, so PyTorch implements it with a conversion to COO and an asymptotically inefficient, general-purpose ker￾nel. Nacho again achieve… view at source ↗
Figure 15
Figure 15. Figure 15: Fusion benefits across SuiteSparse matrices. 5.4 Load Balancing Fused Kernels Because our partitioning algorithm is general across expres￾sions,Nacho can extract both the constant-factor and asymp￾totic [34] improvements that come from fusing sparse tensor algebra operations into a single kernel. We show that Nacho can effectively load balance these fused kernels, allowing for additional speedup over para… view at source ↗
read the original abstract

Sparse tensor algebra is challenging to efficiently parallelize due to the irregular, data-dependent, and potentially skewed structure of sparse computation. We propose the first partitioning algorithm that provably load balances the computation of any sparse tensor algebra expression across parallel execution units. Our algorithm generalizes parallel merging algorithms to any number of operands, and to multi-dimensional, hierarchical sparse data structures. We implement our algorithm within an existing sparse tensor algebra compilation framework to automatically generate parallel sparse tensor algebra kernels that target multi-core CPUs and GPUs. We show that our generated code is competitive with hand-implemented parallelization strategies used by vendor libraries like Intel MKL and NVIDIA cuSPARSE (geo-means of $0.73$--$3.4\times$) and \textsc{Taco} (geo-means of $1.0$--$2.4\times$), and significantly outperforms general-purpose strategies for sparse tensor expressions where specialized algorithms have not been developed (geo-means of $2.0$--$6.4\times$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes the first partitioning algorithm that provably load-balances the parallel execution of arbitrary sparse tensor algebra expressions. The algorithm generalizes parallel merging techniques to any number of operands and to multi-dimensional hierarchical sparse structures. It is implemented inside the TACO compiler to emit parallel kernels for multi-core CPUs and GPUs. Reported experiments indicate the generated code is competitive with hand-tuned vendor libraries (geo-mean speedups 0.73–3.4× versus Intel MKL and NVIDIA cuSPARSE) and outperforms general-purpose parallelization strategies for expressions lacking specialized algorithms (geo-mean speedups 2.0–6.4× versus TACO).

Significance. If the load-balance guarantee and the absence of hidden overheads can be substantiated, the work would constitute a notable advance in automatic parallelization of irregular sparse computations. A general, provably balanced method applicable to any tensor-algebra expression rather than to a fixed set of kernels would reduce the need for hand-written parallel code in scientific computing and machine-learning workloads.

major comments (1)
  1. Abstract: the central claim of a 'provable load-balance guarantee' for any sparse tensor algebra expression is load-bearing. The manuscript must supply the formal argument (or machine-checked proof) showing that the generalization of parallel merging preserves the guarantee for arbitrary operand counts and hierarchical multi-dimensional structures; without it the claim cannot be evaluated.
minor comments (1)
  1. Abstract: the reported geo-mean ranges (0.73–3.4× and 1.0–2.4×) include values below 1×; the text should clarify whether these indicate slowdowns on particular expressions and, if so, under what conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for acknowledging the potential significance of a general provably load-balanced partitioning method for sparse tensor algebra. We address the single major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: the central claim of a 'provable load-balance guarantee' for any sparse tensor algebra expression is load-bearing. The manuscript must supply the formal argument (or machine-checked proof) showing that the generalization of parallel merging preserves the guarantee for arbitrary operand counts and hierarchical multi-dimensional structures; without it the claim cannot be evaluated.

    Authors: We appreciate the referee's emphasis on rigorously establishing the load-balance guarantee. Section 4 of the manuscript contains a formal proof by induction on operand count and hierarchy depth. The argument shows that the generalized multi-operand merge preserves the per-partition work invariant of classical parallel merging for arbitrary numbers of operands and for multi-dimensional hierarchical sparse structures. We will revise the abstract to explicitly reference Section 4 and add a concise proof sketch to the introduction to improve accessibility. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new algorithmic generalization

full rationale

The paper introduces a novel partitioning algorithm that generalizes parallel merging to arbitrary operands and hierarchical sparse structures, claiming a provable load-balance guarantee as a direct consequence of this construction. No load-bearing steps reduce to self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work. The abstract and claims frame the contribution as an independent algorithmic advance with empirical validation against vendor libraries, without equations or theorems that equate outputs to inputs by definition. This aligns with the default expectation for non-circular papers presenting new constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on standard definitions of sparse tensor algebra and load balancing; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Sparse tensor algebra expressions admit a representation in multi-dimensional hierarchical sparse data structures.
    Required for the partitioning algorithm to apply to the target computations.

pith-pipeline@v0.9.0 · 5489 in / 1198 out tokens · 40937 ms · 2026-05-10T06:11:38.895948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Willow Ahrens, Teodoro Fields Collin, Radha Patel, Kyle Deeds, Changwan Hong, and Saman Amarasinghe. 2025. Finch: Sparse and Structured Tensor Programming with Control Flow.Proc. ACM Program. Lang.9, OOPSLA1, Article 117 (April 2025), 31 pages. doi:10.1145/3720473

  2. [2]

    Willow Ahrens, Daniel Donenfeld, Fredrik Kjolstad, and Saman Ama- rasinghe. 2023. Looplets: A Language for Structured Coiteration. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization(Montréal, QC, Canada)(CGO ’23). As- sociation for Computing Machinery, New York, NY, USA, 41–54. doi:10.1145/3579990.3580020

  3. [3]

    Willow Ahrens, Fredrik Kjolstad, and Saman Amarasinghe. 2022. Au- toscheduling for sparse tensor algebra with an asymptotic cost model. InProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation(San Diego, CA, USA)(PLDI 2022). Association for Computing Machinery, New York, NY, USA, 269–285. doi:10.1145...

  4. [4]

    Manya Bansal, Olivia Hsu, Kunle Olukotun, and Fredrik Kjolstad. 2023. Mosaic: An Interoperable Compiler for Tensor Algebra.Proc. ACM Program. Lang.7, PLDI, Article 122 (June 2023), 26 pages. doi:10.1145/ 3591236

  5. [5]

    Sean Baxter. 2016. moderngpu.https://github.com/moderngpu/ moderngpu. Version 2.0.0, released 2016-03-28

  6. [6]

    Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine- Grained Parallelism in Algebraic Multigrid Methods.SIAM J. Sci. Comput.34, 4 (Jan. 2012), C123–C152. doi:10.1137/110838844

  7. [7]

    Aart Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasi- lache, Bixia Zheng, and Fredrik Kjolstad. 2022. Compiler Support for Sparse Tensor Computations in MLIR.ACM Trans. Archit. Code Optim. 19, 4, Article 50 (Sept. 2022), 25 pages. doi:10.1145/3544559

  8. [8]

    Blelloch

    Guy E. Blelloch. 1990.Prefix Sums and Their Applications. Technical Report CMU-CS-90-190. School of Computer Science, Carnegie Mellon University

  9. [9]

    Blumofe and Charles E

    Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multi- threaded computations by work stealing.J. ACM46, 5 (Sept. 1999), 720–748. doi:10.1145/324133.324234

  10. [10]

    Aydin Buluc and John R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In2008 IEEE International Symposium on Parallel and Distributed Processing. 1–11. doi:10.1109/ IPDPS.2008.4536313

  11. [11]

    2023.CCCL: CUDA C++ Core Libraries

    CCCL Development Team. 2023.CCCL: CUDA C++ Core Libraries. https://github.com/NVIDIA/cccl

  12. [12]

    Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2017. Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, Colorado)(SC ’17). Association for Computing Machinery, New York, NY, USA, Art...

  13. [13]

    Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2018. ParSy: Inspection and Trans- formation of Sparse Matrix Computations for Parallelism. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 779–793. doi:10.1109/SC.2018.00065

  14. [14]

    Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. For- mat abstraction for sparse tensor algebra compilers.Proc. ACM Pro- gram. Lang.2, OOPSLA, Article 123 (Oct. 2018), 30 pages. doi:10.1145/ 3276493

  15. [15]

    Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2020. Auto- matic generation of efficient sparse tensor format conversion routines. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)(PLDI 2020). As- sociation for Computing Machinery, New York, NY, USA, 823–838. doi:10.1145/3385412.3385963

  16. [16]

    Olson, and Michael Garland

    Steven Dalton, Sean Baxter, Duane Merrill, Luke N. Olson, and Michael Garland. 2015. Optimizing Sparse Matrix Operations on GPUs Using Merge Path. In2015 IEEE International Parallel and Distributed Process- ing Symposium, IPDPS 2015, Hyderabad, India, May 25-29, 2015. IEEE Computer Society, 407–416. doi:10.1109/IPDPS.2015.98

  17. [17]

    Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing Sparse Matrix—Matrix Multiplication for the GPU.ACM Trans. Math. Softw. 41, 4, Article 25 (Oct. 2015), 20 pages. doi:10.1145/2699470

  18. [18]

    Davis and Yifan Hu

    Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection.ACM Trans. Math. Softw.38, 1, Article 1 (Dec. 2011), 25 pages. doi:10.1145/2049662.2049663

  19. [19]

    Kyle Deeds, Willow Ahrens, Magdalena Balazinska, and Dan Suciu

  20. [20]

    Galley: Modern Query Optimization for Sparse Tensor Programs. Proc. ACM Manag. Data3, 3, Article 164 (June 2025), 24 pages. doi:10. 1145/3725301 12 Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution

  21. [21]

    Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv:1803.03635 [cs.LG]https://arxiv.org/abs/1803.03635

  22. [22]

    Mahdi Ghorbani, Emilien Bauer, Tobias Grosser, and Amir Shaikhha

  23. [23]

    ACM Program

    Compressed and Parallelized Structured Tensor Algebra.Proc. ACM Program. Lang.9, OOPSLA1, Article 141 (April 2025), 29 pages. doi:10.1145/3720506

  24. [24]

    Oded Green, Robert McColl, and David A. Bader. 2012. GPU merge path: a GPU merging algorithm. InProceedings of the 26th ACM Inter- national Conference on Supercomputing(San Servolo Island, Venice, Italy)(ICS ’12). Association for Computing Machinery, New York, NY, USA, 331–340. doi:10.1145/2304576.2304621

  25. [25]

    Haas, Jeffrey F

    Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Arun N. Swami

  26. [26]

    Selectivity and Cost Estimation for Joins Based on Random Sampling.J. Comput. Syst. Sci.52, 3 (June 1996), 550–569. doi:10.1006/ jcss.1996.0041

  27. [27]

    Shideh Hashemian, Michael F. P. O’Boyle, and Amir Shaikhha. 2026. Optimizing Sparse Tensor Compilation for Sparse Output. InProceed- ings of the 35th ACM SIGPLAN International Conference on Compiler Construction(Sydney, NSW, Australia)(CC ’26). Association for Com- puting Machinery, New York, NY, USA, 27–39. doi:10.1145/3771775. 3786267

  28. [28]

    , title =

    Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Alge- bra. InProceedings of the 52nd Annual IEEE/ACM International Sym- posium on Microarchitecture(Columbus, OH, USA)(MICRO-52). As- sociation for Computing Machinery, ...

  29. [29]

    Rawn Henry, Olivia Hsu, Rohan Yadav, Stephen Chou, Kunle Olukotun, Saman Amarasinghe, and Fredrik Kjolstad. 2021. Compilation of sparse array programming models.Proc. ACM Program. Lang.5, OOPSLA, Article 128 (Oct. 2021), 29 pages. doi:10.1145/3485505

  30. [30]

    Olivia Hsu, Alexander Rucker, Tian Zhao, Varun Desai, Kunle Oluko- tun, and Fredrik Kjolstad. 2025. Stardust: Compiling Sparse Ten- sor Algebra to a Reconfigurable Dataflow Architecture. InProceed- ings of the 23rd ACM/IEEE International Symposium on Code Gen- eration and Optimization(Las Vegas, NV, USA)(CGO ’25). Asso- ciation for Computing Machinery, Ne...

  31. [31]

    Emer, Mark A

    Olivia Hsu, Maxwell Strange, Ritvik Sharma, Jaeyeon Won, Kunle Olukotun, Joel S. Emer, Mark A. Horowitz, and Fredrik Kjølstad. 2023. The Sparse Abstract Machine. InProceedings of the 28th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(Vancouver, BC, Canada)(ASPLOS 2023). Association for Co...

  32. [32]

    Intel. 2026. Intel®OneAPI Threading Building Blocks (oneTBB). https://www.intel.com/content/www/us/en/developer/tools/oneapi/ onetbb.htmlAccessed: 2026-04-13

  33. [33]

    Intel Corporation. [n. d.]. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/ onemkl.html. Accessed: 2026-04-05

  34. [34]

    Anirudh Jain, Pulkit Gupta, and Thomas M. Conte. 2025. RASSM: Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling. InProceedings of the 30th ACM International Con- ference on Architectural Support for Programming Languages and Op- erating Systems, Volume 1(Rotterdam, Netherlands)(ASPLOS ’25). Association for Computing Machin...

  35. [35]

    Daniel Kats and Frederick R. Manby. 2013. Sparse tensor framework for implementation of general local correlation methods.The Journal of Chemical Physics138, 14 (04 2013), 144101. doi:10.1063/1.4798940

  36. [36]

    Fredrik Kjolstad, Willow Ahrens, Shoaib Kamil, and Saman Amaras- inghe. 2019. Tensor algebra compilation with workspaces. InProceed- ings of the 2019 IEEE/ACM International Symposium on Code Genera- tion and Optimization(Washington, DC, USA)(CGO 2019). IEEE Press, 180–192

  37. [37]

    Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler.Proc. ACM Program. Lang.1, OOPSLA, Article 77 (Oct. 2017), 29 pages. doi:10. 1145/3133901

  38. [38]

    2020.Sparse Tensor Algebra Compilation

    Fredrik Berg Kjølstad. 2020.Sparse Tensor Algebra Compilation. PhD thesis. Massachusetts Institute of Technology, Cambridge, CA. Avail- able athttps://fredrikbk.com/publications/kjolstad-thesis.pdf

  39. [39]

    2002.An Introduction to Tensors for Students of Physics and Engineering

    Joseph C Kolecki. 2002.An Introduction to Tensors for Students of Physics and Engineering. Technical Report NASA/TM-2002-211716. NASA Glenn Research Center.https://ntrs.nasa.gov/citations/20020083040

  40. [40]

    Emer, Fredrik Kjolstad, Mark Horowitz, and Priyanka Raina

    Kalhan Koul, Olivia Hsu, Yuchen Mei, Sai Gautham Ravipati, Maxwell Strange, Jackson Melchert, Alex Carsello, Taeyoung Kong, Po-Han Chen, Huifeng Ke, Keyi Zhang, Qiaoyi Liu, Gedeon Nyengele, Zhouhua Xie, Akhilesh Balasingam, Jayashree Adivarahan, Ritvik Sharma, Christopher Torng, Joel S. Emer, Fredrik Kjolstad, Mark Horowitz, and Priyanka Raina. 2025. Onyx...

  41. [41]

    Peiming Liu, Alexander J Root, Anlun Xu, Yinying Li, Fredrik Kjolstad, and Aart J.C. Bik. 2024. Compiler Support for Sparse Tensor Convolu- tions.Proc. ACM Program. Lang.8, OOPSLA2, Article 281 (Oct. 2024), 29 pages. doi:10.1145/3689721

  42. [42]

    Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In2014 IEEE 28th International Parallel and Distributed Processing Symposium. 370–381. doi:10.1109/IPDPS.2014.47

  43. [43]

    Duane Merrill and Michael Garland. 2016. Single-pass parallel prefix scan with decoupled look-back.NVIDIA, Tech. Rep. NVR-2016-002 (2016)

  44. [44]

    Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlić, and Aydin Buluç. 2022. Parallel algorithms for masked sparse matrix-matrix products. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Seoul, Republic of Korea)(PPoPP ’22). Association for Computing Machinery, New York, NY, USA, 453–454. doi:10.11...

  45. [45]

    Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High- Performance and Memory-Saving Sparse General Matrix-Matrix Multi- plication for NVIDIA Pascal GPU. In2017 46th International Conference on Parallel Processing (ICPP). 101–110. doi:10.1109/ICPP.2017.19

  46. [46]

    NVIDIA Corporation. [n. d.]. cuSPARSE Library.https://docs.nvidia. com/cuda/cusparse/. Accessed: 2026-04-05

  47. [47]

    Saher Odeh, Oded Green, Zahi Mwassi, Oz Shmueli, and Yitzhak Birk

  48. [48]

    In2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

    Merge Path - Parallel Merging Made Simple. In2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 1611–1618. doi:10.1109/IPDPSW.2012.202

  49. [49]

    Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pel- lauer, Kartik Hegde, Po-An Tsai, Neal C

    Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pel- lauer, Kartik Hegde, Po-An Tsai, Neal C. Crago, Aamer Jaleel, John D. Owens, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher

  50. [50]

    Accelerating sparse data orchestration via dynamic reflexive tiling,

    Accelerating Sparse Data Orchestration via Dynamic Reflex- ive Tiling. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operat- ing Systems, Volume 3(Vancouver, BC, Canada)(ASPLOS 2023). As- sociation for Computing Machinery, New York, NY, USA, 18–32. doi:10.1145/3582016.3582064

  51. [51]

    Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amar- nath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In2018 IEEE International Symposium on High Performance Computer Architecture 13 Chougule et al. (HPCA)....

  52. [52]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

  53. [53]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703

  54. [54]

    PyTorch Contributors. [n. d.]. PyTorch Sparse Documentation.https: //docs.pytorch.org/docs/stable/sparse.html. Accessed: 2026-04-05

  55. [55]

    Bik, and Fredrik Kjolstad

    Alexander J Root, Bobby Yan, Peiming Liu, Christophe Gyurgyik, Aart J.C. Bik, and Fredrik Kjolstad. 2024. Compilation of Shape Opera- tors on Sparse Arrays.Proc. ACM Program. Lang.8, OOPSLA2, Article 312 (Oct. 2024), 27 pages. doi:10.1145/3689752

  56. [56]

    Semih Salihoğlu, Jeremy Chen, Yuqing Huang, Mushi Wang, and Ken Salem. 2026. Cardinality Estimation Graphs.Commun. ACM69, 2 (Jan. 2026), 99–109. doi:10.1145/3780104

  57. [57]

    Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjol- stad. 2020. A sparse iteration space transformation framework for sparse tensor algebra.Proc. ACM Program. Lang.4, OOPSLA, Article 158 (Nov. 2020), 30 pages. doi:10.1145/3428226

  58. [58]

    Amir Shaikhha, Mathieu Huot, Jaclyn Smith, and Dan Olteanu. 2022. Functional collection programming with semi-ring dictionaries.Proc. ACM Program. Lang.6, OOPSLA1, Article 89 (April 2022), 33 pages. doi:10.1145/3527333

  59. [59]

    Marco Siracusa, Olivia Hsu, Victor Soria-Pardos, Joshua Randall, Ar- naud Grasset, Eric Biscondi, Doug Joseph, Randy Allen, Fredrik Kjol- stad, Miquel Moretó Planas, and Adrià Armejach. 2026. Ember: A Compiler for Embedding Operations on Decoupled Access-Execute Architectures. In2026 IEEE/ACM International Symposium on Code Generation and Optimization (CG...

  60. [60]

    Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis

    Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017.FROSTT: The Formidable Repository of Open Sparse Tensors and Tools.http://frostt.io/

  61. [61]

    Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accel- erator Based on Row-Wise Product. In2020 53rd Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO). 766–780. doi:10. 1109/MICRO50266.2020.00068

  62. [62]

    Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations. In2020 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). 689–702. doi:10.1109/HPCA47549.2020.00062

  63. [63]

    Shiv Sundram, Muhammad Usman Tariq, and Fredrik Kjolstad. 2024. Compiling Recurrences over Dense and Sparse Arrays.Proc. ACM Program. Lang.8, OOPSLA1, Article 103 (April 2024), 26 pages. doi:10. 1145/3649820

  64. [64]

    Tinney and J.W

    W.F. Tinney and J.W. Walker. 1967. Direct solutions of sparse network equations by optimally ordered triangular factorization.Proc. IEEE55, 11 (1967), 1801–1809. doi:10.1109/PROC.1967.6011

  65. [65]

    Chakkappen

    David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P. Chakkappen. 2015. Join Size Estimation Subject to Filter Conditions. Proc. VLDB Endow.8, 12 (Aug. 2015), 1530–1541. doi:10.14778/2824032. 2824051

  66. [66]

    Lucas Wilkinson, Kazem Cheshmi, and Maryam Mehri Dehnavi. 2023. Register Tiling for Unstructured Sparsity in Neural Network Inference. Proc. ACM Program. Lang.7, PLDI, Article 188 (June 2023), 26 pages. doi:10.1145/3591302

  67. [67]

    Jaeyeon Won, Willow Ahrens, Saman Amarasinghe, and Joel S. Emer

  68. [68]

    InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26)

    Insum: Sparse GPU Kernels Simplified and Optimized with Indi- rect Einsums. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26). Association for Computing Ma- chinery, New York, NY, USA, 993–1006. doi:10.1145/3779212.3790176

  69. [69]

    Guoqing Xiao, Chuanghui Yin, Tao Zhou, Xueqi Li, Yuedan Chen, and Kenli Li. 2023. A Survey of Accelerating Parallel Sparse Linear Algebra.ACM Comput. Surv.56, 1, Article 21 (Aug. 2023), 38 pages. doi:10.1145/3604606

  70. [70]

    Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. SpDISTAL: compiling distributed sparse tensor computations. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(Dallas, Texas)(SC ’22). IEEE Press, Article 59, 15 pages

  71. [71]

    Bobby Yan, Alexander J Root, Trevor Gale, David Broman, and Fredrik Kjolstad. 2026. Fast Autoscheduling for Sparse ML Frameworks. In 2026 IEEE/ACM International Symposium on Code Generation and Op- timization (CGO). 28–43. doi:10.1109/CGO68049.2026.11394842

  72. [72]

    Carl Yang, Aydın Buluç, and John D. Owens. 2022. GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU.ACM Trans. Math. Softw.48, 1, Article 1 (Feb. 2022), 51 pages. doi:10.1145/3466795

  73. [73]

    Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 841–847. doi:10.1109/IPDPSW.2015.77

  74. [74]

    Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(Vancouver, BC, Canada)(ASPLOS 2023). Asso- ciation for Computing Machinery...

  75. [75]

    Lee, James Bornholt, and Adrian Sampson

    Guowei Zhang, Nithya Attaluri, Joel S. Emer, and Daniel Sanchez. 2021. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(Virtual, USA)(ASPLOS ’21). Association for Computing Ma- chinery, New York...

  76. [76]

    Genghan Zhang, Olivia Hsu, and Fredrik Kjolstad. 2024. Compilation of Modular and General Sparse Workspaces.Proc. ACM Program. Lang. 8, PLDI, Article 196 (June 2024), 26 pages. doi:10.1145/3656426 14