Recognition: unknown
Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution
Pith reviewed 2026-05-10 06:11 UTC · model grok-4.3
The pith
A partitioning algorithm provably load-balances computation for any sparse tensor algebra expression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims to have developed the first algorithm that partitions sparse tensor algebra expressions in a provably load-balanced manner. By generalizing parallel merging algorithms to arbitrary numbers of operands and multi-dimensional hierarchical sparse structures, the method ensures balanced execution and is implemented in a sparse tensor algebra compiler to generate kernels that run competitively with hand-tuned libraries.
What carries the argument
The partitioning algorithm, which generalizes parallel merging algorithms to any number of operands and multi-dimensional hierarchical sparse data structures to ensure load balance.
If this is right
- Generated parallel kernels compete with or exceed hand-implemented strategies from vendor libraries.
- Significantly outperforms general-purpose strategies for sparse tensor expressions lacking specialized algorithms.
- Supports automatic parallel code generation for arbitrary sparse tensor algebra expressions on multi-core CPUs and GPUs.
- Provides a unified approach to load balancing across different hardware targets.
Where Pith is reading between the lines
- If the approach scales, it could simplify development of high-performance sparse computing applications by automating what was previously expert work.
- Similar generalization techniques might apply to load balancing in other domains with irregular data, such as graph algorithms or scientific simulations.
- Further performance gains could come from combining this partitioning with hardware-specific optimizations or dynamic scheduling.
- Testing on larger-scale systems could reveal if the provable balance translates to real-world speedups beyond the reported benchmarks.
Load-bearing premise
That extending parallel merging algorithms to arbitrary operands and hierarchical sparse structures keeps both the load-balance guarantee and the practical performance without adding hidden costs or errors.
What would settle it
A concrete sparse tensor algebra expression where running the generated parallel code results in significant load imbalance across execution units or underperforms compared to a sequential version.
Figures
read the original abstract
Sparse tensor algebra is challenging to efficiently parallelize due to the irregular, data-dependent, and potentially skewed structure of sparse computation. We propose the first partitioning algorithm that provably load balances the computation of any sparse tensor algebra expression across parallel execution units. Our algorithm generalizes parallel merging algorithms to any number of operands, and to multi-dimensional, hierarchical sparse data structures. We implement our algorithm within an existing sparse tensor algebra compilation framework to automatically generate parallel sparse tensor algebra kernels that target multi-core CPUs and GPUs. We show that our generated code is competitive with hand-implemented parallelization strategies used by vendor libraries like Intel MKL and NVIDIA cuSPARSE (geo-means of $0.73$--$3.4\times$) and \textsc{Taco} (geo-means of $1.0$--$2.4\times$), and significantly outperforms general-purpose strategies for sparse tensor expressions where specialized algorithms have not been developed (geo-means of $2.0$--$6.4\times$).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the first partitioning algorithm that provably load-balances the parallel execution of arbitrary sparse tensor algebra expressions. The algorithm generalizes parallel merging techniques to any number of operands and to multi-dimensional hierarchical sparse structures. It is implemented inside the TACO compiler to emit parallel kernels for multi-core CPUs and GPUs. Reported experiments indicate the generated code is competitive with hand-tuned vendor libraries (geo-mean speedups 0.73–3.4× versus Intel MKL and NVIDIA cuSPARSE) and outperforms general-purpose parallelization strategies for expressions lacking specialized algorithms (geo-mean speedups 2.0–6.4× versus TACO).
Significance. If the load-balance guarantee and the absence of hidden overheads can be substantiated, the work would constitute a notable advance in automatic parallelization of irregular sparse computations. A general, provably balanced method applicable to any tensor-algebra expression rather than to a fixed set of kernels would reduce the need for hand-written parallel code in scientific computing and machine-learning workloads.
major comments (1)
- Abstract: the central claim of a 'provable load-balance guarantee' for any sparse tensor algebra expression is load-bearing. The manuscript must supply the formal argument (or machine-checked proof) showing that the generalization of parallel merging preserves the guarantee for arbitrary operand counts and hierarchical multi-dimensional structures; without it the claim cannot be evaluated.
minor comments (1)
- Abstract: the reported geo-mean ranges (0.73–3.4× and 1.0–2.4×) include values below 1×; the text should clarify whether these indicate slowdowns on particular expressions and, if so, under what conditions.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for acknowledging the potential significance of a general provably load-balanced partitioning method for sparse tensor algebra. We address the single major comment below.
read point-by-point responses
-
Referee: [—] Abstract: the central claim of a 'provable load-balance guarantee' for any sparse tensor algebra expression is load-bearing. The manuscript must supply the formal argument (or machine-checked proof) showing that the generalization of parallel merging preserves the guarantee for arbitrary operand counts and hierarchical multi-dimensional structures; without it the claim cannot be evaluated.
Authors: We appreciate the referee's emphasis on rigorously establishing the load-balance guarantee. Section 4 of the manuscript contains a formal proof by induction on operand count and hierarchy depth. The argument shows that the generalized multi-operand merge preserves the per-partition work invariant of classical parallel merging for arbitrary numbers of operands and for multi-dimensional hierarchical sparse structures. We will revise the abstract to explicitly reference Section 4 and add a concise proof sketch to the introduction to improve accessibility. revision: partial
Circularity Check
No significant circularity; new algorithmic generalization
full rationale
The paper introduces a novel partitioning algorithm that generalizes parallel merging to arbitrary operands and hierarchical sparse structures, claiming a provable load-balance guarantee as a direct consequence of this construction. No load-bearing steps reduce to self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work. The abstract and claims frame the contribution as an independent algorithmic advance with empirical validation against vendor libraries, without equations or theorems that equate outputs to inputs by definition. This aligns with the default expectation for non-circular papers presenting new constructions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse tensor algebra expressions admit a representation in multi-dimensional hierarchical sparse data structures.
Reference graph
Works this paper leans on
-
[1]
Willow Ahrens, Teodoro Fields Collin, Radha Patel, Kyle Deeds, Changwan Hong, and Saman Amarasinghe. 2025. Finch: Sparse and Structured Tensor Programming with Control Flow.Proc. ACM Program. Lang.9, OOPSLA1, Article 117 (April 2025), 31 pages. doi:10.1145/3720473
-
[2]
Willow Ahrens, Daniel Donenfeld, Fredrik Kjolstad, and Saman Ama- rasinghe. 2023. Looplets: A Language for Structured Coiteration. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization(Montréal, QC, Canada)(CGO ’23). As- sociation for Computing Machinery, New York, NY, USA, 41–54. doi:10.1145/3579990.3580020
-
[3]
Willow Ahrens, Fredrik Kjolstad, and Saman Amarasinghe. 2022. Au- toscheduling for sparse tensor algebra with an asymptotic cost model. InProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation(San Diego, CA, USA)(PLDI 2022). Association for Computing Machinery, New York, NY, USA, 269–285. doi:10.1145...
-
[4]
Manya Bansal, Olivia Hsu, Kunle Olukotun, and Fredrik Kjolstad. 2023. Mosaic: An Interoperable Compiler for Tensor Algebra.Proc. ACM Program. Lang.7, PLDI, Article 122 (June 2023), 26 pages. doi:10.1145/ 3591236
2023
-
[5]
Sean Baxter. 2016. moderngpu.https://github.com/moderngpu/ moderngpu. Version 2.0.0, released 2016-03-28
2016
-
[6]
Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine- Grained Parallelism in Algebraic Multigrid Methods.SIAM J. Sci. Comput.34, 4 (Jan. 2012), C123–C152. doi:10.1137/110838844
-
[7]
Aart Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasi- lache, Bixia Zheng, and Fredrik Kjolstad. 2022. Compiler Support for Sparse Tensor Computations in MLIR.ACM Trans. Archit. Code Optim. 19, 4, Article 50 (Sept. 2022), 25 pages. doi:10.1145/3544559
-
[8]
Blelloch
Guy E. Blelloch. 1990.Prefix Sums and Their Applications. Technical Report CMU-CS-90-190. School of Computer Science, Carnegie Mellon University
1990
-
[9]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multi- threaded computations by work stealing.J. ACM46, 5 (Sept. 1999), 720–748. doi:10.1145/324133.324234
- [10]
-
[11]
2023.CCCL: CUDA C++ Core Libraries
CCCL Development Team. 2023.CCCL: CUDA C++ Core Libraries. https://github.com/NVIDIA/cccl
2023
-
[12]
Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2017. Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, Colorado)(SC ’17). Association for Computing Machinery, New York, NY, USA, Art...
-
[13]
Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2018. ParSy: Inspection and Trans- formation of Sparse Matrix Computations for Parallelism. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 779–793. doi:10.1109/SC.2018.00065
-
[14]
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. For- mat abstraction for sparse tensor algebra compilers.Proc. ACM Pro- gram. Lang.2, OOPSLA, Article 123 (Oct. 2018), 30 pages. doi:10.1145/ 3276493
2018
-
[15]
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2020. Auto- matic generation of efficient sparse tensor format conversion routines. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)(PLDI 2020). As- sociation for Computing Machinery, New York, NY, USA, 823–838. doi:10.1145/3385412.3385963
-
[16]
Steven Dalton, Sean Baxter, Duane Merrill, Luke N. Olson, and Michael Garland. 2015. Optimizing Sparse Matrix Operations on GPUs Using Merge Path. In2015 IEEE International Parallel and Distributed Process- ing Symposium, IPDPS 2015, Hyderabad, India, May 25-29, 2015. IEEE Computer Society, 407–416. doi:10.1109/IPDPS.2015.98
-
[17]
Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing Sparse Matrix—Matrix Multiplication for the GPU.ACM Trans. Math. Softw. 41, 4, Article 25 (Oct. 2015), 20 pages. doi:10.1145/2699470
-
[18]
Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection.ACM Trans. Math. Softw.38, 1, Article 1 (Dec. 2011), 25 pages. doi:10.1145/2049662.2049663
-
[19]
Kyle Deeds, Willow Ahrens, Magdalena Balazinska, and Dan Suciu
-
[20]
Galley: Modern Query Optimization for Sparse Tensor Programs. Proc. ACM Manag. Data3, 3, Article 164 (June 2025), 24 pages. doi:10. 1145/3725301 12 Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution
2025
-
[21]
Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv:1803.03635 [cs.LG]https://arxiv.org/abs/1803.03635
work page Pith review arXiv 2019
-
[22]
Mahdi Ghorbani, Emilien Bauer, Tobias Grosser, and Amir Shaikhha
-
[23]
Compressed and Parallelized Structured Tensor Algebra.Proc. ACM Program. Lang.9, OOPSLA1, Article 141 (April 2025), 29 pages. doi:10.1145/3720506
-
[24]
Oded Green, Robert McColl, and David A. Bader. 2012. GPU merge path: a GPU merging algorithm. InProceedings of the 26th ACM Inter- national Conference on Supercomputing(San Servolo Island, Venice, Italy)(ICS ’12). Association for Computing Machinery, New York, NY, USA, 331–340. doi:10.1145/2304576.2304621
-
[25]
Haas, Jeffrey F
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Arun N. Swami
- [26]
-
[27]
Shideh Hashemian, Michael F. P. O’Boyle, and Amir Shaikhha. 2026. Optimizing Sparse Tensor Compilation for Sparse Output. InProceed- ings of the 35th ACM SIGPLAN International Conference on Compiler Construction(Sydney, NSW, Australia)(CC ’26). Association for Com- puting Machinery, New York, NY, USA, 27–39. doi:10.1145/3771775. 3786267
-
[28]
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Alge- bra. InProceedings of the 52nd Annual IEEE/ACM International Sym- posium on Microarchitecture(Columbus, OH, USA)(MICRO-52). As- sociation for Computing Machinery, ...
-
[29]
Rawn Henry, Olivia Hsu, Rohan Yadav, Stephen Chou, Kunle Olukotun, Saman Amarasinghe, and Fredrik Kjolstad. 2021. Compilation of sparse array programming models.Proc. ACM Program. Lang.5, OOPSLA, Article 128 (Oct. 2021), 29 pages. doi:10.1145/3485505
-
[30]
Olivia Hsu, Alexander Rucker, Tian Zhao, Varun Desai, Kunle Oluko- tun, and Fredrik Kjolstad. 2025. Stardust: Compiling Sparse Ten- sor Algebra to a Reconfigurable Dataflow Architecture. InProceed- ings of the 23rd ACM/IEEE International Symposium on Code Gen- eration and Optimization(Las Vegas, NV, USA)(CGO ’25). Asso- ciation for Computing Machinery, Ne...
-
[31]
Olivia Hsu, Maxwell Strange, Ritvik Sharma, Jaeyeon Won, Kunle Olukotun, Joel S. Emer, Mark A. Horowitz, and Fredrik Kjølstad. 2023. The Sparse Abstract Machine. InProceedings of the 28th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(Vancouver, BC, Canada)(ASPLOS 2023). Association for Co...
-
[32]
Intel. 2026. Intel®OneAPI Threading Building Blocks (oneTBB). https://www.intel.com/content/www/us/en/developer/tools/oneapi/ onetbb.htmlAccessed: 2026-04-13
2026
-
[33]
Intel Corporation. [n. d.]. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/ onemkl.html. Accessed: 2026-04-05
2026
-
[34]
Anirudh Jain, Pulkit Gupta, and Thomas M. Conte. 2025. RASSM: Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling. InProceedings of the 30th ACM International Con- ference on Architectural Support for Programming Languages and Op- erating Systems, Volume 1(Rotterdam, Netherlands)(ASPLOS ’25). Association for Computing Machin...
-
[35]
Daniel Kats and Frederick R. Manby. 2013. Sparse tensor framework for implementation of general local correlation methods.The Journal of Chemical Physics138, 14 (04 2013), 144101. doi:10.1063/1.4798940
-
[36]
Fredrik Kjolstad, Willow Ahrens, Shoaib Kamil, and Saman Amaras- inghe. 2019. Tensor algebra compilation with workspaces. InProceed- ings of the 2019 IEEE/ACM International Symposium on Code Genera- tion and Optimization(Washington, DC, USA)(CGO 2019). IEEE Press, 180–192
2019
-
[37]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler.Proc. ACM Program. Lang.1, OOPSLA, Article 77 (Oct. 2017), 29 pages. doi:10. 1145/3133901
2017
-
[38]
2020.Sparse Tensor Algebra Compilation
Fredrik Berg Kjølstad. 2020.Sparse Tensor Algebra Compilation. PhD thesis. Massachusetts Institute of Technology, Cambridge, CA. Avail- able athttps://fredrikbk.com/publications/kjolstad-thesis.pdf
2020
-
[39]
2002.An Introduction to Tensors for Students of Physics and Engineering
Joseph C Kolecki. 2002.An Introduction to Tensors for Students of Physics and Engineering. Technical Report NASA/TM-2002-211716. NASA Glenn Research Center.https://ntrs.nasa.gov/citations/20020083040
-
[40]
Emer, Fredrik Kjolstad, Mark Horowitz, and Priyanka Raina
Kalhan Koul, Olivia Hsu, Yuchen Mei, Sai Gautham Ravipati, Maxwell Strange, Jackson Melchert, Alex Carsello, Taeyoung Kong, Po-Han Chen, Huifeng Ke, Keyi Zhang, Qiaoyi Liu, Gedeon Nyengele, Zhouhua Xie, Akhilesh Balasingam, Jayashree Adivarahan, Ritvik Sharma, Christopher Torng, Joel S. Emer, Fredrik Kjolstad, Mark Horowitz, and Priyanka Raina. 2025. Onyx...
-
[41]
Peiming Liu, Alexander J Root, Anlun Xu, Yinying Li, Fredrik Kjolstad, and Aart J.C. Bik. 2024. Compiler Support for Sparse Tensor Convolu- tions.Proc. ACM Program. Lang.8, OOPSLA2, Article 281 (Oct. 2024), 29 pages. doi:10.1145/3689721
-
[42]
Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In2014 IEEE 28th International Parallel and Distributed Processing Symposium. 370–381. doi:10.1109/IPDPS.2014.47
-
[43]
Duane Merrill and Michael Garland. 2016. Single-pass parallel prefix scan with decoupled look-back.NVIDIA, Tech. Rep. NVR-2016-002 (2016)
2016
-
[44]
Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlić, and Aydin Buluç. 2022. Parallel algorithms for masked sparse matrix-matrix products. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Seoul, Republic of Korea)(PPoPP ’22). Association for Computing Machinery, New York, NY, USA, 453–454. doi:10.11...
-
[45]
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High- Performance and Memory-Saving Sparse General Matrix-Matrix Multi- plication for NVIDIA Pascal GPU. In2017 46th International Conference on Parallel Processing (ICPP). 101–110. doi:10.1109/ICPP.2017.19
-
[46]
NVIDIA Corporation. [n. d.]. cuSPARSE Library.https://docs.nvidia. com/cuda/cusparse/. Accessed: 2026-04-05
2026
-
[47]
Saher Odeh, Oded Green, Zahi Mwassi, Oz Shmueli, and Yitzhak Birk
-
[48]
In2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Merge Path - Parallel Merging Made Simple. In2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 1611–1618. doi:10.1109/IPDPSW.2012.202
-
[49]
Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pel- lauer, Kartik Hegde, Po-An Tsai, Neal C
Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pel- lauer, Kartik Hegde, Po-An Tsai, Neal C. Crago, Aamer Jaleel, John D. Owens, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher
-
[50]
Accelerating sparse data orchestration via dynamic reflexive tiling,
Accelerating Sparse Data Orchestration via Dynamic Reflex- ive Tiling. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operat- ing Systems, Volume 3(Vancouver, BC, Canada)(ASPLOS 2023). As- sociation for Computing Machinery, New York, NY, USA, 18–32. doi:10.1145/3582016.3582064
-
[51]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amar- nath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In2018 IEEE International Symposium on High Performance Computer Architecture 13 Chougule et al. (HPCA)....
-
[52]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala
-
[53]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[54]
PyTorch Contributors. [n. d.]. PyTorch Sparse Documentation.https: //docs.pytorch.org/docs/stable/sparse.html. Accessed: 2026-04-05
2026
-
[55]
Alexander J Root, Bobby Yan, Peiming Liu, Christophe Gyurgyik, Aart J.C. Bik, and Fredrik Kjolstad. 2024. Compilation of Shape Opera- tors on Sparse Arrays.Proc. ACM Program. Lang.8, OOPSLA2, Article 312 (Oct. 2024), 27 pages. doi:10.1145/3689752
-
[56]
Semih Salihoğlu, Jeremy Chen, Yuqing Huang, Mushi Wang, and Ken Salem. 2026. Cardinality Estimation Graphs.Commun. ACM69, 2 (Jan. 2026), 99–109. doi:10.1145/3780104
-
[57]
Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjol- stad. 2020. A sparse iteration space transformation framework for sparse tensor algebra.Proc. ACM Program. Lang.4, OOPSLA, Article 158 (Nov. 2020), 30 pages. doi:10.1145/3428226
-
[58]
Amir Shaikhha, Mathieu Huot, Jaclyn Smith, and Dan Olteanu. 2022. Functional collection programming with semi-ring dictionaries.Proc. ACM Program. Lang.6, OOPSLA1, Article 89 (April 2022), 33 pages. doi:10.1145/3527333
-
[59]
Marco Siracusa, Olivia Hsu, Victor Soria-Pardos, Joshua Randall, Ar- naud Grasset, Eric Biscondi, Doug Joseph, Randy Allen, Fredrik Kjol- stad, Miquel Moretó Planas, and Adrià Armejach. 2026. Ember: A Compiler for Embedding Operations on Decoupled Access-Execute Architectures. In2026 IEEE/ACM International Symposium on Code Generation and Optimization (CG...
-
[60]
Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis
Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017.FROSTT: The Formidable Repository of Open Sparse Tensors and Tools.http://frostt.io/
2017
-
[61]
Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accel- erator Based on Row-Wise Product. In2020 53rd Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO). 766–780. doi:10. 1109/MICRO50266.2020.00068
-
[62]
Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations. In2020 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). 689–702. doi:10.1109/HPCA47549.2020.00062
-
[63]
Shiv Sundram, Muhammad Usman Tariq, and Fredrik Kjolstad. 2024. Compiling Recurrences over Dense and Sparse Arrays.Proc. ACM Program. Lang.8, OOPSLA1, Article 103 (April 2024), 26 pages. doi:10. 1145/3649820
2024
-
[64]
W.F. Tinney and J.W. Walker. 1967. Direct solutions of sparse network equations by optimally ordered triangular factorization.Proc. IEEE55, 11 (1967), 1801–1809. doi:10.1109/PROC.1967.6011
-
[65]
David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P. Chakkappen. 2015. Join Size Estimation Subject to Filter Conditions. Proc. VLDB Endow.8, 12 (Aug. 2015), 1530–1541. doi:10.14778/2824032. 2824051
-
[66]
Lucas Wilkinson, Kazem Cheshmi, and Maryam Mehri Dehnavi. 2023. Register Tiling for Unstructured Sparsity in Neural Network Inference. Proc. ACM Program. Lang.7, PLDI, Article 188 (June 2023), 26 pages. doi:10.1145/3591302
-
[67]
Jaeyeon Won, Willow Ahrens, Saman Amarasinghe, and Joel S. Emer
-
[68]
Insum: Sparse GPU Kernels Simplified and Optimized with Indi- rect Einsums. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26). Association for Computing Ma- chinery, New York, NY, USA, 993–1006. doi:10.1145/3779212.3790176
-
[69]
Guoqing Xiao, Chuanghui Yin, Tao Zhou, Xueqi Li, Yuedan Chen, and Kenli Li. 2023. A Survey of Accelerating Parallel Sparse Linear Algebra.ACM Comput. Surv.56, 1, Article 21 (Aug. 2023), 38 pages. doi:10.1145/3604606
-
[70]
Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. SpDISTAL: compiling distributed sparse tensor computations. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(Dallas, Texas)(SC ’22). IEEE Press, Article 59, 15 pages
2022
-
[71]
Bobby Yan, Alexander J Root, Trevor Gale, David Broman, and Fredrik Kjolstad. 2026. Fast Autoscheduling for Sparse ML Frameworks. In 2026 IEEE/ACM International Symposium on Code Generation and Op- timization (CGO). 28–43. doi:10.1109/CGO68049.2026.11394842
-
[72]
Carl Yang, Aydın Buluç, and John D. Owens. 2022. GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU.ACM Trans. Math. Softw.48, 1, Article 1 (Feb. 2022), 51 pages. doi:10.1145/3466795
-
[73]
Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 841–847. doi:10.1109/IPDPSW.2015.77
-
[74]
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(Vancouver, BC, Canada)(ASPLOS 2023). Asso- ciation for Computing Machinery...
-
[75]
Lee, James Bornholt, and Adrian Sampson
Guowei Zhang, Nithya Attaluri, Joel S. Emer, and Daniel Sanchez. 2021. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(Virtual, USA)(ASPLOS ’21). Association for Computing Ma- chinery, New York...
-
[76]
Genghan Zhang, Olivia Hsu, and Fredrik Kjolstad. 2024. Compilation of Modular and General Sparse Workspaces.Proc. ACM Program. Lang. 8, PLDI, Article 196 (June 2024), 26 pages. doi:10.1145/3656426 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.