pith. sign in

arxiv: 2606.26701 · v1 · pith:TFJKDEHAnew · submitted 2026-06-25 · 💻 cs.AR

SegFold: Accelerating Sparse GEMM with a Fine-Grained Dynamic Dataflow

Pith reviewed 2026-06-26 02:07 UTC · model grok-4.3

classification 💻 cs.AR
keywords SpGEMMdynamic dataflowsparse matrix multiplicationacceleratorload balancedata reuseSegment dataflowdynamic scheduling
0
0 comments X

The pith

SegFold shows that fine-grained dynamic scheduling and remapping in SpGEMM dataflows produce reuse and load-balance gains no static dataflow can match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SpGEMM is a core operation in many domains, yet accelerators rely on static dataflows such as inner-product or Gustavson that each leave reuse opportunities unused or create load imbalance. The paper extends these dataflows by adding fine-grained dynamic scheduling that identifies reuse inside a local window of the stationary matrix and dynamic remapping that moves partially completed work to idle processing elements. These extensions are packaged as the Segment dataflow and realized in the SegFold accelerator through a memory controller and a merge network. Measurements across varied matrix densities and sizes show a 1.95 times geometric-mean speedup over prior accelerators and 5.3 times over the best static configuration. A reader would care because the result indicates that the performance ceiling for sparse matrix multiplication has been set too low by the assumption of static scheduling alone.

Core claim

The paper claims that extending typical SpGEMM dataflows with fine-grained dynamic scheduling to optimize reuse and reduce contention, together with dynamic remapping of partially completed work to improve load balance and parallelism, formalized as the Segment dataflow and implemented in SegFold via a memory controller that detects local reuse and a merge network that routes and remaps work, produces a geometric-mean 1.95 times speedup over state-of-the-art SpGEMM accelerators and 5.3 times over the best static dataflow.

What carries the argument

The Segment dataflow, which adds fine-grained dynamic scheduling for reuse optimization and dynamic remapping for load balance, executed by SegFold's memory controller that scans a local window of the stationary input and its merge network that routes and reassigns work across processing elements.

If this is right

  • The same matrix set yields substantially higher throughput when dynamism is present than when any fixed static dataflow is used in isolation.
  • Load imbalance and missed reuse that appear under static assignment disappear once work can be reassigned at fine granularity.
  • Hardware that only supports one static schedule cannot reach the performance level demonstrated by the combined dynamic mechanisms.
  • The gains hold across a wide range of matrix densities and sizes rather than being limited to particular sparsity patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future sparse accelerators may need to allocate silicon budget to flexible controllers rather than to larger fixed-function arrays.
  • The same dynamic-remapping idea could be tested on other sparse kernels such as sparse triangular solve or sparse tensor contraction.
  • If matrix sizes continue to grow, the relative benefit of dynamic load balancing is likely to increase because static partitioning becomes harder to tune in advance.

Load-bearing premise

The memory controller and merge network can locate fine-grained reuse opportunities and execute dynamic remapping with overhead small enough to preserve the reported speedups.

What would settle it

Cycle-accurate hardware measurements showing that the added area, power, or latency of the dynamic memory controller and merge network exceeds the 1.95 times performance gain would falsify the claim that the dynamism is practically beneficial.

Figures

Figures reproduced from arXiv: 2606.26701 by Hanyu Wang, Jason Cong, Tony Nowatzki, Xinrui Wu.

Figure 1
Figure 1. Figure 1: Comparison of classic dataflows. Every two consecutive blocks in a column (top vs bottom) are two sequential snapshots. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SegFold Dataflow Example. A batch of (m, k) pairs is chosen by SELECTA, and SEGMENTBC(m, k, n) is invoked over B[k, :] nonzeros for ordered processing in the virtual coordinate space. Reuse benefits over two iterations are shown in pink. Tick marks indicate the A elements selected for each spatial iteration under different dataflows. same output row and may contend when the corresponding B rows contribute … view at source ↗
Figure 3
Figure 3. Figure 3: Example active window of SELECTA, shown in red for two time steps. For example, in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of on-the-fly update of the mapping from the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SegFold µarch overview. PEs communicate over a local network, and share a row-level memory. PE0 PE1 PE2 PE3 1 3 4 2 (a) Merger at PE0, b > c (case 1) b will be sent to PE1 PE0 PE1 PE2 PE3 1 3 4 (b) Merger at PE1, b < c (case 2) 3, 4 will be pushed to the right PE0 PE1 PE2 PE3 1 3 4 2 (c) Merger at PE1, b = c (case 3) b matches and consumes at PE1 PE0 PE1 PE2 PE3 1 3 4 1 2 (d) Merger at PE1, b > c (case 2) … view at source ↗
Figure 7
Figure 7. Figure 7: Index to PE mapper (IPM) using binary search. Each level is in its [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Speedup over Spada and static dataflow implementations on SuiteSparse matrices. Sparsity patterns shown below. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Nonsquare SuiteSparse matrices. (a) SegFold vs. Spada speedup, normalized to Spada. (b) Effect of multiplication direction on SegFold throughput, normalized to nonsquare matrix as A. bcpwr stk03 stk18 Cond GrQc del13 flow fv1 d2q wood olm pcb pois rdb ros tols geo 0.5 1.0 1.5 2.0 Speedup (vs Zero-Offset) 1.14x 1.42x 1.13x 1.01x 1.14x 1.11x 1.33x 1.04x 1.28x 1.06x 1.32x 1.33x 1.22x 1.23x 1.53x 1.05x 1.20x Z… view at source ↗
Figure 12
Figure 12. Figure 12: Hardware-parameter sensitivity studies across three matrix sizes ( [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-simulation efficiency (cycles per MAC, lower is better) on [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
read the original abstract

Generalized sparse matrix-matrix multiplication (SpGEMM) is critical in many domains. Existing CPUs and GPUs, as well as specialized accelerators, rely on static dataflows (e.g., inner product, outer product, Gustavson, etc.). Each static dataflow sacrifices some data reuse opportunities and imposes constraints on load balance. To address this inefficiency, we extend the typical SpGEMM dataflows by considering dynamism. Specifically, we add fine-grained dynamic scheduling to optimize reuse and reduce resource contention. We also develop dynamic remapping of partially completed work to improve load balance and parallelism. These ideas are formalized into a specific dataflow called Segment. To demonstrate Segment, we codesign a SpGEMM accelerator called SegFold. SegFold includes a memory controller that identifies fine-grained reuse opportunities in a local window of the stationary input array and exploits them through dynamic work assignment. It also incorporates a merge network that routes inputs to appropriate processing elements (PEs) for computation while dynamically remapping the work assigned to each PE to balance load. Across diverse densities and matrix sizes, SegFold achieves a geometric-mean $1.95\times$ speedup over state-of-the-art SpGEMM accelerators and $5.3\times$ over the best static dataflow configuration, demonstrating that adding dynamism to the dataflow design space unlocks reuse and load-balance gains that no static scheduling choice can achieve in isolation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes SegFold, a SpGEMM accelerator that implements a dynamic dataflow called Segment. Segment extends static dataflows (inner/outer product, Gustavson) with fine-grained dynamic scheduling to exploit reuse opportunities in a local window of the stationary input and with dynamic remapping of partial results via a merge network to improve load balance. The accelerator includes a specialized memory controller for reuse detection and work assignment. Across varied matrix densities and sizes, SegFold is reported to deliver 1.95× geometric-mean speedup versus prior SpGEMM accelerators and 5.3× versus the best static dataflow configuration.

Significance. If the net speedup holds after accounting for control overhead, the result would be significant for the sparse-accelerator community: it supplies concrete empirical evidence that dynamism can extract reuse and parallelism gains unavailable to any fixed static schedule. The codesign of dataflow and hardware, together with the reported cross-density speedups, would strengthen the case for exploring dynamic scheduling in future SpGEMM designs.

major comments (1)
  1. [Architecture description and evaluation sections] The central performance claim (1.95× geo-mean and 5.3× over best static) is load-bearing on the assumption that the memory controller's reuse detection and the merge network's dynamic remapping incur sufficiently low area, power, and cycle overhead. No quantitative characterization of these overheads (gate count, energy per operation, or added latency) appears in the architecture description or evaluation, leaving open whether the reported speedups are net positive.
minor comments (1)
  1. [Abstract] The abstract states results 'across diverse densities and matrix sizes' yet supplies neither the exact benchmark matrices nor the density/size ranges used; adding this information would allow readers to assess coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of quantifying control overheads in the dynamic components of SegFold. This is a substantive point that directly affects the strength of our performance claims. We will revise the manuscript to include the requested characterization.

read point-by-point responses
  1. Referee: [Architecture description and evaluation sections] The central performance claim (1.95× geo-mean and 5.3× over best static) is load-bearing on the assumption that the memory controller's reuse detection and the merge network's dynamic remapping incur sufficiently low area, power, and cycle overhead. No quantitative characterization of these overheads (gate count, energy per operation, or added latency) appears in the architecture description or evaluation, leaving open whether the reported speedups are net positive.

    Authors: We agree that the absence of quantitative overhead data leaves the net speedup claim open to question. The current manuscript focuses on the dataflow and high-level speedup results but does not report synthesized gate counts, energy per operation, or cycle latency for the reuse-detection logic in the memory controller or the merge network. In the revised version we will add a dedicated subsection (likely in Section 4) that presents these metrics from our RTL implementation and synthesis runs at the target process node, allowing direct assessment of whether the 1.95× and 5.3× speedups remain positive after overheads. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical hardware proposal with external benchmarks

full rationale

The paper proposes a dynamic dataflow (Segment) and accelerator (SegFold) for SpGEMM, with claims resting on geometric-mean speedups from evaluation across matrix densities/sizes. No equations, fitted parameters, or self-referential derivations appear in the abstract or described structure. The central result (1.95× over SOTA, 5.3× over best static) is presented as an empirical outcome of simulation, not reduced by construction to inputs or self-citations. This is a standard self-contained architecture paper whose validity depends on external benchmarks and implementation details, not internal definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; no equations or detailed assumptions visible. The design implicitly assumes standard hardware constraints and sparse matrix properties typical of the domain.

axioms (1)
  • domain assumption Standard assumptions about sparse matrix access patterns and hardware resource constraints hold for the evaluated workloads.
    Typical background for architecture papers; invoked implicitly when claiming speedups across densities.
invented entities (1)
  • Segment dataflow no independent evidence
    purpose: To combine dynamic scheduling and remapping for reuse and load balance beyond static dataflows.
    Newly introduced in the paper as the core contribution.

pith-pipeline@v0.9.1-grok · 5787 in / 1279 out tokens · 22433 ms · 2026-06-26T02:07:50.125338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 24 canonical work pages

  1. [1]

    On the representation and multiplication of hypersparse matrices,

    A. Buluc ¸ and J. R. Gilbert, “On the representation and multiplication of hypersparse matrices,” in2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2008, pp. 1–11

  2. [2]

    ASAP7: A 7-nm FinFET predictive process design kit,

    L. T. Clark, V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric, “ASAP7: A 7-nm FinFET predictive process design kit,”Microelectronics Journal, vol. 53, pp. 105–115, 2016

  3. [3]

    NVIDIA GPU programming guide, v2.2.1, November, 2004

    N. Corp, “NVIDIA GPU programming guide, v2.2.1, November, 2004.”

  4. [4]

    PolyGraph: Exposing the value of flexibility for graph processing accelerators,

    V . Dadu, S. Liu, and T. Nowatzki, “PolyGraph: Exposing the value of flexibility for graph processing accelerators,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 595–608

  5. [5]

    Towards general purpose acceleration by exploiting common data-dependence forms,

    V . Dadu, J. Weng, S. Liu, and T. Nowatzki, “Towards general purpose acceleration by exploiting common data-dependence forms,” ser. MICRO ’52. New York, NY , USA: ACM, 2019, pp. 924–939. [Online]. Available: http://doi.acm.org/10.1145/3352460.3358276

  6. [6]

    ZeD: A generalized accelerator for variably sparse matrix computations in ML,

    P. Dangi, Z. Bai, R. Juneja, D. Wijerathne, and T. Mitra, “ZeD: A generalized accelerator for variably sparse matrix computations in ML,” inProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 246–257. [Online]. Available: htt...

  7. [7]

    Hardware acceleration of sparse and irregular tensor computa- tions of ML models: A survey and insights,

    S. Dave, R. Baghdadi, T. Nowatzki, S. Avancha, A. Shrivastava, and B. Li, “Hardware acceleration of sparse and irregular tensor computa- tions of ML models: A survey and insights,”Proceedings of the IEEE, vol. 109, no. 10, pp. 1706–1752, 2021

  8. [8]

    The university of florida sparse matrix collection,

    T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,”ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1–1:25, 2011

  9. [9]

    Recent advances in artificial intelligence via machine learning and the implications for computer system design,

    J. Dean, “Recent advances in artificial intelligence via machine learning and the implications for computer system design,” in2017 IEEE Hot Chips 29 Symposium, 2017

  10. [10]

    A systematic survey of general sparse Matrix-Matrix multiplication,

    J. Gao, W. Ji, F. Chang, S. Han, B. Wei, Z. Liu, and Y . Wang, “A systematic survey of general sparse Matrix-Matrix multiplication,” ACM Comput. Surv., vol. 55, no. 12, Mar. 2023. [Online]. Available: https://doi.org/10.1145/3571157

  11. [11]

    Starnuma: Mitigating numa challenges with memory pooling,

    H. N. Genc, H. Kim, P. Ganesh, and Y . S. Shao, “Stellar: An automated design framework for dense and sparse spatial accelerators,” inProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. IEEE Press, 2024, p. 409–422. [Online]. Available: https://doi.org/10.1109/MICRO61859.2024.00038

  12. [12]

    AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing,

    T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y . Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardt, and M. C. Herbordt, “AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing,” in2020 53rd Annual IEEE/ACM International Symposium on Microar- chitecture (MICRO), 2020, pp. 922–936

  13. [13]

    RipTide: A programmable, energy-minimal dataflow compiler and architecture,

    G. Gobieski, S. Ghosh, M. Heule, T. Mowry, T. Nowatzki, N. Beckmann, and B. Lucia, “RipTide: A programmable, energy-minimal dataflow compiler and architecture,” in2022 55th IEEE/ACM International Sym- posium on Microarchitecture (MICRO), 2022, pp. 546–564

  14. [14]

    SparTen: A sparse tensor accelerator for convolutional neural networks,

    A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, “SparTen: A sparse tensor accelerator for convolutional neural networks,” inProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY , USA: Association for Computing Machinery, 2019, p. 151–165. [Online]. Available: https://doi.org/10...

  15. [15]

    Eie: efficient inference engine on compressed deep neural network,

    S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” inProceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. IEEE Press, 2016, p. 243–254. [Online]. Available: https://doi.org/10.1109/ISCA.2016.30

  16. [16]

    Sparse-TPU: Adapting systolic arrays for sparse matrices,

    X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y . Chen, R. Dreslinski, and T. Mudge, “Sparse-TPU: Adapting systolic arrays for sparse matrices,” inProceedings of the 34th ACM interna- tional conference on supercomputing, 2020, pp. 1–12

  17. [17]

    Towards general purpose acceleration by exploiting common data-dependence forms,

    K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “ExTensor: An accelerator for sparse tensor algebra,” inProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-52. New York, NY , USA: Association for Computing Machinery, 2019, p. 319–333. [Online]. Availa...

  18. [18]

    Stardust: Compiling sparse tensor algebra to a reconfigurable dataflow architecture,

    O. Hsu, A. Rucker, T. Zhao, V . Desai, K. Olukotun, and F. Kjolstad, “Stardust: Compiling sparse tensor algebra to a reconfigurable dataflow architecture,” inProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, ser. CGO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 628–643. [Online]. Availa...

  19. [19]

    SpaHet: A software/hardware co-design for accelerating heterogeneous-sparsity based sparse matrix multiplication,

    H. Huang, P. Yao, Z. An, Y . Sun, A. Hu, P. Xu, L. Zheng, X. Liao, and H. Jin, “SpaHet: A software/hardware co-design for accelerating heterogeneous-sparsity based sparse matrix multiplication,” inProceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available:...

  20. [20]

    Cerberus: Triple mode acceleration of sparse matrix and vector multiplication,

    S. Hwang, D. Baek, J. Park, and J. Huh, “Cerberus: Triple mode acceleration of sparse matrix and vector multiplication,”ACM Trans. Archit. Code Optim., vol. 21, no. 2, May 2024. [Online]. Available: https://doi.org/10.1145/3653020

  21. [21]

    Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and Rick et al

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V . Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, ...

  22. [22]

    Zena: Zero-aware neural network accel- erator,

    D. Kim, J. Ahn, and S. Yoo, “Zena: Zero-aware neural network accel- erator,”IEEE Design & Test, vol. 35, no. 1, pp. 39–46, 2017

  23. [23]

    Onyx: A programmable accelerator for sparse tensor algebra,

    K. Koul, M. Strange, J. Melchert, A. Carsello, Y . Mei, O. Hsu, T. Kong, P.-H. Chen, H. Ke, K. Zhanget al., “Onyx: A programmable accelerator for sparse tensor algebra,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society, 2024, pp. 1–91

  24. [24]

    Spada: Accelerating sparse matrix multiplication with adaptive dataflow,

    Z. Li, J. Li, T. Chen, D. Niu, H. Zheng, Y . Xie, and M. Gao, “Spada: Accelerating sparse matrix multiplication with adaptive dataflow,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS 2023. New York, NY , USA: Association for Computing Machinery, 2023, ...

  25. [25]

    Ramulator 2.0: A modern, modular, and extensible DRAM simulator,

    H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible DRAM simulator,” 2023. [Online]. Available: https://arxiv.org/abs/2308.11030

  26. [26]

    Sparm: A sparse matrix multiplication accelerator supporting multiple dataflows,

    S. Luo, B. Wang, Y . Shi, X. Zhang, Q. Xue, and S. Ma, “Sparm: A sparse matrix multiplication accelerator supporting multiple dataflows,” in2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2024, pp. 122–130

  27. [27]

    Mosaic: Processing a trillion-edge graph on a single machine,

    S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim, “Mosaic: Processing a trillion-edge graph on a single machine,” inProceedings of the Twelfth European Conference on Computer Systems, ser. EuroSys ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 527–543. [Online]. Available: https://doi.org/10.1145/3064176.3064191

  28. [28]

    Tpp: Transparent page placement for cxl-enabled tiered-memory,

    F. Mu ˜noz Mart ´ınez, R. Garg, M. Pellauer, J. L. Abell ´an, M. E. Acacio, and T. Krishna, “Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient DNN processing,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. N...

  29. [29]

    Exploiting locality in graph analytics through hardware-accelerated traversal scheduling,

    A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, and D. Sanchez, “Exploiting locality in graph analytics through hardware-accelerated traversal scheduling,” inMICRO. IEEE, 2018, pp. 1–14

  30. [30]

    STONNE: Enabling cycle-level microarchitectural simulation for DNN inference accelerators,

    F. Mu ˜noz-Mart´ınez, J. L. Abell ´an, M. E. Acacio, and T. Krishna, “STONNE: Enabling cycle-level microarchitectural simulation for DNN inference accelerators,” in2021 IEEE International Symposium on Workload Characterization (IISWC), 2021, pp. 201–213

  31. [31]

    TeAAL: A declarative framework for modeling sparse tensor accelerators,

    N. Nayak, T. O. Odemuyiwa, S. Ugare, C. Fletcher, M. Pellauer, and J. Emer, “TeAAL: A declarative framework for modeling sparse tensor accelerators,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1255–1270. [Online]. Available: https...

  32. [32]

    A sparse matrix vector multiply accelerator for support vector machine,

    E. Nurvitadhi, A. Mishra, and D. Marr, “A sparse matrix vector multiply accelerator for support vector machine,” in2015 International Confer- ence on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Oct 2015, pp. 109–116

  33. [33]

    OuterSPACE: An outer product based sparse matrix multiplication accelerator,

    S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “OuterSPACE: An outer product based sparse matrix multiplication accelerator,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2018, pp. 724–736

  34. [34]

    SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator,

    S. Pal, A. Amarnath, S. Feng, M. O’Boyle, R. Dreslinski, and C. Dubach, “SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 1005–1021. [Online]. Available: ht...

  35. [35]

    Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration,

    S. Pal, S. Feng, D.-h. Park, S. Kim, A. Amarnath, C.-S. Yang, X. He, J. Beaumont, K. May, Y . Xiong, K. Kaszyk, J. M. Morton, J. Sun, M. O’Boyle, M. Cole, C. Chakrabarti, D. Blaauw, H.-S. Kim, T. Mudge, and R. Dreslinski, “Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration,” inProceedings of the ACM International Conference ...

  36. [36]

    SCNN: An accelerator for compressed-sparse convolutional neural networks,

    A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An accelerator for compressed-sparse convolutional neural networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY , USA: ACM, 2017, pp. 27–

  37. [37]

    Available: http://doi.acm.org/10.1145/3079856.3080254

    [Online]. Available: http://doi.acm.org/10.1145/3079856.3080254

  38. [38]

    Available: http://doi.acm.org/10.1145/3079856.3080254

    R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,” ser. ISCA ’17. New York, NY , USA: ACM, 2017, pp. 389–402. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080256

  39. [39]

    Extending sparse tensor accelerators to support multiple compression formats,

    E. Qin, G. Jeong, W. Won, S.-C. Kao, H. Kwon, S. Srinivasan, D. Das, G. E. Moon, S. Rajamanickam, and T. Krishna, “Extending sparse tensor accelerators to support multiple compression formats,” in2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 1014–1024

  40. [40]

    SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training,

    E. Qin, A. Samajdar, H. Kwon, V . Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training,” in2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 58–70

  41. [41]

    Capstan: A vector RDA for sparsity,

    A. Rucker, M. Vilim, T. Zhao, Y . Zhang, R. Prabhakar, and K. Olukotun, “Capstan: A vector RDA for sparsity,” 2021

  42. [42]

    Serpens: A high bandwidth memory based accelerator for general-purpose sparse matrix-vector mul- tiplication,

    L. Song, Y . Chi, L. Guo, and J. Cong, “Serpens: A high bandwidth memory based accelerator for general-purpose sparse matrix-vector mul- tiplication,” inProceedings of the 59th ACM/IEEE design automation conference, 2022, pp. 211–216

  43. [43]

    Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication,

    L. Song, Y . Chi, A. Sohrabizadeh, Y .-k. Choi, J. Lau, and J. Cong, “Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication,” inProceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 65–77

  44. [44]

    MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,

    N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 766–780

  45. [45]

    Tensaurus: A versatile accelerator for mixed sparse-dense tensor com- putations,

    N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang, “Tensaurus: A versatile accelerator for mixed sparse-dense tensor com- putations,” in2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 689–702

  46. [46]

    SparGD: A sparse GEMM accelerator with dynamic dataflow,

    B. Wang, S. Ma, S. Luo, L. Wu, J. Zhang, C. Zhang, and T. Li, “SparGD: A sparse GEMM accelerator with dynamic dataflow,”ACM Trans. Des. Autom. Electron. Syst., vol. 29, no. 2, Jan. 2024. [Online]. Available: https://doi.org/10.1145/3634703

  47. [47]

    SpMARD: A sparse-sparse matrix multiplication accelerator with reconfigurable dataflow for DNN workloads,

    B. Wang, S. Ma, Y . Zhao, S. Luo, L. Wu, J. Zhang, D. Li, T. Li, and Z. Chen, “SpMARD: A sparse-sparse matrix multiplication accelerator with reconfigurable dataflow for DNN workloads,”ACM Trans. Archit. Code Optim., Aug. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3747847

  48. [48]

    Sparseloop: An analytical approach to sparse tensor accelerator modeling,

    Y . N. Wu, P.-A. Tsai, A. Parashar, V . Sze, and J. S. Emer, “Sparseloop: An analytical approach to sparse tensor accelerator modeling,” in Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’22. IEEE Press, 2023, p. 1377–1395. [Online]. Available: https://doi.org/10.1109/MICRO56248.2022.00096

  49. [49]

    DynaFlow: An ML framework for dynamic dataflow selection in SpGEMM accelerators,

    S. Yadav and B. Asgari, “DynaFlow: An ML framework for dynamic dataflow selection in SpGEMM accelerators,”IEEE Computer Architec- ture Letters, vol. 24, no. 1, pp. 189–192, 2025

  50. [50]

    Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,

    Y . Yang, J. S. Emer, and D. Sanchez, “Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 931–945

  51. [51]

    ScalaGraph: A scalable accelerator for massively parallel graph processing,

    P. Yao, L. Zheng, Y . Huang, Q. Wang, C. Gui, Z. Zeng, X. Liao, H. Jin, and J. Xue, “ScalaGraph: A scalable accelerator for massively parallel graph processing,” in2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA), 2022, pp. 199–212

  52. [52]

    Gamma: Leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication,

    G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, “Gamma: Leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication,” ser. ASPLOS 2021. New York, NY , USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3445814. 3446702

  53. [53]

    Exploring the hidden dimension in graph processing,

    M. Zhang, Y . Wu, K. Chen, X. Qian, X. Li, and W. Zheng, “Exploring the hidden dimension in graph processing,” inProceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’16. USA: USENIX Association, 2016, p. 285–300

  54. [54]

    Cambricon-X: An accelerator for sparse neural networks,

    S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y . Chen, “Cambricon-X: An accelerator for sparse neural networks,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO), 2016, pp. 1–12

  55. [55]

    Sparch: Efficient architecture for sparse matrix multiplication,

    Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efficient architecture for sparse matrix multiplication,” in2020 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 261–274

  56. [56]

    Feasta: A flexible and efficient accelerator for sparse tensor algebra in machine learning,

    K. Zhong, Z. Zhu, G. Dai, H. Wang, X. Yang, H. Zhang, J. Si, Q. Mao, S. Zeng, K. Honget al., “Feasta: A flexible and efficient accelerator for sparse tensor algebra in machine learning,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 349–366

  57. [57]

    rdkit.org, version [2025.03.6]; DOI: 10.5281/zen- odo.591637

    X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y . Chen, “Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO), 2018, pp. 15–28. APPENDIX Abstract We release the complete SegFold ar...

  58. [58]

    Build the simulator and run a smoke test: ./scripts/setup.sh

  59. [59]

    Download benchmark matrices: python3 scripts/download_matrices.py Docker (alternative).A pre-configured container is also avail- able: docker compose build docker compose run artifact \ ./scripts/run_all.sh Experiment Workflow A single command reproduces every experiment: ./scripts/run_all.sh Outputs are placed inoutput/ae_<timestamp>/. Each figure or tab...