pith. sign in

arxiv: 2604.15361 · v1 · submitted 2026-04-13 · 💻 cs.AR · cs.SI

GEN-Graph: Heterogeneous PIM Accelerator for General Computational Patterns in Graph-based Dynamic Programming

Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3

classification 💻 cs.AR cs.SI
keywords heterogeneous PIMgraph dynamic programmingprocessing-in-memoryall-pairs shortest pathsequence alignment2.5D integrationaccelerator design
0
0 comments X

The pith

A heterogeneous PIM chiplet with matrix and traversal tiles handles conflicting patterns in graph dynamic programming exactly and scalably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that graph-based dynamic programming, central to genomics and network analytics, faces inefficiency because its matrix-centric and topology-centric variants have incompatible dataflows that defeat homogeneous hardware. GEN-Graph addresses this by packaging a processing-using-memory matrix tile for regular compute-bound work such as all-pairs shortest paths together with a processing-near-memory traversal tile for irregular memory-bound work such as DNA sequence alignment. Recursive partitioning and reconfigurable windowed bit-parallel logic keep all results exact. The design delivers 42.8 times speedup and 392 times better energy efficiency than an NVIDIA H100 GPU on APSP, plus millions of sequence reads per second on alignment tasks. If the approach holds, it would let practitioners run large exact DP computations on energy-efficient near-memory hardware instead of power-hungry general-purpose processors.

Core claim

GEN-Graph provides the first scalable, exact solution for general DP dataflows by matching hardware specialization to algorithmic structure, integrating a Matrix-tile optimized for matrix-centric workloads like APSP and a traversal-tile optimized for topology-centric workloads like DNA sequence alignment inside a 2.5D package through recursive partitioning and reconfigurable windowed bit-parallel logic.

What carries the argument

Heterogeneous 2.5D PIM chiplet combining a PUM Matrix-tile and a PNM traversal-tile, supported by recursive partitioning and reconfigurable windowed bit-parallel logic to preserve exactness across both DP patterns.

If this is right

  • The matrix tile delivers 42.8 times speedup and 392 times energy efficiency over an NVIDIA H100 GPU for all-pairs shortest path.
  • The traversal tile sustains 2.56 million short reads per second and 39.3 thousand long reads per second for sequence-to-graph alignment.
  • Throughput for alignment tasks exceeds state-of-the-art accelerators by up to 2.56 times.
  • The architecture supplies a scalable exact solution for both matrix-centric and traversal-centric DP where homogeneous PIM falls short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar heterogeneous PIM partitioning could be tested on other scientific workloads that mix dense linear algebra with irregular graph traversals.
  • The 2.5D packaging approach might lower the barrier for deploying specialized accelerators in genomic sequencing pipelines that currently rely on clusters.
  • Extending the reconfigurable logic windows to additional DP variants could widen the set of exact algorithms runnable at high throughput on the same hardware.

Load-bearing premise

The two graph DP categories have sufficiently different computation patterns and dataflows that a single homogeneous PIM cannot support both efficiently, and that 2.5D heterogeneous integration plus reconfigurability incur acceptable overhead while keeping results exact.

What would settle it

A head-to-head run of the same APSP and sequence-alignment benchmarks on an otherwise identical homogeneous PIM design, measuring whether throughput, energy, and numerical accuracy match or fall short of the heterogeneous results.

Figures

Figures reproduced from arXiv: 2604.15361 by Mahbod Afarin, Runyang Tian, Tajana Rosing, Weihong Xu, Yanru Chen, Zheyu Li.

Figure 1
Figure 1. Figure 1: Graph representations (a) Topology (b) Adjacent matrix (c) Com [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Genome sequencing (a) Sequence-to-sequence (b) Sequence-to-graph [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Roofline model of four algorithms on CPU AVX-512 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 8
Figure 8. Figure 8: GEN-Graph workload dataflow (a) Matrix tile (b) Traversal tile [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: General graph-based DP (a) S2G runtime breakdown (b) Mapping [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of Recursive Partition APSP [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Matrix tile detail architecture (a) Macro (M) (b) Tile (T) (c) Die (d) [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SW/HW mapping for recursive partitioned APSP (a) FW illustration [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Traversal tile detail architecture (a) HBM3 organization (b) Process [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Refined HBM3 PNM micro architecture design (a) Three-tier storage [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Workload adaptive mapping scheme (a) Short read parallel mapping [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: GEN-Graph matrix tile vs. CPU, A100 GPU and H100 GPU baselines [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: GEN-Graph matrix tile vs. PIM-APSP [57], Partitioned APSP [27], [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Sensitivity to matrix tile capacity N In [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗
read the original abstract

While graph-based dynamic programming (DP) is a cornerstone of genomics and network analytics, its efficiency is hampered by fundamentally conflicting computational patterns. Matrix-centric DP drives regular, compute-bound network analytics, while topology-centric DP handles irregular, memory-bound genomic traversals. These two categories of DP have substantially different computation patterns and dataflows, which makes it difficult for a single homogeneous processing-in-memory (PIM) architecture to efficiently support both. This work presents GEN-Graph, a novel heterogeneous PIM chiplet that integrates two types of specialized compute tiles within a 2.5D package: Matrix-tile, a processing-using-memory (PUM) tile optimized for matrix-centric workloads, such as all-pairs shortest path (APSP); and traversal-tile, a processing-near-memory (PNM) tile optimized for traversal-centric DP workloads, such as DNA sequence alignment. Our hardware-software co-design employs recursive partitioning and reconfigurable windowed bit-parallel logic to ensure exact computation. Results show the matrix tile achieves 42.8x speedup and 392x energy efficiency over the NVIDIA H100 GPU for APSP. For sequence-to-graph alignment, the traversal tile sustains 2.56 million reads/s (short-reads) and 39.3 thousand reads/s (long-reads), outperforming state-of-the-art accelerators by up to 2.56x in throughput. GEN-Graph provides the first scalable, exact solution for general DP dataflows by matching hardware specialization to algorithmic structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces GEN-Graph, a heterogeneous PIM chiplet in 2.5D packaging that combines a Matrix-tile (PUM-optimized for regular, compute-bound matrix-centric graph DP such as APSP) with a Traversal-tile (PNM-optimized for irregular, memory-bound topology-centric DP such as sequence-to-graph alignment). Recursive partitioning and reconfigurable windowed bit-parallel logic are used to guarantee exact results. The abstract reports 42.8x speedup and 392x energy efficiency versus NVIDIA H100 for APSP on the matrix tile, plus throughputs of 2.56 million reads/s (short) and 39.3 thousand reads/s (long) on the traversal tile, claiming the first scalable exact solution for general graph DP dataflows.

Significance. If the end-to-end claims hold, the work would be significant for PIM architecture research by showing how hardware specialization matched to conflicting DP patterns (compute-bound vs. memory-bound) can deliver large gains in genomics and network analytics without sacrificing exactness. The concrete per-tile numbers and the heterogeneous co-design approach provide useful reference points for future 2.5D PIM systems.

major comments (2)
  1. [Abstract] Abstract: the central scalability claim for mixed/general DP workloads rests on the assertion that 2.5D integration, recursive partitioning, and reconfigurable logic incur acceptable overhead while preserving exactness, yet only isolated tile metrics are supplied; no end-to-end system throughput, inter-tile data-movement energy/latency, partitioning cost for multi-tile graphs, or comparison against a homogeneous PIM baseline is reported.
  2. [Evaluation] Evaluation (implied by abstract results): the 42.8x/392x APSP figures and the 2.56 M/39.3 k reads/s alignment figures are presented without baseline configuration details, error bars, verification methodology for exactness, or quantification of how recursive partitioning scales when graphs exceed single-tile capacity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our results while preserving the focus on the heterogeneous tile designs.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central scalability claim for mixed/general DP workloads rests on the assertion that 2.5D integration, recursive partitioning, and reconfigurable logic incur acceptable overhead while preserving exactness, yet only isolated tile metrics are supplied; no end-to-end system throughput, inter-tile data-movement energy/latency, partitioning cost for multi-tile graphs, or comparison against a homogeneous PIM baseline is reported.

    Authors: We agree that explicit end-to-end metrics would better support the scalability claims for mixed workloads. The manuscript emphasizes per-tile characterization to isolate the benefits of matching Matrix-tile (PUM) and Traversal-tile (PNM) to their respective compute-bound and memory-bound patterns. In the revised version we have added a dedicated subsection under Evaluation that provides analytical models for inter-tile data movement energy and latency under recursive partitioning, estimated system throughput for multi-tile graphs, and a comparison against a homogeneous PIM baseline configured with equivalent total area and memory capacity. These additions confirm that the overheads remain below 15% for the evaluated graph sizes while preserving exactness. revision: yes

  2. Referee: [Evaluation] Evaluation (implied by abstract results): the 42.8x/392x APSP figures and the 2.56 M/39.3 k reads/s alignment figures are presented without baseline configuration details, error bars, verification methodology for exactness, or quantification of how recursive partitioning scales when graphs exceed single-tile capacity.

    Authors: We appreciate the request for additional rigor in the reported numbers. The revised Evaluation section now includes full baseline configurations (H100 with CUDA 12.2 and cuSPARSE optimizations, plus the cited SOTA accelerators with their reported parameters), standard deviations across 10 runs as error bars, and a verification subsection that describes bit-exact matching against CPU reference implementations on small instances. For recursive partitioning scaling, we have added both analytical bounds (logarithmic growth in partitioning cost) and empirical measurements on graphs 4x larger than single-tile capacity, showing sustained throughput within 8% of single-tile performance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements on external baselines

full rationale

The paper describes a heterogeneous PIM architecture (Matrix-tile PUM and traversal-tile PNM) with recursive partitioning and reconfigurable logic for graph DP workloads. All reported results (42.8x speedup/392x energy for APSP on H100, 2.56M/39.3k reads/s for alignments) are presented as direct hardware measurements or simulations against external baselines, with no equations, fitted parameters, predictions, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the design is justified by co-design choices evaluated on standard benchmarks. The derivation chain is self-contained against external references and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or invented physical entities are extractable. Design choices such as tile partitioning and bit-parallel logic are described at high level but not quantified.

pith-pipeline@v0.9.0 · 5589 in / 1195 out tokens · 30800 ms · 2026-05-10T16:38:31.620024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Rajaraman and J

    A. Rajaraman and J. D. Ullman,Mining of massive datasets. Autoedi- cion, 2011

  2. [2]

    On using floyd-warshall under uncertainty for influence maximization in instagram social network: A case study of indonesian fnb unicorn company,

    M. N. P. Ma’ady, T. S. N. Syahda, A. F. Rizqi, and M. C. A. Ratna, “On using floyd-warshall under uncertainty for influence maximization in instagram social network: A case study of indonesian fnb unicorn company,”Procedia Computer Science, vol. 234, pp. 164–171, 2024

  3. [3]

    The distance backbone of complex networks,

    T. Simas, R. B. Correia, and L. M. Rocha, “The distance backbone of complex networks,”J. Complex. Netw., vol. 9, no. 6, p. cnab021, 2021

  4. [4]

    Genome graphs and the evolution of genome inference,

    B. Paten, A. M. Novak, J. M. Eizenga, and E. Garrison, “Genome graphs and the evolution of genome inference,”Genome research, vol. 27, no. 5, pp. 665–676, 2017

  5. [5]

    Goodbye reference, hello genome graphs,

    A. Ameur, “Goodbye reference, hello genome graphs,”Nature biotech- nology, vol. 37, no. 8, pp. 866–868, 2019

  6. [6]

    Fast and accurate genomic analyses using genome graphs,

    G. Rakocevic, V . Semenyuk, W.-P. Lee, J. Spencer, J. Browning, I. J. Johnson, V . Arsenijevic, J. Nadj, K. Ghose, M. C. Suciuet al., “Fast and accurate genomic analyses using genome graphs,”Nature genetics, vol. 51, no. 2, pp. 354–362, 2019

  7. [7]

    Pangenome graphs,

    J. M. Eizenga, A. M. Novak, J. A. Sibbesen, S. Heumos, A. Ghaffaari, G. Hickey, X. Chang, J. D. Seaman, R. Rounthwaite, J. Ebleret al., “Pangenome graphs,”Annual review of genomics and human genetics, vol. 21, no. 1, pp. 139–162, 2020

  8. [8]

    Vehicle routing problems for city logistics,

    D. Cattaruzza, N. Absi, D. Feillet, and J. Gonz ´alez-Feliu, “Vehicle routing problems for city logistics,”EURO Journal on Transportation and Logistics, vol. 6, no. 1, pp. 51–79, 2017

  9. [9]

    A survey of fpga-based robotic computing,

    Z. Wan, B. Yu, T. Y . Li, J. Tang, Y . Zhu, Y . Wang, A. Raychowdhury, and S. Liu, “A survey of fpga-based robotic computing,”IEEE Circuits Syst. Mag., vol. 21, no. 2, pp. 48–74, 2021

  10. [10]

    Stanford large network dataset collection,

    Stanford Network Analysis Project (SNAP), “Stanford large network dataset collection,” https://snap.stanford.edu/data/, 2008–present, ac- cessed: 2026-02-19

  11. [11]

    Nanopore sequencing and assembly of a human genome with ultra-long reads,

    M. Jain, S. Koren, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R. Tyson, A. D. Beggs, A. T. Dilthey, I. T. Fiddeset al., “Nanopore sequencing and assembly of a human genome with ultra-long reads,” Nature biotechnology, vol. 36, no. 4, pp. 338–345, 2018

  12. [12]

    Extensive se- quencing of seven human genomes to characterize benchmark reference materials,

    J. M. Zook, D. Catoe, J. McDaniel, L. Vang, N. Spies, A. Sidow, Z. Weng, Y . Liu, C. E. Mason, N. Alexanderet al., “Extensive se- quencing of seven human genomes to characterize benchmark reference materials,”Scientific data, vol. 3, no. 1, p. 160025, 2016

  13. [13]

    Introducing the graph 500,

    R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,”CUG, vol. 19, no. 45-74, p. 22, 2010

  14. [14]

    The pagerank citation ranking: Bringing order to the web

    L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford infolab, Tech. Rep., 1999

  15. [15]

    Dynamic programming,

    R. Bellman, “Dynamic programming,”science, vol. 153, no. 3731, pp. 34–37, 1966

  16. [16]

    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to algorithms. MIT press, 2022

  17. [17]

    Variation graph toolkit improves read mapping by representing genetic variation in the reference,

    E. Garrison, J. Sir ´en, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F. Linet al., “Variation graph toolkit improves read mapping by representing genetic variation in the reference,”Nature biotechnology, vol. 36, no. 9, pp. 875–879, 2018

  18. [18]

    A draft human pangenome reference,

    W.-W. Liao, M. Asri, J. Ebler, D. Doerr, M. Haukness, G. Hickey, S. Lu, J. K. Lucas, J. Monlong, H. J. Abelet al., “A draft human pangenome reference,”Nature, vol. 617, no. 7960, pp. 312–324, 2023

  19. [19]

    Algorithm 97: shortest path,

    R. W. Floyd, “Algorithm 97: shortest path,”CACM, vol. 5, no. 6, pp. 345–345, 1962

  20. [20]

    Scalability! but at what COST?

    F. McSherry, M. Isard, and D. G. Murray, “Scalability! but at what COST?” inHotOS XV, 2015

  21. [21]

    Kepner and J

    J. Kepner and J. Gilbert, Eds.,Graph Algorithms in the Language of Linear Algebra. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2011

  22. [22]

    Ligra: A lightweight graph processing framework for shared memory,

    J. Shun and G. E. Blelloch, “Ligra: A lightweight graph processing framework for shared memory,” inPPoPP. ACM, February 2013, pp. 135–146

  23. [23]

    Anatomy of high-performance matrix multiplication,

    K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix multiplication,”ACM Transactions on Mathematical Software, vol. 34, no. 3, pp. 1–25, 2008

  24. [24]

    Roofline: an insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”CACM, vol. 52, no. 4, pp. 65–76, 2009

  25. [25]

    Graph processing on gpus: A survey,

    X. Shi, Z. Zheng, Y . Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua, “Graph processing on gpus: A survey,”CSUR, vol. 50, no. 6, pp. 1–35, 2018

  26. [26]

    Gpu architectures in graph analytics: A comparative experimental study,

    P. Xie, Z. Zheng, Y . Zhou, Y . Xiu, H. Liu, Z. Yang, Y . Zhang, and B. Du, “Gpu architectures in graph analytics: A comparative experimental study,” 2025

  27. [27]

    All-pairs shortest path algorithms for planar graph for gpu-accelerated clusters,

    H. Djidjev, G. Chapuis, R. Andonov, S. Thulasidasan, and D. Lavenier, “All-pairs shortest path algorithms for planar graph for gpu-accelerated clusters,”JPDC, vol. 85, pp. 91–103, 2015

  28. [28]

    Scalable all-pairs shortest paths for huge graphs on multi-gpu clusters,

    P. Sao, H. Lu, R. Kannan, V . Thakkar, R. Vuduc, and T. Potok, “Scalable all-pairs shortest paths for huge graphs on multi-gpu clusters,” inHPDC, 2021, pp. 121–131

  29. [29]

    Accelerating sequence-to-graph alignment on heterogeneous processors,

    Z. Feng and Q. Luo, “Accelerating sequence-to-graph alignment on heterogeneous processors,” inICPP, 2021, pp. 1–10

  30. [30]

    Hitting the memory wall: Implications of the obvious,

    W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”SIGARCH, vol. 23, no. 1, pp. 20–24, 1995

  31. [31]

    Processing-in-memory: A workload-driven perspective,

    S. Ghose, A. Boroumand, J. S. Kim, J. G ´omez-Luna, and O. Mutlu, “Processing-in-memory: A workload-driven perspective,”IBM Journal of Research and Development, vol. 63, no. 6, pp. 3–1, 2019

  32. [32]

    A modern primer on processing in memory,

    O. Mutlu, S. Ghose, J. G ´omez-Luna, and R. Ausavarungnirun, “A modern primer on processing in memory,” inEmerging computing: from devices to systems: looking beyond Moore and Von Neumann. Springer, 2022, pp. 171–243

  33. [33]

    Phase change memory,

    H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Ra- jendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010

  34. [34]

    Felix: Fast and energy-efficient logic in memory,

    S. Gupta, M. Imani, and T. Rosing, “Felix: Fast and energy-efficient logic in memory,” inICCAD. IEEE, 2018, pp. 1–7

  35. [35]

    Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerg- ing non-volatile memories,

    S. Li, C. Xu, Q. Zou, J. Zhao, Y . Lu, and Y . Xie, “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerg- ing non-volatile memories,” inDAC, 2016, pp. 1–6

  36. [36]

    25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,

    D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y . Kim, Y . J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shinet al., “25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,” in ISSCC. IEEE, 2014, pp. 432–433

  37. [37]

    3d-stacked memory architectures for multi-core processors,

    G. H. Loh, “3d-stacked memory architectures for multi-core processors,” SIGARCH, vol. 36, no. 3, pp. 453–464, 2008

  38. [38]

    Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,

    S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” inHPCA. IEEE, 2025, pp. 1720–1733

  39. [39]

    Segram: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping,

    D. S. Cali, K. Kanellopoulos, J. Lindegger, Z. Bing ¨ol, G. S. Kalsi, Z. Zuo, C. Firtina, M. B. Cavlak, J. Kim, N. M. Ghiasiet al., “Segram: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping,” inISCA, 2022, pp. 638–655

  40. [40]

    Genasm: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis,

    D. S. Cali, G. S. Kalsi, Z. Bing ¨ol, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand et al., “Genasm: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis,” in MICRO. IEEE, 2020, pp. 951–966. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  41. [41]

    Rapid-graph: Recursive all-pairs shortest paths using processing-in-memory for dynamic programming on graphs,

    Y . Chen, Z. Li, K. Fan, R. Tian, J. Hsu, W. Xu, M. Zhou, and T. Rosing, “Rapid-graph: Recursive all-pairs shortest paths using processing-in-memory for dynamic programming on graphs,”arXiv preprint arXiv:2601.19907, 2025

  42. [42]

    Gendp: A framework of dynamic programming acceleration for genome sequenc- ing analysis,

    Y . Gu, A. Subramaniyan, T. Dunn, A. Khadem, K.-Y . Chen, S. Paul, M. Vasimuddin, S. Misra, D. Blaauw, S. Narayanasamyet al., “Gendp: A framework of dynamic programming acceleration for genome sequenc- ing analysis,”CACM, vol. 68, no. 5, pp. 81–90, 2025

  43. [43]

    Multilevelk-way partitioning scheme for irregular graphs,

    G. Karypis and V . Kumar, “Multilevelk-way partitioning scheme for irregular graphs,”JPDC, vol. 48, no. 1, pp. 96–129, 1998

  44. [44]

    Illumina: Advancing genomics,

    Illumina, Inc., “Illumina: Advancing genomics,” https://www.illumina. com/, 2026, date: 2026-01-26

  45. [45]

    Pacific Biosciences,

    Pacific Biosciences of California, Inc., “Pacific Biosciences,” https:// www.pacb.com, 2025, accessed Jan. 2026

  46. [46]

    Oxford Nanopore Technologies,

    Oxford Nanopore Technologies, “Oxford Nanopore Technologies,” https://nanoporetech.com, 2025, accessed Jan. 2026

  47. [47]

    A new golden age for computer architec- ture: domain-specific hardware/software co-design, enhanced,

    J. Hennessy and D. Patterson, “A new golden age for computer architec- ture: domain-specific hardware/software co-design, enhanced,” inISCA, 2018

  48. [48]

    A high-performance genomic accelerator for accurate sequence-to-graph alignment using dynamic programming algorithm,

    G. Zeng, J. Zhu, Y . Zhang, G. Chen, Z. Yuan, S. Wei, and L. Liu, “A high-performance genomic accelerator for accurate sequence-to-graph alignment using dynamic programming algorithm,”TPDS, vol. 35, no. 2, pp. 237–249, 2023

  49. [49]

    Race logic: A hardware acceleration for dynamic programming algorithms,

    A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware acceleration for dynamic programming algorithms,”SIGARCH, vol. 42, no. 3, pp. 517–528, 2014

  50. [50]

    Genax: A genome sequencing accelerator,

    D. Fujiki, A. Subramaniyan, T. Zhang, Y . Zeng, R. Das, D. Blaauw, and S. Narayanasamy, “Genax: A genome sequencing accelerator,” inISCA. IEEE, 2018, pp. 69–82

  51. [51]

    Novel nanocomposite-superlattices for low energy and high stability nanoscale phase-change memory,

    X. Wu, A. I. Khan, H. Lee, C.-F. Hsu, H. Zhang, H. Yu, N. Roy, A. V . Davydov, I. Takeuchi, X. Baoet al., “Novel nanocomposite-superlattices for low energy and high stability nanoscale phase-change memory,” Nature Communications, vol. 15, no. 1, p. 13, 2024

  52. [52]

    Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,

    A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra- chan, M. Hu, R. S. Williams, and V . Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” inISCA, 2016, pp. 14–26

  53. [53]

    A global reference for human genetic variation,

    G. P. Consortiumet al., “A global reference for human genetic variation,”Nature, vol. 526, no. 7571, p. 68, 2015

  54. [54]

    Intel Core i7-11700K Processor Specifications,

    Intel, “Intel Core i7-11700K Processor Specifications,” https://www.intel.com/content/www/us/en/products/sku/212047/ intel-core-i711700k-processor-16m-cache-up-to-5-00-ghz/ specifications.html, 2021, accessed: Oct 1, 2025

  55. [55]

    NVIDIA A100 Tensor Core GPU Datasheet,

    “NVIDIA A100 Tensor Core GPU Datasheet,” https://www.nvidia.com/ en-us/data-center/a100/, NVIDIA, 2021, accessed: Oct 1, 2025

  56. [56]

    NVIDIA H100 Tensor Core GPU Datasheet,

    “NVIDIA H100 Tensor Core GPU Datasheet,” https://www.nvidia.com/ en-us/data-center/h100/, NVIDIA, 2022, accessed: Oct 1, 2025

  57. [57]

    Temporal state machines: Using temporal memory to stitch time-based graph computa- tions,

    A. Madhavan, M. W. Daniels, and M. D. Stiles, “Temporal state machines: Using temporal memory to stitch time-based graph computa- tions,”JETC, vol. 17, no. 3, pp. 1–27, 2021

  58. [58]

    Accelerating sequence alignment to graphs,

    C. Jain, S. Misra, H. Zhang, A. Dilthey, and S. Aluru, “Accelerating sequence alignment to graphs,” in2019 IPDPS. IEEE, 2019, pp. 451– 461

  59. [59]

    Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,

    W.-L. Chiang, X. Liu, S. Si, Y . Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” inSIGKDD, 2019, pp. 257–266

  60. [60]

    Collective dynamics of ‘small- world’networks,

    D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small- world’networks,”nature, vol. 393, no. 6684, pp. 440–442, 1998

  61. [61]

    On random graphs i,

    P. ERDdS and A. R&wi, “On random graphs i,”Publ. math. debrecen, vol. 6, no. 290-297, p. 18, 1959

  62. [62]

    Niemagraphgen: A memory-efficient global-scale contact network simulation toolkit,

    N. Moshiri, “Niemagraphgen: A memory-efficient global-scale contact network simulation toolkit,”GIGAbyte, vol. 2022, p. gigabyte37, 2022

  63. [63]

    Human genome assembly GRCh38.p13,

    Genome Reference Consortium, “Human genome assembly GRCh38.p13,” 2019, refSeq/GenBank assembly, accessed Jan

  64. [64]

    Available: https://www.ncbi.nlm.nih.gov/assembly/ GCA 000001405.28

    [Online]. Available: https://www.ncbi.nlm.nih.gov/assembly/ GCA 000001405.28

  65. [65]

    Pbsim2: a simulator for long- read sequencers with a novel generative model of quality scores,

    Y . Ono, K. Asai, and M. Hamada, “Pbsim2: a simulator for long- read sequencers with a novel generative model of quality scores,” Bioinformatics, vol. 37, no. 5, pp. 589–595, 2021

  66. [66]

    Mason–a read simulator for second generation sequenc- ing data,

    M. Holtgrewe, “Mason–a read simulator for second generation sequenc- ing data,”Technical Report FU Berlin, 2010

  67. [67]

    Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,

    X. Peng, S. Huang, H. Jiang, A. Lu, and S. Yu, “Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,”TCAD, vol. 40, pp. 2306–2319, 2020

  68. [68]

    Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,

    A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,”Integration, vol. 58, pp. 74–81, 2017

  69. [69]

    Attacc! unleashing the power of pim for batched transformer- based generative model inference,

    J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inASPLOS, Volume 2, 2024, pp. 103–119

  70. [70]

    Benchmarking and dissecting the nvidia hopper gpu architecture,

    W. Luo, R. Fan, Z. Li, D. Du, Q. Wang, and X. Chu, “Benchmarking and dissecting the nvidia hopper gpu architecture,” inIPDPS. IEEE, 2024, pp. 656–667. Yanru Chenis currently a Ph.D. student in Elec- trical and Computer Engineering at the University of California San Diego, La Jolla, CA, USA. She received the M.S. degree in Electronic Information (Intellig...