GEN-Graph: Heterogeneous PIM Accelerator for General Computational Patterns in Graph-based Dynamic Programming

Mahbod Afarin; Runyang Tian; Tajana Rosing; Weihong Xu; Yanru Chen; Zheyu Li

arxiv: 2604.15361 · v1 · submitted 2026-04-13 · 💻 cs.AR · cs.SI

GEN-Graph: Heterogeneous PIM Accelerator for General Computational Patterns in Graph-based Dynamic Programming

Yanru Chen , Runyang Tian , Zheyu Li , Mahbod Afarin , Weihong Xu , Tajana Rosing This is my paper

Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3

classification 💻 cs.AR cs.SI

keywords heterogeneous PIMgraph dynamic programmingprocessing-in-memoryall-pairs shortest pathsequence alignment2.5D integrationaccelerator design

0 comments

The pith

A heterogeneous PIM chiplet with matrix and traversal tiles handles conflicting patterns in graph dynamic programming exactly and scalably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that graph-based dynamic programming, central to genomics and network analytics, faces inefficiency because its matrix-centric and topology-centric variants have incompatible dataflows that defeat homogeneous hardware. GEN-Graph addresses this by packaging a processing-using-memory matrix tile for regular compute-bound work such as all-pairs shortest paths together with a processing-near-memory traversal tile for irregular memory-bound work such as DNA sequence alignment. Recursive partitioning and reconfigurable windowed bit-parallel logic keep all results exact. The design delivers 42.8 times speedup and 392 times better energy efficiency than an NVIDIA H100 GPU on APSP, plus millions of sequence reads per second on alignment tasks. If the approach holds, it would let practitioners run large exact DP computations on energy-efficient near-memory hardware instead of power-hungry general-purpose processors.

Core claim

GEN-Graph provides the first scalable, exact solution for general DP dataflows by matching hardware specialization to algorithmic structure, integrating a Matrix-tile optimized for matrix-centric workloads like APSP and a traversal-tile optimized for topology-centric workloads like DNA sequence alignment inside a 2.5D package through recursive partitioning and reconfigurable windowed bit-parallel logic.

What carries the argument

Heterogeneous 2.5D PIM chiplet combining a PUM Matrix-tile and a PNM traversal-tile, supported by recursive partitioning and reconfigurable windowed bit-parallel logic to preserve exactness across both DP patterns.

If this is right

The matrix tile delivers 42.8 times speedup and 392 times energy efficiency over an NVIDIA H100 GPU for all-pairs shortest path.
The traversal tile sustains 2.56 million short reads per second and 39.3 thousand long reads per second for sequence-to-graph alignment.
Throughput for alignment tasks exceeds state-of-the-art accelerators by up to 2.56 times.
The architecture supplies a scalable exact solution for both matrix-centric and traversal-centric DP where homogeneous PIM falls short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar heterogeneous PIM partitioning could be tested on other scientific workloads that mix dense linear algebra with irregular graph traversals.
The 2.5D packaging approach might lower the barrier for deploying specialized accelerators in genomic sequencing pipelines that currently rely on clusters.
Extending the reconfigurable logic windows to additional DP variants could widen the set of exact algorithms runnable at high throughput on the same hardware.

Load-bearing premise

The two graph DP categories have sufficiently different computation patterns and dataflows that a single homogeneous PIM cannot support both efficiently, and that 2.5D heterogeneous integration plus reconfigurability incur acceptable overhead while keeping results exact.

What would settle it

A head-to-head run of the same APSP and sequence-alignment benchmarks on an otherwise identical homogeneous PIM design, measuring whether throughput, energy, and numerical accuracy match or fall short of the heterogeneous results.

Figures

Figures reproduced from arXiv: 2604.15361 by Mahbod Afarin, Runyang Tian, Tajana Rosing, Weihong Xu, Yanru Chen, Zheyu Li.

**Figure 3.** Figure 3: Genome sequencing (a) Sequence-to-sequence (b) Sequence-to-graph [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Roofline model of four algorithms on CPU AVX-512 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 8.** Figure 8: GEN-Graph workload dataflow (a) Matrix tile (b) Traversal tile [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 7.** Figure 7: General graph-based DP (a) S2G runtime breakdown (b) Mapping [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 9.** Figure 9: Illustration of Recursive Partition APSP [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 10.** Figure 10: Matrix tile detail architecture (a) Macro (M) (b) Tile (T) (c) Die (d) [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: SW/HW mapping for recursive partitioned APSP (a) FW illustration [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 12.** Figure 12: Traversal tile detail architecture (a) HBM3 organization (b) Process [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: Refined HBM3 PNM micro architecture design (a) Three-tier storage [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 14.** Figure 14: Workload adaptive mapping scheme (a) Short read parallel mapping [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗

**Figure 15.** Figure 15: GEN-Graph matrix tile vs. CPU, A100 GPU and H100 GPU baselines [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 16.** Figure 16: GEN-Graph matrix tile vs. PIM-APSP [57], Partitioned APSP [27], [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

**Figure 18.** Figure 18: Sensitivity to matrix tile capacity N In [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗

read the original abstract

While graph-based dynamic programming (DP) is a cornerstone of genomics and network analytics, its efficiency is hampered by fundamentally conflicting computational patterns. Matrix-centric DP drives regular, compute-bound network analytics, while topology-centric DP handles irregular, memory-bound genomic traversals. These two categories of DP have substantially different computation patterns and dataflows, which makes it difficult for a single homogeneous processing-in-memory (PIM) architecture to efficiently support both. This work presents GEN-Graph, a novel heterogeneous PIM chiplet that integrates two types of specialized compute tiles within a 2.5D package: Matrix-tile, a processing-using-memory (PUM) tile optimized for matrix-centric workloads, such as all-pairs shortest path (APSP); and traversal-tile, a processing-near-memory (PNM) tile optimized for traversal-centric DP workloads, such as DNA sequence alignment. Our hardware-software co-design employs recursive partitioning and reconfigurable windowed bit-parallel logic to ensure exact computation. Results show the matrix tile achieves 42.8x speedup and 392x energy efficiency over the NVIDIA H100 GPU for APSP. For sequence-to-graph alignment, the traversal tile sustains 2.56 million reads/s (short-reads) and 39.3 thousand reads/s (long-reads), outperforming state-of-the-art accelerators by up to 2.56x in throughput. GEN-Graph provides the first scalable, exact solution for general DP dataflows by matching hardware specialization to algorithmic structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces GEN-Graph, a heterogeneous PIM chiplet in 2.5D packaging that combines a Matrix-tile (PUM-optimized for regular, compute-bound matrix-centric graph DP such as APSP) with a Traversal-tile (PNM-optimized for irregular, memory-bound topology-centric DP such as sequence-to-graph alignment). Recursive partitioning and reconfigurable windowed bit-parallel logic are used to guarantee exact results. The abstract reports 42.8x speedup and 392x energy efficiency versus NVIDIA H100 for APSP on the matrix tile, plus throughputs of 2.56 million reads/s (short) and 39.3 thousand reads/s (long) on the traversal tile, claiming the first scalable exact solution for general graph DP dataflows.

Significance. If the end-to-end claims hold, the work would be significant for PIM architecture research by showing how hardware specialization matched to conflicting DP patterns (compute-bound vs. memory-bound) can deliver large gains in genomics and network analytics without sacrificing exactness. The concrete per-tile numbers and the heterogeneous co-design approach provide useful reference points for future 2.5D PIM systems.

major comments (2)

[Abstract] Abstract: the central scalability claim for mixed/general DP workloads rests on the assertion that 2.5D integration, recursive partitioning, and reconfigurable logic incur acceptable overhead while preserving exactness, yet only isolated tile metrics are supplied; no end-to-end system throughput, inter-tile data-movement energy/latency, partitioning cost for multi-tile graphs, or comparison against a homogeneous PIM baseline is reported.
[Evaluation] Evaluation (implied by abstract results): the 42.8x/392x APSP figures and the 2.56 M/39.3 k reads/s alignment figures are presented without baseline configuration details, error bars, verification methodology for exactness, or quantification of how recursive partitioning scales when graphs exceed single-tile capacity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our results while preserving the focus on the heterogeneous tile designs.

read point-by-point responses

Referee: [Abstract] Abstract: the central scalability claim for mixed/general DP workloads rests on the assertion that 2.5D integration, recursive partitioning, and reconfigurable logic incur acceptable overhead while preserving exactness, yet only isolated tile metrics are supplied; no end-to-end system throughput, inter-tile data-movement energy/latency, partitioning cost for multi-tile graphs, or comparison against a homogeneous PIM baseline is reported.

Authors: We agree that explicit end-to-end metrics would better support the scalability claims for mixed workloads. The manuscript emphasizes per-tile characterization to isolate the benefits of matching Matrix-tile (PUM) and Traversal-tile (PNM) to their respective compute-bound and memory-bound patterns. In the revised version we have added a dedicated subsection under Evaluation that provides analytical models for inter-tile data movement energy and latency under recursive partitioning, estimated system throughput for multi-tile graphs, and a comparison against a homogeneous PIM baseline configured with equivalent total area and memory capacity. These additions confirm that the overheads remain below 15% for the evaluated graph sizes while preserving exactness. revision: yes
Referee: [Evaluation] Evaluation (implied by abstract results): the 42.8x/392x APSP figures and the 2.56 M/39.3 k reads/s alignment figures are presented without baseline configuration details, error bars, verification methodology for exactness, or quantification of how recursive partitioning scales when graphs exceed single-tile capacity.

Authors: We appreciate the request for additional rigor in the reported numbers. The revised Evaluation section now includes full baseline configurations (H100 with CUDA 12.2 and cuSPARSE optimizations, plus the cited SOTA accelerators with their reported parameters), standard deviations across 10 runs as error bars, and a verification subsection that describes bit-exact matching against CPU reference implementations on small instances. For recursive partitioning scaling, we have added both analytical bounds (logarithmic growth in partitioning cost) and empirical measurements on graphs 4x larger than single-tile capacity, showing sustained throughput within 8% of single-tile performance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements on external baselines

full rationale

The paper describes a heterogeneous PIM architecture (Matrix-tile PUM and traversal-tile PNM) with recursive partitioning and reconfigurable logic for graph DP workloads. All reported results (42.8x speedup/392x energy for APSP on H100, 2.56M/39.3k reads/s for alignments) are presented as direct hardware measurements or simulations against external baselines, with no equations, fitted parameters, predictions, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the design is justified by co-design choices evaluated on standard benchmarks. The derivation chain is self-contained against external references and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or invented physical entities are extractable. Design choices such as tile partitioning and bit-parallel logic are described at high level but not quantified.

pith-pipeline@v0.9.0 · 5589 in / 1195 out tokens · 30800 ms · 2026-05-10T16:38:31.620024+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Rajaraman and J

A. Rajaraman and J. D. Ullman,Mining of massive datasets. Autoedi- cion, 2011

work page 2011
[2]

On using floyd-warshall under uncertainty for influence maximization in instagram social network: A case study of indonesian fnb unicorn company,

M. N. P. Ma’ady, T. S. N. Syahda, A. F. Rizqi, and M. C. A. Ratna, “On using floyd-warshall under uncertainty for influence maximization in instagram social network: A case study of indonesian fnb unicorn company,”Procedia Computer Science, vol. 234, pp. 164–171, 2024

work page 2024
[3]

The distance backbone of complex networks,

T. Simas, R. B. Correia, and L. M. Rocha, “The distance backbone of complex networks,”J. Complex. Netw., vol. 9, no. 6, p. cnab021, 2021

work page 2021
[4]

Genome graphs and the evolution of genome inference,

B. Paten, A. M. Novak, J. M. Eizenga, and E. Garrison, “Genome graphs and the evolution of genome inference,”Genome research, vol. 27, no. 5, pp. 665–676, 2017

work page 2017
[5]

Goodbye reference, hello genome graphs,

A. Ameur, “Goodbye reference, hello genome graphs,”Nature biotech- nology, vol. 37, no. 8, pp. 866–868, 2019

work page 2019
[6]

Fast and accurate genomic analyses using genome graphs,

G. Rakocevic, V . Semenyuk, W.-P. Lee, J. Spencer, J. Browning, I. J. Johnson, V . Arsenijevic, J. Nadj, K. Ghose, M. C. Suciuet al., “Fast and accurate genomic analyses using genome graphs,”Nature genetics, vol. 51, no. 2, pp. 354–362, 2019

work page 2019
[7]

Pangenome graphs,

J. M. Eizenga, A. M. Novak, J. A. Sibbesen, S. Heumos, A. Ghaffaari, G. Hickey, X. Chang, J. D. Seaman, R. Rounthwaite, J. Ebleret al., “Pangenome graphs,”Annual review of genomics and human genetics, vol. 21, no. 1, pp. 139–162, 2020

work page 2020
[8]

Vehicle routing problems for city logistics,

D. Cattaruzza, N. Absi, D. Feillet, and J. Gonz ´alez-Feliu, “Vehicle routing problems for city logistics,”EURO Journal on Transportation and Logistics, vol. 6, no. 1, pp. 51–79, 2017

work page 2017
[9]

A survey of fpga-based robotic computing,

Z. Wan, B. Yu, T. Y . Li, J. Tang, Y . Zhu, Y . Wang, A. Raychowdhury, and S. Liu, “A survey of fpga-based robotic computing,”IEEE Circuits Syst. Mag., vol. 21, no. 2, pp. 48–74, 2021

work page 2021
[10]

Stanford large network dataset collection,

Stanford Network Analysis Project (SNAP), “Stanford large network dataset collection,” https://snap.stanford.edu/data/, 2008–present, ac- cessed: 2026-02-19

work page 2008
[11]

Nanopore sequencing and assembly of a human genome with ultra-long reads,

M. Jain, S. Koren, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R. Tyson, A. D. Beggs, A. T. Dilthey, I. T. Fiddeset al., “Nanopore sequencing and assembly of a human genome with ultra-long reads,” Nature biotechnology, vol. 36, no. 4, pp. 338–345, 2018

work page 2018
[12]

Extensive se- quencing of seven human genomes to characterize benchmark reference materials,

J. M. Zook, D. Catoe, J. McDaniel, L. Vang, N. Spies, A. Sidow, Z. Weng, Y . Liu, C. E. Mason, N. Alexanderet al., “Extensive se- quencing of seven human genomes to characterize benchmark reference materials,”Scientific data, vol. 3, no. 1, p. 160025, 2016

work page 2016
[13]

Introducing the graph 500,

R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,”CUG, vol. 19, no. 45-74, p. 22, 2010

work page 2010
[14]

The pagerank citation ranking: Bringing order to the web

L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford infolab, Tech. Rep., 1999

work page 1999
[15]

Dynamic programming,

R. Bellman, “Dynamic programming,”science, vol. 153, no. 3731, pp. 34–37, 1966

work page 1966
[16]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to algorithms. MIT press, 2022

work page 2022
[17]

Variation graph toolkit improves read mapping by representing genetic variation in the reference,

E. Garrison, J. Sir ´en, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F. Linet al., “Variation graph toolkit improves read mapping by representing genetic variation in the reference,”Nature biotechnology, vol. 36, no. 9, pp. 875–879, 2018

work page 2018
[18]

A draft human pangenome reference,

W.-W. Liao, M. Asri, J. Ebler, D. Doerr, M. Haukness, G. Hickey, S. Lu, J. K. Lucas, J. Monlong, H. J. Abelet al., “A draft human pangenome reference,”Nature, vol. 617, no. 7960, pp. 312–324, 2023

work page 2023
[19]

Algorithm 97: shortest path,

R. W. Floyd, “Algorithm 97: shortest path,”CACM, vol. 5, no. 6, pp. 345–345, 1962

work page 1962
[20]

Scalability! but at what COST?

F. McSherry, M. Isard, and D. G. Murray, “Scalability! but at what COST?” inHotOS XV, 2015

work page 2015
[21]

Kepner and J

J. Kepner and J. Gilbert, Eds.,Graph Algorithms in the Language of Linear Algebra. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2011

work page 2011
[22]

Ligra: A lightweight graph processing framework for shared memory,

J. Shun and G. E. Blelloch, “Ligra: A lightweight graph processing framework for shared memory,” inPPoPP. ACM, February 2013, pp. 135–146

work page 2013
[23]

Anatomy of high-performance matrix multiplication,

K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix multiplication,”ACM Transactions on Mathematical Software, vol. 34, no. 3, pp. 1–25, 2008

work page 2008
[24]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”CACM, vol. 52, no. 4, pp. 65–76, 2009

work page 2009
[25]

Graph processing on gpus: A survey,

X. Shi, Z. Zheng, Y . Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua, “Graph processing on gpus: A survey,”CSUR, vol. 50, no. 6, pp. 1–35, 2018

work page 2018
[26]

Gpu architectures in graph analytics: A comparative experimental study,

P. Xie, Z. Zheng, Y . Zhou, Y . Xiu, H. Liu, Z. Yang, Y . Zhang, and B. Du, “Gpu architectures in graph analytics: A comparative experimental study,” 2025

work page 2025
[27]

All-pairs shortest path algorithms for planar graph for gpu-accelerated clusters,

H. Djidjev, G. Chapuis, R. Andonov, S. Thulasidasan, and D. Lavenier, “All-pairs shortest path algorithms for planar graph for gpu-accelerated clusters,”JPDC, vol. 85, pp. 91–103, 2015

work page 2015
[28]

Scalable all-pairs shortest paths for huge graphs on multi-gpu clusters,

P. Sao, H. Lu, R. Kannan, V . Thakkar, R. Vuduc, and T. Potok, “Scalable all-pairs shortest paths for huge graphs on multi-gpu clusters,” inHPDC, 2021, pp. 121–131

work page 2021
[29]

Accelerating sequence-to-graph alignment on heterogeneous processors,

Z. Feng and Q. Luo, “Accelerating sequence-to-graph alignment on heterogeneous processors,” inICPP, 2021, pp. 1–10

work page 2021
[30]

Hitting the memory wall: Implications of the obvious,

W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”SIGARCH, vol. 23, no. 1, pp. 20–24, 1995

work page 1995
[31]

Processing-in-memory: A workload-driven perspective,

S. Ghose, A. Boroumand, J. S. Kim, J. G ´omez-Luna, and O. Mutlu, “Processing-in-memory: A workload-driven perspective,”IBM Journal of Research and Development, vol. 63, no. 6, pp. 3–1, 2019

work page 2019
[32]

A modern primer on processing in memory,

O. Mutlu, S. Ghose, J. G ´omez-Luna, and R. Ausavarungnirun, “A modern primer on processing in memory,” inEmerging computing: from devices to systems: looking beyond Moore and Von Neumann. Springer, 2022, pp. 171–243

work page 2022
[33]

Phase change memory,

H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Ra- jendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010

work page 2010
[34]

Felix: Fast and energy-efficient logic in memory,

S. Gupta, M. Imani, and T. Rosing, “Felix: Fast and energy-efficient logic in memory,” inICCAD. IEEE, 2018, pp. 1–7

work page 2018
[35]

Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerg- ing non-volatile memories,

S. Li, C. Xu, Q. Zou, J. Zhao, Y . Lu, and Y . Xie, “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerg- ing non-volatile memories,” inDAC, 2016, pp. 1–6

work page 2016
[36]

25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,

D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y . Kim, Y . J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shinet al., “25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,” in ISSCC. IEEE, 2014, pp. 432–433

work page 2014
[37]

3d-stacked memory architectures for multi-core processors,

G. H. Loh, “3d-stacked memory architectures for multi-core processors,” SIGARCH, vol. 36, no. 3, pp. 453–464, 2008

work page 2008
[38]

Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,

S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” inHPCA. IEEE, 2025, pp. 1720–1733

work page 2025
[39]

Segram: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping,

D. S. Cali, K. Kanellopoulos, J. Lindegger, Z. Bing ¨ol, G. S. Kalsi, Z. Zuo, C. Firtina, M. B. Cavlak, J. Kim, N. M. Ghiasiet al., “Segram: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping,” inISCA, 2022, pp. 638–655

work page 2022
[40]

Genasm: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis,

D. S. Cali, G. S. Kalsi, Z. Bing ¨ol, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand et al., “Genasm: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis,” in MICRO. IEEE, 2020, pp. 951–966. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2020
[41]

Rapid-graph: Recursive all-pairs shortest paths using processing-in-memory for dynamic programming on graphs,

Y . Chen, Z. Li, K. Fan, R. Tian, J. Hsu, W. Xu, M. Zhou, and T. Rosing, “Rapid-graph: Recursive all-pairs shortest paths using processing-in-memory for dynamic programming on graphs,”arXiv preprint arXiv:2601.19907, 2025

work page arXiv 2025
[42]

Gendp: A framework of dynamic programming acceleration for genome sequenc- ing analysis,

Y . Gu, A. Subramaniyan, T. Dunn, A. Khadem, K.-Y . Chen, S. Paul, M. Vasimuddin, S. Misra, D. Blaauw, S. Narayanasamyet al., “Gendp: A framework of dynamic programming acceleration for genome sequenc- ing analysis,”CACM, vol. 68, no. 5, pp. 81–90, 2025

work page 2025
[43]

Multilevelk-way partitioning scheme for irregular graphs,

G. Karypis and V . Kumar, “Multilevelk-way partitioning scheme for irregular graphs,”JPDC, vol. 48, no. 1, pp. 96–129, 1998

work page 1998
[44]

Illumina: Advancing genomics,

Illumina, Inc., “Illumina: Advancing genomics,” https://www.illumina. com/, 2026, date: 2026-01-26

work page 2026
[45]

Pacific Biosciences,

Pacific Biosciences of California, Inc., “Pacific Biosciences,” https:// www.pacb.com, 2025, accessed Jan. 2026

work page 2025
[46]

Oxford Nanopore Technologies,

Oxford Nanopore Technologies, “Oxford Nanopore Technologies,” https://nanoporetech.com, 2025, accessed Jan. 2026

work page 2025
[47]

A new golden age for computer architec- ture: domain-specific hardware/software co-design, enhanced,

J. Hennessy and D. Patterson, “A new golden age for computer architec- ture: domain-specific hardware/software co-design, enhanced,” inISCA, 2018

work page 2018
[48]

A high-performance genomic accelerator for accurate sequence-to-graph alignment using dynamic programming algorithm,

G. Zeng, J. Zhu, Y . Zhang, G. Chen, Z. Yuan, S. Wei, and L. Liu, “A high-performance genomic accelerator for accurate sequence-to-graph alignment using dynamic programming algorithm,”TPDS, vol. 35, no. 2, pp. 237–249, 2023

work page 2023
[49]

Race logic: A hardware acceleration for dynamic programming algorithms,

A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware acceleration for dynamic programming algorithms,”SIGARCH, vol. 42, no. 3, pp. 517–528, 2014

work page 2014
[50]

Genax: A genome sequencing accelerator,

D. Fujiki, A. Subramaniyan, T. Zhang, Y . Zeng, R. Das, D. Blaauw, and S. Narayanasamy, “Genax: A genome sequencing accelerator,” inISCA. IEEE, 2018, pp. 69–82

work page 2018
[51]

Novel nanocomposite-superlattices for low energy and high stability nanoscale phase-change memory,

X. Wu, A. I. Khan, H. Lee, C.-F. Hsu, H. Zhang, H. Yu, N. Roy, A. V . Davydov, I. Takeuchi, X. Baoet al., “Novel nanocomposite-superlattices for low energy and high stability nanoscale phase-change memory,” Nature Communications, vol. 15, no. 1, p. 13, 2024

work page 2024
[52]

Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra- chan, M. Hu, R. S. Williams, and V . Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” inISCA, 2016, pp. 14–26

work page 2016
[53]

A global reference for human genetic variation,

G. P. Consortiumet al., “A global reference for human genetic variation,”Nature, vol. 526, no. 7571, p. 68, 2015

work page 2015
[54]

Intel Core i7-11700K Processor Specifications,

Intel, “Intel Core i7-11700K Processor Specifications,” https://www.intel.com/content/www/us/en/products/sku/212047/ intel-core-i711700k-processor-16m-cache-up-to-5-00-ghz/ specifications.html, 2021, accessed: Oct 1, 2025

work page 2021
[55]

NVIDIA A100 Tensor Core GPU Datasheet,

“NVIDIA A100 Tensor Core GPU Datasheet,” https://www.nvidia.com/ en-us/data-center/a100/, NVIDIA, 2021, accessed: Oct 1, 2025

work page 2021
[56]

NVIDIA H100 Tensor Core GPU Datasheet,

“NVIDIA H100 Tensor Core GPU Datasheet,” https://www.nvidia.com/ en-us/data-center/h100/, NVIDIA, 2022, accessed: Oct 1, 2025

work page 2022
[57]

Temporal state machines: Using temporal memory to stitch time-based graph computa- tions,

A. Madhavan, M. W. Daniels, and M. D. Stiles, “Temporal state machines: Using temporal memory to stitch time-based graph computa- tions,”JETC, vol. 17, no. 3, pp. 1–27, 2021

work page 2021
[58]

Accelerating sequence alignment to graphs,

C. Jain, S. Misra, H. Zhang, A. Dilthey, and S. Aluru, “Accelerating sequence alignment to graphs,” in2019 IPDPS. IEEE, 2019, pp. 451– 461

work page 2019
[59]

Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,

W.-L. Chiang, X. Liu, S. Si, Y . Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” inSIGKDD, 2019, pp. 257–266

work page 2019
[60]

Collective dynamics of ‘small- world’networks,

D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small- world’networks,”nature, vol. 393, no. 6684, pp. 440–442, 1998

work page 1998
[61]

On random graphs i,

P. ERDdS and A. R&wi, “On random graphs i,”Publ. math. debrecen, vol. 6, no. 290-297, p. 18, 1959

work page 1959
[62]

Niemagraphgen: A memory-efficient global-scale contact network simulation toolkit,

N. Moshiri, “Niemagraphgen: A memory-efficient global-scale contact network simulation toolkit,”GIGAbyte, vol. 2022, p. gigabyte37, 2022

work page 2022
[63]

Human genome assembly GRCh38.p13,

Genome Reference Consortium, “Human genome assembly GRCh38.p13,” 2019, refSeq/GenBank assembly, accessed Jan

work page 2019
[64]

Available: https://www.ncbi.nlm.nih.gov/assembly/ GCA 000001405.28

[Online]. Available: https://www.ncbi.nlm.nih.gov/assembly/ GCA 000001405.28

work page
[65]

Pbsim2: a simulator for long- read sequencers with a novel generative model of quality scores,

Y . Ono, K. Asai, and M. Hamada, “Pbsim2: a simulator for long- read sequencers with a novel generative model of quality scores,” Bioinformatics, vol. 37, no. 5, pp. 589–595, 2021

work page 2021
[66]

Mason–a read simulator for second generation sequenc- ing data,

M. Holtgrewe, “Mason–a read simulator for second generation sequenc- ing data,”Technical Report FU Berlin, 2010

work page 2010
[67]

Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,

X. Peng, S. Huang, H. Jiang, A. Lu, and S. Yu, “Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,”TCAD, vol. 40, pp. 2306–2319, 2020

work page 2020
[68]

Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,

A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,”Integration, vol. 58, pp. 74–81, 2017

work page 2017
[69]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inASPLOS, Volume 2, 2024, pp. 103–119

work page 2024
[70]

Benchmarking and dissecting the nvidia hopper gpu architecture,

W. Luo, R. Fan, Z. Li, D. Du, Q. Wang, and X. Chu, “Benchmarking and dissecting the nvidia hopper gpu architecture,” inIPDPS. IEEE, 2024, pp. 656–667. Yanru Chenis currently a Ph.D. student in Elec- trical and Computer Engineering at the University of California San Diego, La Jolla, CA, USA. She received the M.S. degree in Electronic Information (Intellig...

work page 2024

[1] [1]

Rajaraman and J

A. Rajaraman and J. D. Ullman,Mining of massive datasets. Autoedi- cion, 2011

work page 2011

[2] [2]

On using floyd-warshall under uncertainty for influence maximization in instagram social network: A case study of indonesian fnb unicorn company,

M. N. P. Ma’ady, T. S. N. Syahda, A. F. Rizqi, and M. C. A. Ratna, “On using floyd-warshall under uncertainty for influence maximization in instagram social network: A case study of indonesian fnb unicorn company,”Procedia Computer Science, vol. 234, pp. 164–171, 2024

work page 2024

[3] [3]

The distance backbone of complex networks,

T. Simas, R. B. Correia, and L. M. Rocha, “The distance backbone of complex networks,”J. Complex. Netw., vol. 9, no. 6, p. cnab021, 2021

work page 2021

[4] [4]

Genome graphs and the evolution of genome inference,

B. Paten, A. M. Novak, J. M. Eizenga, and E. Garrison, “Genome graphs and the evolution of genome inference,”Genome research, vol. 27, no. 5, pp. 665–676, 2017

work page 2017

[5] [5]

Goodbye reference, hello genome graphs,

A. Ameur, “Goodbye reference, hello genome graphs,”Nature biotech- nology, vol. 37, no. 8, pp. 866–868, 2019

work page 2019

[6] [6]

Fast and accurate genomic analyses using genome graphs,

G. Rakocevic, V . Semenyuk, W.-P. Lee, J. Spencer, J. Browning, I. J. Johnson, V . Arsenijevic, J. Nadj, K. Ghose, M. C. Suciuet al., “Fast and accurate genomic analyses using genome graphs,”Nature genetics, vol. 51, no. 2, pp. 354–362, 2019

work page 2019

[7] [7]

Pangenome graphs,

J. M. Eizenga, A. M. Novak, J. A. Sibbesen, S. Heumos, A. Ghaffaari, G. Hickey, X. Chang, J. D. Seaman, R. Rounthwaite, J. Ebleret al., “Pangenome graphs,”Annual review of genomics and human genetics, vol. 21, no. 1, pp. 139–162, 2020

work page 2020

[8] [8]

Vehicle routing problems for city logistics,

D. Cattaruzza, N. Absi, D. Feillet, and J. Gonz ´alez-Feliu, “Vehicle routing problems for city logistics,”EURO Journal on Transportation and Logistics, vol. 6, no. 1, pp. 51–79, 2017

work page 2017

[9] [9]

A survey of fpga-based robotic computing,

Z. Wan, B. Yu, T. Y . Li, J. Tang, Y . Zhu, Y . Wang, A. Raychowdhury, and S. Liu, “A survey of fpga-based robotic computing,”IEEE Circuits Syst. Mag., vol. 21, no. 2, pp. 48–74, 2021

work page 2021

[10] [10]

Stanford large network dataset collection,

Stanford Network Analysis Project (SNAP), “Stanford large network dataset collection,” https://snap.stanford.edu/data/, 2008–present, ac- cessed: 2026-02-19

work page 2008

[11] [11]

Nanopore sequencing and assembly of a human genome with ultra-long reads,

M. Jain, S. Koren, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R. Tyson, A. D. Beggs, A. T. Dilthey, I. T. Fiddeset al., “Nanopore sequencing and assembly of a human genome with ultra-long reads,” Nature biotechnology, vol. 36, no. 4, pp. 338–345, 2018

work page 2018

[12] [12]

Extensive se- quencing of seven human genomes to characterize benchmark reference materials,

J. M. Zook, D. Catoe, J. McDaniel, L. Vang, N. Spies, A. Sidow, Z. Weng, Y . Liu, C. E. Mason, N. Alexanderet al., “Extensive se- quencing of seven human genomes to characterize benchmark reference materials,”Scientific data, vol. 3, no. 1, p. 160025, 2016

work page 2016

[13] [13]

Introducing the graph 500,

R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,”CUG, vol. 19, no. 45-74, p. 22, 2010

work page 2010

[14] [14]

The pagerank citation ranking: Bringing order to the web

L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford infolab, Tech. Rep., 1999

work page 1999

[15] [15]

Dynamic programming,

R. Bellman, “Dynamic programming,”science, vol. 153, no. 3731, pp. 34–37, 1966

work page 1966

[16] [16]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to algorithms. MIT press, 2022

work page 2022

[17] [17]

Variation graph toolkit improves read mapping by representing genetic variation in the reference,

E. Garrison, J. Sir ´en, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F. Linet al., “Variation graph toolkit improves read mapping by representing genetic variation in the reference,”Nature biotechnology, vol. 36, no. 9, pp. 875–879, 2018

work page 2018

[18] [18]

A draft human pangenome reference,

W.-W. Liao, M. Asri, J. Ebler, D. Doerr, M. Haukness, G. Hickey, S. Lu, J. K. Lucas, J. Monlong, H. J. Abelet al., “A draft human pangenome reference,”Nature, vol. 617, no. 7960, pp. 312–324, 2023

work page 2023

[19] [19]

Algorithm 97: shortest path,

R. W. Floyd, “Algorithm 97: shortest path,”CACM, vol. 5, no. 6, pp. 345–345, 1962

work page 1962

[20] [20]

Scalability! but at what COST?

F. McSherry, M. Isard, and D. G. Murray, “Scalability! but at what COST?” inHotOS XV, 2015

work page 2015

[21] [21]

Kepner and J

J. Kepner and J. Gilbert, Eds.,Graph Algorithms in the Language of Linear Algebra. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2011

work page 2011

[22] [22]

Ligra: A lightweight graph processing framework for shared memory,

J. Shun and G. E. Blelloch, “Ligra: A lightweight graph processing framework for shared memory,” inPPoPP. ACM, February 2013, pp. 135–146

work page 2013

[23] [23]

Anatomy of high-performance matrix multiplication,

K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix multiplication,”ACM Transactions on Mathematical Software, vol. 34, no. 3, pp. 1–25, 2008

work page 2008

[24] [24]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”CACM, vol. 52, no. 4, pp. 65–76, 2009

work page 2009

[25] [25]

Graph processing on gpus: A survey,

X. Shi, Z. Zheng, Y . Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua, “Graph processing on gpus: A survey,”CSUR, vol. 50, no. 6, pp. 1–35, 2018

work page 2018

[26] [26]

Gpu architectures in graph analytics: A comparative experimental study,

P. Xie, Z. Zheng, Y . Zhou, Y . Xiu, H. Liu, Z. Yang, Y . Zhang, and B. Du, “Gpu architectures in graph analytics: A comparative experimental study,” 2025

work page 2025

[27] [27]

All-pairs shortest path algorithms for planar graph for gpu-accelerated clusters,

H. Djidjev, G. Chapuis, R. Andonov, S. Thulasidasan, and D. Lavenier, “All-pairs shortest path algorithms for planar graph for gpu-accelerated clusters,”JPDC, vol. 85, pp. 91–103, 2015

work page 2015

[28] [28]

Scalable all-pairs shortest paths for huge graphs on multi-gpu clusters,

P. Sao, H. Lu, R. Kannan, V . Thakkar, R. Vuduc, and T. Potok, “Scalable all-pairs shortest paths for huge graphs on multi-gpu clusters,” inHPDC, 2021, pp. 121–131

work page 2021

[29] [29]

Accelerating sequence-to-graph alignment on heterogeneous processors,

Z. Feng and Q. Luo, “Accelerating sequence-to-graph alignment on heterogeneous processors,” inICPP, 2021, pp. 1–10

work page 2021

[30] [30]

Hitting the memory wall: Implications of the obvious,

W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”SIGARCH, vol. 23, no. 1, pp. 20–24, 1995

work page 1995

[31] [31]

Processing-in-memory: A workload-driven perspective,

S. Ghose, A. Boroumand, J. S. Kim, J. G ´omez-Luna, and O. Mutlu, “Processing-in-memory: A workload-driven perspective,”IBM Journal of Research and Development, vol. 63, no. 6, pp. 3–1, 2019

work page 2019

[32] [32]

A modern primer on processing in memory,

O. Mutlu, S. Ghose, J. G ´omez-Luna, and R. Ausavarungnirun, “A modern primer on processing in memory,” inEmerging computing: from devices to systems: looking beyond Moore and Von Neumann. Springer, 2022, pp. 171–243

work page 2022

[33] [33]

Phase change memory,

H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Ra- jendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010

work page 2010

[34] [34]

Felix: Fast and energy-efficient logic in memory,

S. Gupta, M. Imani, and T. Rosing, “Felix: Fast and energy-efficient logic in memory,” inICCAD. IEEE, 2018, pp. 1–7

work page 2018

[35] [35]

Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerg- ing non-volatile memories,

S. Li, C. Xu, Q. Zou, J. Zhao, Y . Lu, and Y . Xie, “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerg- ing non-volatile memories,” inDAC, 2016, pp. 1–6

work page 2016

[36] [36]

25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,

D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y . Kim, Y . J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shinet al., “25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,” in ISSCC. IEEE, 2014, pp. 432–433

work page 2014

[37] [37]

3d-stacked memory architectures for multi-core processors,

G. H. Loh, “3d-stacked memory architectures for multi-core processors,” SIGARCH, vol. 36, no. 3, pp. 453–464, 2008

work page 2008

[38] [38]

Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,

S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” inHPCA. IEEE, 2025, pp. 1720–1733

work page 2025

[39] [39]

Segram: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping,

D. S. Cali, K. Kanellopoulos, J. Lindegger, Z. Bing ¨ol, G. S. Kalsi, Z. Zuo, C. Firtina, M. B. Cavlak, J. Kim, N. M. Ghiasiet al., “Segram: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping,” inISCA, 2022, pp. 638–655

work page 2022

[40] [40]

Genasm: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis,

D. S. Cali, G. S. Kalsi, Z. Bing ¨ol, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand et al., “Genasm: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis,” in MICRO. IEEE, 2020, pp. 951–966. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2020

[41] [41]

Rapid-graph: Recursive all-pairs shortest paths using processing-in-memory for dynamic programming on graphs,

Y . Chen, Z. Li, K. Fan, R. Tian, J. Hsu, W. Xu, M. Zhou, and T. Rosing, “Rapid-graph: Recursive all-pairs shortest paths using processing-in-memory for dynamic programming on graphs,”arXiv preprint arXiv:2601.19907, 2025

work page arXiv 2025

[42] [42]

Gendp: A framework of dynamic programming acceleration for genome sequenc- ing analysis,

Y . Gu, A. Subramaniyan, T. Dunn, A. Khadem, K.-Y . Chen, S. Paul, M. Vasimuddin, S. Misra, D. Blaauw, S. Narayanasamyet al., “Gendp: A framework of dynamic programming acceleration for genome sequenc- ing analysis,”CACM, vol. 68, no. 5, pp. 81–90, 2025

work page 2025

[43] [43]

Multilevelk-way partitioning scheme for irregular graphs,

G. Karypis and V . Kumar, “Multilevelk-way partitioning scheme for irregular graphs,”JPDC, vol. 48, no. 1, pp. 96–129, 1998

work page 1998

[44] [44]

Illumina: Advancing genomics,

Illumina, Inc., “Illumina: Advancing genomics,” https://www.illumina. com/, 2026, date: 2026-01-26

work page 2026

[45] [45]

Pacific Biosciences,

Pacific Biosciences of California, Inc., “Pacific Biosciences,” https:// www.pacb.com, 2025, accessed Jan. 2026

work page 2025

[46] [46]

Oxford Nanopore Technologies,

Oxford Nanopore Technologies, “Oxford Nanopore Technologies,” https://nanoporetech.com, 2025, accessed Jan. 2026

work page 2025

[47] [47]

A new golden age for computer architec- ture: domain-specific hardware/software co-design, enhanced,

J. Hennessy and D. Patterson, “A new golden age for computer architec- ture: domain-specific hardware/software co-design, enhanced,” inISCA, 2018

work page 2018

[48] [48]

A high-performance genomic accelerator for accurate sequence-to-graph alignment using dynamic programming algorithm,

G. Zeng, J. Zhu, Y . Zhang, G. Chen, Z. Yuan, S. Wei, and L. Liu, “A high-performance genomic accelerator for accurate sequence-to-graph alignment using dynamic programming algorithm,”TPDS, vol. 35, no. 2, pp. 237–249, 2023

work page 2023

[49] [49]

Race logic: A hardware acceleration for dynamic programming algorithms,

A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware acceleration for dynamic programming algorithms,”SIGARCH, vol. 42, no. 3, pp. 517–528, 2014

work page 2014

[50] [50]

Genax: A genome sequencing accelerator,

D. Fujiki, A. Subramaniyan, T. Zhang, Y . Zeng, R. Das, D. Blaauw, and S. Narayanasamy, “Genax: A genome sequencing accelerator,” inISCA. IEEE, 2018, pp. 69–82

work page 2018

[51] [51]

Novel nanocomposite-superlattices for low energy and high stability nanoscale phase-change memory,

X. Wu, A. I. Khan, H. Lee, C.-F. Hsu, H. Zhang, H. Yu, N. Roy, A. V . Davydov, I. Takeuchi, X. Baoet al., “Novel nanocomposite-superlattices for low energy and high stability nanoscale phase-change memory,” Nature Communications, vol. 15, no. 1, p. 13, 2024

work page 2024

[52] [52]

Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra- chan, M. Hu, R. S. Williams, and V . Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” inISCA, 2016, pp. 14–26

work page 2016

[53] [53]

A global reference for human genetic variation,

G. P. Consortiumet al., “A global reference for human genetic variation,”Nature, vol. 526, no. 7571, p. 68, 2015

work page 2015

[54] [54]

Intel Core i7-11700K Processor Specifications,

Intel, “Intel Core i7-11700K Processor Specifications,” https://www.intel.com/content/www/us/en/products/sku/212047/ intel-core-i711700k-processor-16m-cache-up-to-5-00-ghz/ specifications.html, 2021, accessed: Oct 1, 2025

work page 2021

[55] [55]

NVIDIA A100 Tensor Core GPU Datasheet,

“NVIDIA A100 Tensor Core GPU Datasheet,” https://www.nvidia.com/ en-us/data-center/a100/, NVIDIA, 2021, accessed: Oct 1, 2025

work page 2021

[56] [56]

NVIDIA H100 Tensor Core GPU Datasheet,

“NVIDIA H100 Tensor Core GPU Datasheet,” https://www.nvidia.com/ en-us/data-center/h100/, NVIDIA, 2022, accessed: Oct 1, 2025

work page 2022

[57] [57]

Temporal state machines: Using temporal memory to stitch time-based graph computa- tions,

A. Madhavan, M. W. Daniels, and M. D. Stiles, “Temporal state machines: Using temporal memory to stitch time-based graph computa- tions,”JETC, vol. 17, no. 3, pp. 1–27, 2021

work page 2021

[58] [58]

Accelerating sequence alignment to graphs,

C. Jain, S. Misra, H. Zhang, A. Dilthey, and S. Aluru, “Accelerating sequence alignment to graphs,” in2019 IPDPS. IEEE, 2019, pp. 451– 461

work page 2019

[59] [59]

Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,

W.-L. Chiang, X. Liu, S. Si, Y . Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” inSIGKDD, 2019, pp. 257–266

work page 2019

[60] [60]

Collective dynamics of ‘small- world’networks,

D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small- world’networks,”nature, vol. 393, no. 6684, pp. 440–442, 1998

work page 1998

[61] [61]

On random graphs i,

P. ERDdS and A. R&wi, “On random graphs i,”Publ. math. debrecen, vol. 6, no. 290-297, p. 18, 1959

work page 1959

[62] [62]

Niemagraphgen: A memory-efficient global-scale contact network simulation toolkit,

N. Moshiri, “Niemagraphgen: A memory-efficient global-scale contact network simulation toolkit,”GIGAbyte, vol. 2022, p. gigabyte37, 2022

work page 2022

[63] [63]

Human genome assembly GRCh38.p13,

Genome Reference Consortium, “Human genome assembly GRCh38.p13,” 2019, refSeq/GenBank assembly, accessed Jan

work page 2019

[64] [64]

Available: https://www.ncbi.nlm.nih.gov/assembly/ GCA 000001405.28

[Online]. Available: https://www.ncbi.nlm.nih.gov/assembly/ GCA 000001405.28

work page

[65] [65]

Pbsim2: a simulator for long- read sequencers with a novel generative model of quality scores,

Y . Ono, K. Asai, and M. Hamada, “Pbsim2: a simulator for long- read sequencers with a novel generative model of quality scores,” Bioinformatics, vol. 37, no. 5, pp. 589–595, 2021

work page 2021

[66] [66]

Mason–a read simulator for second generation sequenc- ing data,

M. Holtgrewe, “Mason–a read simulator for second generation sequenc- ing data,”Technical Report FU Berlin, 2010

work page 2010

[67] [67]

Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,

X. Peng, S. Huang, H. Jiang, A. Lu, and S. Yu, “Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,”TCAD, vol. 40, pp. 2306–2319, 2020

work page 2020

[68] [68]

Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,

A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,”Integration, vol. 58, pp. 74–81, 2017

work page 2017

[69] [69]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inASPLOS, Volume 2, 2024, pp. 103–119

work page 2024

[70] [70]

Benchmarking and dissecting the nvidia hopper gpu architecture,

W. Luo, R. Fan, Z. Li, D. Du, Q. Wang, and X. Chu, “Benchmarking and dissecting the nvidia hopper gpu architecture,” inIPDPS. IEEE, 2024, pp. 656–667. Yanru Chenis currently a Ph.D. student in Elec- trical and Computer Engineering at the University of California San Diego, La Jolla, CA, USA. She received the M.S. degree in Electronic Information (Intellig...

work page 2024