AcOrch: Accelerating Sampling-based GNN Training under CPU-NPU Heterogeneous Environments

Ge Yu; Kefu Chen; Qiange Wang; Xin Ai; Yanfeng Zhang

arxiv: 2606.01161 · v1 · pith:WAO5AXBZnew · submitted 2026-05-31 · 💻 cs.DC

AcOrch: Accelerating Sampling-based GNN Training under CPU-NPU Heterogeneous Environments

Kefu Chen , Xin Ai , Qiange Wang , Yanfeng Zhang , Ge Yu This is my paper

Pith reviewed 2026-06-28 16:35 UTC · model grok-4.3

classification 💻 cs.DC

keywords sampling-based GNN trainingheterogeneous CPU-NPUtask orchestrationtwo-level pipelineAscend AI processorgraph neural networksNPU accelerationfeature gathering

0 comments

The pith

AcOrch achieves 2.31x speedup on sampling-based GNN training by mapping tasks to CPU, AIC, and AIV units in a two-level pipeline on Ascend processors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AcOrch as a system to accelerate sampling-based Graph Neural Network training on CPU-NPU heterogeneous platforms such as the Ascend AI processor. It establishes that fine-grained task orchestration, which maps sampling, feature gathering, and model training to appropriate compute units, combined with a two-level pipelined execution model, overlaps these stages both across CPU and NPU and among units inside the NPU. This approach is presented as a way to maximize resource utilization in workloads where stages have mismatched resource needs and computation volumes. A sympathetic reader would care because sampling-based GNN training on large graphs is resource-intensive, and better overlap could reduce training time on existing NPU hardware without requiring new accelerators.

Core claim

AcOrch is a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. It offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. The two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU, thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-

What carries the argument

The two-level pipelined execution model with fine-grained task orchestration that maps tasks to AIC units, AIV units, and CPU cores.

If this is right

Overlapping execution across CPU-NPU boundaries and within NPU units reduces idle time during the multi-stage training process.
Task mapping to specialized units allows each stage to run on the compute type best suited to its requirements.
The approach scales mini-batch training on sampled subgraphs for larger graphs by keeping more units busy.
Resource utilization improves without changes to the underlying GNN model or sampling algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same task-mapping and pipelining logic could be adapted to other NPUs or heterogeneous accelerators that expose distinct internal compute units.
The results suggest that orchestration overhead, rather than raw peak performance, is often the main limiter in current NPU-based GNN systems.
Dynamic adjustment of the pipeline depth based on graph size or batch characteristics might further improve results on varied workloads.

Load-bearing premise

The heterogeneous compute features of the NPU can be analyzed and tasks mapped to AIC, AIV, and CPU units such that the two-level pipeline overlaps sampling, gathering, and training with negligible synchronization or data-movement overhead on the target platform.

What would settle it

Running identical sampling-based GNN workloads on the Ascend 910B with and without the two-level pipeline and task mapping, then checking whether the measured speedup over MindSporeGL drops below the reported 2.31x average.

read the original abstract

Graph Neural Networks (GNNs) have achieved remarkable success in various applications. Sampling-based GNN training, which conducts mini-batch training on sampled subgraphs, has become a promising solution for large-scale graphs. Given the resource-intensive nature of sampling-based GNN training, Neural Processing Units (NPUs), such as the Ascend AI processor, offer a promising alternative due to their high throughput and energy efficiency, making them well-suited for GNN workloads. However, the multi-stage nature of sampling-based training, which involves subgraph sampling, feature gathering, and model training, with different resource requirements and computation volume. This requires careful coordination to fully utilize the heterogeneous computation resources of CPUs and NPUs. In this work, we present AcOrch, a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. AcOrch offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. Moreover, the two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU (e.g., AIC and AIV units), thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-the-art NPU-native graph learning system, MindSporeGL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AcOrch describes a concrete two-level pipeline and unit mapping for sampling GNNs on Ascend NPUs with a 2.31x claim, but the abstract supplies almost no experimental details to check whether the overlaps actually work.

read the letter

AcOrch claims a 2.31x speedup for sampling-based GNN training on Ascend 910B by using fine-grained task orchestration and a two-level pipelined model that overlaps sampling, gathering, and training across CPU and NPU units including AIC and AIV.

The new part is the concrete mapping of tasks to those specific units and the extension of pipelining to overlap within the NPU as well as between CPU and NPU. That addresses the multi-stage nature of the workload directly.

It does well at explaining the different compute requirements of each stage and why heterogeneous resources need careful coordination.

The soft spots are in the evaluation. The abstract reports the speedup but gives no information on the graph datasets, model architectures, number of runs, or baseline configuration details. Without those, it's difficult to judge whether the gains come from the proposed pipeline or from other factors. The central assumption that synchronization and data movement overheads stay negligible after the mapping also lacks supporting measurements like unit utilization or timing breakdowns in what is shown here.

This work is aimed at practitioners and researchers building GNN training systems for Ascend or similar CPU-NPU environments. A reader focused on performance engineering for graph neural networks on specialized hardware would get the most out of the orchestration strategy.

I would send it for peer review. The contribution is a real implementation on actual hardware with a measurable claim, so referees can verify the experiments and see if the pipeline delivers as described.

Referee Report

2 major / 0 minor

Summary. The paper presents AcOrch, a sampling-based GNN training system for CPU-NPU heterogeneous platforms (e.g., Ascend 910B). It introduces fine-grained task orchestration that maps tasks to AIC, AIV, and CPU units, combined with a two-level pipelined execution model to overlap subgraph sampling, feature gathering, and model training, claiming an average 2.31x speedup over the NPU-native baseline MindSporeGL.

Significance. If the performance claims are substantiated with complete experimental evidence, the work could meaningfully advance systems support for large-scale GNN training on specialized AI processors by demonstrating practical exploitation of intra-NPU heterogeneity via pipelining.

major comments (2)

[Abstract] Abstract: The central claim of a 2.31x average speedup is presented without any description of the experimental setup, graph datasets, GNN model sizes/architectures, number of runs, variance across runs, or confirmation that the MindSporeGL baseline received equivalent tuning on the Ascend 910B. This absence prevents evaluation of the numerical result.
[Abstract] Abstract: The two-level pipeline is asserted to overlap sampling/gathering/training with negligible synchronization and data-movement overhead after mapping to AIC/AIV/CPU units, yet no utilization metrics, per-stage timing breakdowns, or overhead measurements versus the baseline are supplied to substantiate that the weakest assumption holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract should be more self-contained to allow readers to better assess the performance claims. We will revise the abstract accordingly while preserving its conciseness. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 2.31x average speedup is presented without any description of the experimental setup, graph datasets, GNN model sizes/architectures, number of runs, variance across runs, or confirmation that the MindSporeGL baseline received equivalent tuning on the Ascend 910B. This absence prevents evaluation of the numerical result.

Authors: We acknowledge the referee's point that the abstract, as currently written, does not provide sufficient context for the 2.31x claim. The full manuscript details the experimental setup (Ascend 910B platform, standard large-scale graph datasets, GNN architectures such as GCN and GraphSAGE, multiple independent runs with reported variance) and confirms equivalent tuning of the MindSporeGL baseline in the Experiments section. To address the concern directly, we will revise the abstract to include a concise summary of these elements. revision: yes
Referee: [Abstract] Abstract: The two-level pipeline is asserted to overlap sampling/gathering/training with negligible synchronization and data-movement overhead after mapping to AIC/AIV/CPU units, yet no utilization metrics, per-stage timing breakdowns, or overhead measurements versus the baseline are supplied to substantiate that the weakest assumption holds.

Authors: The manuscript supplies per-stage timing breakdowns, resource utilization measurements, and overhead comparisons versus the baseline in the evaluation section to support the pipelining claims. However, these details are not referenced in the abstract. We will revise the abstract to briefly note the availability of these supporting measurements and their key findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical speedup claim rests on hardware measurements, not derivations or self-referential fits

full rationale

The paper describes a systems implementation (AcOrch) with task mapping to AIC/AIV/CPU units and a two-level pipeline, then reports measured wall-clock speedups (2.31x average) versus MindSporeGL on Ascend 910B hardware. No equations, fitted parameters, or mathematical derivations appear in the provided text that could reduce to inputs by construction. The central performance claim is an empirical observation, not a prediction derived from prior results or self-citations. No load-bearing self-citation chains, ansatzes, or renamings are present. This is a standard non-circular systems paper whose validity hinges on experimental reproducibility rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the contribution is an engineering orchestration layer whose correctness depends on platform-specific performance characteristics rather than new theoretical constructs.

pith-pipeline@v0.9.1-grok · 5829 in / 1136 out tokens · 25838 ms · 2026-06-28T16:35:14.661711+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 7 canonical work pages · 1 internal anchor

[1]

As shown in Fig

Case 1: Assigning both sampling and gathering to the CPU. As shown in Fig. 5 (a), during each iteration, the CPU first performs neighbor sampling, then gathers the required node features, and finally transfers the prepared graph topology and feature data in batches to the NPU for training. Throughout the data preparation phase, the NPU remains idle and on...
[2]

As shown in Fig

Case 2: Assigning sampling to the CPU and gathering to the AIV. As shown in Fig. 5 (b), the CPU first performs neighbor sampling and then sends the sampled subgraph topology to the NPU. The AIV is responsible for gathering the corresponding node features from the NPU’s memory, and together with the subgraph topology, delivers them to the AIC for training....
[3]

As shown in Fig

Case 3: Assigning sampling to the AIV and gathering to the CPU. As shown in Fig. 5 (c), the AIV first completes graph sampling, after which the CPU collects the corresponding node features from main memory and transfers the processed feature data in batches to the NPU for training by the AIC. In this mode, the gathering step is again constrained by the PC...
[4]

As shown in Fig

Case 4: Assigning both sampling and gathering to the AIV. As shown in Fig. 5 (d), the graph topology and feature data are cached in the NPU, and the AIV1 first performs subgraph topology sampling, followed by the AIV2 collecting the corresponding node features from NPU memory. Finally, both are sent to the AIC for training. This approach better leverages ...

2027
[5]

NeutronOrch: rethinking sample-based GNN training under CPU- GPU heterogeneous environments

Ai X, Wang Q, Cao C, Zhang Y, Chen C, Yuan H, Gu Y, Yu G. NeutronOrch: rethinking sample-based GNN training under CPU- GPU heterogeneous environments. Proceedings of the VLDB Endow- ment, 2024, 17(8): 1995–2008

2024
[6]

Graph attention networks for neural social recommendation

Mu N, Zha D, He Y, Tang Z. Graph attention networks for neural social recommendation. In: Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence. 2019, 1320–1327

2019
[7]

NeutronStar: distributed GNN training with hybrid dependency management

Wang Q, Zhang Y, Wang H, Chen C, Zhang X, Yu G. NeutronStar: distributed GNN training with hybrid dependency management. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1301–1315

2022
[8]

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Wang M, Yu L, Zheng D, Gan Q, Gai Y, Ye Z, Li M, Zhou J, Huang Q, Ma C, Huang Z, Guo Q, Zhang H, Lin H, Zhao J, Li J, Smola A J, Zhang Z. Deep graph library: towards efficient and scalable deep learning on graphs. 2019, arXiv preprint arXiv: 1909.01315

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

XGCN: a library for large- scale graph neural network recommendations

Song X, Huang H, Lian J, Jin H. XGCN: a library for large- scale graph neural network recommendations. Frontiers of Computer Science, 2024, 18(3): 183343

2024
[10]

Semi-supervised classification with graph convolutional networks

Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Con- ference on Learning Representations. 2017

2017
[11]

TurboGNN: improving the end-to- end performance for sampling-based GNN training on GPUs

Wu W, Shi X, He L, Jin H. TurboGNN: improving the end-to- end performance for sampling-based GNN training on GPUs. IEEE Transactions on Computers, 2023, 72(9): 2571–2584

2023
[12]

Sampling meth- ods for efficient training of graph convolutional networks: a survey

Liu X, Yan M, Deng L, Li G, Ye X, Fan D. Sampling meth- ods for efficient training of graph convolutional networks: a survey. IEEE/CAA Journal of Automatica Sinica, 2022, 9(2): 205–234

2022
[13]

A comprehen- sive survey on graph neural networks

Wu Z, Pan S, Chen F, Long G, Zhang C, Yu P S. A comprehen- sive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4–24

2021
[14]

A comprehensive survey on graph neural network accelerators

Liu J, Chen S, Shen L. A comprehensive survey on graph neural network accelerators. Frontiers of Computer Science, 2025, 19(2): 192104

2025
[15]

A survey of dynamic graph neural net- works

Zheng Y, Yi L, Wei Z. A survey of dynamic graph neural net- works. Frontiers of Computer Science, 2025, 19(6): 196323

2025
[16]

SAN- CUS: staleness-aware communication-avoiding full-graph decentral- ized training in large-scale graph neural networks

Peng J, Chen Z, Shao Y, Shen Y, Chen L, Cao J. SAN- CUS: staleness-aware communication-avoiding full-graph decentral- ized training in large-scale graph neural networks. Proceedings of the VLDB Endowment, 2022, 15(9): 1937–1950

2022
[17]

ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud de- tection

Tian Y, Liu G, Wang J, Zhou M. ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud de- tection. IEEE Transactions on Computational Social Systems, 2024, 11(3): 3536–3549

2024
[18]

Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network

Mishra S, Singh G, Bhattacharya M. Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network. Medical & Biological Engineering & Computing, 2024, 62(8): 2499–2510

2024
[19]

PGSampler: accelerating GPU- based graph sampling in GNN systems via workload fusion

Wei X, Tang W, Qi H, Yue H. PGSampler: accelerating GPU- based graph sampling in GNN systems via workload fusion. In: Pro- ceedings of 2024 IEEE International Conference on Cluster Comput- ing. 2024, 51–61

2024
[20]

Scalable graph neural network training: the case for sampling

Serafini M. Scalable graph neural network training: the case for sampling. ACM SIGOPS Operating Systems Review, 2021, 55(1): 68–76

2021
[21]

Efficient data loader for fast sampling-based GNN training on large graphs

Bai Y, Li C, Lin Z, Wu Y, Miao Y, Liu Y, Xu Y. Efficient data loader for fast sampling-based GNN training on large graphs. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(10): 2541–2556

2021
[22]

FastGL: a GPU- efficient framework for accelerating sampling-based GNN training at large scale

Zhu Z, Wang P, Hu Q, Li G, Liang X, Cheng J. FastGL: a GPU- efficient framework for accelerating sampling-based GNN training at large scale. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2024, 94–110

2024
[23]

A local graph limits perspective on sampling-based GNNs

Alimohammadi Y, Ruiz L, Saberi A. A local graph limits perspective on sampling-based GNNs. 2023, arXiv preprint arXiv: 2310.10953

work page arXiv 2023
[24]

GNNLab: a factored system for sample-based GNN training over GPUs

Yang J, Tang D, Song X, Wang L, Yin Q, Chen R, Yu W, Zhou J. GNNLab: a factored system for sample-based GNN training over GPUs. In: Proceedings of the 17th European Conference on Computer Systems. 2022, 417–434 FrontiersofComputer Science|Issue 5|Volume 21|May 2027|2105103–14 Front. Comput. Sci., 2027, 21(5): 2105103

2022
[25]

Graph neural network training and data tiering

Min S, Wu K, Hidayetoglu M, Xiong J, Song X, Hwu W. Graph neural network training and data tiering. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing. 2022, 3555–3565

2022
[26]

DUCATI: a Dual-Cache training system for graph neural networks on giant graphs with the GPU

Zhang X, Shen Y, Shao Y, Chen L. DUCATI: a Dual-Cache training system for graph neural networks on giant graphs with the GPU. Proceedings of the ACM on Management of Data, 2023, 1(2): 166:1–166:24

2023
[27]

Large graph convolutional network training with GPU-oriented data communication architecture

Min S, Wu K, Huang S, Hidayetoglu M, Xiong J, Ebrahimi E, Chen D, Hwu W W. Large graph convolutional network training with GPU-oriented data communication architecture. Proceedings of the VLDB Endowment, 2021, 14(11): 2087–2100

2021
[28]

FastGCN: fast learning with graph con- volutional networks via importance sampling

Chen J, Ma T, Xiao C. FastGCN: fast learning with graph con- volutional networks via importance sampling. In: Proceedings of the 6th International Conference on Learning Representations. 2018

2018
[29]

Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform

Zhang B, Kuppannagari S R, Kannan R, Prasanna V K. Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform. In: Proceedings of 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021. 2021, 1–7

2021
[30]

FreshGNN: reducing memory access via stable historical embeddings for graph neural network training

Huang K, Jiang H, Wang M, Xiao G, Wipf D, Song X, Gan Q, Huang Z, Zhai J, Zhang Z. FreshGNN: reducing memory access via stable historical embeddings for graph neural network training. Proceedings of the VLDB Endowment, 2024, 17(6): 1473–1486

2024
[31]

GNNAutoScale: scalable and expressive graph neural networks via historical embed- dings

Fey M, Lenssen J E, Weichert F, Leskovec J. GNNAutoScale: scalable and expressive graph neural networks via historical embed- dings. In: Proceedings of the 38th International Conference on Ma- chine Learning. 2021, 3294–3304

2021
[32]

Marius: learning massive graph embeddings on a single machine

Mohoney J, Waleffe R, Xu H, Rekatsinas T, Venkataraman S. Marius: learning massive graph embeddings on a single machine. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 533–549

2021
[33]

WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared mem- ory architecture

Yang D, Liu J, Qi J, Lai J. WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared mem- ory architecture. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 54:1–54:14

2022
[34]

SaGNN: a sample- based GNN training and inference hardware accelerator

Wang H, Zhang S, Feng K, Wang M, Yang Z. SaGNN: a sample- based GNN training and inference hardware accelerator. In: Proceed- ings of 2023 IEEE International Symposium on Circuits and Systems. 2023, 1–5

2023
[35]

An efficient sampling- based SpMM kernel for balancing accuracy and speed in GNN infer- ence

Song Y, Wang Y, Xiong C, Wang T, Tang P. An efficient sampling- based SpMM kernel for balancing accuracy and speed in GNN infer- ence. In: Proceedings of 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. 2024, 468–475

2024
[36]

SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs

He Y, Lai Z, Ran Z, Zhang L, Li D. SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs. In: Proceedings of 2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Comput- ing, Sustainable Computing & Communications, Social Computing & Networking. 2022, 106–113

2022
[37]

Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper

Liao H, Tu J, Xia J, Liu H, Zhou X, Yuan H, Hu Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper. In: Proceedings of 2021 IEEE In- ternational Symposium on High-Performance Computer Architecture. 2021, 789–801

2021
[38]

Performance evaluation of MindSpore and PyTorch based on Ascend NPU

Zhu Z, Wang B, Yang C, Zhu R, Zhou M, Zheng N. Performance evaluation of MindSpore and PyTorch based on Ascend NPU. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1826–1832

2023
[39]

In-datacenter performance analysis of a tensor processing unit

Jouppi N P, Young C, Patil N, Patterson D A, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R, Cantin P, Chao C, Clark C, Coriell J, Daley M, Dau M, Dean J, Gelb B, Ghaemmaghami T V, Gottipati R, Gulland W, Hagmann R, Ho C R, Hogberg D, Hu J, Hundt R, Hurt D, Ibarz J, Jaffey A, Jaworski A, Kaplan A, Khaitan H, Killebrew D, Koch A, Kumar...
[40]

Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with gaudi processor

Medina E, Dagan E. Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with gaudi processor. IEEE Micro, 2020, 40(2): 17–24

2020
[41]

AIbench: a tool for benchmarking Huawei Ascend AI processors

Xiao Y, Wang Z. AIbench: a tool for benchmarking Huawei Ascend AI processors. CCF Transactions on High Performance Com- puting, 2024, 6(2): 115–129

2024
[42]

Ascend-CC: confi- dential computing on heterogeneous NPU for emerging generative AI workloads

Dhar A, Thorens C, Lazier L M, Cavigelli L. Ascend-CC: confi- dential computing on heterogeneous NPU for emerging generative AI workloads. 2024, arXiv preprint arXiv: 2407.11888

work page arXiv 2024
[43]

Ma- chine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput

Wongpanich A, Oguntebi T, Paredes J B, Wang Y E, Phothilimthana P M, Mitra R, Zhou Z, Kumar N, Reddi V J. Ma- chine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput. 2025, arXiv preprint arXiv: 2502.06982

work page arXiv 2025
[44]

Tackling the dynamicity in a produc- tion LLM serving system with SOTA optimizations via hybrid pre- fill/decode/verify scheduling on efficient Meta-kernels

Song M, Tang X, Hou F, Li J, Wei W, Ma Y, Xiao R, Si H, Jiang D, Yin S, Hu Y, Long G. Tackling the dynamicity in a produc- tion LLM serving system with SOTA optimizations via hybrid pre- fill/decode/verify scheduling on efficient Meta-kernels. 2024, arXiv preprint arXiv: 2412.18106

work page arXiv 2024
[45]

Analysis of performance and optimization in MindSpore on Ascend NPUs

Wang B, Yang C, Zhu R, Liu X, Zhou M, Zheng N. Analysis of performance and optimization in MindSpore on Ascend NPUs. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1701–1708

2023
[46]

Machine learning-enabled performance model for DNN applications and AI accelerator

Wu R, Li M, Li H, Chen T, Tian X, Xu X, Zhou B, Chen J, An H. Machine learning-enabled performance model for DNN applications and AI accelerator. In: Proceedings of the 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud ...

2022
[47]

Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL- MxP on Cloud Brain II

Xue W, Yang K, Liu Y, Fan D, Xu P, Tian Y. Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL- MxP on Cloud Brain II. In: Proceedings of International Conference for High Performance Computing, Networking, Storage, and Analysis. 2024, 82

2024
[48]

Inductive representation learning on large graphs

Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1024– 1034

2017
[49]

Defining and evaluating network communi- ties based on ground-truth

Yang J, Leskovec J. Defining and evaluating network communi- ties based on ground-truth. In: Proceedings of the 12th IEEE Interna- tional Conference on Data Mining. 2012, 745–754

2012
[50]

Predicting positive and negative links in online social networks

Leskovec J, Huttenlocher D P, Kleinberg J M. Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 641–650

2010
[51]

when to sample

Ramezani M, Cong W, Mahdavi M, Sivasubramaniam A, Kan- demir M T. GCN meets GPU: decoupling “when to sample” from “how to sample”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

2020
[52]

Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters

Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29–123

2009
[53]

Cube-fx: mapping Taylor ex- pansion onto matrix multiplier-accumulators of Huawei Ascend AI processors

Tang Y, Zhou H, Ji Z, Wang C. Cube-fx: mapping Taylor ex- pansion onto matrix multiplier-accumulators of Huawei Ascend AI processors. IEEE Transactions on Parallel and Distributed Systems, 2025, 36(6): 1115–1129

2025
[54]

High- utilization GPGPU design for accelerating GEMM workloads: an incremental approach

Wang C, Song P, Zhao H, Zhang F, Wang J, Zhang L. High- utilization GPGPU design for accelerating GEMM workloads: an incremental approach. In: Proceedings of 2024 IEEE International Symposium on Circuits and Systems. 2024, 1–5

2024
[55]

HBM-based hardware accelerator for GNN sampling and aggregation

Gui Y, Wu Q, Yuan W, Liang H, Wang X, Jin X. HBM-based hardware accelerator for GNN sampling and aggregation. In: Proceed- ings of 2024 IEEE High Performance Extreme Computing Conference. 2024, 1–7

2024
[56]

HongTu: scalable full-graph GNN training on multiple GPUs (via communication-optimized CPU data offloading)

Wang Q, Chen Y, Wong W, He B. HongTu: scalable full-graph GNN training on multiple GPUs (via communication-optimized CPU data offloading). 2023, arXiv preprint arXiv: 2311.14898

work page arXiv 2023
[57]

Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness

Tan Z, Yuan X, He C, Sit M, Li G, Liu X, Ai B, Zeng K, Pietzuch P R, Mai L. Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness. 2023, arXiv preprint arXiv: 2305.10863

work page arXiv 2023
[58]

Principal component analysis in the local differ- ential privacy model

Wang D, Xu J. Principal component analysis in the local differ- ential privacy model. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019. 2019, 4795–4801

2019
[59]

Accelerating graph sampling for graph machine learning using GPUs

Jangda A, Polisetty S, Guha A, Serafini M. Accelerating graph sampling for graph machine learning using GPUs. In: Proceedings of the 16th European Conference on Computer Systems. 2021, 311–326

2021
[60]

PaGraph: scaling GNN training on large graphs via computation-aware caching

Lin Z, Li C, Miao Y, Liu Y, Xu Y. PaGraph: scaling GNN training on large graphs via computation-aware caching. In: Proceedings of the 11th ACM Symposium on Cloud Computing. 2020, 401–415

2020
[61]

Neutronascend: Optimizing gnn training with ascend ai processors

Ai X, Zhang B, Wang Q, Zhang Y, Yuan H, Gong S, Yu G. Neutronascend: Optimizing gnn training with ascend ai processors. ACM Transactions on Architecture and Code Optimization, 2025

2025
[62]

Paper of Distinction

Tang Y, Wang C. Performance modeling on DaVinci AI core. Journal of Parallel and Distributed Computing, 2023, 175: 134–149 Kefu Chen is currently a master’s student in computer sci- ence at Northeastern University, China. His major research interests include acceleration of graph computing and learn- ing system on emerging hardware. Xin Ai is currently wo...

2023

[1] [1]

As shown in Fig

Case 1: Assigning both sampling and gathering to the CPU. As shown in Fig. 5 (a), during each iteration, the CPU first performs neighbor sampling, then gathers the required node features, and finally transfers the prepared graph topology and feature data in batches to the NPU for training. Throughout the data preparation phase, the NPU remains idle and on...

[2] [2]

As shown in Fig

Case 2: Assigning sampling to the CPU and gathering to the AIV. As shown in Fig. 5 (b), the CPU first performs neighbor sampling and then sends the sampled subgraph topology to the NPU. The AIV is responsible for gathering the corresponding node features from the NPU’s memory, and together with the subgraph topology, delivers them to the AIC for training....

[3] [3]

As shown in Fig

Case 3: Assigning sampling to the AIV and gathering to the CPU. As shown in Fig. 5 (c), the AIV first completes graph sampling, after which the CPU collects the corresponding node features from main memory and transfers the processed feature data in batches to the NPU for training by the AIC. In this mode, the gathering step is again constrained by the PC...

[4] [4]

As shown in Fig

Case 4: Assigning both sampling and gathering to the AIV. As shown in Fig. 5 (d), the graph topology and feature data are cached in the NPU, and the AIV1 first performs subgraph topology sampling, followed by the AIV2 collecting the corresponding node features from NPU memory. Finally, both are sent to the AIC for training. This approach better leverages ...

2027

[5] [5]

NeutronOrch: rethinking sample-based GNN training under CPU- GPU heterogeneous environments

Ai X, Wang Q, Cao C, Zhang Y, Chen C, Yuan H, Gu Y, Yu G. NeutronOrch: rethinking sample-based GNN training under CPU- GPU heterogeneous environments. Proceedings of the VLDB Endow- ment, 2024, 17(8): 1995–2008

2024

[6] [6]

Graph attention networks for neural social recommendation

Mu N, Zha D, He Y, Tang Z. Graph attention networks for neural social recommendation. In: Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence. 2019, 1320–1327

2019

[7] [7]

NeutronStar: distributed GNN training with hybrid dependency management

Wang Q, Zhang Y, Wang H, Chen C, Zhang X, Yu G. NeutronStar: distributed GNN training with hybrid dependency management. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1301–1315

2022

[8] [8]

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Wang M, Yu L, Zheng D, Gan Q, Gai Y, Ye Z, Li M, Zhou J, Huang Q, Ma C, Huang Z, Guo Q, Zhang H, Lin H, Zhao J, Li J, Smola A J, Zhang Z. Deep graph library: towards efficient and scalable deep learning on graphs. 2019, arXiv preprint arXiv: 1909.01315

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

XGCN: a library for large- scale graph neural network recommendations

Song X, Huang H, Lian J, Jin H. XGCN: a library for large- scale graph neural network recommendations. Frontiers of Computer Science, 2024, 18(3): 183343

2024

[10] [10]

Semi-supervised classification with graph convolutional networks

Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Con- ference on Learning Representations. 2017

2017

[11] [11]

TurboGNN: improving the end-to- end performance for sampling-based GNN training on GPUs

Wu W, Shi X, He L, Jin H. TurboGNN: improving the end-to- end performance for sampling-based GNN training on GPUs. IEEE Transactions on Computers, 2023, 72(9): 2571–2584

2023

[12] [12]

Sampling meth- ods for efficient training of graph convolutional networks: a survey

Liu X, Yan M, Deng L, Li G, Ye X, Fan D. Sampling meth- ods for efficient training of graph convolutional networks: a survey. IEEE/CAA Journal of Automatica Sinica, 2022, 9(2): 205–234

2022

[13] [13]

A comprehen- sive survey on graph neural networks

Wu Z, Pan S, Chen F, Long G, Zhang C, Yu P S. A comprehen- sive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4–24

2021

[14] [14]

A comprehensive survey on graph neural network accelerators

Liu J, Chen S, Shen L. A comprehensive survey on graph neural network accelerators. Frontiers of Computer Science, 2025, 19(2): 192104

2025

[15] [15]

A survey of dynamic graph neural net- works

Zheng Y, Yi L, Wei Z. A survey of dynamic graph neural net- works. Frontiers of Computer Science, 2025, 19(6): 196323

2025

[16] [16]

SAN- CUS: staleness-aware communication-avoiding full-graph decentral- ized training in large-scale graph neural networks

Peng J, Chen Z, Shao Y, Shen Y, Chen L, Cao J. SAN- CUS: staleness-aware communication-avoiding full-graph decentral- ized training in large-scale graph neural networks. Proceedings of the VLDB Endowment, 2022, 15(9): 1937–1950

2022

[17] [17]

ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud de- tection

Tian Y, Liu G, Wang J, Zhou M. ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud de- tection. IEEE Transactions on Computational Social Systems, 2024, 11(3): 3536–3549

2024

[18] [18]

Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network

Mishra S, Singh G, Bhattacharya M. Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network. Medical & Biological Engineering & Computing, 2024, 62(8): 2499–2510

2024

[19] [19]

PGSampler: accelerating GPU- based graph sampling in GNN systems via workload fusion

Wei X, Tang W, Qi H, Yue H. PGSampler: accelerating GPU- based graph sampling in GNN systems via workload fusion. In: Pro- ceedings of 2024 IEEE International Conference on Cluster Comput- ing. 2024, 51–61

2024

[20] [20]

Scalable graph neural network training: the case for sampling

Serafini M. Scalable graph neural network training: the case for sampling. ACM SIGOPS Operating Systems Review, 2021, 55(1): 68–76

2021

[21] [21]

Efficient data loader for fast sampling-based GNN training on large graphs

Bai Y, Li C, Lin Z, Wu Y, Miao Y, Liu Y, Xu Y. Efficient data loader for fast sampling-based GNN training on large graphs. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(10): 2541–2556

2021

[22] [22]

FastGL: a GPU- efficient framework for accelerating sampling-based GNN training at large scale

Zhu Z, Wang P, Hu Q, Li G, Liang X, Cheng J. FastGL: a GPU- efficient framework for accelerating sampling-based GNN training at large scale. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2024, 94–110

2024

[23] [23]

A local graph limits perspective on sampling-based GNNs

Alimohammadi Y, Ruiz L, Saberi A. A local graph limits perspective on sampling-based GNNs. 2023, arXiv preprint arXiv: 2310.10953

work page arXiv 2023

[24] [24]

GNNLab: a factored system for sample-based GNN training over GPUs

Yang J, Tang D, Song X, Wang L, Yin Q, Chen R, Yu W, Zhou J. GNNLab: a factored system for sample-based GNN training over GPUs. In: Proceedings of the 17th European Conference on Computer Systems. 2022, 417–434 FrontiersofComputer Science|Issue 5|Volume 21|May 2027|2105103–14 Front. Comput. Sci., 2027, 21(5): 2105103

2022

[25] [25]

Graph neural network training and data tiering

Min S, Wu K, Hidayetoglu M, Xiong J, Song X, Hwu W. Graph neural network training and data tiering. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing. 2022, 3555–3565

2022

[26] [26]

DUCATI: a Dual-Cache training system for graph neural networks on giant graphs with the GPU

Zhang X, Shen Y, Shao Y, Chen L. DUCATI: a Dual-Cache training system for graph neural networks on giant graphs with the GPU. Proceedings of the ACM on Management of Data, 2023, 1(2): 166:1–166:24

2023

[27] [27]

Large graph convolutional network training with GPU-oriented data communication architecture

Min S, Wu K, Huang S, Hidayetoglu M, Xiong J, Ebrahimi E, Chen D, Hwu W W. Large graph convolutional network training with GPU-oriented data communication architecture. Proceedings of the VLDB Endowment, 2021, 14(11): 2087–2100

2021

[28] [28]

FastGCN: fast learning with graph con- volutional networks via importance sampling

Chen J, Ma T, Xiao C. FastGCN: fast learning with graph con- volutional networks via importance sampling. In: Proceedings of the 6th International Conference on Learning Representations. 2018

2018

[29] [29]

Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform

Zhang B, Kuppannagari S R, Kannan R, Prasanna V K. Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform. In: Proceedings of 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021. 2021, 1–7

2021

[30] [30]

FreshGNN: reducing memory access via stable historical embeddings for graph neural network training

Huang K, Jiang H, Wang M, Xiao G, Wipf D, Song X, Gan Q, Huang Z, Zhai J, Zhang Z. FreshGNN: reducing memory access via stable historical embeddings for graph neural network training. Proceedings of the VLDB Endowment, 2024, 17(6): 1473–1486

2024

[31] [31]

GNNAutoScale: scalable and expressive graph neural networks via historical embed- dings

Fey M, Lenssen J E, Weichert F, Leskovec J. GNNAutoScale: scalable and expressive graph neural networks via historical embed- dings. In: Proceedings of the 38th International Conference on Ma- chine Learning. 2021, 3294–3304

2021

[32] [32]

Marius: learning massive graph embeddings on a single machine

Mohoney J, Waleffe R, Xu H, Rekatsinas T, Venkataraman S. Marius: learning massive graph embeddings on a single machine. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 533–549

2021

[33] [33]

WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared mem- ory architecture

Yang D, Liu J, Qi J, Lai J. WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared mem- ory architecture. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 54:1–54:14

2022

[34] [34]

SaGNN: a sample- based GNN training and inference hardware accelerator

Wang H, Zhang S, Feng K, Wang M, Yang Z. SaGNN: a sample- based GNN training and inference hardware accelerator. In: Proceed- ings of 2023 IEEE International Symposium on Circuits and Systems. 2023, 1–5

2023

[35] [35]

An efficient sampling- based SpMM kernel for balancing accuracy and speed in GNN infer- ence

Song Y, Wang Y, Xiong C, Wang T, Tang P. An efficient sampling- based SpMM kernel for balancing accuracy and speed in GNN infer- ence. In: Proceedings of 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. 2024, 468–475

2024

[36] [36]

SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs

He Y, Lai Z, Ran Z, Zhang L, Li D. SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs. In: Proceedings of 2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Comput- ing, Sustainable Computing & Communications, Social Computing & Networking. 2022, 106–113

2022

[37] [37]

Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper

Liao H, Tu J, Xia J, Liu H, Zhou X, Yuan H, Hu Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper. In: Proceedings of 2021 IEEE In- ternational Symposium on High-Performance Computer Architecture. 2021, 789–801

2021

[38] [38]

Performance evaluation of MindSpore and PyTorch based on Ascend NPU

Zhu Z, Wang B, Yang C, Zhu R, Zhou M, Zheng N. Performance evaluation of MindSpore and PyTorch based on Ascend NPU. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1826–1832

2023

[39] [39]

In-datacenter performance analysis of a tensor processing unit

Jouppi N P, Young C, Patil N, Patterson D A, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R, Cantin P, Chao C, Clark C, Coriell J, Daley M, Dau M, Dean J, Gelb B, Ghaemmaghami T V, Gottipati R, Gulland W, Hagmann R, Ho C R, Hogberg D, Hu J, Hundt R, Hurt D, Ibarz J, Jaffey A, Jaworski A, Kaplan A, Khaitan H, Killebrew D, Koch A, Kumar...

[40] [40]

Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with gaudi processor

Medina E, Dagan E. Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with gaudi processor. IEEE Micro, 2020, 40(2): 17–24

2020

[41] [41]

AIbench: a tool for benchmarking Huawei Ascend AI processors

Xiao Y, Wang Z. AIbench: a tool for benchmarking Huawei Ascend AI processors. CCF Transactions on High Performance Com- puting, 2024, 6(2): 115–129

2024

[42] [42]

Ascend-CC: confi- dential computing on heterogeneous NPU for emerging generative AI workloads

Dhar A, Thorens C, Lazier L M, Cavigelli L. Ascend-CC: confi- dential computing on heterogeneous NPU for emerging generative AI workloads. 2024, arXiv preprint arXiv: 2407.11888

work page arXiv 2024

[43] [43]

Ma- chine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput

Wongpanich A, Oguntebi T, Paredes J B, Wang Y E, Phothilimthana P M, Mitra R, Zhou Z, Kumar N, Reddi V J. Ma- chine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput. 2025, arXiv preprint arXiv: 2502.06982

work page arXiv 2025

[44] [44]

Tackling the dynamicity in a produc- tion LLM serving system with SOTA optimizations via hybrid pre- fill/decode/verify scheduling on efficient Meta-kernels

Song M, Tang X, Hou F, Li J, Wei W, Ma Y, Xiao R, Si H, Jiang D, Yin S, Hu Y, Long G. Tackling the dynamicity in a produc- tion LLM serving system with SOTA optimizations via hybrid pre- fill/decode/verify scheduling on efficient Meta-kernels. 2024, arXiv preprint arXiv: 2412.18106

work page arXiv 2024

[45] [45]

Analysis of performance and optimization in MindSpore on Ascend NPUs

Wang B, Yang C, Zhu R, Liu X, Zhou M, Zheng N. Analysis of performance and optimization in MindSpore on Ascend NPUs. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1701–1708

2023

[46] [46]

Machine learning-enabled performance model for DNN applications and AI accelerator

Wu R, Li M, Li H, Chen T, Tian X, Xu X, Zhou B, Chen J, An H. Machine learning-enabled performance model for DNN applications and AI accelerator. In: Proceedings of the 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud ...

2022

[47] [47]

Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL- MxP on Cloud Brain II

Xue W, Yang K, Liu Y, Fan D, Xu P, Tian Y. Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL- MxP on Cloud Brain II. In: Proceedings of International Conference for High Performance Computing, Networking, Storage, and Analysis. 2024, 82

2024

[48] [48]

Inductive representation learning on large graphs

Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1024– 1034

2017

[49] [49]

Defining and evaluating network communi- ties based on ground-truth

Yang J, Leskovec J. Defining and evaluating network communi- ties based on ground-truth. In: Proceedings of the 12th IEEE Interna- tional Conference on Data Mining. 2012, 745–754

2012

[50] [50]

Predicting positive and negative links in online social networks

Leskovec J, Huttenlocher D P, Kleinberg J M. Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 641–650

2010

[51] [51]

when to sample

Ramezani M, Cong W, Mahdavi M, Sivasubramaniam A, Kan- demir M T. GCN meets GPU: decoupling “when to sample” from “how to sample”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

2020

[52] [52]

Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters

Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29–123

2009

[53] [53]

Cube-fx: mapping Taylor ex- pansion onto matrix multiplier-accumulators of Huawei Ascend AI processors

Tang Y, Zhou H, Ji Z, Wang C. Cube-fx: mapping Taylor ex- pansion onto matrix multiplier-accumulators of Huawei Ascend AI processors. IEEE Transactions on Parallel and Distributed Systems, 2025, 36(6): 1115–1129

2025

[54] [54]

High- utilization GPGPU design for accelerating GEMM workloads: an incremental approach

Wang C, Song P, Zhao H, Zhang F, Wang J, Zhang L. High- utilization GPGPU design for accelerating GEMM workloads: an incremental approach. In: Proceedings of 2024 IEEE International Symposium on Circuits and Systems. 2024, 1–5

2024

[55] [55]

HBM-based hardware accelerator for GNN sampling and aggregation

Gui Y, Wu Q, Yuan W, Liang H, Wang X, Jin X. HBM-based hardware accelerator for GNN sampling and aggregation. In: Proceed- ings of 2024 IEEE High Performance Extreme Computing Conference. 2024, 1–7

2024

[56] [56]

HongTu: scalable full-graph GNN training on multiple GPUs (via communication-optimized CPU data offloading)

Wang Q, Chen Y, Wong W, He B. HongTu: scalable full-graph GNN training on multiple GPUs (via communication-optimized CPU data offloading). 2023, arXiv preprint arXiv: 2311.14898

work page arXiv 2023

[57] [57]

Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness

Tan Z, Yuan X, He C, Sit M, Li G, Liu X, Ai B, Zeng K, Pietzuch P R, Mai L. Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness. 2023, arXiv preprint arXiv: 2305.10863

work page arXiv 2023

[58] [58]

Principal component analysis in the local differ- ential privacy model

Wang D, Xu J. Principal component analysis in the local differ- ential privacy model. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019. 2019, 4795–4801

2019

[59] [59]

Accelerating graph sampling for graph machine learning using GPUs

Jangda A, Polisetty S, Guha A, Serafini M. Accelerating graph sampling for graph machine learning using GPUs. In: Proceedings of the 16th European Conference on Computer Systems. 2021, 311–326

2021

[60] [60]

PaGraph: scaling GNN training on large graphs via computation-aware caching

Lin Z, Li C, Miao Y, Liu Y, Xu Y. PaGraph: scaling GNN training on large graphs via computation-aware caching. In: Proceedings of the 11th ACM Symposium on Cloud Computing. 2020, 401–415

2020

[61] [61]

Neutronascend: Optimizing gnn training with ascend ai processors

Ai X, Zhang B, Wang Q, Zhang Y, Yuan H, Gong S, Yu G. Neutronascend: Optimizing gnn training with ascend ai processors. ACM Transactions on Architecture and Code Optimization, 2025

2025

[62] [62]

Paper of Distinction

Tang Y, Wang C. Performance modeling on DaVinci AI core. Journal of Parallel and Distributed Computing, 2023, 175: 134–149 Kefu Chen is currently a master’s student in computer sci- ence at Northeastern University, China. His major research interests include acceleration of graph computing and learn- ing system on emerging hardware. Xin Ai is currently wo...

2023