pith. sign in

arxiv: 2606.01161 · v1 · pith:WAO5AXBZnew · submitted 2026-05-31 · 💻 cs.DC

AcOrch: Accelerating Sampling-based GNN Training under CPU-NPU Heterogeneous Environments

Pith reviewed 2026-06-28 16:35 UTC · model grok-4.3

classification 💻 cs.DC
keywords sampling-based GNN trainingheterogeneous CPU-NPUtask orchestrationtwo-level pipelineAscend AI processorgraph neural networksNPU accelerationfeature gathering
0
0 comments X

The pith

AcOrch achieves 2.31x speedup on sampling-based GNN training by mapping tasks to CPU, AIC, and AIV units in a two-level pipeline on Ascend processors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AcOrch as a system to accelerate sampling-based Graph Neural Network training on CPU-NPU heterogeneous platforms such as the Ascend AI processor. It establishes that fine-grained task orchestration, which maps sampling, feature gathering, and model training to appropriate compute units, combined with a two-level pipelined execution model, overlaps these stages both across CPU and NPU and among units inside the NPU. This approach is presented as a way to maximize resource utilization in workloads where stages have mismatched resource needs and computation volumes. A sympathetic reader would care because sampling-based GNN training on large graphs is resource-intensive, and better overlap could reduce training time on existing NPU hardware without requiring new accelerators.

Core claim

AcOrch is a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. It offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. The two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU, thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-

What carries the argument

The two-level pipelined execution model with fine-grained task orchestration that maps tasks to AIC units, AIV units, and CPU cores.

If this is right

  • Overlapping execution across CPU-NPU boundaries and within NPU units reduces idle time during the multi-stage training process.
  • Task mapping to specialized units allows each stage to run on the compute type best suited to its requirements.
  • The approach scales mini-batch training on sampled subgraphs for larger graphs by keeping more units busy.
  • Resource utilization improves without changes to the underlying GNN model or sampling algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same task-mapping and pipelining logic could be adapted to other NPUs or heterogeneous accelerators that expose distinct internal compute units.
  • The results suggest that orchestration overhead, rather than raw peak performance, is often the main limiter in current NPU-based GNN systems.
  • Dynamic adjustment of the pipeline depth based on graph size or batch characteristics might further improve results on varied workloads.

Load-bearing premise

The heterogeneous compute features of the NPU can be analyzed and tasks mapped to AIC, AIV, and CPU units such that the two-level pipeline overlaps sampling, gathering, and training with negligible synchronization or data-movement overhead on the target platform.

What would settle it

Running identical sampling-based GNN workloads on the Ascend 910B with and without the two-level pipeline and task mapping, then checking whether the measured speedup over MindSporeGL drops below the reported 2.31x average.

read the original abstract

Graph Neural Networks (GNNs) have achieved remarkable success in various applications. Sampling-based GNN training, which conducts mini-batch training on sampled subgraphs, has become a promising solution for large-scale graphs. Given the resource-intensive nature of sampling-based GNN training, Neural Processing Units (NPUs), such as the Ascend AI processor, offer a promising alternative due to their high throughput and energy efficiency, making them well-suited for GNN workloads. However, the multi-stage nature of sampling-based training, which involves subgraph sampling, feature gathering, and model training, with different resource requirements and computation volume. This requires careful coordination to fully utilize the heterogeneous computation resources of CPUs and NPUs. In this work, we present AcOrch, a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. AcOrch offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. Moreover, the two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU (e.g., AIC and AIV units), thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-the-art NPU-native graph learning system, MindSporeGL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents AcOrch, a sampling-based GNN training system for CPU-NPU heterogeneous platforms (e.g., Ascend 910B). It introduces fine-grained task orchestration that maps tasks to AIC, AIV, and CPU units, combined with a two-level pipelined execution model to overlap subgraph sampling, feature gathering, and model training, claiming an average 2.31x speedup over the NPU-native baseline MindSporeGL.

Significance. If the performance claims are substantiated with complete experimental evidence, the work could meaningfully advance systems support for large-scale GNN training on specialized AI processors by demonstrating practical exploitation of intra-NPU heterogeneity via pipelining.

major comments (2)
  1. [Abstract] Abstract: The central claim of a 2.31x average speedup is presented without any description of the experimental setup, graph datasets, GNN model sizes/architectures, number of runs, variance across runs, or confirmation that the MindSporeGL baseline received equivalent tuning on the Ascend 910B. This absence prevents evaluation of the numerical result.
  2. [Abstract] Abstract: The two-level pipeline is asserted to overlap sampling/gathering/training with negligible synchronization and data-movement overhead after mapping to AIC/AIV/CPU units, yet no utilization metrics, per-stage timing breakdowns, or overhead measurements versus the baseline are supplied to substantiate that the weakest assumption holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract should be more self-contained to allow readers to better assess the performance claims. We will revise the abstract accordingly while preserving its conciseness. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 2.31x average speedup is presented without any description of the experimental setup, graph datasets, GNN model sizes/architectures, number of runs, variance across runs, or confirmation that the MindSporeGL baseline received equivalent tuning on the Ascend 910B. This absence prevents evaluation of the numerical result.

    Authors: We acknowledge the referee's point that the abstract, as currently written, does not provide sufficient context for the 2.31x claim. The full manuscript details the experimental setup (Ascend 910B platform, standard large-scale graph datasets, GNN architectures such as GCN and GraphSAGE, multiple independent runs with reported variance) and confirms equivalent tuning of the MindSporeGL baseline in the Experiments section. To address the concern directly, we will revise the abstract to include a concise summary of these elements. revision: yes

  2. Referee: [Abstract] Abstract: The two-level pipeline is asserted to overlap sampling/gathering/training with negligible synchronization and data-movement overhead after mapping to AIC/AIV/CPU units, yet no utilization metrics, per-stage timing breakdowns, or overhead measurements versus the baseline are supplied to substantiate that the weakest assumption holds.

    Authors: The manuscript supplies per-stage timing breakdowns, resource utilization measurements, and overhead comparisons versus the baseline in the evaluation section to support the pipelining claims. However, these details are not referenced in the abstract. We will revise the abstract to briefly note the availability of these supporting measurements and their key findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical speedup claim rests on hardware measurements, not derivations or self-referential fits

full rationale

The paper describes a systems implementation (AcOrch) with task mapping to AIC/AIV/CPU units and a two-level pipeline, then reports measured wall-clock speedups (2.31x average) versus MindSporeGL on Ascend 910B hardware. No equations, fitted parameters, or mathematical derivations appear in the provided text that could reduce to inputs by construction. The central performance claim is an empirical observation, not a prediction derived from prior results or self-citations. No load-bearing self-citation chains, ansatzes, or renamings are present. This is a standard non-circular systems paper whose validity hinges on experimental reproducibility rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the contribution is an engineering orchestration layer whose correctness depends on platform-specific performance characteristics rather than new theoretical constructs.

pith-pipeline@v0.9.1-grok · 5829 in / 1136 out tokens · 25838 ms · 2026-06-28T16:35:14.661711+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    As shown in Fig

    Case 1: Assigning both sampling and gathering to the CPU. As shown in Fig. 5 (a), during each iteration, the CPU first performs neighbor sampling, then gathers the required node features, and finally transfers the prepared graph topology and feature data in batches to the NPU for training. Throughout the data preparation phase, the NPU remains idle and on...

  2. [2]

    As shown in Fig

    Case 2: Assigning sampling to the CPU and gathering to the AIV. As shown in Fig. 5 (b), the CPU first performs neighbor sampling and then sends the sampled subgraph topology to the NPU. The AIV is responsible for gathering the corresponding node features from the NPU’s memory, and together with the subgraph topology, delivers them to the AIC for training....

  3. [3]

    As shown in Fig

    Case 3: Assigning sampling to the AIV and gathering to the CPU. As shown in Fig. 5 (c), the AIV first completes graph sampling, after which the CPU collects the corresponding node features from main memory and transfers the processed feature data in batches to the NPU for training by the AIC. In this mode, the gathering step is again constrained by the PC...

  4. [4]

    As shown in Fig

    Case 4: Assigning both sampling and gathering to the AIV. As shown in Fig. 5 (d), the graph topology and feature data are cached in the NPU, and the AIV1 first performs subgraph topology sampling, followed by the AIV2 collecting the corresponding node features from NPU memory. Finally, both are sent to the AIC for training. This approach better leverages ...

  5. [5]

    NeutronOrch: rethinking sample-based GNN training under CPU- GPU heterogeneous environments

    Ai X, Wang Q, Cao C, Zhang Y, Chen C, Yuan H, Gu Y, Yu G. NeutronOrch: rethinking sample-based GNN training under CPU- GPU heterogeneous environments. Proceedings of the VLDB Endow- ment, 2024, 17(8): 1995–2008

  6. [6]

    Graph attention networks for neural social recommendation

    Mu N, Zha D, He Y, Tang Z. Graph attention networks for neural social recommendation. In: Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence. 2019, 1320–1327

  7. [7]

    NeutronStar: distributed GNN training with hybrid dependency management

    Wang Q, Zhang Y, Wang H, Chen C, Zhang X, Yu G. NeutronStar: distributed GNN training with hybrid dependency management. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1301–1315

  8. [8]

    Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

    Wang M, Yu L, Zheng D, Gan Q, Gai Y, Ye Z, Li M, Zhou J, Huang Q, Ma C, Huang Z, Guo Q, Zhang H, Lin H, Zhao J, Li J, Smola A J, Zhang Z. Deep graph library: towards efficient and scalable deep learning on graphs. 2019, arXiv preprint arXiv: 1909.01315

  9. [9]

    XGCN: a library for large- scale graph neural network recommendations

    Song X, Huang H, Lian J, Jin H. XGCN: a library for large- scale graph neural network recommendations. Frontiers of Computer Science, 2024, 18(3): 183343

  10. [10]

    Semi-supervised classification with graph convolutional networks

    Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Con- ference on Learning Representations. 2017

  11. [11]

    TurboGNN: improving the end-to- end performance for sampling-based GNN training on GPUs

    Wu W, Shi X, He L, Jin H. TurboGNN: improving the end-to- end performance for sampling-based GNN training on GPUs. IEEE Transactions on Computers, 2023, 72(9): 2571–2584

  12. [12]

    Sampling meth- ods for efficient training of graph convolutional networks: a survey

    Liu X, Yan M, Deng L, Li G, Ye X, Fan D. Sampling meth- ods for efficient training of graph convolutional networks: a survey. IEEE/CAA Journal of Automatica Sinica, 2022, 9(2): 205–234

  13. [13]

    A comprehen- sive survey on graph neural networks

    Wu Z, Pan S, Chen F, Long G, Zhang C, Yu P S. A comprehen- sive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4–24

  14. [14]

    A comprehensive survey on graph neural network accelerators

    Liu J, Chen S, Shen L. A comprehensive survey on graph neural network accelerators. Frontiers of Computer Science, 2025, 19(2): 192104

  15. [15]

    A survey of dynamic graph neural net- works

    Zheng Y, Yi L, Wei Z. A survey of dynamic graph neural net- works. Frontiers of Computer Science, 2025, 19(6): 196323

  16. [16]

    SAN- CUS: staleness-aware communication-avoiding full-graph decentral- ized training in large-scale graph neural networks

    Peng J, Chen Z, Shao Y, Shen Y, Chen L, Cao J. SAN- CUS: staleness-aware communication-avoiding full-graph decentral- ized training in large-scale graph neural networks. Proceedings of the VLDB Endowment, 2022, 15(9): 1937–1950

  17. [17]

    ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud de- tection

    Tian Y, Liu G, Wang J, Zhou M. ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud de- tection. IEEE Transactions on Computational Social Systems, 2024, 11(3): 3536–3549

  18. [18]

    Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network

    Mishra S, Singh G, Bhattacharya M. Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network. Medical & Biological Engineering & Computing, 2024, 62(8): 2499–2510

  19. [19]

    PGSampler: accelerating GPU- based graph sampling in GNN systems via workload fusion

    Wei X, Tang W, Qi H, Yue H. PGSampler: accelerating GPU- based graph sampling in GNN systems via workload fusion. In: Pro- ceedings of 2024 IEEE International Conference on Cluster Comput- ing. 2024, 51–61

  20. [20]

    Scalable graph neural network training: the case for sampling

    Serafini M. Scalable graph neural network training: the case for sampling. ACM SIGOPS Operating Systems Review, 2021, 55(1): 68–76

  21. [21]

    Efficient data loader for fast sampling-based GNN training on large graphs

    Bai Y, Li C, Lin Z, Wu Y, Miao Y, Liu Y, Xu Y. Efficient data loader for fast sampling-based GNN training on large graphs. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(10): 2541–2556

  22. [22]

    FastGL: a GPU- efficient framework for accelerating sampling-based GNN training at large scale

    Zhu Z, Wang P, Hu Q, Li G, Liang X, Cheng J. FastGL: a GPU- efficient framework for accelerating sampling-based GNN training at large scale. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2024, 94–110

  23. [23]

    A local graph limits perspective on sampling-based GNNs

    Alimohammadi Y, Ruiz L, Saberi A. A local graph limits perspective on sampling-based GNNs. 2023, arXiv preprint arXiv: 2310.10953

  24. [24]

    GNNLab: a factored system for sample-based GNN training over GPUs

    Yang J, Tang D, Song X, Wang L, Yin Q, Chen R, Yu W, Zhou J. GNNLab: a factored system for sample-based GNN training over GPUs. In: Proceedings of the 17th European Conference on Computer Systems. 2022, 417–434 FrontiersofComputer Science|Issue 5|Volume 21|May 2027|2105103–14 Front. Comput. Sci., 2027, 21(5): 2105103

  25. [25]

    Graph neural network training and data tiering

    Min S, Wu K, Hidayetoglu M, Xiong J, Song X, Hwu W. Graph neural network training and data tiering. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing. 2022, 3555–3565

  26. [26]

    DUCATI: a Dual-Cache training system for graph neural networks on giant graphs with the GPU

    Zhang X, Shen Y, Shao Y, Chen L. DUCATI: a Dual-Cache training system for graph neural networks on giant graphs with the GPU. Proceedings of the ACM on Management of Data, 2023, 1(2): 166:1–166:24

  27. [27]

    Large graph convolutional network training with GPU-oriented data communication architecture

    Min S, Wu K, Huang S, Hidayetoglu M, Xiong J, Ebrahimi E, Chen D, Hwu W W. Large graph convolutional network training with GPU-oriented data communication architecture. Proceedings of the VLDB Endowment, 2021, 14(11): 2087–2100

  28. [28]

    FastGCN: fast learning with graph con- volutional networks via importance sampling

    Chen J, Ma T, Xiao C. FastGCN: fast learning with graph con- volutional networks via importance sampling. In: Proceedings of the 6th International Conference on Learning Representations. 2018

  29. [29]

    Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform

    Zhang B, Kuppannagari S R, Kannan R, Prasanna V K. Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform. In: Proceedings of 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021. 2021, 1–7

  30. [30]

    FreshGNN: reducing memory access via stable historical embeddings for graph neural network training

    Huang K, Jiang H, Wang M, Xiao G, Wipf D, Song X, Gan Q, Huang Z, Zhai J, Zhang Z. FreshGNN: reducing memory access via stable historical embeddings for graph neural network training. Proceedings of the VLDB Endowment, 2024, 17(6): 1473–1486

  31. [31]

    GNNAutoScale: scalable and expressive graph neural networks via historical embed- dings

    Fey M, Lenssen J E, Weichert F, Leskovec J. GNNAutoScale: scalable and expressive graph neural networks via historical embed- dings. In: Proceedings of the 38th International Conference on Ma- chine Learning. 2021, 3294–3304

  32. [32]

    Marius: learning massive graph embeddings on a single machine

    Mohoney J, Waleffe R, Xu H, Rekatsinas T, Venkataraman S. Marius: learning massive graph embeddings on a single machine. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 533–549

  33. [33]

    WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared mem- ory architecture

    Yang D, Liu J, Qi J, Lai J. WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared mem- ory architecture. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 54:1–54:14

  34. [34]

    SaGNN: a sample- based GNN training and inference hardware accelerator

    Wang H, Zhang S, Feng K, Wang M, Yang Z. SaGNN: a sample- based GNN training and inference hardware accelerator. In: Proceed- ings of 2023 IEEE International Symposium on Circuits and Systems. 2023, 1–5

  35. [35]

    An efficient sampling- based SpMM kernel for balancing accuracy and speed in GNN infer- ence

    Song Y, Wang Y, Xiong C, Wang T, Tang P. An efficient sampling- based SpMM kernel for balancing accuracy and speed in GNN infer- ence. In: Proceedings of 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. 2024, 468–475

  36. [36]

    SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs

    He Y, Lai Z, Ran Z, Zhang L, Li D. SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs. In: Proceedings of 2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Comput- ing, Sustainable Computing & Communications, Social Computing & Networking. 2022, 106–113

  37. [37]

    Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper

    Liao H, Tu J, Xia J, Liu H, Zhou X, Yuan H, Hu Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper. In: Proceedings of 2021 IEEE In- ternational Symposium on High-Performance Computer Architecture. 2021, 789–801

  38. [38]

    Performance evaluation of MindSpore and PyTorch based on Ascend NPU

    Zhu Z, Wang B, Yang C, Zhu R, Zhou M, Zheng N. Performance evaluation of MindSpore and PyTorch based on Ascend NPU. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1826–1832

  39. [39]

    In-datacenter performance analysis of a tensor processing unit

    Jouppi N P, Young C, Patil N, Patterson D A, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R, Cantin P, Chao C, Clark C, Coriell J, Daley M, Dau M, Dean J, Gelb B, Ghaemmaghami T V, Gottipati R, Gulland W, Hagmann R, Ho C R, Hogberg D, Hu J, Hundt R, Hurt D, Ibarz J, Jaffey A, Jaworski A, Kaplan A, Khaitan H, Killebrew D, Koch A, Kumar...

  40. [40]

    Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with gaudi processor

    Medina E, Dagan E. Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with gaudi processor. IEEE Micro, 2020, 40(2): 17–24

  41. [41]

    AIbench: a tool for benchmarking Huawei Ascend AI processors

    Xiao Y, Wang Z. AIbench: a tool for benchmarking Huawei Ascend AI processors. CCF Transactions on High Performance Com- puting, 2024, 6(2): 115–129

  42. [42]

    Ascend-CC: confi- dential computing on heterogeneous NPU for emerging generative AI workloads

    Dhar A, Thorens C, Lazier L M, Cavigelli L. Ascend-CC: confi- dential computing on heterogeneous NPU for emerging generative AI workloads. 2024, arXiv preprint arXiv: 2407.11888

  43. [43]

    Ma- chine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput

    Wongpanich A, Oguntebi T, Paredes J B, Wang Y E, Phothilimthana P M, Mitra R, Zhou Z, Kumar N, Reddi V J. Ma- chine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput. 2025, arXiv preprint arXiv: 2502.06982

  44. [44]

    Tackling the dynamicity in a produc- tion LLM serving system with SOTA optimizations via hybrid pre- fill/decode/verify scheduling on efficient Meta-kernels

    Song M, Tang X, Hou F, Li J, Wei W, Ma Y, Xiao R, Si H, Jiang D, Yin S, Hu Y, Long G. Tackling the dynamicity in a produc- tion LLM serving system with SOTA optimizations via hybrid pre- fill/decode/verify scheduling on efficient Meta-kernels. 2024, arXiv preprint arXiv: 2412.18106

  45. [45]

    Analysis of performance and optimization in MindSpore on Ascend NPUs

    Wang B, Yang C, Zhu R, Liu X, Zhou M, Zheng N. Analysis of performance and optimization in MindSpore on Ascend NPUs. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1701–1708

  46. [46]

    Machine learning-enabled performance model for DNN applications and AI accelerator

    Wu R, Li M, Li H, Chen T, Tian X, Xu X, Zhou B, Chen J, An H. Machine learning-enabled performance model for DNN applications and AI accelerator. In: Proceedings of the 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud ...

  47. [47]

    Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL- MxP on Cloud Brain II

    Xue W, Yang K, Liu Y, Fan D, Xu P, Tian Y. Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL- MxP on Cloud Brain II. In: Proceedings of International Conference for High Performance Computing, Networking, Storage, and Analysis. 2024, 82

  48. [48]

    Inductive representation learning on large graphs

    Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1024– 1034

  49. [49]

    Defining and evaluating network communi- ties based on ground-truth

    Yang J, Leskovec J. Defining and evaluating network communi- ties based on ground-truth. In: Proceedings of the 12th IEEE Interna- tional Conference on Data Mining. 2012, 745–754

  50. [50]

    Predicting positive and negative links in online social networks

    Leskovec J, Huttenlocher D P, Kleinberg J M. Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 641–650

  51. [51]

    when to sample

    Ramezani M, Cong W, Mahdavi M, Sivasubramaniam A, Kan- demir M T. GCN meets GPU: decoupling “when to sample” from “how to sample”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020

  52. [52]

    Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters

    Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29–123

  53. [53]

    Cube-fx: mapping Taylor ex- pansion onto matrix multiplier-accumulators of Huawei Ascend AI processors

    Tang Y, Zhou H, Ji Z, Wang C. Cube-fx: mapping Taylor ex- pansion onto matrix multiplier-accumulators of Huawei Ascend AI processors. IEEE Transactions on Parallel and Distributed Systems, 2025, 36(6): 1115–1129

  54. [54]

    High- utilization GPGPU design for accelerating GEMM workloads: an incremental approach

    Wang C, Song P, Zhao H, Zhang F, Wang J, Zhang L. High- utilization GPGPU design for accelerating GEMM workloads: an incremental approach. In: Proceedings of 2024 IEEE International Symposium on Circuits and Systems. 2024, 1–5

  55. [55]

    HBM-based hardware accelerator for GNN sampling and aggregation

    Gui Y, Wu Q, Yuan W, Liang H, Wang X, Jin X. HBM-based hardware accelerator for GNN sampling and aggregation. In: Proceed- ings of 2024 IEEE High Performance Extreme Computing Conference. 2024, 1–7

  56. [56]

    HongTu: scalable full-graph GNN training on multiple GPUs (via communication-optimized CPU data offloading)

    Wang Q, Chen Y, Wong W, He B. HongTu: scalable full-graph GNN training on multiple GPUs (via communication-optimized CPU data offloading). 2023, arXiv preprint arXiv: 2311.14898

  57. [57]

    Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness

    Tan Z, Yuan X, He C, Sit M, Li G, Liu X, Ai B, Zeng K, Pietzuch P R, Mai L. Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness. 2023, arXiv preprint arXiv: 2305.10863

  58. [58]

    Principal component analysis in the local differ- ential privacy model

    Wang D, Xu J. Principal component analysis in the local differ- ential privacy model. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019. 2019, 4795–4801

  59. [59]

    Accelerating graph sampling for graph machine learning using GPUs

    Jangda A, Polisetty S, Guha A, Serafini M. Accelerating graph sampling for graph machine learning using GPUs. In: Proceedings of the 16th European Conference on Computer Systems. 2021, 311–326

  60. [60]

    PaGraph: scaling GNN training on large graphs via computation-aware caching

    Lin Z, Li C, Miao Y, Liu Y, Xu Y. PaGraph: scaling GNN training on large graphs via computation-aware caching. In: Proceedings of the 11th ACM Symposium on Cloud Computing. 2020, 401–415

  61. [61]

    Neutronascend: Optimizing gnn training with ascend ai processors

    Ai X, Zhang B, Wang Q, Zhang Y, Yuan H, Gong S, Yu G. Neutronascend: Optimizing gnn training with ascend ai processors. ACM Transactions on Architecture and Code Optimization, 2025

  62. [62]

    Paper of Distinction

    Tang Y, Wang C. Performance modeling on DaVinci AI core. Journal of Parallel and Distributed Computing, 2023, 175: 134–149 Kefu Chen is currently a master’s student in computer sci- ence at Northeastern University, China. His major research interests include acceleration of graph computing and learn- ing system on emerging hardware. Xin Ai is currently wo...