pith. sign in

arxiv: 1906.09702 · v1 · pith:LEA5E5DLnew · submitted 2019-06-24 · 💻 cs.DC · cs.PL

Heterogeneous Active Messages (HAM) -- Implementing Lightweight Remote Procedure Calls in C++

Pith reviewed 2026-05-25 17:39 UTC · model grok-4.3

classification 💻 cs.DC cs.PL
keywords active messagesremote procedure callsheterogeneous systemsC++ template metaprogrammingoffloadingaddress translationdistributed computing
0
0 comments X

The pith

HAM uses C++ template metaprogramming to generate active message types and handlers for lightweight RPC across heterogeneous architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HAM as a C++-only active messaging library for distributed systems that span different processors and binaries. Template meta-programming automatically creates the message types and their handler functions without explicit definitions. An address translation step maps handler locations between processes that run on mismatched architectures. When paired with a communication protocol this becomes a generic RPC mechanism. The same design has been applied to offload work to accelerators such as Xeon Phi and SX-Aurora while exposing per-type hooks for serialization.

Core claim

HAM uses template meta-programming to implicitly generate active message types and their corresponding handler functions. Heterogeneity is enabled by providing an efficient address translation mechanism between the individual handler code addresses of processes running different binaries on different architectures, as well as hooks to inject serialisation and deserialisation code on a per-type basis.

What carries the argument

Template meta-programming for implicit generation of active message types and handlers, together with an address translation mechanism that maps handler addresses across different binaries and architectures.

If this is right

  • Combined with any communication protocol, HAM supplies a generic RPC facility.
  • It supports low-overhead offloading between CPUs and accelerators such as the Xeon Phi and SX-Aurora.
  • Per-type hooks allow custom serialization code to be inserted without changing the core message generation.
  • The implementation surfaces gaps in the current C++ standard for distributed heterogeneous code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could reduce boilerplate in other mixed-architecture HPC codes that currently hand-write message formats.
  • Similar metaprogramming patterns might be explored for languages that lack C++-style templates but target heterogeneous hardware.
  • Wider use could motivate clearer language rules for code-address portability in distributed settings.

Load-bearing premise

C++ template metaprogramming and address translation can be made to work reliably across heterogeneous binaries and architectures without violating language rules or incurring high runtime cost.

What would settle it

A concrete test in which address translation between two processes on different architectures produces an incorrect handler invocation or adds measurable overhead to the call path.

Figures

Figures reproduced from arXiv: 1906.09702 by Matthias Noack.

Figure 1
Figure 1. Figure 1: HAM in the context of the HAM-Offload frame￾work. [16]. The Heterogeneous Active Message (HAM) mech￾anism is used to implement the HAM-Offload API to offload function calls to other process running on local or remote resource like CPUs or accelerators. The communication be￾tween processes is provided by an abstract Communication Backend, for which multiple implementations exist [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the offload overhead of vendor pro￾vided solutions and HAM-Offload measured as the time for offloading an empty function. For the Intel Xeon Phi, In￾tel LEO [12] is used as vendor solution following the same microbenchmark as published in [16]. For offloading to the NEC Vector Engine (VE), we show the numbers published in [15], measured on an NEC SX-Aurora TSUBASA A300-8 system with NEC VEO a… view at source ↗
Figure 5
Figure 5. Figure 5: Sequence of entities and transformations for o [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: contains a class diagram showing the main entities of HAM and putting them into relation with each other [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The message-handler address translation between [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An output of the actual handler tables generated by HAM in the context of HAM-O [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

We present HAM (Heterogeneous Active Messages), a C++-only active messaging solution for heterogeneous distributed systems.Combined with a communication protocol, HAM can be used as a generic Remote Procedure Call (RPC) mechanism. It has been used in HAM-Offload to implement a low-overhead offloading framework for inter- and intra-node offloading between different architectures including accelerators like the Intel Xeon Phi x100 series and the NEC SX-Aurora TSUBASA Vector Engine. HAM uses template meta-programming to implicitly generate active message types and their corresponding handler functions. Heterogeneity is enabled by providing an efficient address translation mechanism between the individual handler code addresses of processes running different binaries on different architectures, as well a hooks to inject serialisation and deserialisation code on a per-type basis. Implementing such a solution in modern C++ sheds some light on the shortcomings and grey areas of the C++ standard when it comes to distributed and heterogeneous environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Heterogeneous Active Messages (HAM), a C++-only active messaging library for heterogeneous distributed systems that can serve as a generic RPC mechanism when combined with a communication protocol. It has been deployed in the HAM-Offload framework for low-overhead offloading between architectures including Intel Xeon Phi and NEC SX-Aurora. HAM relies on template metaprogramming to implicitly generate active message types and handler functions, supplies an address translation mechanism to map handler code addresses across processes with different binaries and architectures, and provides per-type hooks for serialization/deserialization. The authors also discuss shortcomings and grey areas of the C++ standard for distributed and heterogeneous environments.

Significance. If the address translation mechanism can be shown to be portable, standards-compliant, and low-overhead, the work would supply a practical, dependency-free C++ solution for lightweight RPC in heterogeneous HPC settings and could usefully inform future language extensions.

major comments (2)
  1. [Abstract] Abstract: the central claim of an 'efficient address translation mechanism' that maps handler code addresses between processes running different binaries on different architectures is load-bearing for the heterogeneity guarantee, yet no algorithm, casting strategy, registry approach, memory-model assumptions, or overhead data are supplied; the text only notes 'grey areas of the C++ standard.'
  2. [Abstract] Abstract: the assertion that HAM 'has been used in HAM-Offload to implement a low-overhead offloading framework' is unsupported by any benchmarks, error analysis, or performance derivation, leaving the 'lightweight' and practical-utility claims unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and the opportunity to respond to the comments on our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of an 'efficient address translation mechanism' that maps handler code addresses between processes running different binaries on different architectures is load-bearing for the heterogeneity guarantee, yet no algorithm, casting strategy, registry approach, memory-model assumptions, or overhead data are supplied; the text only notes 'grey areas of the C++ standard.'

    Authors: The abstract is a concise summary and does not contain the implementation details. The manuscript body describes the address translation mechanism, including the registry-based approach for mapping handler addresses across different binaries and architectures, the use of template metaprogramming, and our analysis of C++ standard limitations. Overhead considerations are discussed in the context of the design. We agree the abstract could better signal the presence of these details and will revise it to include a brief qualifier or section reference. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that HAM 'has been used in HAM-Offload to implement a low-overhead offloading framework' is unsupported by any benchmarks, error analysis, or performance derivation, leaving the 'lightweight' and practical-utility claims unverified.

    Authors: The abstract summarizes the use of HAM within the HAM-Offload framework. The manuscript contains the supporting performance evaluation and analysis in the dedicated evaluation section. To address the concern that the abstract makes an unsupported assertion, we will revise the abstract to qualify the 'low-overhead' claim or add a reference to the evaluation results. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation description without equations or self-referential derivations

full rationale

The paper describes a C++ template-metaprogramming technique for generating active-message types and an address-translation mechanism for heterogeneous binaries. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims concern the feasibility of an implementation artifact rather than any derivation that reduces to its own inputs by construction. This is the expected non-finding for a systems paper whose contribution is code-level engineering rather than a mathematical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new library mechanism whose correctness rests on unstated assumptions about C++ template behavior and address translation across binaries; no free parameters or external benchmarks are mentioned in the abstract.

axioms (1)
  • domain assumption C++ template metaprogramming suffices to implicitly generate active message types and handler functions at compile time
    Invoked to support the core generation of message types without explicit user code.
invented entities (1)
  • HAM address translation mechanism no independent evidence
    purpose: Maps handler code addresses between processes running different binaries on different architectures
    Required to enable heterogeneity; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5685 in / 1224 out tokens · 28695 ms · 2026-05-25T17:39:38.332864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migrat- able Objects: Charm++ in Practice. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Anal...

  2. [2]

    C. Chen, F. Yang, F. Wang, L. Deng, and D. Zhao. 2018. Review of Programming and Performance Optimization on CPU-MIC Heterogeneous System. In2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). 894–900. https://doi.org/10.1109/ICIVC.2018.8492841

  3. [3]

    Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, and Wolfgang De Meuter. 2015. Partitioned Global Address Space Languages.ACM Comput. Surv.47, 4, Article 62 (May 2015), 27 pages. https://doi.org/10.1145/2716320

  4. [4]

    2018.Advancing the Heterogeneous Active Messages Approach

    Daniel Deppisch. 2018.Advancing the Heterogeneous Active Messages Approach. Master’s thesis. Humboldt-Universität zu Berlin, Faculty of Mathematics and Natural Siences, Department of Computer Science

  5. [5]

    d.].Itanium C ++ ABI, v1.86

    Intel Corporation [n. d.].Itanium C ++ ABI, v1.86. Intel Corporation

  6. [6]

    2013.Intel Xeon Phi Coprocessor High Perfor- mance Programming(1st ed.)

    Jeffers, James and Reinders, James. 2013.Intel Xeon Phi Coprocessor High Perfor- mance Programming(1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

  7. [7]

    Kaiser, M

    H. Kaiser, M. Brodowicz, and T. Sterling. 2009. ParalleX An Advanced Par- allel Execution Model for Scaling-Impaired Applications. InParallel Process- ing Workshops, 2009. ICPPW ’09. International Conference on. 394–401. https: //doi.org/10.1109/ICPPW.2009.14

  8. [8]

    Lu, Milind Girkar, Michael Matz, Jan Hubika, Andreas Jaeger, and Mark Mitchell

    H.J. Lu, Milind Girkar, Michael Matz, Jan Hubika, Andreas Jaeger, and Mark Mitchell. [n. d.].System V Application Binary Interface, K1OM Architecture Pro- cessor Supplement, v1.0

  9. [9]

    Malý, Lukáš and Zapletal, Jan and Merta, Michal andČermák, Martin. 2018. Xeon Phi acceleration of domain decomposition iterations via heterogeneous active messages.AIP Conference Proceedings1978, 1 (2018), 360004. https://doi.org/10. 1063/1.5043963 arXiv:https://aip.scitation.org/doi/pdf/10.1063/1.5043963

  10. [10]

    Michael Matz, Jan Hubika, Andreas Jaeger, and Mark Mitchell. [n. d.].System V Application Binary Interface, AMD64 Architecture Processor Supplement, Draft v0.99.6

  11. [11]

    d.].System V Application Binary Interface VE Architecture Processor Supplement, v1.1

    NEC Corporation [n. d.].System V Application Binary Interface VE Architecture Processor Supplement, v1.1. NEC Corporation. https://www.nec.com/en/global/ prod/hpc/aurora/document/VE-ABI_v1.1.pdf

  12. [12]

    Newburn, Rajiv Deodhar, Serguei Dmitriev, Ravi Murty, Ravi Narayanaswamy, John Wiegert, Francisco Chinchilla, and Russell McGuire

    ChrisJ. Newburn, Rajiv Deodhar, Serguei Dmitriev, Ravi Murty, Ravi Narayanaswamy, John Wiegert, Francisco Chinchilla, and Russell McGuire

  13. [13]

    InSu- percomputing

    Offload Compiler Runtime for the Intel Xeon Phi Coprocessor. InSu- percomputing. Springer Berlin Heidelberg, 239–254. https://doi.org/10.1007/ 978-3-642-38750-0_18

  14. [14]

    C. J. Newburn, G. Bansal, M. Wood, L. Crivelli, J. Planas, A. Duran, P. Souza, L. Borges, P. Luszczek, S. Tomov, J. Dongarra, H. Anzt, M. Gates, A. Haidar, Y. Jia, K. Kabir, I. Yamazaki, and J. Labarta. 2016. Heterogeneous Streaming. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 611–620. https://doi.org/10.110...

  15. [15]

    Matthias Noack. 2019. HAM-Offload GitHub Repository. (2019). https://github. com/noma/ham

  16. [16]

    Matthias Noack, Erich Focht, and Thomas Steinke. 2019. Heterogeneous Ac- tive Messages for Offloading on the NEC SX-Aurora TSUBASA. In2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

  17. [17]

    Matthias Noack, Florian Wende, Thomas Steinke, and Frank Cordes. 2014. A Unified Programming Model for Intra- and Inter-node Offloading on Xeon Phi Clusters. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE Press, Piscataway, NJ, USA, 203–214. https://doi.org/10.1109/SC.2014.22

  18. [18]

    Jörg Nolte, Yutaka Ishikawa, and Mitsuhisa Sato. 2001. TACO: Prototyping High- Level Object-Oriented Programming Constructs by Means of Template Based Programming Techniques.SIGPLAN Not.36 (December 2001), 35–49. Issue 12. https://doi.org/10.1145/583960.583965

  19. [19]

    OpenMP Architecture Review Board

    OpenMP Architecture Review Board 2018.OpenMP Application Program Interface, Version 5.0. OpenMP Architecture Review Board. https://www.openmp.org/ wp-content/uploads/OpenMP-API-Specification-5.0.pdf

  20. [20]

    TOP500.org. 2018. Top500: TOP 500 Supercomputer Sites. (November 2018). http://www.top500.org 8