Heterogeneous Active Messages (HAM) -- Implementing Lightweight Remote Procedure Calls in C++
Pith reviewed 2026-05-25 17:39 UTC · model grok-4.3
The pith
HAM uses C++ template metaprogramming to generate active message types and handlers for lightweight RPC across heterogeneous architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HAM uses template meta-programming to implicitly generate active message types and their corresponding handler functions. Heterogeneity is enabled by providing an efficient address translation mechanism between the individual handler code addresses of processes running different binaries on different architectures, as well as hooks to inject serialisation and deserialisation code on a per-type basis.
What carries the argument
Template meta-programming for implicit generation of active message types and handlers, together with an address translation mechanism that maps handler addresses across different binaries and architectures.
If this is right
- Combined with any communication protocol, HAM supplies a generic RPC facility.
- It supports low-overhead offloading between CPUs and accelerators such as the Xeon Phi and SX-Aurora.
- Per-type hooks allow custom serialization code to be inserted without changing the core message generation.
- The implementation surfaces gaps in the current C++ standard for distributed heterogeneous code.
Where Pith is reading between the lines
- The technique could reduce boilerplate in other mixed-architecture HPC codes that currently hand-write message formats.
- Similar metaprogramming patterns might be explored for languages that lack C++-style templates but target heterogeneous hardware.
- Wider use could motivate clearer language rules for code-address portability in distributed settings.
Load-bearing premise
C++ template metaprogramming and address translation can be made to work reliably across heterogeneous binaries and architectures without violating language rules or incurring high runtime cost.
What would settle it
A concrete test in which address translation between two processes on different architectures produces an incorrect handler invocation or adds measurable overhead to the call path.
Figures
read the original abstract
We present HAM (Heterogeneous Active Messages), a C++-only active messaging solution for heterogeneous distributed systems.Combined with a communication protocol, HAM can be used as a generic Remote Procedure Call (RPC) mechanism. It has been used in HAM-Offload to implement a low-overhead offloading framework for inter- and intra-node offloading between different architectures including accelerators like the Intel Xeon Phi x100 series and the NEC SX-Aurora TSUBASA Vector Engine. HAM uses template meta-programming to implicitly generate active message types and their corresponding handler functions. Heterogeneity is enabled by providing an efficient address translation mechanism between the individual handler code addresses of processes running different binaries on different architectures, as well a hooks to inject serialisation and deserialisation code on a per-type basis. Implementing such a solution in modern C++ sheds some light on the shortcomings and grey areas of the C++ standard when it comes to distributed and heterogeneous environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Heterogeneous Active Messages (HAM), a C++-only active messaging library for heterogeneous distributed systems that can serve as a generic RPC mechanism when combined with a communication protocol. It has been deployed in the HAM-Offload framework for low-overhead offloading between architectures including Intel Xeon Phi and NEC SX-Aurora. HAM relies on template metaprogramming to implicitly generate active message types and handler functions, supplies an address translation mechanism to map handler code addresses across processes with different binaries and architectures, and provides per-type hooks for serialization/deserialization. The authors also discuss shortcomings and grey areas of the C++ standard for distributed and heterogeneous environments.
Significance. If the address translation mechanism can be shown to be portable, standards-compliant, and low-overhead, the work would supply a practical, dependency-free C++ solution for lightweight RPC in heterogeneous HPC settings and could usefully inform future language extensions.
major comments (2)
- [Abstract] Abstract: the central claim of an 'efficient address translation mechanism' that maps handler code addresses between processes running different binaries on different architectures is load-bearing for the heterogeneity guarantee, yet no algorithm, casting strategy, registry approach, memory-model assumptions, or overhead data are supplied; the text only notes 'grey areas of the C++ standard.'
- [Abstract] Abstract: the assertion that HAM 'has been used in HAM-Offload to implement a low-overhead offloading framework' is unsupported by any benchmarks, error analysis, or performance derivation, leaving the 'lightweight' and practical-utility claims unverified.
Simulated Author's Rebuttal
We thank the referee for their review and the opportunity to respond to the comments on our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of an 'efficient address translation mechanism' that maps handler code addresses between processes running different binaries on different architectures is load-bearing for the heterogeneity guarantee, yet no algorithm, casting strategy, registry approach, memory-model assumptions, or overhead data are supplied; the text only notes 'grey areas of the C++ standard.'
Authors: The abstract is a concise summary and does not contain the implementation details. The manuscript body describes the address translation mechanism, including the registry-based approach for mapping handler addresses across different binaries and architectures, the use of template metaprogramming, and our analysis of C++ standard limitations. Overhead considerations are discussed in the context of the design. We agree the abstract could better signal the presence of these details and will revise it to include a brief qualifier or section reference. revision: yes
-
Referee: [Abstract] Abstract: the assertion that HAM 'has been used in HAM-Offload to implement a low-overhead offloading framework' is unsupported by any benchmarks, error analysis, or performance derivation, leaving the 'lightweight' and practical-utility claims unverified.
Authors: The abstract summarizes the use of HAM within the HAM-Offload framework. The manuscript contains the supporting performance evaluation and analysis in the dedicated evaluation section. To address the concern that the abstract makes an unsupported assertion, we will revise the abstract to qualify the 'low-overhead' claim or add a reference to the evaluation results. revision: yes
Circularity Check
No circularity: implementation description without equations or self-referential derivations
full rationale
The paper describes a C++ template-metaprogramming technique for generating active-message types and an address-translation mechanism for heterogeneous binaries. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims concern the feasibility of an implementation artifact rather than any derivation that reduces to its own inputs by construction. This is the expected non-finding for a systems paper whose contribution is code-level engineering rather than a mathematical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption C++ template metaprogramming suffices to implicitly generate active message types and handler functions at compile time
invented entities (1)
-
HAM address translation mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migrat- able Objects: Charm++ in Practice. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Anal...
-
[2]
C. Chen, F. Yang, F. Wang, L. Deng, and D. Zhao. 2018. Review of Programming and Performance Optimization on CPU-MIC Heterogeneous System. In2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). 894–900. https://doi.org/10.1109/ICIVC.2018.8492841
-
[3]
Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, and Wolfgang De Meuter. 2015. Partitioned Global Address Space Languages.ACM Comput. Surv.47, 4, Article 62 (May 2015), 27 pages. https://doi.org/10.1145/2716320
-
[4]
2018.Advancing the Heterogeneous Active Messages Approach
Daniel Deppisch. 2018.Advancing the Heterogeneous Active Messages Approach. Master’s thesis. Humboldt-Universität zu Berlin, Faculty of Mathematics and Natural Siences, Department of Computer Science
work page 2018
-
[5]
Intel Corporation [n. d.].Itanium C ++ ABI, v1.86. Intel Corporation
-
[6]
2013.Intel Xeon Phi Coprocessor High Perfor- mance Programming(1st ed.)
Jeffers, James and Reinders, James. 2013.Intel Xeon Phi Coprocessor High Perfor- mance Programming(1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
work page 2013
-
[7]
H. Kaiser, M. Brodowicz, and T. Sterling. 2009. ParalleX An Advanced Par- allel Execution Model for Scaling-Impaired Applications. InParallel Process- ing Workshops, 2009. ICPPW ’09. International Conference on. 394–401. https: //doi.org/10.1109/ICPPW.2009.14
-
[8]
Lu, Milind Girkar, Michael Matz, Jan Hubika, Andreas Jaeger, and Mark Mitchell
H.J. Lu, Milind Girkar, Michael Matz, Jan Hubika, Andreas Jaeger, and Mark Mitchell. [n. d.].System V Application Binary Interface, K1OM Architecture Pro- cessor Supplement, v1.0
-
[9]
Malý, Lukáš and Zapletal, Jan and Merta, Michal andČermák, Martin. 2018. Xeon Phi acceleration of domain decomposition iterations via heterogeneous active messages.AIP Conference Proceedings1978, 1 (2018), 360004. https://doi.org/10. 1063/1.5043963 arXiv:https://aip.scitation.org/doi/pdf/10.1063/1.5043963
-
[10]
Michael Matz, Jan Hubika, Andreas Jaeger, and Mark Mitchell. [n. d.].System V Application Binary Interface, AMD64 Architecture Processor Supplement, Draft v0.99.6
-
[11]
d.].System V Application Binary Interface VE Architecture Processor Supplement, v1.1
NEC Corporation [n. d.].System V Application Binary Interface VE Architecture Processor Supplement, v1.1. NEC Corporation. https://www.nec.com/en/global/ prod/hpc/aurora/document/VE-ABI_v1.1.pdf
-
[12]
ChrisJ. Newburn, Rajiv Deodhar, Serguei Dmitriev, Ravi Murty, Ravi Narayanaswamy, John Wiegert, Francisco Chinchilla, and Russell McGuire
-
[13]
Offload Compiler Runtime for the Intel Xeon Phi Coprocessor. InSu- percomputing. Springer Berlin Heidelberg, 239–254. https://doi.org/10.1007/ 978-3-642-38750-0_18
-
[14]
C. J. Newburn, G. Bansal, M. Wood, L. Crivelli, J. Planas, A. Duran, P. Souza, L. Borges, P. Luszczek, S. Tomov, J. Dongarra, H. Anzt, M. Gates, A. Haidar, Y. Jia, K. Kabir, I. Yamazaki, and J. Labarta. 2016. Heterogeneous Streaming. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 611–620. https://doi.org/10.110...
-
[15]
Matthias Noack. 2019. HAM-Offload GitHub Repository. (2019). https://github. com/noma/ham
work page 2019
-
[16]
Matthias Noack, Erich Focht, and Thomas Steinke. 2019. Heterogeneous Ac- tive Messages for Offloading on the NEC SX-Aurora TSUBASA. In2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
work page 2019
-
[17]
Matthias Noack, Florian Wende, Thomas Steinke, and Frank Cordes. 2014. A Unified Programming Model for Intra- and Inter-node Offloading on Xeon Phi Clusters. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE Press, Piscataway, NJ, USA, 203–214. https://doi.org/10.1109/SC.2014.22
-
[18]
Jörg Nolte, Yutaka Ishikawa, and Mitsuhisa Sato. 2001. TACO: Prototyping High- Level Object-Oriented Programming Constructs by Means of Template Based Programming Techniques.SIGPLAN Not.36 (December 2001), 35–49. Issue 12. https://doi.org/10.1145/583960.583965
-
[19]
OpenMP Architecture Review Board
OpenMP Architecture Review Board 2018.OpenMP Application Program Interface, Version 5.0. OpenMP Architecture Review Board. https://www.openmp.org/ wp-content/uploads/OpenMP-API-Specification-5.0.pdf
work page 2018
-
[20]
TOP500.org. 2018. Top500: TOP 500 Supercomputer Sites. (November 2018). http://www.top500.org 8
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.