pith. sign in

arxiv: 2605.30294 · v1 · pith:IHWKG6HUnew · submitted 2026-05-28 · 💻 cs.DC · cs.GR

RAFI -- A Ray/Work Forwarding Infrastructure for Data Parallel Multi-Node/Multi-GPU Computing

Pith reviewed 2026-06-29 05:22 UTC · model grok-4.3

classification 💻 cs.DC cs.GR
keywords RaFIwork forwardingmulti-GPUCUDAMPIdata-parallel computingray migrationdistributed GPU
0
0 comments X

The pith

RaFI gives CUDA kernels a simple way to forward rays and work items across GPUs while handling all CUDA and MPI details internally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RaFI as a framework that lets programmers build data-parallel GPU applications where work items such as rays can move between different GPUs and nodes. It exposes an easy-to-use interface inside CUDA kernels for sending those items onward, and the system takes care of the required CUDA memory transfers and MPI messaging without further programmer intervention. A sympathetic reader would care because many rendering, simulation, and scientific codes naturally produce work that must cross device boundaries, and removing the need to write and debug that communication code can shorten development time. The authors describe the design choices, implementation, and use in example applications to show how the approach works in practice.

Core claim

RaFI is a CUDA and MPI based software framework that simplifies the task of building GPU-enabled data-parallel software where rays or similar work items need to migrate between different GPUs. RaFI provides a simple interface for CUDA kernels to forward such work items to other GPUs, while under the hood managing all the CUDA and MPI related work required to make this happen.

What carries the argument

RaFI's kernel-level forwarding interface, which lets CUDA code send work items to remote GPUs and delegates all inter-device CUDA transfers plus MPI messaging to the framework.

If this is right

  • Data-parallel codes that move rays or particles across GPUs can be written with far less explicit communication logic.
  • Developers can focus kernel logic on the computation rather than on device-to-device data movement.
  • Existing multi-GPU ray-tracing and simulation workloads become easier to extend across additional nodes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same forwarding abstraction could support dynamic load balancing by letting kernels decide at runtime where to send work.
  • Adoption would likely require only modest changes to existing CUDA kernels that already contain forwarding logic.
  • Performance behavior on very large node counts remains an open question that the provided examples do not address.

Load-bearing premise

The overhead of the added communication layer can be hidden behind the simple interface without large performance penalties or loss of needed flexibility in data-parallel GPU codes.

What would settle it

A side-by-side timing and code-complexity comparison, on a multi-node GPU cluster, between an application written with RaFI and the same application written with direct CUDA and MPI calls.

Figures

Figures reproduced from arXiv: 2605.30294 by Alper Sahistan, Andrea Paris, Ingo Wald, Milan Jaros, Patrick Moran, Serkan Demirci, Stefan Zellmann, Tatiana von Landesberger, Ugur Gudukbay, Valerio Pascucci.

Figure 1
Figure 1. Figure 1: Illustration of the VoPaT data-parallel volume path tracing system when using RaFI (Section 5.1). From top to bottom: 1) Input Data: each one of three MPI-connected ranks computes the same k-d tree domain partitioning, loads its part of the input data, but also stores “proxy” bounding boxes that describe the other ranks’ domains. 2) RayGen: Each rank generates all primary rays and traces these rays against… view at source ↗
Figure 2
Figure 2. Figure 2: Two screenshots from our RaFI-enabled VoPaT volume path tracer, running on four 10GB-Ethernet networked PCs with two RTX8000 (Turing) GPUs each. Left: the 4K ×4K ×4K rotstrat data set. Right: the 305×3001×2501 thunderstorm data set. each such entry-exit ray segment produces a separate color-plus-alpha fragment. Traditional sort-last rendering can handle only a single frag￾ment per pixel and thus cannot han… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of before (top) vs. after adopting [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Images of the 788 million unstructured element Mars Lander [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two images from the RaFI-enabled Schlieren renderer described in Section 5.3, in both cases showing the Mars Lander Retropropulsion data set. Left: knife-edge filter in the u direction (emphasizing horizontal gradients with respect to the viewer). Right: knife-edge filter in the v direction (emphasizing vertical gradients). on the exterior side of this face (or -1 if it is on the model boundary). Ray trave… view at source ↗
Figure 6
Figure 6. Figure 6: Streamlines computed with our RaFI-enabled data-parallel particle advection application described in Section 5.4. Left: velocity field around an airplane computed with OpenFoam. Center: airflow around a car. Right: wind turbulence at low altitude from a daily snapshot computed with ICON, by the German weather forecasting institute (DWD). the application ran on a single GPU; only at the end of each update s… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the communication patterns in our [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Communication bandwidth utilization as a function of [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

We present RaFI, a CUDA and MPI based software framework that simplifies the task of building GPU-enabled data-parallel software where rays or similar work items need to migrate between different GPUs. RaFI provides a simple interface for CUDA kernels to forward such work items to other GPUs, while under the hood managing all the CUDA and MPI related work required to make this happen. We describe RaFI's motivation and implementation, and show its potential in several example applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents RaFI, a CUDA- and MPI-based software framework for simplifying construction of data-parallel multi-GPU applications in which work items (e.g., rays) must migrate between GPUs. It claims to expose a simple kernel-level forwarding interface while transparently handling all CUDA and MPI details (queues, serialization, transfers, synchronization). The manuscript describes motivation and implementation and illustrates the framework with several example applications.

Significance. If the claimed simplification is achieved without material performance penalty or loss of control relative to hand-written CUDA+MPI code, the framework would be useful for developers of ray-tracing, particle-transport, or similar irregular data-parallel workloads on multi-node GPU systems. The absence of any quantitative evaluation, however, leaves this utility unestablished.

major comments (1)
  1. [Abstract] Abstract: the central claim that RaFI 'manages all the CUDA and MPI related work' while imposing no significant performance penalties is load-bearing for the framework's value, yet the manuscript supplies no head-to-head timing data, latency measurements, or comparisons against equivalent manual implementations; this omission prevents assessment of whether the added layers (queues, serialization, inter-GPU transfers) are negligible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that quantitative performance data is required to substantiate the framework's value proposition and will address this in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that RaFI 'manages all the CUDA and MPI related work' while imposing no significant performance penalties is load-bearing for the framework's value, yet the manuscript supplies no head-to-head timing data, latency measurements, or comparisons against equivalent manual implementations; this omission prevents assessment of whether the added layers (queues, serialization, inter-GPU transfers) are negligible.

    Authors: We acknowledge that the manuscript currently lacks head-to-head timing data and comparisons against manual CUDA+MPI implementations, which is necessary to evaluate any overhead from the framework's queues, serialization, and transfer mechanisms. In the revised version we will add a dedicated evaluation section containing latency and throughput measurements for the example applications, directly comparing RaFI against equivalent hand-written code on the same hardware and workloads. This will allow readers to assess whether the added layers impose material penalties. revision: yes

Circularity Check

0 steps flagged

No circularity: software framework description with no derivations or fitted predictions

full rationale

The paper presents a CUDA/MPI software infrastructure (RaFI) for forwarding work items across GPUs, describing its interface, implementation, and example applications. No mathematical derivations, first-principles predictions, fitted parameters, or uniqueness theorems appear in the provided text or abstract. Claims reduce directly to the described code and usage rather than to any self-referential inputs, self-citations, or ansatzes. This is a systems/software paper, not a theoretical derivation; the central claim (simple kernel interface hiding CUDA/MPI details) is an engineering assertion evaluated by implementation and examples, not by construction from its own outputs. No load-bearing steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software infrastructure paper with no mathematical derivations, fitted parameters, or postulated entities; the ledger is empty.

pith-pipeline@v0.9.1-grok · 5634 in / 991 out tokens · 20534 ms · 2026-06-29T05:22:48.198029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 4 canonical work pages

  1. [1]

    S. Abbott. GPUDirect, CUDA-Aware MPI, and CUDA IPC. Oak Ridge Leadership Facility, Summit Training Workshop, https://www.olcf.ornl.gov/wp-content/uploads/2018/ 12/summit_workshop_CUDA-Aware-MPI.pdf, 2018. 2

  2. [2]

    Abram, P

    G. Abram, P. Navrátil, P. Grossett, D. Rogers, and J. Ahrens. Galaxy: Asynchronous ray tracing for large high-fidelity visualization. InProceed- ings of the IEEE 8th Symposium on Large Data Analysis and Visualization, LDA V ’18, 2018. 1

  3. [3]

    Available at https: //rocm.docs.amd.com/projects/HIP, Last accessed: 29 March 2026

    Introduction to the HIP programming model, 2026. Available at https: //rocm.docs.amd.com/projects/HIP, Last accessed: 29 March 2026. 2

  4. [4]

    Barnes and P

    J. Barnes and P. Hut. A hierarchical O (N log N) force-calculation algo- rithm.Nature, 324(6096), 1986. 7

  5. [5]

    Bickford, T

    N. Bickford, T. Lorach, and C. Hebert. Hands-on class: Introduction to slang: The next generation shading language. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Labs, 2025. SIGGRAPH Labs ’25. 2

  6. [6]

    Chalmers, E

    A. Chalmers, E. Reinhard, and T. Davis.Practical Parallel Rendering. CRC Press, 2002. 1

  7. [7]

    K. M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems.ACM Transactions on Computing Systems, 3(1):63–75, Feb. 1985. doi: 10.1145/214451.214456 3

  8. [8]

    Childs et al

    H. Childs et al. A Terminology for In Situ Visualization and Analysis Systems.The International Journal of High Performance Computing Applications, 34(6), 2020. 1

  9. [9]

    J. Fong, M. Wrenninge, C. Kulla, and R. Habel. Production V olume Rendering: SIGGRAPH 2017 Course. InACM SIGGRAPH 2017 Courses, SIGGRAPH ’17, 2017. doi: 10.1145/3084873.3084907 5

  10. [10]

    J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. OpenCL as a unified programming model for heterogeneous CPU/GPU clusters.SIGPLAN Not., 47(8), 2012. 2

  11. [11]

    J. Kraus. An Introduction to CUDA-Aware MPI. NVIDIA Technical Blog, https://developer.nvidia.com/blog/ introduction-cuda-aware-mpi/, 2013. 2

  12. [12]

    CUDA Toolkit

    NVIDIA Corporation . CUDA Toolkit. Available at https:// developer.nvidia.com/cuda-toolkit. 2

  13. [13]

    Molnar, M

    S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. A sorting classification of parallel rendering.IEEE Computer Graphics and Applications, 14(4):23– 32, 1994. 1

  14. [14]

    Moreland, W

    K. Moreland, W. Kendall, T. Peterka, and J. Huang. An Image Composit- ing Solution at Scale. InProceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. 1

  15. [15]

    Morrical, A

    N. Morrical, A. Sahistan, U. Güdükbay, I. Wald, and V . Pascucci. Quick Clusters: A GPU-Parallel Partitioning for Efficient Path Tracing of Un- structured V olumetric Grids.IEEE Transactions on Visualization and Computer Graphics, 29(01), 2023. doi: 10.1109/TVCG.2022.3209418 5

  16. [16]

    P. A. Navratil.Memory-Efficient, Scalable Ray Tracing. PhD thesis, University of Texas, Austin, 2010. 1

  17. [17]

    Available at https: //github.com/NVIDIA/cuBQL, Accessed: Feb 26, 2026

    cuBQL - The CUDA BVH Build and Query Library. Available at https: //github.com/NVIDIA/cuBQL, Accessed: Feb 26, 2026. 7

  18. [18]

    Available at https: //github.com/NVIDIA/owl, Accessed: Feb 26, 2026

    OWL – A Productivity Library for OptiX, 2026. Available at https: //github.com/NVIDIA/owl, Accessed: Feb 26, 2026. 5

  19. [19]

    H. Park, D. Fussell, and P. Navrátil. SpRay: speculative ray scheduling for large data visualization. InProceedings of IEEE 8th Symposium on Large Data Analysis and Visualization, LDA V 08, pp. 77–86, 2018. 1

  20. [20]

    Reinhard.Scheduling and Data Management for Parallel Ray Tracing

    E. Reinhard.Scheduling and Data Management for Parallel Ray Tracing. PhD thesis, University of East Anglia, 1995. 1

  21. [21]

    Sahistan, S

    A. Sahistan, S. Demirci, N. Morrical, S. Zellmann, A. Aman, I. Wald, and U. Güdükbay. Ray-traced Shell Traversal of Tetrahedral Meshes for Direct V olume Visualization. InProceedings of the IEEE Visualization Conference-Short Papers, VIS ’21, 2021. doi: 10.1109/VIS49827.2021. 9623298 5

  22. [22]

    Sahistan, S

    A. Sahistan, S. Demirci, I. Wald, S. Zellmann, J. Barbosa, N. Morrical, and U. Güdükbay. Visualization of Large Non-Trivially Partitioned Un- structured Data with Native Distribution on High-Performance Computing Systems.IEEE Transactions on Visualization and Computer Graphics, 31(9):5000–5014, September 2025. 4, 5

  23. [23]

    G. S. Settles.Schlieren and Shadowgraph Techniques: Visualizing Phe- nomena in Transparent Media. Springer-Verlag, 2001. 5

  24. [24]

    Toepler.Beobachtungen nach einer neuen optischen Methode: e

    A. Toepler.Beobachtungen nach einer neuen optischen Methode: e. Beitr. zur Experimental-Physik. Max Cohen & Son, Bonn, Germany, 1864. 5

  25. [25]

    Trott, L

    C. Trott, L. Berger-Vergiat, D. Poliakoff, S. Rajamanickam, D. Lebrun- Grandie, J. Madsen, N. Al Awar, M. Gligoric, G. Shipman, and G. Womel- dorff. The Kokkos Ecosystem: Comprehensive Performance Portability for High Performance Computing.Computing in Science Engineering, 23(5), 2021. 2

  26. [26]

    I. Wald, M. Jaros, and S. Zellmann. Data Parallel Multi-GPU Path Tracing using Ray Queue Cycling.Computer Graphics Forum, 42, 2023. 8

  27. [27]

    Wald and S

    I. Wald and S. G. Parker. Data Parallel Path Tracing with Object Hierar- chies.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(3), 2022. (Proceedings of High Performance Graphics). 1, 4

  28. [28]

    I. Wald, S. Zellmann, and N. Morrical. Data-Parallel V olume Path Tracing (with Ray Forwarding). Available at https://github.com/ingowald/ vopat, Accessed: 2026/03/16. 1, 4

  29. [29]

    D. W. Walker and J. J. Dongarra. MPI: a standard message passing interface.Supercomputer, 12, 1996. 2

  30. [30]

    L. A. Yates. Images constructed from computed flowfields.AIAA Journal, 31(10), October 1993. 5 9