pith. sign in

arxiv: 2605.16253 · v1 · pith:5WBQNOLBnew · submitted 2026-05-15 · 💻 cs.AR

TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing

Pith reviewed 2026-05-19 18:11 UTC · model grok-4.3

classification 💻 cs.AR
keywords ray tracingprefetchingBVH traversalhardware prefetcherGPU architecturememory latencytraversal stack
0
0 comments X

The pith

A prefetcher that monitors consecutive pops from ray tracing traversal stacks delivers 1.48x average speedup with negligible hardware overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a hardware prefetcher for ray tracing called TTP that exploits the traversal stack already built into dedicated RT units. By generating prefetch requests for BVH nodes when addresses are popped consecutively from the stack during depth-first traversal, the design anticipates upward movements through the tree. This yields 1.48 times average speedup and up to 1.89 times in tested cases while adding almost no extra circuitry. The prefetches prove highly accurate at 98.92 percent on average, with 31.54 percent coverage of baseline misses.

Core claim

TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS-based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. Evaluated on the cycle-level simulator Vulkan-sim 2.0, TTP achieves 1.48x speedup on average compared to the baseline with nearly negligible hardware overhead, 98.92 percent average L1 accuracy, and 31.54 percent coverage.

What carries the argument

Tree Traversal Prefetcher (TTP), which issues prefetch requests from addresses on the existing hardware traversal stack during consecutive pops in depth-first BVH traversal.

If this is right

  • Ray tracing workloads experience fewer memory stalls during BVH traversal, increasing overall throughput.
  • High prefetch accuracy keeps cache pollution low and avoids wasting bandwidth on unused data.
  • The design integrates into existing RT cores because it reuses already-present stack hardware.
  • Speedup scales with scene size and complexity where memory latency dominates computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Stack-based monitoring may extend to prefetching in other tree-traversal workloads such as spatial databases or physics simulations.
  • Future hardware revisions could expose additional traversal state to enable even more targeted prefetch decisions.
  • The approach suggests that domain-specific knowledge of internal hardware structures can outperform generic prefetch algorithms.

Load-bearing premise

The cycle-level simulator accurately reproduces the memory access patterns and traversal stack behavior of real ray tracing hardware, and consecutive stack pops reliably indicate useful upward traversal without causing cache pollution.

What would settle it

Measuring prefetch accuracy and overall speedup when the same TTP logic is placed in real ray tracing silicon and run on physical GPUs instead of a cycle-level simulator.

Figures

Figures reproduced from arXiv: 2605.16253 by Anshul Naithani, Huiyang Zhou, Yavuz Selim Tozlu.

Figure 1
Figure 1. Figure 1: Thread status distribution in an RT unit. Threads may be waiting for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DRAM activity with and without (i.e., baseline) TTP. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simplified BVH structure for Stanford Bunny. The green arrow [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagram of the GPU model used in this study. Purple blocks indicate the modified components. Redrawn from [39]. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of pop streaks. rays (or threads) have fetched them, and capacity or conflict misses if they were fetched by other threads but evicted from the caches later on. To see the impact of such traversal stack pop streaks, we extract traversal stack activity from the simulator. Note that during BVH traversal, every memory read request is a pop from a traversal stack. Therefore, we categorize the cache mi… view at source ↗
Figure 5
Figure 5. Figure 5: Example DFS BVH traversal and the corresponding traversal stack. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Percentage of RT read misses where the node was in the traversal [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: State machine that generates prefetches. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TTP implementation. Top of the stack is denoted by [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 6
Figure 6. Figure 6: In addition, we perform a limit study by simulating with perfect upward and perfect downward traversals. For perfect upward traversal, 2nd and later pops after a push always hit in the L1 cache, and for perfect downward traversal, 1st pops after a push always hit in the L1 cache [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: TTP speedup (higher the better), power and energy (lower the better) with DFS traversal, normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Normalized speedups with larger L1 data caches. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: RT read MPKI with TTP normalized to baseline. Lower is better. [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 11
Figure 11. Figure 11: Normalized speedups with perfect upward and downward traversal. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prefetcher accuracy and coverage with DFS traversal. Higher is [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: L1 and L2 cache responses to prefetch requests. For each scene, first [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗
Figure 19
Figure 19. Figure 19: Speedups for different prefetch intensities, normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p009_19.png] view at source ↗
Figure 16
Figure 16. Figure 16: Normalized speedup, power and energy at 256x256 resolution. [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗
Figure 21
Figure 21. Figure 21: RT read MPKI with TTP normalized to baseline, with BFS [PITH_FULL_IMAGE:figures/full_fig_p010_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: BFS and DFS speedups normalized to DFS without TTP. [PITH_FULL_IMAGE:figures/full_fig_p010_22.png] view at source ↗
Figure 25
Figure 25. Figure 25: Total DRAM reads for Treelet prefetcher normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p010_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Normalized speedups with Park et al.’s prefetching strategy. [PITH_FULL_IMAGE:figures/full_fig_p011_26.png] view at source ↗
Figure 1
Figure 1. Figure 1: fig1.png [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
read the original abstract

Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time reading tree node data from memory. In this work, we attack the memory latency bottleneck of ray tracing through prefetching. We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. The main idea is to leverage the existing tree traversal stack in the RT units for highly accurate prefetching. In particular, TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS (Depth-first search) based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads. The coverage, computed as the ratio of L1 miss reduction over baseline L1 misses, is 31.54%, correlating well with the achieved speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Tree Traversal Prefetcher (TTP), a hardware prefetcher for ray tracing that leverages addresses already present on the traversal stacks of RT units. Prefetches are triggered on consecutive pops from the stack (assumed to mark useful upward DFS traversal). Evaluated in cycle-level simulation on Vulkan-sim 2.0, TTP reports 1.48× average speedup (max 1.89×), 98.92 % L1 accuracy, 31.54 % coverage, and negligible hardware overhead relative to a baseline without prefetching.

Significance. If the simulation results hold, the work demonstrates a low-cost, high-accuracy prefetching technique that exploits an existing RT hardware structure (the traversal stack) rather than adding new prediction tables. The reported correlation between coverage and speedup, together with the near-negligible area cost, would be a useful contribution to memory-latency mitigation in dedicated ray-tracing hardware.

major comments (2)
  1. [Evaluation] Evaluation section: all speedup, accuracy, and coverage numbers are obtained from Vulkan-sim 2.0 runs; the manuscript provides no cross-validation against real RT cores, alternative simulators, or sensitivity analysis to stack-pop timing and memory latency parameters. Because the prefetch logic is triggered specifically by consecutive stack pops, any mismatch in modeled stack behavior directly undermines the 1.48× speedup and 98.92 % accuracy claims.
  2. [Evaluation] The coverage metric (31.54 %) is defined as L1-miss reduction relative to baseline, yet the text does not quantify how much of the observed speedup is attributable to latency hiding versus other simulator effects (e.g., changes in thread scheduling). This link is load-bearing for the central performance claim.
minor comments (2)
  1. [Abstract] The abstract states 'nearly negligible hardware overhead' without quoting concrete area or power numbers; these figures should appear in the main text or a table.
  2. [Abstract] Benchmark scenes, triangle counts, and exact baseline configuration (cache sizes, memory latency, etc.) are not summarized in the abstract; a short table or sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment on the evaluation methodology below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: all speedup, accuracy, and coverage numbers are obtained from Vulkan-sim 2.0 runs; the manuscript provides no cross-validation against real RT cores, alternative simulators, or sensitivity analysis to stack-pop timing and memory latency parameters. Because the prefetch logic is triggered specifically by consecutive stack pops, any mismatch in modeled stack behavior directly undermines the 1.48× speedup and 98.92 % accuracy claims.

    Authors: We agree that additional validation would be valuable. Cross-validation on real RT cores is not feasible in this work, as commercial ray-tracing hardware implementations and their detailed microarchitectural models are proprietary. Vulkan-sim 2.0 is a cycle-accurate simulator specifically designed for Vulkan and ray-tracing workloads and is commonly used in the literature for such studies. To address the concern about sensitivity to modeling assumptions, we will add a new subsection in the revised manuscript presenting sensitivity analysis on memory latency parameters and variations in the timing of consecutive stack pops. revision: partial

  2. Referee: [Evaluation] The coverage metric (31.54 %) is defined as L1-miss reduction relative to baseline, yet the text does not quantify how much of the observed speedup is attributable to latency hiding versus other simulator effects (e.g., changes in thread scheduling). This link is load-bearing for the central performance claim.

    Authors: We acknowledge that a more explicit breakdown would improve the clarity of the performance results. The reported correlation between coverage and speedup, combined with the very high L1 accuracy, indicates that the gains stem primarily from reduced memory latency. In the revision we will add execution-time breakdowns and memory-stall-cycle statistics to better quantify the contribution of latency hiding versus other simulator effects such as scheduling. revision: yes

standing simulated objections not resolved
  • Direct validation against real commercial RT cores or alternative closed-source simulators, as we lack access to proprietary hardware models.

Circularity Check

0 steps flagged

No circularity: TTP design evaluated via independent simulation measurements

full rationale

The paper proposes TTP as a hardware prefetcher that triggers on consecutive pops from the existing RT traversal stack to prefetch BVH nodes during upward DFS traversal. All reported results (1.48x average speedup, 98.92% L1 accuracy defined as prefetched blocks later referenced, 31.54% coverage as miss reduction ratio) are measured outcomes from separate cycle-level simulation runs on Vulkan-sim 2.0. These quantities are not algebraically or definitionally forced by the prefetch rule itself; they are empirical outputs. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work appear in the derivation. The design is self-contained; simulator fidelity is an external validity assumption, not an internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard domain assumptions about DFS traversal in ray tracing hardware and uses simulation for validation rather than introducing new fitted parameters or entities.

axioms (1)
  • domain assumption Ray tracing hardware performs DFS-based BVH traversal and maintains a per-thread traversal stack of node addresses.
    This assumption is used to justify prefetch generation when nodes are popped consecutively from the stack.

pith-pipeline@v0.9.0 · 5860 in / 1204 out tokens · 48355 ms · 2026-05-19T18:11:46.288479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023)

    “Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023).” [Online]. Available: https://github.com/ubc-aamodt-group/ treelet-prefetching-for-rt

  2. [2]

    DirectX Raytracing (DXR) Functional Spec

    “DirectX Raytracing (DXR) Functional Spec.” [Online]. Available: https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html

  3. [3]

    Intel Embree

    “Intel Embree.” [Online]. Available: https://www.embree.org/

  4. [4]

    Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in

    “Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in...” [Online]. Available: https://www.intel.com/content/www/us/en/ developer/articles/guide/real-time-ray-tracing-in-games.html

  5. [5]

    NVIDIA ADA GPU ARCHITECTURE

    “NVIDIA ADA GPU ARCHITECTURE.” [Online]. Avail- able: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia- ada-gpu-architecture.pdf

  6. [6]

    NVIDIA AMPERE GA102 GPU ARCHITECTURE

    “NVIDIA AMPERE GA102 GPU ARCHITECTURE.” [Online]. Available: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102- gpu-architecture-whitepaper-v2.pdf

  7. [7]

    Real-Time Ray Tracing

    “Real-Time Ray Tracing.” [Online]. Available: https: //dev.epicgames.com/documentation/en-us/unreal-engine/hardware- ray-tracing-tips-and-tricks-in-unreal-engine

  8. [8]

    Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite

    “Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite.” [Online]. Available: https://www.gdcvault.com/play/1024801/

  9. [9]

    The Stanford 3D Scanning Repository

    “The Stanford 3D Scanning Repository.” [Online]. Available: https: //graphics.stanford.edu/data/3Dscanrep/

  10. [10]

    Unity real-time Ray Tracing

    “Unity real-time Ray Tracing.” [Online]. Available: https://unity.com/ ray-tracing

  11. [11]

    Architecture considerations for tracing incoherent rays,

    T. Aila and T. Karras, “Architecture considerations for tracing incoherent rays,” inProceedings of the Conference on High Performance Graphics, ser. HPG ’10. Goslar, DEU: Eurographics Association, Jun. 2010, pp. 113–122

  12. [12]

    Understanding the efficiency of ray traversal on GPUs,

    T. Aila and S. Laine, “Understanding the efficiency of ray traversal on GPUs,” inProceedings of the Conference on High Performance Graphics 2009, ser. HPG ’09. New York, NY , USA: Association for Computing Machinery, Aug. 2009, pp. 145–149. [Online]. Available: https://doi.org/10.1145/1572769.1572792

  13. [13]

    Graph Prefetching Using Data Structure Knowledge,

    S. Ainsworth and T. M. Jones, “Graph Prefetching Using Data Structure Knowledge,” inProceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY , USA: Association for Computing Machinery, Jun. 2016, pp. 1–11. [Online]. Available: https://doi.org/10.1145/2925426.2926254

  14. [14]

    Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,

    A. Barnes, F. Shen, and T. G. Rogers, “Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024

  15. [15]

    Parallel breadth-first search on distributed memory systems,

    A. Buluc ¸ and K. Madduri, “Parallel breadth-first search on distributed memory systems,” inProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY , USA: Association for Computing Machinery, Nov. 2011, pp. 1–12. [Online]. Available: https://dl.acm.org/doi/10. 1145/2063384.2063471

  16. [16]

    RTX on—The NVIDIA Turing GPU,

    J. Burgess, “RTX on—The NVIDIA Turing GPU,”IEEE Micro, vol. 40, no. 2, pp. 36–44, Mar. 2020, conference Name: IEEE Micro. [Online]. Available: https://ieeexplore.ieee.org/document/8981896

  17. [17]

    What’s the Difference Between Ray Tracing and Rasterization?

    B. Caulfield, “What’s the Difference Between Ray Tracing and Rasterization?” Mar. 2018. [Online]. Available: https://blogs.nvidia. com/blog/whats-difference-between-ray-tracing-rasterization/

  18. [18]

    Treelet Accelerated Ray Tracing on GPUs,

    Y . H. Chou and T. M. Aamodt, “Treelet Accelerated Ray Tracing on GPUs,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, Mar. 2025, pp. 1334–1347. [Online]. Available: https://dl.acm.org/doi/1...

  19. [19]

    Treelet Prefetching For Ray Tracing,

    Y . H. Chou, T. Nowicki, and T. M. Aamodt, “Treelet Prefetching For Ray Tracing,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, Dec. 2023, pp. 742–755

  20. [20]

    Ray Tracing for the Movie ‘Cars’,

    P. H. Christensen, J. Fong, D. M. Laur, and D. Batali, “Ray Tracing for the Movie ‘Cars’,” in2006 IEEE Symposium on Interactive Ray Tracing, Sep. 2006, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/4061539

  21. [21]

    Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,

    Y . Deng, Y . Ni, Z. Li, S. Mu, and W. Zhang, “Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,”ACM Computing Surveys, vol. 50, no. 4, pp. 58:1–58:41, Aug. 2017. [Online]. Available: https://doi.org/10.1145/3104067

  22. [22]

    Ansmet: Approximate nearest neighbor search with near-memory processing and hybrid early termination,

    Y . Feng, Y . Li, J. Lee, W. W. Ro, and H. Jeon, “Heliostat: Harnessing Ray Tracing Accelerators for Page Table Walks,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 122–136. [Online]. Available: https://dl.acm.org/doi/10.1145/36950...

  23. [23]

    Stride Directed Prefetching In Scalar Processors,

    J. Fu, J. Patel, and B. Janssens, “Stride Directed Prefetching In Scalar Processors,” in[1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25, Dec. 1992, pp. 102–110. [Online]. Available: https://ieeexplore.ieee.org/document/697004

  24. [24]

    LibRTS: A Spatial Indexing Library by Ray Tracing,

    L. Geng, R. Lee, and X. Zhang, “LibRTS: A Spatial Indexing Library by Ray Tracing,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, Feb. 2025, pp. 396–411. [Online]. Available: https://dl.acm.org/doi/10.1145/3710848.3710850

  25. [25]

    Realtime Ray Tracing on GPU with BVH-based Packet Traversal,

    J. Gunther, S. Popov, H.-P. Seidel, and P. Slusallek, “Realtime Ray Tracing on GPU with BVH-based Packet Traversal,” in2007 IEEE Symposium on Interactive Ray Tracing, Sep. 2007, pp. 113–118. [Online]. Available: https://ieeexplore.ieee.org/document/4342598

  26. [26]

    Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,

    D. Ha, L. Liu, Y . H. Chou, S. Go, W. W. Ro, H.-W. Tseng, and T. M. Aamodt, “Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024

  27. [27]

    Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,

    N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” in

  28. [28]

    The 17th Annual International Symposium on Computer Architecture, May 1990, pp

    Proceedings. The 17th Annual International Symposium on Computer Architecture, May 1990, pp. 364–373. [Online]. Available: https://ieeexplore.ieee.org/document/134547

  29. [29]

    Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,

    M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), May 2020, pp. 473–486. [Online]. Available: https://ieeexplore.ieee.org/document/9138922

  30. [30]

    Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,

    J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,” in2010 43rd Annual IEEE/ACM International Symposium on 13 Microarchitecture, Dec. 2010, pp. 213–224, iSSN: 2379-3155. [Online]. Available: https://ieeexplore.ieee.org/document/5695538

  31. [31]

    GPUWattch: enabling energy optimizations in GPGPUs,

    J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “GPUWattch: enabling energy optimizations in GPGPUs,”ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487–498, Jun. 2013. [Online]. Available: https://doi.org/10.1145/2508148.2485964

  32. [32]

    Intersection Prediction for Accelerated GPU Ray Tracing,

    L. Liu, W. Chang, F. Demoullin, Y . H. Chou, M. Saed, D. Pankratz, T. Nowicki, and T. M. Aamodt, “Intersection Prediction for Accelerated GPU Ray Tracing,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, Oct. 2021, pp. 709–723. [Online]. Available: http...

  33. [33]

    LumiBench: A Benchmark Suite for Hardware Ray Tracing,

    L. Liu, M. Saed, Y . H. Chou, D. Grigoryan, T. Nowicki, and T. M. Aamodt, “LumiBench: A Benchmark Suite for Hardware Ray Tracing,” in2023 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2023, pp. 1–14, iSSN: 2835-2238. [Online]. Available: https://ieeexplore.ieee.org/document/10289559

  34. [34]

    Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,

    P. Liu, J. Yu, and M. C. Huang, “Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,”ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 13:1– 13:25, Mar. 2016. [Online]. Available: https://doi.org/10.1145/2890505

  35. [35]

    An effective GPU implementation of breadth-first search,

    L. Luo, M. Wong, and W.-m. Hwu, “An effective GPU implementation of breadth-first search,” inProceedings of the 47th Design Automation Conference, ser. DAC ’10. New York, NY , USA: Association for Computing Machinery, Jun. 2010, pp. 52–55. [Online]. Available: https://dl.acm.org/doi/10.1145/1837274.1837289

  36. [36]

    Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,

    D. K. Mandarapu, V . Nagarajan, A. Pelenitsyn, and M. Kulkarni, “Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,” inProceedings of the 38th ACM International Conference on Supercomputing, ser. ICS ’24. New York, NY , USA: Association for Computing Machinery, Jun. 2024, pp. 14–25. [Online]. Available: https://dl.acm.or...

  37. [37]

    Scalable GPU graph traversal,

    D. Merrill, M. Garland, and A. Grimshaw, “Scalable GPU graph traversal,”SIGPLAN Not., vol. 47, no. 8, pp. 117–128, Feb. 2012. [Online]. Available: https://doi.org/10.1145/2370036.2145832

  38. [38]

    Data Cache Prefetching Using a Global History Buffer,

    K. Nesbit and J. Smith, “Data Cache Prefetching Using a Global History Buffer,” in10th International Symposium on High Performance Computer Architecture (HPCA’04), Feb. 2004, pp. 96–96, iSSN: 1530-

  39. [39]

    Available: https://ieeexplore.ieee.org/document/1410068

    [Online]. Available: https://ieeexplore.ieee.org/document/1410068

  40. [40]

    Node pre-fetching ar- chitecture for real-time ray tracing,

    J.-s. Park, W.-c. Park, J.-H. Nah, and T.-d. Han, “Node pre-fetching ar- chitecture for real-time ray tracing,”IEICE Electronics Express, vol. 10, no. 14, pp. 20 130 468–20 130 468, 2013

  41. [41]

    Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,

    M. Saed, Y . H. Chou, L. Liu, T. Nowicki, and T. M. Aamodt, “Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2022, pp. 263–281. [Online]. Available: https://ieeexplore.ieee.org/ document/9923844/citations?tabFilter=papers#citations

  42. [42]

    RayN: Ray Tracing Acceleration with Near-memory Computing,

    M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceleration with Near-memory Computing,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, Oct. 2025, pp. 277–291. [Online]. Available: https: //dl.acm.org/doi/10.1145/3725843.3756067

  43. [43]

    FreePDK: An Open-Source Variation-Aware Design Kit,

    J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “FreePDK: An Open-Source Variation-Aware Design Kit,” in2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), Jun. 2007, pp. 173–174. [Online]. Available: https://ieeexplore.ieee.org/document/4231502

  44. [44]

    Home|Vulkan|Cross platform 3D Graphics,

    Vulkan, “Home|Vulkan|Cross platform 3D Graphics,” May 2024. [Online]. Available: https://vulkan.org/

  45. [45]

    IMP: indirect memory prefetcher,

    X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “IMP: indirect memory prefetcher,” inProceedings of the 48th International Symposium on Microarchitecture, ser. MICRO-48. New York, NY , USA: Association for Computing Machinery, Dec. 2015, pp. 178–190. [Online]. Available: https://doi.org/10.1145/2830772.2830807

  46. [46]

    Drex: Accurate and scalable dense retrieval acceleration via algorithmic-hardware codesign,

    H. Zhang, Y . Zhang, and H.-W. Tseng, “RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 359–373. [Online]. Available: https: //dl.acm.org/doi/10.1145/3695053.3731072 14